Neural Turing Machine Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve) Introduction • Neural Turning Machine: Couple a Neural Network with external memory resources • The combined system is analogous to TM, but differentiable from end to end • NTM Can infer simple algorithms! 4/6/2015 Tinghui Wang, EECS, WSU 2 In Computation Theory… Finite State Machine 𝚺, 𝑺, 𝒔𝟎 , 𝜹, 𝑭 • 𝚺: Alphabet • 𝑺: Finite set of states • 𝒔𝟎 : Initial state • 𝜹: state transition function: 𝑺×𝚺→𝑷 𝑺 • 𝑭: set of Final States 4/6/2015 Tinghui Wang, EECS, WSU 3 In Computation Theory… Push Down Automata 𝑸, 𝚺, 𝚪, 𝜹, 𝒒𝟎 , 𝒁, 𝑭 • 𝚺: Input Alphabet • 𝚪: Stack Alphabet • 𝑸: Finite set of states • 𝒒𝟎 : Initial state • 𝜹: state transition function: 𝑸×𝚺×𝚪→𝑷 𝑸 ×𝚪 • 𝑭: set of Final States 4/6/2015 Tinghui Wang, EECS, WSU 4 Turing Machine 𝑸, 𝚪, 𝒃, 𝚺, 𝜹, 𝒒𝟎 , 𝑭 • 𝑸: Finite set of States • 𝒒𝟎 : Initial State • 𝚪: Tape alphabet (Finite) • 𝒃: Blank Symbol (occurs infinitely on the tape) • 𝚺: Input Alphabet (Σ = 𝚪\ 𝑏 ) • 𝜹: Transition Function 𝑄\F × Γ → 𝑄 × Γ × 𝐿, 𝑅 • 𝑭: Final State (Accepted) 4/6/2015 Tinghui Wang, EECS, WSU 5 Neural Turing Machine - Add Learning to TM!! Finite State Machine (Program) 4/6/2015 Neural Network (I Can Learn!!) Tinghui Wang, EECS, WSU 6 A Little History on Neural Network • 1950s: Frank Rosenblatt, Perceptron – classification based on linear predictor • 1969: Minsky – Proof that Perceptron Sucks! – Not able to learn XOR • 1980s: Back propagation • 1990s: Recurrent Neural Network, Long Short-Term Memory • Late 2000s – now: Deep Learning, Fast Computer 4/6/2015 Tinghui Wang, EECS, WSU 7 Feed-Forward Neural Net and Back Propagation • Total Input of unit 𝑗: 𝑥𝑗 = 𝑦𝑖 𝑤𝑗𝑖 𝑖 𝑖 𝑤𝑗𝑖 • Output of unit 𝑗 (logistic function) 1 𝑦𝑗 = 1 + 𝑒 −𝑥𝑗 𝑗 • Total Error (mean square) 1 𝐸𝑡𝑜𝑡𝑎𝑙 = 𝑑𝑐,𝑗 − 𝑦𝑐,𝑗 2 𝑐 2 𝑗 • Gradient Descent 𝜕𝐸 𝜕𝐸 𝜕𝑦𝑗 𝜕𝑥𝑗 = 𝜕𝑤𝑗𝑖 𝜕𝑦𝑗 𝜕𝑥𝑗 𝜕𝑤𝑗𝑖 Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536. 4/6/2015 Tinghui Wang, EECS, WSU 8 Recurrent Neural Network • 𝑦 𝑡 : n-tuple of outputs at time t • 𝑥 𝑛𝑒𝑡 𝑡 : m-tuple of external inputs at time t • 𝑈: set of indices 𝑘 that 𝑥𝑘 is output of unit in network • 𝐼: set of indices 𝑘 that 𝑥𝑘 is external input 𝑥𝑘𝑛𝑒𝑡 𝑡 , 𝑖𝑓 𝑘 ∈ 𝐼 • 𝑥𝑘 = 𝑦𝑘 𝑡 , 𝑖𝑓 𝑘 ∈ 𝑅 where 𝑓𝑘 is a differential • 𝑦𝑘 𝑡 + 1 = 𝑓𝑘 𝑠𝑘 𝑡 + 1 squashing function R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994. 4/6/2015 • 𝑠𝑘 𝑡 + 1 = Tinghui Wang, EECS, WSU 𝑖∈𝐼∪𝑈 𝑤𝑘𝑖 𝑥𝑖 𝑡 9 Recurrent Neural Network – Time Unrolling • Error Function: 1 𝐽 𝑡 =− 2 𝑘∈𝑈 𝐽𝑇𝑜𝑡𝑎𝑙 𝑡′, 𝑡 = 𝑒𝑘 𝑡 2 1 =− 2 𝑡 𝑑𝑘 𝑡 − 𝑦𝑘 𝑡 2 𝑘∈𝑈 𝐽 𝜏 𝜏=𝑡′+1 • Back Propagation through Time: 𝜕𝐽 𝑡 𝜀𝑘 𝑡 = = 𝑒𝑘 𝑡 = 𝑑𝑘 𝑡 − 𝑦𝑘 𝑡 𝜕𝑦𝑘 𝑡 𝜕𝐽 𝑡 𝜕𝐽 𝑡 𝜕𝑦𝑘 𝜏 𝛿𝑘 𝜏 = = = 𝑓𝑘′ 𝑠𝑘 𝜏 𝜀𝑘 𝜏 𝜕𝑠𝑘 𝜏 𝜕𝑦𝑘 𝜏 𝜕𝑠𝑘 𝜏 𝜀𝑘 𝜏 − 1 = 𝜕𝐽 𝑡 = 𝜕𝑦𝑘 𝜏 − 1 𝜕𝐽𝑇𝑜𝑡𝑎𝑙 = 𝜕𝑤𝑗𝑖 4/6/2015 Tinghui Wang, EECS, WSU 𝑙∈𝑈 𝜕𝐽 𝑡 𝜕𝑠𝑙 𝜏 = 𝜕𝑠𝑙 𝜏 𝜕𝑦𝑘 𝜏 − 1 𝑤𝑙𝑘 𝛿𝑙 𝜏 𝑙∈𝑈 𝑡 𝜕𝐽 𝑡 𝜕𝑠𝑗 𝜏 𝜏=𝑡′+1 𝜕𝑠𝑗 𝜏 𝜕𝑤𝑗𝑖 10 RNN & Long Short-Term Memory • RNN is Turing Complete With Proper Configuration, it can simulate any sequence generated by a Turing Machine • Error Vanishing and Exploding Consider single unit self loop: 𝜕𝐽 𝑡 𝜕𝐽 𝑡 𝜕𝑦𝑘 𝜏 𝛿𝑘 𝜏 = = = 𝑓𝑘′ 𝑠𝑘 𝜏 𝜀𝑘 𝜏 = 𝑓𝑘′ 𝑠𝑘 𝜏 𝑤𝑘𝑘 𝛿𝑘 𝜏 + 1 𝜕𝑠𝑘 𝜏 𝜕𝑦𝑘 𝜏 𝜕𝑠𝑘 𝜏 • Long Short-term Memory Hochreiter, S. and Schmidhuber, J. (1997). Long shortterm memory. Neural computation, 9(8):1735–1780. 4/6/2015 Tinghui Wang, EECS, WSU 11 Purpose of Neural Turing Machine • Enrich the capabilities of standard RNN to simplify the solution of algorithmic tasks 4/6/2015 Tinghui Wang, EECS, WSU 12 NTM Read/Write Operation 1 2 3 ... 1 2 3 N ... N w 1 w 2 w3 𝑟𝑡 = 𝑤𝑡 𝑖 𝑀𝑡 𝑖 𝑖 ... wN 𝑤𝑡 𝑖 : vector of weights over the N locations emitted by a read head at time t 4/6/2015 M-Size Vector M-Size Vector Erase: w 1 w 2 w3 𝑀𝑡 𝑖 = 𝑀𝑡−1 𝑖 1 − 𝑤𝑡 𝑖 𝑒𝑡 Add: 𝑀𝑡 𝑖 += 𝑤𝑡 𝑖 𝑎𝑡 ... wN 𝑒𝑡 : Erase Vector, emitted by the header 𝑎𝑡 : Add Vector, emitted by the header 𝑤𝑡 : weights vector Tinghui Wang, EECS, WSU 13 Attention & Focus • Attention and Focus is adjusted by weights vector across whole memory bank! 4/6/2015 Tinghui Wang, EECS, WSU 14 Content Based Addressing • Find Similar Data in memory 𝒌𝑡 : Key Vector 𝛽𝑡 : Key Strength 𝑤𝑡𝑐 𝑖 = exp 𝛽𝑡 𝐾 𝒌𝑡 , 𝑴𝑡 𝑖 𝑗 exp 𝛽𝑡 𝐾 𝒌𝑡 , 𝑴𝑡 𝑗 𝒖∙𝒗 𝐾 𝒖, 𝒗 = 𝒖 ∙ 𝒗 4/6/2015 Tinghui Wang, EECS, WSU 15 May not depend on content only… • Gating Content Addressing 𝑔𝑡 : Interpolation gate in the range of (0,1) 𝑔 𝒘𝑡 = 𝑔𝑡 𝒘𝑐𝑡 + 1 − 𝑔𝑡 𝒘𝑡−1 4/6/2015 Tinghui Wang, EECS, WSU 16 Convolutional Shift – Location Addressing 𝒔𝑡 : Shift weighting – Convolutional Vector (size N) 𝑁−1 𝑔 𝒘𝑡 𝑖 = 𝒘𝑡 𝑗 𝒔𝑡 𝑖 − 𝑗 𝑗=0 4/6/2015 Tinghui Wang, EECS, WSU 17 Convolutional Shift – Location Addressing 𝛾𝑡 : sharpening factor 𝒘𝑡 𝑖 = 4/6/2015 Tinghui Wang, EECS, WSU 𝒘𝑡 𝑖 𝛾𝑡 𝑗 𝒘𝑡 𝑗 𝛾𝑡 18 Neural Turing Machine - Experiments • Goal: to demonstrate NTM is Able to solve the problems By learning compact internal programs • Three architectures: NTM with a feedforward controller NTM with an LSTM controller Standard LSTM controller • Applications Copy Repeat Copy Associative Recall Dynamic N-Grams Priority Sort 4/6/2015 Tinghui Wang, EECS, WSU 19 NTM Experiments: Copy • Task: Store and recall a long sequence of arbitrary information NTM • Training: 8-bit random vectors with length 1-20 • No inputs while it generates the targets LSTM 4/6/2015 Tinghui Wang, EECS, WSU 20 NTM Experiments: Copy – Learning Curve 4/6/2015 Tinghui Wang, EECS, WSU 21 NTM Experiments: Repeated Copy • Task: Repeat a sequence a specified number of times • Training: random length sequences of random binary vectors followed by a scalar value indicating the desired number of copies 4/6/2015 Tinghui Wang, EECS, WSU 22 NTM Experiments: Repeated Copy – Learning Curve 4/6/2015 Tinghui Wang, EECS, WSU 23 NTM Experiments: Associative Recall • Task: ask the network to produce the next item, given current item after propagating a sequence to network • Training: each item is composed of three six-bit binary vectors, 2-8 items every episode 4/6/2015 Tinghui Wang, EECS, WSU 24 NTM Experiments: Dynamic N-Grams • Task: Learn N-Gram Model – rapidly adapt to new predictive distribution • Training: 6-Gram distributions over binary sequences. 200 successive bits using look-up table by drawing 32 probabilities from Beta(.5,.5) distribution • Compare to Optimal Estimator: 𝑁1 + 0.5 𝑃 𝐵 = 1 𝑁1 , 𝑁2 , 𝒄 = 𝑁1 + 𝑁0 + 1 4/6/2015 Tinghui Wang, EECS, WSU C=01111 C=00010 25 NTM Experiments: Priority Sorting • Task: Sort Binary Vector based on priority • Training: 20 binary vectors with corresponding priorities, output 16 highest-priority vectors 4/6/2015 Tinghui Wang, EECS, WSU 26 NTM Experiments: Learning Curve 4/6/2015 Tinghui Wang, EECS, WSU 27 Some Detailed Parameters NTM with Feed Forward Neural Network NTM with LSTM LSTM 4/6/2015 Tinghui Wang, EECS, WSU 28 Additional Information • Literature Reference: Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536 R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780 Y. Bengio and Y. LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machines, Bottou, ., Chapelle, O., DeCoste, D., and Weston, J., Eds., MIT Press, 2007. • NTM Implementation https://github.com/shawntan/neural-turing-machines • NTM Reddit Discussion 4/6/2015 https://www.reddit.com/r/MachineLearning/comments/2m9mga/implementation_of_neural_turing_ma chines/ Tinghui Wang, EECS, WSU 29 4/6/2015 Tinghui Wang, EECS, WSU 30
© Copyright 2024