Neural Turing Machine

Neural Turing Machine
Author: Alex Graves, Greg Wayne, Ivo Danihelka
Presented By: Tinghui Wang (Steve)
Introduction
• Neural Turning Machine: Couple a Neural Network with
external memory resources
• The combined system is analogous to TM, but differentiable
from end to end
• NTM Can infer simple algorithms!
4/6/2015
Tinghui Wang, EECS, WSU
2
In Computation Theory…
Finite State Machine
𝚺, 𝑺, 𝒔𝟎 , 𝜹, 𝑭
• 𝚺: Alphabet
• 𝑺: Finite set of states
• 𝒔𝟎 : Initial state
• 𝜹: state transition function:
𝑺×𝚺→𝑷 𝑺
• 𝑭: set of Final States
4/6/2015
Tinghui Wang, EECS, WSU
3
In Computation Theory…
Push Down Automata
𝑸, 𝚺, 𝚪, 𝜹, 𝒒𝟎 , 𝒁, 𝑭
• 𝚺: Input Alphabet
• 𝚪: Stack Alphabet
• 𝑸: Finite set of states
• 𝒒𝟎 : Initial state
• 𝜹: state transition function:
𝑸×𝚺×𝚪→𝑷 𝑸 ×𝚪
• 𝑭: set of Final States
4/6/2015
Tinghui Wang, EECS, WSU
4
Turing Machine
𝑸, 𝚪, 𝒃, 𝚺, 𝜹, 𝒒𝟎 , 𝑭
• 𝑸: Finite set of States
• 𝒒𝟎 : Initial State
• 𝚪: Tape alphabet (Finite)
• 𝒃: Blank Symbol (occurs
infinitely on the tape)
• 𝚺: Input Alphabet (Σ = 𝚪\ 𝑏 )
• 𝜹: Transition Function
𝑄\F × Γ → 𝑄 × Γ × 𝐿, 𝑅
• 𝑭: Final State (Accepted)
4/6/2015
Tinghui Wang, EECS, WSU
5
Neural Turing Machine - Add Learning to TM!!
Finite State Machine (Program)
4/6/2015
Neural Network (I Can Learn!!)
Tinghui Wang, EECS, WSU
6
A Little History on Neural Network
• 1950s: Frank Rosenblatt, Perceptron – classification based
on linear predictor
• 1969: Minsky – Proof that Perceptron Sucks! – Not able to
learn XOR
• 1980s: Back propagation
• 1990s: Recurrent Neural Network, Long Short-Term Memory
• Late 2000s – now: Deep Learning, Fast Computer
4/6/2015
Tinghui Wang, EECS, WSU
7
Feed-Forward Neural Net and Back Propagation
• Total Input of unit 𝑗:
𝑥𝑗 =
𝑦𝑖 𝑤𝑗𝑖
𝑖
𝑖
𝑤𝑗𝑖
• Output of unit 𝑗 (logistic function)
1
𝑦𝑗 =
1 + 𝑒 −𝑥𝑗
𝑗
• Total Error (mean square)
1
𝐸𝑡𝑜𝑡𝑎𝑙 =
𝑑𝑐,𝑗 − 𝑦𝑐,𝑗
2
𝑐
2
𝑗
• Gradient Descent
𝜕𝐸
𝜕𝐸 𝜕𝑦𝑗 𝜕𝑥𝑗
=
𝜕𝑤𝑗𝑖 𝜕𝑦𝑗 𝜕𝑥𝑗 𝜕𝑤𝑗𝑖
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October
1986). "Learning representations by back-propagating
errors". Nature 323 (6088): 533–536.
4/6/2015
Tinghui Wang, EECS, WSU
8
Recurrent Neural Network
• 𝑦 𝑡 : n-tuple of outputs at time t
• 𝑥 𝑛𝑒𝑡 𝑡 : m-tuple of external inputs at time t
• 𝑈: set of indices 𝑘 that 𝑥𝑘 is output of unit in
network
• 𝐼: set of indices 𝑘 that 𝑥𝑘 is external input
𝑥𝑘𝑛𝑒𝑡 𝑡 , 𝑖𝑓 𝑘 ∈ 𝐼
• 𝑥𝑘 =
𝑦𝑘 𝑡 , 𝑖𝑓 𝑘 ∈ 𝑅
where 𝑓𝑘 is a differential
• 𝑦𝑘 𝑡 + 1 = 𝑓𝑘 𝑠𝑘 𝑡 + 1
squashing function
R. J. Williams and D. Zipser. Gradient-based learning
algorithms for recurrent networks and their computational
complexity. In Back-propagation: Theory, Architectures and
Applications. Hillsdale, NJ: Erlbaum, 1994.
4/6/2015
• 𝑠𝑘 𝑡 + 1 =
Tinghui Wang, EECS, WSU
𝑖∈𝐼∪𝑈 𝑤𝑘𝑖 𝑥𝑖
𝑡
9
Recurrent Neural Network – Time Unrolling
• Error Function:
1
𝐽 𝑡 =−
2
𝑘∈𝑈
𝐽𝑇𝑜𝑡𝑎𝑙 𝑡′, 𝑡 =
𝑒𝑘 𝑡
2
1
=−
2
𝑡
𝑑𝑘 𝑡 − 𝑦𝑘 𝑡
2
𝑘∈𝑈
𝐽 𝜏
𝜏=𝑡′+1
• Back Propagation through Time:
𝜕𝐽 𝑡
𝜀𝑘 𝑡 =
= 𝑒𝑘 𝑡 = 𝑑𝑘 𝑡 − 𝑦𝑘 𝑡
𝜕𝑦𝑘 𝑡
𝜕𝐽 𝑡
𝜕𝐽 𝑡 𝜕𝑦𝑘 𝜏
𝛿𝑘 𝜏 =
=
= 𝑓𝑘′ 𝑠𝑘 𝜏 𝜀𝑘 𝜏
𝜕𝑠𝑘 𝜏
𝜕𝑦𝑘 𝜏 𝜕𝑠𝑘 𝜏
𝜀𝑘 𝜏 − 1 =
𝜕𝐽 𝑡
=
𝜕𝑦𝑘 𝜏 − 1
𝜕𝐽𝑇𝑜𝑡𝑎𝑙
=
𝜕𝑤𝑗𝑖
4/6/2015
Tinghui Wang, EECS, WSU
𝑙∈𝑈
𝜕𝐽 𝑡
𝜕𝑠𝑙 𝜏
=
𝜕𝑠𝑙 𝜏 𝜕𝑦𝑘 𝜏 − 1
𝑤𝑙𝑘 𝛿𝑙 𝜏
𝑙∈𝑈
𝑡
𝜕𝐽 𝑡 𝜕𝑠𝑗 𝜏
𝜏=𝑡′+1 𝜕𝑠𝑗 𝜏 𝜕𝑤𝑗𝑖
10
RNN & Long Short-Term Memory
• RNN is Turing Complete

With Proper Configuration, it can simulate any sequence generated by a
Turing Machine
• Error Vanishing and Exploding

Consider single unit self loop:
𝜕𝐽 𝑡
𝜕𝐽 𝑡 𝜕𝑦𝑘 𝜏
𝛿𝑘 𝜏 =
=
= 𝑓𝑘′ 𝑠𝑘 𝜏 𝜀𝑘 𝜏 = 𝑓𝑘′ 𝑠𝑘 𝜏 𝑤𝑘𝑘 𝛿𝑘 𝜏 + 1
𝜕𝑠𝑘 𝜏
𝜕𝑦𝑘 𝜏 𝜕𝑠𝑘 𝜏
• Long Short-term Memory
Hochreiter, S. and Schmidhuber, J. (1997). Long shortterm memory. Neural computation, 9(8):1735–1780.
4/6/2015
Tinghui Wang, EECS, WSU
11
Purpose of Neural Turing Machine
• Enrich the capabilities of standard RNN to simplify the
solution of algorithmic tasks
4/6/2015
Tinghui Wang, EECS, WSU
12
NTM Read/Write Operation
1 2 3
...
1 2 3
N
...
N
w 1 w 2 w3
𝑟𝑡 =
𝑤𝑡 𝑖 𝑀𝑡 𝑖
𝑖
...
wN
𝑤𝑡 𝑖 : vector of weights over the N locations
emitted by a read head at time t
4/6/2015
M-Size Vector
M-Size Vector
Erase:
w 1 w 2 w3
𝑀𝑡 𝑖 = 𝑀𝑡−1 𝑖 1 − 𝑤𝑡 𝑖 𝑒𝑡
Add:
𝑀𝑡 𝑖 += 𝑤𝑡 𝑖 𝑎𝑡
...
wN
𝑒𝑡 : Erase Vector, emitted by the header
𝑎𝑡 : Add Vector, emitted by the header
𝑤𝑡 : weights vector
Tinghui Wang, EECS, WSU
13
Attention & Focus
• Attention and Focus is adjusted by weights vector across
whole memory bank!
4/6/2015
Tinghui Wang, EECS, WSU
14
Content Based Addressing
• Find Similar Data in memory
𝒌𝑡 : Key Vector
𝛽𝑡 : Key Strength
𝑤𝑡𝑐
𝑖 =
exp 𝛽𝑡 𝐾 𝒌𝑡 , 𝑴𝑡 𝑖
𝑗 exp
𝛽𝑡 𝐾 𝒌𝑡 , 𝑴𝑡 𝑗
𝒖∙𝒗
𝐾 𝒖, 𝒗 =
𝒖 ∙ 𝒗
4/6/2015
Tinghui Wang, EECS, WSU
15
May not depend on content only…
• Gating Content Addressing
𝑔𝑡 : Interpolation gate in the range of (0,1)
𝑔
𝒘𝑡 = 𝑔𝑡 𝒘𝑐𝑡 + 1 − 𝑔𝑡 𝒘𝑡−1
4/6/2015
Tinghui Wang, EECS, WSU
16
Convolutional Shift – Location Addressing
𝒔𝑡 : Shift weighting – Convolutional Vector (size N)
𝑁−1
𝑔
𝒘𝑡 𝑖 =
𝒘𝑡 𝑗 𝒔𝑡 𝑖 − 𝑗
𝑗=0
4/6/2015
Tinghui Wang, EECS, WSU
17
Convolutional Shift – Location Addressing
𝛾𝑡 : sharpening factor
𝒘𝑡 𝑖 =
4/6/2015
Tinghui Wang, EECS, WSU
𝒘𝑡 𝑖
𝛾𝑡
𝑗 𝒘𝑡
𝑗
𝛾𝑡
18
Neural Turing Machine - Experiments
• Goal: to demonstrate NTM is


Able to solve the problems
By learning compact internal programs
• Three architectures:



NTM with a feedforward controller
NTM with an LSTM controller
Standard LSTM controller
• Applications
 Copy
 Repeat Copy
 Associative Recall
 Dynamic N-Grams
 Priority Sort
4/6/2015
Tinghui Wang, EECS, WSU
19
NTM Experiments: Copy
• Task: Store and recall a long sequence
of arbitrary information
NTM
• Training: 8-bit random vectors with
length 1-20
• No inputs while it generates the targets
LSTM
4/6/2015
Tinghui Wang, EECS, WSU
20
NTM Experiments: Copy – Learning Curve
4/6/2015
Tinghui Wang, EECS, WSU
21
NTM Experiments: Repeated Copy
• Task: Repeat a sequence a specified
number of times
• Training: random length sequences of
random binary vectors followed by a
scalar value indicating the desired
number of copies
4/6/2015
Tinghui Wang, EECS, WSU
22
NTM Experiments: Repeated Copy – Learning Curve
4/6/2015
Tinghui Wang, EECS, WSU
23
NTM Experiments: Associative Recall
• Task: ask the network to produce the next item,
given current item after propagating a sequence to
network
• Training: each item is composed of three six-bit
binary vectors, 2-8 items every episode
4/6/2015
Tinghui Wang, EECS, WSU
24
NTM Experiments: Dynamic N-Grams
• Task: Learn N-Gram Model – rapidly
adapt to new predictive distribution
• Training: 6-Gram distributions over
binary sequences. 200 successive
bits using look-up table by drawing
32 probabilities from Beta(.5,.5)
distribution
• Compare to Optimal Estimator:
𝑁1 + 0.5
𝑃 𝐵 = 1 𝑁1 , 𝑁2 , 𝒄 =
𝑁1 + 𝑁0 + 1
4/6/2015
Tinghui Wang, EECS, WSU
C=01111
C=00010
25
NTM Experiments: Priority Sorting
• Task: Sort Binary Vector based on priority
• Training: 20 binary vectors with
corresponding priorities, output 16
highest-priority vectors
4/6/2015
Tinghui Wang, EECS, WSU
26
NTM Experiments: Learning Curve
4/6/2015
Tinghui Wang, EECS, WSU
27
Some Detailed Parameters
NTM with
Feed Forward
Neural Network
NTM with
LSTM
LSTM
4/6/2015
Tinghui Wang, EECS, WSU
28
Additional Information
• Literature Reference:




Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations
by back-propagating errors". Nature 323 (6088): 533–536
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their
computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ:
Erlbaum, 1994.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780
Y. Bengio and Y. LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machines, Bottou, .,
Chapelle, O., DeCoste, D., and Weston, J., Eds., MIT Press, 2007.
• NTM Implementation

https://github.com/shawntan/neural-turing-machines
• NTM Reddit Discussion

4/6/2015
https://www.reddit.com/r/MachineLearning/comments/2m9mga/implementation_of_neural_turing_ma
chines/
Tinghui Wang, EECS, WSU
29
4/6/2015
Tinghui Wang, EECS, WSU
30