slides - Deep Learning for Natural Language Processing

CS224d Deep NLP Lecture 7: Fancy Recurrent Neural Networks for Machine TranslaBon Richard Socher [email protected] Overview •  Wrapping up: Bidirec8onal and deep RNNs for opinion labeling •  Machine transla8on •  RNN Models tackling MT: 2 • 
Gated Recurrent Units by Cho et al. (2014) • 
Long-­‐Short-­‐Term-­‐Memories by Hochreiter and Schmidhuber (1997) Richard Socher 4/22/15 Deep B
idirecBonal R
NNs b
y I
rsoy a
nd C
ardie Going Deep
y
h (3)
h (2)
h (1)
! (i)
!"
! (i) (i−1) !" (i) ! (i) ! (i)
h t = f (W ht +V h t−1 + b )
! (i)
!"
" (i) (i−1) !" (i) ! (i) ! (i)
h t = f (W ht +V h t+1 + b )
! (L ) ! (L )
yt = g(U[h t ;h t ]+ c)
x
Each memory layer passes an intermediate sequential
representation to the next.
3 Richard Socher 4/22/15 Data •  MPQA 1.2 corpus (Wiebe et al., 2005) •  consists of 535 news ar8cles (11,111 sentences) •  manually labeled at the phrase level •  Evalua8on: F1 4 Richard Socher 4/22/15 Prop
63
53
61
51
Bin F1
EvaluaBon 59
49
57
47
74
72
70
68
66
64
68
66
64
62
60
58
1
2
3
4
# Layers
5
1
2
3
4
# Layers
Results: Deep vs Shallow RNNs
Prop F1
67
DSE
57
65
55
63
53
5 61
Richard Socher 51
5
24k
200k
ESE
4/22/15 Machine TranslaBon •  Methods are sta8s8cal •  Use parallel corpora • 
European Parliament •  First parallel corpus: • 
RoseZa Stone à •  Tradi8onal systems are very complex 6 Richard Socher 4/22/15 Current staBsBcal machine translaBon systems •  Source language f, e.g. French •  Target language e, e.g. English •  Probabilis8c formula8on (using Bayes rule) •  Transla8on model p(f|e) trained on parallel corpus •  Language model p(e) trained on English only corpus (lots, free!) French à Transla8on Model à Pieces of English à p(f|e) Decoder argmax p(f|e)p(e) 7 Richard Socher Language Model p(e) à Proper English 4/22/15 Step 1: Alignment Goal: know which word or phrases in source language Alignments
would translate to what words or phrases in target We can factor the translation model P(f | e )
language? Hard aalignments
lready! (correspondences)
by à
identifying
Japan
shaken
by
two
new
quakes
séismes
nouveaux
deux
par
secoué
Le
Japon
secoué
par
deux
nouveaux
séismes
Japon
spurious
word
Le
between words in f and words in e
Japan
shaken
by
two
new
quakes
Alignment examples from Chris Manning/CS224n 8 Richard Socher 4/22/15 And
the
program
has
been
implemented
one-to-many
alignment
9 Richard Socher 4/22/15 application
en
mis
Le
programme
a
été
mis
en
application
été
And
the
program
has
been
implemented
Le
zero fertility word
not translated
a
harder
programme
Alignments:
Step 1: Alignment Step 1: Alignment Alignments:
harder
The
appartenait
balance
aux
was
autochtones
the
territory
of
many-to-one
alignments
the
aboriginal
people
10 Richard Socher 4/22/15 autochtones
aux
appartenait
Le
reste
Le
The
balance
was
the
territory
of
the
aboriginal
people
reste
Really hard :/ The
poor
don’t
have
any
money
Les
pauvres
sont
démunis
The
poor
don t
have
any
money
many-to-many
alignment
11 phrase
alignment
Richard Socher 4/22/15 démunis
sont
pauvres
hardest
Les
Alignments:
Step 1: Alignment Step 1: Alignment •  We could spend an en8re lecture on alignment models Translation
Process
•  Not only single words but could use phrases, syntax • Task: translate this sentence from German into English
•  Then consider reordering of translated phrases er
geht
er
geht
he
ja
ja nicht
does not
• Pick phrase in input, translate
12 nicht
nach
hause
nach hause
go
home
Example from Philipp Koehn Richard Socher 4/22/15 AOer many steps Each phrase in source language has many possible transla8ons resul8ng in large search space: Translation Options
er
geht
he
it
, it
, he
is
are
goes
go
ja
nicht
yes
is
, of course
,
not
do not
does not
is not
it is
he will be
it goes
he goes
nach
after
to
according to
in
not
is not
does not
do not
is
are
is after all
does
hause
home
under house
return home
do not
to
following
not after
not to
not
is not
are not
is not a
• Many
13 Richard Socher translation options to choose
from
house
home
chamber
at home
4/22/15 Decode: Search for best of many hypotheses Hard search problem that also includes language model Decoding: Find Best Path
er
geht
ja
nicht
nach
hause
yes
he
are
it
14 goes
home
does not
go
home
to
Richard Socher 4/22/15 backtrack from highest
scoring complete hypothesis
TradiBonal MT •  Skipped hundreds of important details •  A lot of human feature engineering •  Very complex systems •  Many different, independent machine learning problems 15 Richard Socher 4/22/15 Deep learning to the rescue! … ? Maybe, we could translate directly with an RNN? Decoder: Encoder h1 W 16 W x2 Echt dicke
sauce y2 h3 h2 x1 Awesome y1 x3 Kiste Richard Socher This needs to capture the en8re phrase! 4/22/15 MT with RNNs – Simplest Model Encoder: Decoder: Minimize cross entropy error for all target words condi8oned on source words It’s not quite that simple ;) 17 Richard Socher 4/22/15 RNN TranslaBon Model Extensions 1. Train different RNN weights for encoding and decoding Awesome y1 h1 h3 h2 W x1 18 W x2 Echt dicke
sauce y2 x3 Kiste Richard Socher 4/22/15 RNN TranslaBon Model Extensions Nota8on: Each input of Á has its own linear transforma8on matrix. Simple: 2
RNN Encoder–Decoder
2.  Compute every hRecurrent
idden Neural
state in 2.1 Preliminary:
Networks
A recurrent
neural network (RNN) is a neural netdecoder f
rom work that consists of a hidden state h and an
• 
Previous hidden state (standard) • 
Last hidden vector of encoder c=hT • 
19 optional output y which operates on a variablelength sequence x = (x1 , . . . , xT ). At each time
step t, the hidden state hhti of the RNN is updated
by
hhti = f hht
1i , xt
,
(1)
Previous predicted output word yt-­‐1 where f is a non-linear activation function.
f may be as simple as an elementwise logistic sigmoid function and as complex as a long short-term memory (LSTM)
unit (Hochreiter and Schmidhuber, 1997).
An RNN can learn a probability distribution
over a sequence by being trained to predict the
Richard ocher next symbol in a sequence. In that
case, Sthe
output
at each timestep t is the conditional distribution
Figure 1: An illustration of the proposed R
Encoder–Decoder.
Cho et al. 2014 should note that the input and output seque
0 may differ.
lengths T and T 4/22/15 The encoder is an RNN that reads each sym
Different picture, same idea pi
Recurrent Recurrent
State
State
Word Probability
ui
Decoder
Word Ssample
f = (La, croissance, économique, s'est, ralentie, ces, dernières, années, .)
hi
si
wi
Encoder
Continuous-space
1-of-K coding Word Representation
zi
years,
.)
Richard Socher 20 e = (Economic, growth, has, slowed, down, in, recent,
Kyunghyun Cho et al. 2014 4/22/15 Going Deep
RNN TranslaBon Model Extensions y
3.  Train stacked/deep RNNs with mul8ple layers 4.  Poten8ally train bidirec8onal encoder h (3)
! (i)
!
h t = f (W
! (i)
!
h t = f (W
h (2)
yt = g(U
h (1)
x
Each memory layer passes an intermedia
representation to the next.
5.  Train input sequence in reverse order for simple op8miza8on problem: Instead of A B C à X Y, train with C B A à X Y 21 Richard Socher 4/22/15 6. Main Improvement: BeXer Units •  More complex hidden unit computa8on in recurrence! •  Gated Recurrent Units (GRU) introduced by Cho et al. 2014 (see reading list) •  Main ideas: 22 • 
keep around memories to capture long distance dependencies • 
allow error messages to flow at different strengths depending on the inputs Richard Socher 4/22/15 GRUs •  Standard RNN computes hidden layer at next 8me step directly: •  GRU first computes an update gate (another layer) based on current input word vector and hidden state •  Compute reset gate similarly but with different weights 23 Richard Socher 4/22/15 GRUs •  Update gate •  Reset gate •  New memory content: If reset gate unit is ~0, then this ignores previous memory and only stores the new word informa8on •  Final memory at 8me step combines current and previous 8me steps: 24 Richard Socher 4/22/15 AXempt at a clean illustraBon ht-­‐1 Final memory Memory (reset) Update gate ht ~ ht ~ ht-­‐1 zt zt-­‐1 Reset gate rt-­‐1 rt Input: xt-­‐1 xt 25 Richard Socher 4/22/15 GRU intuiBon •  If reset is close to 0, ignore previous hidden state à Allows model to drop informa8on that is irrelevant in the future •  Update gate z controls how much of past state should maZer now. • 
If z close to 1, then we can copy informa8on in that unit through many 8me steps! Less vanishing gradient! •  Units with short-­‐term dependencies ooen have reset gates very ac8ve 26 Richard Socher 4/22/15 GRU intuiBon •  Units with long term dependencies have ac8ve update gates z •  Illustra8on: odel parameters and
sequence, output seng set. In our case,
starting from the inuse a gradient-based
del parameters.
ecoder is trained, the
ys. One way is to use
et sequence given an
hand, the model can
r of input and output
s simply a probability
(4).
Figure 2: An illustration of the proposed hidden
activation function. The update gate z selects
whether the hidden state is to be updated with
˜ The reset gate r decides
a new hidden state h.
whether the previous hidden state is ignored. See
Eqs. (5)–(8) for the detailed equations of r, z, h
˜
and h.
•  Deriva8ve of ? à rest is same chain rule, but implement with modularizaBon or automa8c differen8a8on ptively
27 Remembers
Richard Socher only. This effectively allows the hidden state to
4/22/15 Long-­‐short-­‐term-­‐memories (LSTMs) •  We can make the units even more complex •  Allow each 8me step to modify • 
Input gate (current cell maZers) • 
Forget (gate 0, forget past) • 
Output (how much cell is exposed) • 
New memory cell •  Final memory cell: •  Final hidden state: 28 Richard Socher 4/22/15 IllustraBons a bit overwhelming ;) net c
s c = s c + g y in
j
g y in
g
wc
j
i
yc
j
j
j
j
1.0
y in
y out
j
w in i
j
h y out
h
net in
j
j
j
wic
j
j
w out i
net out
j
j
Long Short-­‐Term Memory by Hochreiter and Schmidhuber (1997) hZp://people.idsia.ch/~juergen/lstm/sld017.htm hZp://deeplearning.net/tutorial/lstm.html Intui8on: memory cells can keep informa8on intact, unless inputs makes them forget it or overwrite it with new input. Cell can decide to output this informa8on or just store it 29 Richard Socher 4/22/15 LSTMs are currently very hip! •  En vogue default model for most sequence labeling tasks •  Very powerful, especially when stacked and made even deeper (each hidden layer is already computed by a deep internal network) •  Most useful if you have lots and lots of data 30 Richard Socher 4/22/15 Deep LSTMs don’t outperform tradiBonal MT yet Method
Bahdanau et al. [2]
Baseline System [29]
Single forward LSTM, beam size 12
Single reversed LSTM, beam size 12
Ensemble of 5 reversed LSTMs, beam size 1
Ensemble of 2 reversed LSTMs, beam size 12
Ensemble of 5 reversed LSTMs, beam size 2
Ensemble of 5 reversed LSTMs, beam size 12
test BLEU score (ntst14)
28.45
33.30
26.17
30.59
33.00
33.27
34.50
34.81
Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note that
an ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam of
size 12.
Method
Baseline System [29]
Cho et al. [5]
Best WMT’14 result [9]
Rescoring the baseline 1000-best with a single forward LSTM
Rescoring the baseline 1000-best with a single reversed LSTM
Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs
Oracle Rescoring of the Baseline 1000-best lists
test BLEU score (ntst14)
33.30
34.54
37.0
35.61
35.85
36.5
∼45
Sequence to Sequence Learning by Sutskever et al. 2014 Table
2: Methods that use neural networks together with an SMT system on the WMT’14 English
31 to French test set (ntst14).
e were surprised to discover that the LSTM did well on long sentences, which is shown quantitavely in figure 3. Table 3 presents several examples of long sentences and their translations.
Deep LSTM for Machine TranslaBon 8 Model Analysis
PCA of vectors from last 8me step hidden layer 15
I was given a card by her in the garden
4
Mary admires John
3
2
In the garden , she gave me a card
She gave me a card in the garden
10
Mary is in love with John
5
1
0
−1
Mary respects John
0
John admires Mary
−5
John is in love with Mary
−2
She was given a card by me in the garden
In the garden , I gave her a card
−3
−10
−4
−5
−6
−8
−15
John respects Mary
−6
−4
−2
0
2
4
6
8
10
−20
−15
I gave her a card in the garden
−10
−5
0
5
10
15
20
gure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtained
Sequence Learning by Sin
utskever t al. 2014 is
er processing the phrases in the figures. The phrases are Sequence clusteredto by
meaning,
which
these eexamples
imarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice tha
th clusters have similar internal structure.
Richard Socher ne of32 the attractive features of our model is its ability to turn a sequence4/22/15 of words into a vector
Further Improvements: More Gates! Gated Feedback Recurrent Neural Networks, Chung et al. 2015 Gated Feedback Recurrent Neural Networks
(a) Conventional stacked RNN
(b) Gated Feedback RNN
Figure 1. Illustrations of (a) conventional stacking approach and (b) gated-feedback approach to form a deep RNN architecture. Bullets
in (b) correspond to global reset gates. Skip connections are omitted to simplify the visualization of networks.
The global reset gate is computed as:
33 ⇣
computed by
Richard Socher ⌘
hj = tanh Wj
1!j
hj
4/22/15 L
X
1
+
g i!j Ui!j hi
!
,
Summary •  Recurrent Neural Networks are powerful •  A lot of ongoing work right now •  Gated Recurrent Units even beZer •  LSTMs maybe even beZer (jury s8ll out) •  This was an advanced lecture à gain intui8on, encourage explora8on •  Next up: Recursive Neural Networks simpler and also powerful :) 34 Richard Socher 4/22/15