CS224d Deep NLP Lecture 7: Fancy Recurrent Neural Networks for Machine TranslaBon Richard Socher [email protected] Overview • Wrapping up: Bidirec8onal and deep RNNs for opinion labeling • Machine transla8on • RNN Models tackling MT: 2 • Gated Recurrent Units by Cho et al. (2014) • Long-‐Short-‐Term-‐Memories by Hochreiter and Schmidhuber (1997) Richard Socher 4/22/15 Deep B idirecBonal R NNs b y I rsoy a nd C ardie Going Deep y h (3) h (2) h (1) ! (i) !" ! (i) (i−1) !" (i) ! (i) ! (i) h t = f (W ht +V h t−1 + b ) ! (i) !" " (i) (i−1) !" (i) ! (i) ! (i) h t = f (W ht +V h t+1 + b ) ! (L ) ! (L ) yt = g(U[h t ;h t ]+ c) x Each memory layer passes an intermediate sequential representation to the next. 3 Richard Socher 4/22/15 Data • MPQA 1.2 corpus (Wiebe et al., 2005) • consists of 535 news ar8cles (11,111 sentences) • manually labeled at the phrase level • Evalua8on: F1 4 Richard Socher 4/22/15 Prop 63 53 61 51 Bin F1 EvaluaBon 59 49 57 47 74 72 70 68 66 64 68 66 64 62 60 58 1 2 3 4 # Layers 5 1 2 3 4 # Layers Results: Deep vs Shallow RNNs Prop F1 67 DSE 57 65 55 63 53 5 61 Richard Socher 51 5 24k 200k ESE 4/22/15 Machine TranslaBon • Methods are sta8s8cal • Use parallel corpora • European Parliament • First parallel corpus: • RoseZa Stone à • Tradi8onal systems are very complex 6 Richard Socher 4/22/15 Current staBsBcal machine translaBon systems • Source language f, e.g. French • Target language e, e.g. English • Probabilis8c formula8on (using Bayes rule) • Transla8on model p(f|e) trained on parallel corpus • Language model p(e) trained on English only corpus (lots, free!) French à Transla8on Model à Pieces of English à p(f|e) Decoder argmax p(f|e)p(e) 7 Richard Socher Language Model p(e) à Proper English 4/22/15 Step 1: Alignment Goal: know which word or phrases in source language Alignments would translate to what words or phrases in target We can factor the translation model P(f | e ) language? Hard aalignments lready! (correspondences) by à identifying Japan shaken by two new quakes séismes nouveaux deux par secoué Le Japon secoué par deux nouveaux séismes Japon spurious word Le between words in f and words in e Japan shaken by two new quakes Alignment examples from Chris Manning/CS224n 8 Richard Socher 4/22/15 And the program has been implemented one-to-many alignment 9 Richard Socher 4/22/15 application en mis Le programme a été mis en application été And the program has been implemented Le zero fertility word not translated a harder programme Alignments: Step 1: Alignment Step 1: Alignment Alignments: harder The appartenait balance aux was autochtones the territory of many-to-one alignments the aboriginal people 10 Richard Socher 4/22/15 autochtones aux appartenait Le reste Le The balance was the territory of the aboriginal people reste Really hard :/ The poor don’t have any money Les pauvres sont démunis The poor don t have any money many-to-many alignment 11 phrase alignment Richard Socher 4/22/15 démunis sont pauvres hardest Les Alignments: Step 1: Alignment Step 1: Alignment • We could spend an en8re lecture on alignment models Translation Process • Not only single words but could use phrases, syntax • Task: translate this sentence from German into English • Then consider reordering of translated phrases er geht er geht he ja ja nicht does not • Pick phrase in input, translate 12 nicht nach hause nach hause go home Example from Philipp Koehn Richard Socher 4/22/15 AOer many steps Each phrase in source language has many possible transla8ons resul8ng in large search space: Translation Options er geht he it , it , he is are goes go ja nicht yes is , of course , not do not does not is not it is he will be it goes he goes nach after to according to in not is not does not do not is are is after all does hause home under house return home do not to following not after not to not is not are not is not a • Many 13 Richard Socher translation options to choose from house home chamber at home 4/22/15 Decode: Search for best of many hypotheses Hard search problem that also includes language model Decoding: Find Best Path er geht ja nicht nach hause yes he are it 14 goes home does not go home to Richard Socher 4/22/15 backtrack from highest scoring complete hypothesis TradiBonal MT • Skipped hundreds of important details • A lot of human feature engineering • Very complex systems • Many different, independent machine learning problems 15 Richard Socher 4/22/15 Deep learning to the rescue! … ? Maybe, we could translate directly with an RNN? Decoder: Encoder h1 W 16 W x2 Echt dicke sauce y2 h3 h2 x1 Awesome y1 x3 Kiste Richard Socher This needs to capture the en8re phrase! 4/22/15 MT with RNNs – Simplest Model Encoder: Decoder: Minimize cross entropy error for all target words condi8oned on source words It’s not quite that simple ;) 17 Richard Socher 4/22/15 RNN TranslaBon Model Extensions 1. Train different RNN weights for encoding and decoding Awesome y1 h1 h3 h2 W x1 18 W x2 Echt dicke sauce y2 x3 Kiste Richard Socher 4/22/15 RNN TranslaBon Model Extensions Nota8on: Each input of Á has its own linear transforma8on matrix. Simple: 2 RNN Encoder–Decoder 2. Compute every hRecurrent idden Neural state in 2.1 Preliminary: Networks A recurrent neural network (RNN) is a neural netdecoder f rom work that consists of a hidden state h and an • Previous hidden state (standard) • Last hidden vector of encoder c=hT • 19 optional output y which operates on a variablelength sequence x = (x1 , . . . , xT ). At each time step t, the hidden state hhti of the RNN is updated by hhti = f hht 1i , xt , (1) Previous predicted output word yt-‐1 where f is a non-linear activation function. f may be as simple as an elementwise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit (Hochreiter and Schmidhuber, 1997). An RNN can learn a probability distribution over a sequence by being trained to predict the Richard ocher next symbol in a sequence. In that case, Sthe output at each timestep t is the conditional distribution Figure 1: An illustration of the proposed R Encoder–Decoder. Cho et al. 2014 should note that the input and output seque 0 may differ. lengths T and T 4/22/15 The encoder is an RNN that reads each sym Different picture, same idea pi Recurrent Recurrent State State Word Probability ui Decoder Word Ssample f = (La, croissance, économique, s'est, ralentie, ces, dernières, années, .) hi si wi Encoder Continuous-space 1-of-K coding Word Representation zi years, .) Richard Socher 20 e = (Economic, growth, has, slowed, down, in, recent, Kyunghyun Cho et al. 2014 4/22/15 Going Deep RNN TranslaBon Model Extensions y 3. Train stacked/deep RNNs with mul8ple layers 4. Poten8ally train bidirec8onal encoder h (3) ! (i) ! h t = f (W ! (i) ! h t = f (W h (2) yt = g(U h (1) x Each memory layer passes an intermedia representation to the next. 5. Train input sequence in reverse order for simple op8miza8on problem: Instead of A B C à X Y, train with C B A à X Y 21 Richard Socher 4/22/15 6. Main Improvement: BeXer Units • More complex hidden unit computa8on in recurrence! • Gated Recurrent Units (GRU) introduced by Cho et al. 2014 (see reading list) • Main ideas: 22 • keep around memories to capture long distance dependencies • allow error messages to flow at different strengths depending on the inputs Richard Socher 4/22/15 GRUs • Standard RNN computes hidden layer at next 8me step directly: • GRU first computes an update gate (another layer) based on current input word vector and hidden state • Compute reset gate similarly but with different weights 23 Richard Socher 4/22/15 GRUs • Update gate • Reset gate • New memory content: If reset gate unit is ~0, then this ignores previous memory and only stores the new word informa8on • Final memory at 8me step combines current and previous 8me steps: 24 Richard Socher 4/22/15 AXempt at a clean illustraBon ht-‐1 Final memory Memory (reset) Update gate ht ~ ht ~ ht-‐1 zt zt-‐1 Reset gate rt-‐1 rt Input: xt-‐1 xt 25 Richard Socher 4/22/15 GRU intuiBon • If reset is close to 0, ignore previous hidden state à Allows model to drop informa8on that is irrelevant in the future • Update gate z controls how much of past state should maZer now. • If z close to 1, then we can copy informa8on in that unit through many 8me steps! Less vanishing gradient! • Units with short-‐term dependencies ooen have reset gates very ac8ve 26 Richard Socher 4/22/15 GRU intuiBon • Units with long term dependencies have ac8ve update gates z • Illustra8on: odel parameters and sequence, output seng set. In our case, starting from the inuse a gradient-based del parameters. ecoder is trained, the ys. One way is to use et sequence given an hand, the model can r of input and output s simply a probability (4). Figure 2: An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with ˜ The reset gate r decides a new hidden state h. whether the previous hidden state is ignored. See Eqs. (5)–(8) for the detailed equations of r, z, h ˜ and h. • Deriva8ve of ? à rest is same chain rule, but implement with modularizaBon or automa8c differen8a8on ptively 27 Remembers Richard Socher only. This effectively allows the hidden state to 4/22/15 Long-‐short-‐term-‐memories (LSTMs) • We can make the units even more complex • Allow each 8me step to modify • Input gate (current cell maZers) • Forget (gate 0, forget past) • Output (how much cell is exposed) • New memory cell • Final memory cell: • Final hidden state: 28 Richard Socher 4/22/15 IllustraBons a bit overwhelming ;) net c s c = s c + g y in j g y in g wc j i yc j j j j 1.0 y in y out j w in i j h y out h net in j j j wic j j w out i net out j j Long Short-‐Term Memory by Hochreiter and Schmidhuber (1997) hZp://people.idsia.ch/~juergen/lstm/sld017.htm hZp://deeplearning.net/tutorial/lstm.html Intui8on: memory cells can keep informa8on intact, unless inputs makes them forget it or overwrite it with new input. Cell can decide to output this informa8on or just store it 29 Richard Socher 4/22/15 LSTMs are currently very hip! • En vogue default model for most sequence labeling tasks • Very powerful, especially when stacked and made even deeper (each hidden layer is already computed by a deep internal network) • Most useful if you have lots and lots of data 30 Richard Socher 4/22/15 Deep LSTMs don’t outperform tradiBonal MT yet Method Bahdanau et al. [2] Baseline System [29] Single forward LSTM, beam size 12 Single reversed LSTM, beam size 12 Ensemble of 5 reversed LSTMs, beam size 1 Ensemble of 2 reversed LSTMs, beam size 12 Ensemble of 5 reversed LSTMs, beam size 2 Ensemble of 5 reversed LSTMs, beam size 12 test BLEU score (ntst14) 28.45 33.30 26.17 30.59 33.00 33.27 34.50 34.81 Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note that an ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam of size 12. Method Baseline System [29] Cho et al. [5] Best WMT’14 result [9] Rescoring the baseline 1000-best with a single forward LSTM Rescoring the baseline 1000-best with a single reversed LSTM Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs Oracle Rescoring of the Baseline 1000-best lists test BLEU score (ntst14) 33.30 34.54 37.0 35.61 35.85 36.5 ∼45 Sequence to Sequence Learning by Sutskever et al. 2014 Table 2: Methods that use neural networks together with an SMT system on the WMT’14 English 31 to French test set (ntst14). e were surprised to discover that the LSTM did well on long sentences, which is shown quantitavely in figure 3. Table 3 presents several examples of long sentences and their translations. Deep LSTM for Machine TranslaBon 8 Model Analysis PCA of vectors from last 8me step hidden layer 15 I was given a card by her in the garden 4 Mary admires John 3 2 In the garden , she gave me a card She gave me a card in the garden 10 Mary is in love with John 5 1 0 −1 Mary respects John 0 John admires Mary −5 John is in love with Mary −2 She was given a card by me in the garden In the garden , I gave her a card −3 −10 −4 −5 −6 −8 −15 John respects Mary −6 −4 −2 0 2 4 6 8 10 −20 −15 I gave her a card in the garden −10 −5 0 5 10 15 20 gure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtained Sequence Learning by Sin utskever t al. 2014 is er processing the phrases in the figures. The phrases are Sequence clusteredto by meaning, which these eexamples imarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice tha th clusters have similar internal structure. Richard Socher ne of32 the attractive features of our model is its ability to turn a sequence4/22/15 of words into a vector Further Improvements: More Gates! Gated Feedback Recurrent Neural Networks, Chung et al. 2015 Gated Feedback Recurrent Neural Networks (a) Conventional stacked RNN (b) Gated Feedback RNN Figure 1. Illustrations of (a) conventional stacking approach and (b) gated-feedback approach to form a deep RNN architecture. Bullets in (b) correspond to global reset gates. Skip connections are omitted to simplify the visualization of networks. The global reset gate is computed as: 33 ⇣ computed by Richard Socher ⌘ hj = tanh Wj 1!j hj 4/22/15 L X 1 + g i!j Ui!j hi ! , Summary • Recurrent Neural Networks are powerful • A lot of ongoing work right now • Gated Recurrent Units even beZer • LSTMs maybe even beZer (jury s8ll out) • This was an advanced lecture à gain intui8on, encourage explora8on • Next up: Recursive Neural Networks simpler and also powerful :) 34 Richard Socher 4/22/15
© Copyright 2024