Slide - Deep Learning for Natural Language Processing

CS224d: Deep NLP Lecture 10: Advanced Recursive Neural Networks Richard Socher [email protected] Recursive Neural Networks •  Focused on composi-onal representa-on learning of •  Hierarchical structure, features and predic-ons •  Different combina-ons of: 1.  Training Objec-ve 2.  ComposiCon FuncCon W
Vscore
W
c1
3.  Tree Structure s
p
c2
Overview Last lecture: Recursive Neural Networks This lecture: Different RNN composi-on func-ons and NLP tasks 1.  Standard RNNs:
Paraphrase detec-on
2.  Matrix-­‐Vector RNNs: Rela-on classifica-on 3.  Recursive Neural Tensor Networks: Sen-ment Analysis 4.  Tree LSTMs: Phrase Similarity Next lecture • 
Review for Midterm. Going over PSet solu-ons and common problems/
ques-ons from office hours. Please prepare ques-ons. 3 Richard Socher 4/28/15 ApplicaCons and Models •  Note: All models can be applied to all tasks •  More powerful models are needed for harder tasks •  Models get increasingly more expressive and powerful: 1.  Standard RNNs: Paraphrase detec-on 2.  Matrix-­‐Vector RNNs: Rela-on classifica-on 3.  Recursive Neural Tensor Networks: Sen-ment Analysis 4.  Tree LSTMs:
Phrase Similarity
Lecture 1, Slide 4 Richard Socher 4/29/15 Paraphrase DetecCon Pollack said the plain-ffs failed to show that Merrill and Blodget directly caused their losses Basically , the plain-ffs did not show that omissions in Merrill’s research caused the claimed losses The ini-al report was made to Modesto Police December 28 It stems from a Modesto police report 5 How to compare the meaning of two sentences? 6 RNNs for Paraphrase DetecCon Unsupervised RNNs and a pair-­‐wise sentence comparison of nodes in parsed trees (Socher et al., NIPS 2011) 7 RNNs for Paraphrase DetecCon Experiments on Microsod Research Paraphrase Corpus (Dolan et al. 2004) 8 Method Acc. F1 Rus et al.(2008) 70.6 80.5 Mihalcea et al.(2006) 70.3 81.3 Islam et al.(2007) 72.6 81.3 Qiu et al.(2006) 72.0 81.6 Fernando et al.(2008) 74.1 82.4 Wan et al.(2006) 75.6 83.0 Das and Smith (2009) 73.9 82.3 Das and Smith (2009) + 18 Surface Features 76.1 82.7 F. Bu et al. (ACL 2012): String Re-­‐wri-ng Kernel 76.3 -­‐-­‐ Unfolding Recursive Autoencoder (NIPS 2011) 76.8 83.6 Dataset is problema-c, a bejer evalua-on is introduced later RNNs for Paraphrase DetecCon 9 Recursive Deep Learning 1. 
2. 
3. 
4. 
10 Standard RNNs: Matrix-­‐Vector RNNs: Recursive Neural Tensor Networks: Tree LSTMs:
Paraphrase Detec-on Rela-on classifica-on Sen-ment Analysis Phrase Similarity ComposiConality Through Recursive Matrix-­‐Vector Spaces c p = tanh(W c 1 + b) 2
One way to make the composi-on func-on more powerful was by untying the weights W But what if words act mostly as an operator, e.g. “very” in very good Proposal: A new composi-on func-on 11 ComposiConality Through Recursive Matrix-­‐Vector Recursive Neural Networks c p = tanh(W c 1 + b) 2
12 C c p = tanh(W C 2 c 1 + b) 1 2
PredicCng SenCment DistribuCons Good example for non-­‐linearity in language 13 MV-­‐RNN for RelaConship ClassificaCon RelaConship Sentence with labeled nouns for which to predict relaConships Cause-­‐
Effect(e2,e1) Avian [influenza]e1 is an infec-ous disease caused by type a strains of the influenza [virus]e2. En-ty-­‐
Origin(e1,e2) The [mother]e1 led her na-ve [land]e2 about the same -me and they were married in that city. Message-­‐
Topic(e2,e1) Roadside [ajrac-ons]e1 are frequently adver-sed with [billboards]e2 to ajract tourists. 14 SenCment DetecCon Sen-ment detec-on is crucial to business intelligence, stock trading, … 15 SenCment DetecCon and Bag-­‐of-­‐Words Models Most methods start with a bag of words + linguis-c features/processing/lexica But such methods (including s-­‐idf) can’t dis-nguish: + white blood cells destroying an infec-on − an infec-on destroying white blood cells 16 SenCment DetecCon and Bag-­‐of-­‐Words Models •  Sen-ment is that sen-ment is “easy” •  Detec-on accuracy for longer documents ∼90% •  Lots of easy cases (… horrible … or … awesome …) •  For dataset of single sentence movie reviews (Pang and Lee, 2005) accuracy never reached above 80% for >7 years •  Harder cases require actual understanding of negaCon and its scope + other seman-c effects Data: Movie Reviews Stealing Harvard doesn’t care about cleverness, wit or any other kind of intelligent humor. There are slow and repe--ve parts but it has just enough spice to keep it interes-ng. 18 Two missing pieces for improving senCment 1.  Composi-onal Training Data 2.  Bejer Composi-onal model 1. New SenCment Treebank 1. New SenCment Treebank •  Parse trees of 11,855 sentences •  215,154 phrases with labels •  Allows training and evalua-ng with composi-onal informa-on BeVer Dataset Helped All Models •  Posi-ve/nega-ve full sentence classifica-on 84 83 82 81 80 79 78 77 76 75 Bi NB RNN MV-­‐RNN Training with Sentence Training with Treebank Labels •  But hard nega-on cases are s-ll mostly incorrect •  We also need a more powerful model! BeVer Dataset Helped •  This improved performance for full sentence posi-ve/nega-ve classifica-on by 2 – 3 % •  Yay! •  But a more in depth analysis shows: hard nega-on cases are s-ll mostly incorrect •  We also need a more powerful model! 2. New ComposiConal Model •  Recursive Neural Tensor Network
•  More expressive than previous RNNs •  Idea: Allow more interac-ons of vectors 2. New ComposiConal Model •  Recursive Neural Tensor Network
2. New ComposiConal Model •  Recursive Neural Tensor Network
Recursive Neural Tensor Network Recursive Deep Models for Seman-c Composi-onality Over a Sen-ment Treebank Socher et al. 2013 atives of Matrices, Vectors and Scalar Forms
Details: Tensor BackpropagaCon Training Order
T
@xT anew m@a
x deriva-ve •  Main atrix =
= a
@x
@x
needed for a tensor: @aT Xb
@X
@aT XT b
@X
@aT Xa
@X
@X
@Xij
@(XA)ij
@Xmn
@(XT A)ij
@Xmn
(69)
= abT
(70)
= baT
(71)
=
@aT XT a
@X
=
aaT
= Jij
(72)
(73)
=
im (A)nj
=
(Jmn A)ij
(74)
=
in (A)mj
=
(Jnm A)ij
(75)
dersen, The Matrix Cookbook, Version: November 15, 2012, Page 10
Details: Tensor BackpropagaCon Training •  Minimizing cross entropy error: •  Standard sodmax error message: •  For each slice, we have update: •  Main backprop rule to pass error down from parent: •  Finally, add errors from parent and current sodmax: PosiCve/NegaCve Results on Treebank Classifying Sentences: Accuracy improves to 85.4 86 Bi NB 84 RNN 82 RNTN MV-­‐RNN 80 78 76 74 Training with Sentence Labels Training with Treebank Fine Grained Results on Treebank NegaCon Results NegaCon Results •  Most methods capture that nega-on oden makes things more nega-ve (See Pojs, 2010) •  Analysis on nega-on dataset •  Accuracy: Results on NegaCng NegaCves •  But how about nega-ng nega-ves? •  No flips, but posi-ve ac-va-on should increase! not bad Results on NegaCng NegaCves •  Evalua-on: Posi-ve ac-va-on should increase 36 Visualizing Deep Learning: Word Embeddings LSTMs •  Remember LSTMs? •  Historically only over temporal sequences We used Lecture 1, Slide 38 Richard Socher 4/28/15 Tree LSTMs •  We can use those ideas in gramma-cal tree structures! •  Paper: Tai et al. 2015: Improved Seman-c Representa-ons From Tree-­‐Structured Long Short-­‐Term Memory Networks •  Idea: Sum the child vectors in a tree structure •  Each child has its own forget gate •  Same sodmax on h Lecture 1, Slide 39 Richard Socher 4/29/15 Results on Stanford SenCment Treebank Method
Fine-grained
Binary
RAE (Socher et al., 2013)
MV-RNN (Socher et al., 2013)
RNTN (Socher et al., 2013)
DCNN (Blunsom et al., 2014)
Paragraph-Vec (Le and Mikolov, 2014)
CNN-non-static (Kim, 2014)
CNN-multichannel (Kim, 2014)
DRNN (Irsoy and Cardie, 2014)
43.2
44.4
45.7
48.5
48.7
48.0
47.4
49.8
82.4
82.9
85.4
86.8
87.8
87.2
88.1
86.6
LSTM
Bidirectional LSTM
2-layer LSTM
2-layer Bidirectional LSTM
45.8
49.1
47.5
46.2
86.7
86.8
85.5
84.8
Constituency Tree LSTM (no tuning)
Constituency Tree LSTM
46.7
50.6
86.6
86.9
Method
Mean vectors
DT-RNN (Soch
SDT-RNN (Soc
Illinois-LH (La
UNAL-NLP (Ji
Meaning Factor
ECNU (Zhao et
LSTM
Bidirectional LS
2-layer LSTM
2-layer Bidirect
Constituency Tr
Dependency Tr
of word vectors Table 2: Test set accuracies on the Stanford SentiLecture 1, Slide 40 Treebank. Fine-grained:
Richard Socher 5-class sentiment
4/29/15 ment
Table 3: Te
relatedness
Pearson’s r,
SemanCc Similarity •  Bejer than binary paraphrase classifica-on! •  Dataset from a compe--on: SemEval-­‐2014 Task 1: Evalua-on of composi-onal distribu-onal seman-c models on full sentences through seman-c relatedness [and textual entailment] Relatedness score
Example
1.6
A: “A man is jumping into an empty pool”
B: “There is no biker jumping in the air”
2.9
A: “Two children are lying in the snow and are making snow angels”
B: “Two angels are making snow on the lying children”
3.6
A: “The young boys are playing outdoors and the man is smiling nearby”
B: “There is no boy playing outdoors and there is no man smiling”
4.9
A: “A person in a black jacket is doing tricks on a motorbike”
B: “A man in a black jacket is doing tricks on a motorbike”
Lecture 1, Slide 41 Richard Socher 4/29/15 e 1: Examples
of sentence pairs with their
gold relatedness scores (on
a 5-point rating sca
SemanCc Similarity Results (correlaCon and MSE) Pearson’s r, Spearman’s ρ Method
r
⇢
MSE
Mean vectors
DT-RNN (Socher et al., 2014)
SDT-RNN (Socher et al., 2014)
0.8046
0.7863
0.7886
0.7294
0.7305
0.7280
0.3595
0.3983
0.3859
Illinois-LH (Lai and Hockenmaier, 2014)
UNAL-NLP (Jimenez et al., 2014)
Meaning Factory (Bjerva et al., 2014)
ECNU (Zhao et al., 2014)
0.7993
0.8070
0.8268
0.8414
0.7538
0.7489
0.7721
–
0.3692
0.3550
0.3224
–
LSTM
Bidirectional LSTM
2-layer LSTM
2-layer Bidirectional LSTM
0.8477
0.8522
0.8411
0.8488
0.7921
0.7952
0.7849
0.7926
0.2949
0.2850
0.2980
0.2893
Constituency Tree LSTM
Dependency Tree LSTM
0.8491
0.8627
0.7873
0.8032
0.2852
0.2635
Lecture 1, Slide 42 of the length distribution are batched in the final
SemanCc imilarity window
(`S=
45). Results, Pearson CorrelaCon 0.90
DepTree-LSTM
LSTM
Bi-LSTM
ConstTree-LSTM
0.88
r
0.86
0.84
0.82
0.80
0.78
4
Lecture 1, Slide 43 6
8
10
12
14
16
mean sentence length
Richard Socher 18
4/29/15 20
⇣
(n 1)
W·i l
⌘
(n
1)
(n
2)
(
)
f 0 (zi l )aj l
(56)
0
1
sl+1
X (n 1) (n )
(nl 2)
l
l
0 (nl 1)
@
A
=
W
)
f
(z
)
a
ji
j
i
where the sigmoid
derivative
from
eq.j 14 gives f 0 (z (l) ) = (1 (57)
a(l) )a(l
j=1
Go over m| aterials hidden
layer backprop
{z derivatives: }
=
(nl ) T
Next week: Review Session and Midterm • 
(nl 1) (nl 2)
aj
i
=
@
(l) (l+1)
i
E = aj
(l) R
(58)
(l)
+ Wij
•  Visit office hours for PSet solu-ons @Wij
t line that the top layer is linear. This is a very detailed account of essentially
errors
all layers l (except the top layer) (in vector format, using the Hadamard
•  of
Derive Which in one simplified vector notation becomes:
(l)
⇣
= (W
(l) T (l+1)
)
⌘
f 0 (z (l) ),
@
ER =
(l)
@W
(l+1)
(a(l) )T + (59)
W (l) .
In summary, the backprop procedure consists of four steps:
7
1. Apply an input xn and forward propagate it through the netw
activations using eq. 18.
2. Evaluate
(nl )
for output units using eq. 42.
3. Backpropagate the
’s to obtain a
Lecture 1, Slide 44 Richard Socher (l)
for each
hidden layer in t
4/29/15