Deep Learning for Language and Image Understanding

Deep Learning for Language and Image Understanding Richard Socher Joint work with MetaMind Team: Romain Paulus, Elliot English, Brian Pierce and Mohit Iyyer, Andrej Karpathy and Chris Manning Socher,
Manning!Ng!
Socher,Ng,
Manning,
Single Word Representa1ons •  Many learning algorithms to represent and label single words Socher,
Manning!Ng!
Socher,Ng,
Manning,
Single Word Representa1ons •  Con1nuous vector representa1ons can capture more informa1on nouns!
verbs!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Single Word Representa1ons countries!
nouns!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
What About Larger Seman1c Units? How can we know when larger units are similar in meaning? –  Two senators received contribu1ons engineered by lobbyist Jack Abramoff in return for poli1cal favors. –  Jack Abramoff a=empted to bribe two legislators. People interpret the meaning of larger text units – en11es, descrip1ve terms, facts, arguments, stories – by semanHc composiHon of smaller elements Socher,
Manning!Ng!
Socher,Ng,
Manning,
Poten1al Solu1on for Longer Phrases: Bag of Words •  Count vector of vocabulary size, ignore word order •  Good for informa1on retrieval, topic modeling J  white blood cells destroying an infec1on L  an infec1on destroying white blood cells 0
0
1
1
1
1
0
0
...
...
0
0
0
0
0
0
...! ...!
aardvark
an
blood
bold
…
weird
yes
zebra
...!
L  This film doesn't care about cleverness, wit or any other kind of intelligent humor. Socher,
Manning!Ng!
Socher,Ng,
Manning,
Poten1al Solu1on for Longer Phrases: Windows •  Good for part of speech tagging, named en1ty tagging and phoneme classifica1on in speech (Collobert et al., 2011; Hinton et al. 2012) Classifica1on!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Poten1al Solu1on for Longer Phrases: Windows •  Good for part of speech tagging, named en1ty tagging and phoneme classifica1on in speech (Collobert et al., 2011; Hinton et al. 2012) Label: J !
If you enjoy being rewarded by a script that assumes you aren’t very bright , then BloodWork is for you Socher,
Manning!Ng!
Socher,Ng,
Manning,
Poten1al Solu1on: Discrete Phrase Representa1ons •  Formal logic and λ-­‐calculus –  (Montague, 1974; Ze`lemoyer, 2007) •  Discrete categories: noun phrase, preposi1onal phrase –  Many subcategories (Petrov et al., 2006) –  Lexicalized subcategories (Collins, 2003) –  Manually designed subcategories (Klein and Manning, 2003) –  Many careful features (Taskar et al., 2004; Finkel et al.,2008) NP!
a!
cat!
NP(cat)!
a!
cat!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
New Proposal: Represent Phrases as Vectors x2 1 5 5 4 3 Germany 2 France 1 1 3 2 2.5 1.1 4 Monday Tuesday 9 2 9.5 1.5 0 1 2 3 4 5 6 7 8 9 10 x1 the country of my birth the place where I was born If the vector space captures syntac1c and seman1c informa1on, the vectors can be used as features. Socher,
Manning!Ng!
Socher,Ng,
Manning,
How should we map phrases into a vector space? Use the principle of composi1onality! The meaning (vector) of a sentence is determined by (1) the meanings of its words and (2) the rules that combine them. x2!
the country of my birth!
5!
the place where I was born!
4!
Germany!
3!
1!
5!
France!
Monday!
2!
Tuesday!
1!
0
5.5!
6.1!
1!
3.5!
0.4!
0.3!
2.5!
3.8!
2.1!
3.3!
7!
7!
4!
4.5!
2.3!
3.6!
1
2
3
4
5
6
7
8
9
10!
x1!
Model jointly learns composi1onal vector representa1ons and tree structure. the country of my birth Socher,
Manning!Ng!
Socher,Ng,
Manning,
Composi1on is Everywhere Socher,
Manning!Ng!
Socher,Ng,
Manning,
Composi1on is Everywhere Socher,
Manning!Ng!
Socher,Ng,
Manning,
Composi1on is Everywhere Richard works-at MetaMind!
MetaMind is-in Palo Alto!
True!
Richard works-in Palo Alto!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Outline: Recursive Deep Learning •  Goal: Learning models that capture composi1onal meaning and jointly learn structure, feature representa1ons and predic1ons tasks. 1.  Sen1ment Analysis 2.  Ques1on Answering 3.  Grounding Meaning in the Visual World Socher,
Manning!Ng!
Socher,Ng,
Manning,
Models for Composi1on Model family: Recursive Neural Network Inputs: Two candidate children’s representa1ons Outputs: 1.  The seman1c feature vector represen1ng the two nodes 2.  Sen1ment predic1on 8 3 0.3 8 3 3 3 Recursive Neural Network 8 5 8 5 3 3 9 1 4 3 was not great Socher,
Manning!Ng!
Socher,Ng,
Manning,
Recursive Neural Tensor Network •  Goal: Func1on that composes two vectors •  More expressive than any other RNN so far •  Idea: Allow both addi1ve and mediated mul1plica1ve interac1ons of vectors Socher,
Manning!Ng!
Socher,Ng,
Manning,
Applying an RNTN to a Sentence Tree -­‐ -­‐-­‐
9 1 5 3 0
7 1 Not bad , ++ ++
8 5 9 1 0
4 3 pre`y smart actually Socher,
Manning!Ng!
Socher,Ng,
Manning,
Applying an RNTN to a Sentence Tree 0
5 2 +
3 3 RNTN -­‐ -­‐-­‐
9 1 5 3 0
7 1 Not bad , RNTN ++
8 5 ++
9 1 0
4 3 pre`y smart actually Socher,
Manning!Ng!
Socher,Ng,
Manning,
Applying an RNTN to a Sentence Tree 0
5 2 -­‐ -­‐-­‐
9 1 5 3 0
7 1 Not bad , ++
8 5 +
3 3 ++
9 1 0
4 3 pre`y smart actually Socher,
Manning!Ng!
Socher,Ng,
Manning,
Applying an RNTN to a Sentence Tree 0
5 2 -­‐ -­‐-­‐
9 1 5 3 0
7 1 Not bad , ++
8 5 +
RNTN 3 3 +
8 3 ++
9 1 0
4 3 pre`y smart actually Socher,
Manning!Ng!
Socher,Ng,
Manning,
Applying an RNTN to a Sentence Tree 0
5 2 -­‐ -­‐-­‐
9 1 5 3 0
7 1 Not bad , ++
8 5 +
8 3 3 3 +
++
9 1 0
4 3 pre`y smart actually Socher,
Manning!Ng!
Socher,Ng,
Manning,
Applying an RNTN to a Sentence Tree +
7!
3!
0
5 2 -­‐ -­‐-­‐
9 1 5 3 0
7 1 Not bad , ++
8 5 +
8 3 3 3 +
++
9 1 0
4 3 pre`y smart actually Socher,
Manning!Ng!
Socher,Ng,
Manning,
Applying an RNTN to a Sentence Tree +
5!
4!
+
7!
3!
0
5 2 -­‐ -­‐-­‐
9 1 5 3 0
7 1 Not bad , ++
8 5 +
8 3 3 3 +
++
9 1 0
4 3 pre`y smart actually Socher,
Manning!Ng!
Socher,Ng,
Manning,
Posi1ve/Nega1ve Results on Treebank Classifying Sentences: Accuracy improves 85.4 Since then, a deep RNN by Irsoy and Cardie has go`en the best performance 86 84 82 Bi NB RNN 80 MV-­‐RNN 78 RNTN 76 74 Training with Sentence Labels Training with Treebank Socher,
Manning!Ng!
Socher,Ng,
Manning,
Experimental Result on Treebank •  RNTN can capture X but Y, 131 cases in dataset •  RNTN obtains accuracy of 41% compared to MV-­‐RNN (37), RNN (36) and biNB (27). Socher,
Manning!Ng!
Socher,Ng,
Manning,
Results on Nega1ng Nega1ves •  But how about nega1ng nega1ves? •  No flips, but posi1ve ac1va1on should increase! not bad!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
But not all sen1ment is context independent! ?
-
0
Android
0
0
beats
iOS
of the problem of sentiment classification that uses only
xt. The word Android is neutral in isolation but become
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Instead: Global Belief RNN Figure 2: Propagation steps of the GB-RNN. Step 1 describes the standard RNN feedforward process, showing that the vector representation of “Android” is independent of the rest of the document.
Step 2 computes additional vectors at each node (in red), using information from the higher level
nodes in the tree (in blue), allowing “Android” and “iOS” to have different representations given the
context.
[20] unfold the same autoencoder multiple times which gives it more representational power with
the same number of parameters. Our model is different in that it takes into consideration more
information at each step and can eventually make better local predictions by using global context.
Sentiment analysis. Sentiment analysis has been the subject of research for some time [4, 2, 3, 6,
1, 23]. Most approaches in sentiment analysis use “bag of words” representations that do not take
the phrase structure into account but learn from word-level features. We explore our model’s ability
to determine contextual sentiment on Twitter, a social media platform.
3
Global Belief Recursive Neural Networks
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Instead: Global Belief RNN •  SemEval 2013 Sen1ment Compe11on Classifier
SVM
SVM
SVM
GB-RNN
Feature Sets
stemming, word cluster, SentiWordNet
score, negation
POS, lexicon, negations, emoticons,
elongated words, scores, syntactic dependency, PMI
punctuation, word n-grams, emoticons,
character n-grams, elongated words,
upper case, stopwords, phrase length,
negation, phrase position, large sentiment lexicons, microblogging features
parser, unsupervised word vectors (ensemble)
Twitter 2013 (F1)
85.19
SMS 2013 (F1)
88.37
87.38
85.79
88.93
88.00
89.41
88.40
Table 1: Comparison to the best Semeval 2013 Task 2 systems, their feature sets and F1 results o
each dataset for predicting sentiment of phrases in context. The GB-RNN obtains state of the ar
performance on both datasets.
Model
Bigram Naive Bayes
Twitter 2013
80.45
SMS 2013
78.53
Socher,
Manning!Ng!
Socher,Ng,
Manning,
semantic word vectors
hybrid word
vectors
Forward Only 100
100 + 34
85.
86.
Table 3: F1 score comparison of word vectors on the Sem
-
-
-
-
Chelski
+
-
-
Chelski
+
-
want this so
+
+
+
+
that it
-
bad
+
+
+
+
want +
this +
so
+
makes-
+
me
+
even
that +
it
-
bad
+
-
-
+
+
+
makes
+
me
+
eve
-
+
-
+
beat
+
+
+
thinking+
we
may
happier
+
+
-
-
twice
them
+
in
-
+
-
-
-
+
4
days
at
SB
Figure 5: Change in sentiment predictions in the tweet chelski w
happier thinking we may beat them twice in 4 days at SB betw
Socher,
Manning!Ng!
Socher,Ng,
Manning,
word vectors
rd vectors
100
100 + 34
85.67
86.80
84.70
87.15
Forward Backward parison of word vectors on the SemEval 2013 Task 2 test dataset.
-
Chelski
+
+
+
+
+
want +
this +
so
+
that +
it
-
bad
+
+
+
+
makes
+
+
me
+
ing-
at
+
we
+
+
+
-
+
may
+
even
+
+
thinking+
+
we +
may
happier
-
+
-
+
+
-
-
twice
them
+
+
+
+
in
+
beat
-
+
+
+
+
twice
them
+
in
+
+
+
-
-
-
+
+
+
+
+
4
days
at
SB
4
days
at
SB
nt predictions in the tweet chelski want this so bad that it makes me even
at them twice in 4 days at SB between the RNN (left) and the GB-RNN
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Outline: Recursive Deep Learning •  Goal: Learning models that capture composi1onal meaning and jointly learn structure, feature representa1ons and language predic1ons tasks. 1.  Sen1ment Analysis 2.  Ques1on Answering 3.  Grounding Meaning in the Visual World Socher,
Manning!Ng!
Socher,Ng,
Manning,
Ques1on Answering: Quiz Bowl Compe11on •  QUESTION: He ler unfinished a novel whose 1tle character forges his father's signature to get out of school and avoids the drar by feigning desire to join. A more famous work by this author tells of the rise and fall of the composer Adrian Leverkühn. Another of his novels features the jesuit Naptha and his opponent Se`embrini, while his most famous work depicts the aging writer Gustav von Aschenbach. Name this German author of The Magic Mountain and Death in Venice. Socher,
Manning!Ng!
Socher,Ng,
Manning,
Ques1on Answering: Quiz Bowl Compe11on •  QUESTION: He ler unfinished a novel whose 1tle character forges his father's signature to get out of school and avoids the drar by feigning desire to join. A more famous work by this author tells of the rise and fall of the composer Adrian Leverkühn. Another of his novels features the jesuit Naptha and his opponent Se`embrini, while his most famous work depicts the aging writer Gustav von Aschenbach. Name this German author of The Magic Mountain and Death in Venice. •  ANSWER: Thomas Mann Socher,
Manning!Ng!
Socher,Ng,
Manning,
Discussion: Composi1onal Structure Use dependency tree Recursive Neural Networks which capture more seman1c structure Socher,
Manning!Ng!
Socher,Ng,
Manning,
Discussion: Composi1onal Structure hdepended =f (WNSUBJ · heconomy + WPREP
+ Wv · xdepended + b).
ROOT
POBJ
POSS
DET
This
POSSESSIVE
city
’s
NSUBJ
economy
AMOD
PREP
depended
on
subjugated
VMOD
peasants
DOBJ
called
helots
The composition equation for any node n
hildren
andwith word
vector
is hn =
•  Parent K(n)
computa1on variable number oxf w
children X
f (Wv · xw + b +
WR(n,k) · hk ),
Figure 2: Dependency parse of a sentence from a question about Sparta.
positionality over the standard rnn model by
taking into account relation identity along with
tree structure. We include an additional d ⇥ d
matrix, Wv , to incorporate the word vector xw
at a node into the node vector hn .
Given a parse tree (Figure 2), we first compute leaf representations. For example, the
hidden representation hhelots is
War II might mention the Battle of the Bulge
and vice versa). Thus, word vectors associated
with such answers can be trained in the same
vector space as question text,2 enabling us to
model relationships between answers instead
of assuming incorrectly that all answers are
independent.
k2K(n)
To take advantage of this observation, we
hhelots = f (Wv · xhelots + b),
(1)
where f is a non-linear activation function such
as tanh and b is a bias term. Once all leaves
depart from Socher et al. (2014) by training
both the answers and questions jointly in a
single model, rather than training each separately and holding embeddings fixed during
where R(n, k) is the dependency relation
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Training •  Train on ranked pairs of sentences and correct en1ty hidden vector hs . The error for the sentence is
vectors XX
C(S,
= sentenceL(rank(c,
s, Z))max(0,
hidden
vector hs . The error
for✓)the
is
While
past work on rnn
s2S z2Z
stricted to the sentent
XX
C(S, ✓) =
L(rank(c, s, Z))max(0, 1 xclevels,
· hs + we
xz ·show
hs ), that
(5) sen
tions can be easily comb
s2S z2Z
rank(c,
s, Z) provides at
thethe l
representations
1 where
xc · hsthe
+ xfunction
z · hs ), (5)
rank of correct answer c with
to and
the bes
Therespect
simplest
incorrect
this rank
where the function rank(c,
s, Z) answers
providesZ.
theWe transform
is just to average
the r
4 shown by
where the func1on ank(c, s, Z) toprovides the rank f far
a loss
function
Usunier
et
al. in
rank
of correct
answerinto
c rwith
respect
the
sentence
seen oso
(2009)
to optimize
theshow
ranked
list,
correct answers
answer w
ith respect to rank
tthe
he top
incorrect incorrect
Z.c We
transform
this
Asofwe
in Section
r
4 shown byP
into
a
loss
function
answers Z. L(r) = Usunier
1/i. et al. powerful and performs b
(2009) to optimize the top of the
baselines. We call this a
i=1 ranked list,
r
P
Since rank(c, s, Z) is expensive
to
compute,ans
qanta:
a
question
L(r) =
1/i.
we approximate it by randomly
sampling K av
with trans-sentential
i=1
until a violation is observed
Since rank(c, s, Z) is incorrect
expensiveanswers
to compute,
+ x z · hs K
) and4set Experiments
rank(c,
s, Z)Manning!
=
we approximate it by (x
randomly
sampling
c · hs < 1
Socher,
Socher,Ng,
Manning, Ng!
Score of plot (no named en11es) and author 51 answer
ters. The
ed by U.S.
temporal
Thomas Mann
Henrik Ibsen
Joseph Conrad
Henry James
Franz Kafka
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Results QANTA: question answering neural network with trans-sentential averaging !
!
History
Model
Literature
Pos 1
Pos 2
Full
Pos 1
Pos 2
Full
bow
bow-dt
ir-qb
fixed-qanta
qanta
27.5
35.4
37.5
38.3
47.1
51.3
57.7
65.9
64.4
72.1
53.1
60.2
71.4
66.2
73.7
19.3
24.4
27.4
28.9
36.4
43.4
51.8
54.0
57.7
68.2
46.7
55.7
61.9
62.3
69.1
ir-wiki
qanta+ir-wiki
53.7
59.8
76.6
81.8
77.5
82.3
41.8
44.7
74.0
78.7
73.3
76.6
ble 1: Accuracy for history and literature at the first two sentence positions of each question
d the full question. The top half of the table compares models trained on questions only, while
e IR models in the bottom half have access to Wikipedia. qanta outperforms all baselines
at are restricted to just the question data, and it substantially improves an IR model with
cess to Wikipedia despite being trained on much less data.
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Pushing Facts into En1ty Vectors Socher,
Manning!Ng!
Socher,Ng,
Manning,
that are restricted to just the question data, a
accessQanta to Wikipedia
Model Can despite
Defeat Hbeing
uman Ptrained
layers on m
Figure 4: Comparisons of qanta+ir-wiki to h
individual human, and the bar height correspon
Socher,
Manning!Ng!
Socher,Ng,
Manning,
and it substantially improves an IR model wit
Literature Ques1ons are Hard! much less data.
human quiz bowl players. Each bar represents a
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Outline: Recursive Deep Learning •  Goal: Learning models that capture composi1onal meaning and jointly learn structure, feature representa1ons and language predic1ons tasks. 1.  Sen1ment Analysis 2.  Ques1on Answering 3.  Grounding Meaning in the Visual World Socher,
Manning!Ng!
Socher,Ng,
Manning,
Visual Grounding of Full Senteces •  Idea: Map sentences and images into a joint space Socher,
Manning!Ng!
Socher,Ng,
Manning,
Composi1onal Sentence Structure •  Use same dependency tree RNNs which capture more seman1c structure Socher,
Manning!Ng!
Socher,Ng,
Manning,
Convolu1onal Neural Network for Images •  CNN trained on ImageNet (Le et al. 2013) •  RNN trained to give large inner products between sentence and image vectors: Socher,
Manning!Ng!
Socher,Ng,
Manning,
Results ü
û!
ü
û!
ü
û!
û!
ü
ü
û!
û!
û!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Results û!
ü
û!
û!
û!
û!
û!
ü
Describing Images Mean Rank Image Search Mean Rank Random 92.1 Random 52.1 Bag of Words 21.1 Bag of Words 14.6 CT-­‐RNN 23.9 CT-­‐RNN 16.1 Recurrent Neural Network 27.1 Recurrent Neural Network 19.2 Kernelized Canonical Correla1on Analysis 18.0 Kernelized Canonical Correla1on Analysis 15.9 DT-­‐RNN 16.9 DT-­‐RNN 12.5 Socher,
Manning!Ng!
Socher,Ng,
Manning,
Grounded sentence-image search
Demo!
Image-Sentence!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Grounded sentence-image search
Demo!
Train your own Deep Vision Model!
Socher,
Manning!Ng!
Socher,Ng,
Manning,
Conclusion •  Deep Learning can learn grounded representa1ons •  Recursive Deep Learning can learn composi1onal representa1ons Figure
5: t-SNE 2-D projections
of b
451e answer
•  The combina1on can vectors divided into six major clusters. The
employed in a variety of by U.S.
blue cluster is predominantly
populated
presidents. The zoomed plot reveals temporal
tasks requiring world or on the
clustering
among the presidents
based
years they spent in office.
visual knowledge from the meaning of the words that it contains as well as the syntax that glues those
words together. Many computational models
of compositionality focus on learning vector
spaces (Zanzotto et al., 2010; Erk, 2012; Grefenstette et al., 2013; Yessenalina and Cardie,
2011). Recent approaches towards modeling
Thomas Mann
Henrik Ibsen
Joseph Conrad
Henry James
Franz Kafka
Figure 6: A question on the German novelist
Thomas Mann that contains no named entities,
along with the five top answers as scored by
qanta. Each cell in the heatmap corresponds
to the score (inner product) between a node
in the parse tree and the given answer, and
the dependency parse of the sentence is shown
on the left. All of our baselines, including irwiki, are wrong, while qanta
usesNg,
the
plot Ng!
Socher,
Manning!
Socher,
Manning,