Slide

tive of the Skip-gram model is to find word represe
ounding
words
insaimple sentence
or a document.
More for
Refresher: The word2vec model w2 , w3 , . . . , wT , the objective of the Skip-gram model
•  Main cost func,on J: T
!
1
T t=1
!
log p(wt+j |wt )
−c≤j≤c,j̸=0
training time. The basic Skip-gram formulation defines p(wt+j |wt ) u
!
"
′ ⊤
exp
v
of the training context (which can be a function
wO vwI of th
!
"
•  With probabili,es defined as: p(wO |wI ) = #
W
raining examples and thus can lead tow=1a exp
higher
′ ⊤ vaccu
vw
wI
ning time. The basic Skip-gram formulation defines p(wt+j |wt ) using the so
′
!
" representatio
where vw and vw
are the “input” and “output”
vector
′ ⊤
exp
v
•  ber
We of
derived t
he g
radient f
or t
he i
nternal v
ectors wO vwIis impractical be
words in the vocabulary. This formulation
2
" (105 –107
p(w
|w
)
=
O
I
(v
o
n t
he b
oard) #,Wwhich is!often
∇ clog p(wO |wI ) is proportional to W
large
′ ⊤v
exp
v
wI
w
w=1
Lecture 1, Slide 1 Richard Socher 4/6/15 Calcula7ng all gradients! •  We went through gradients for each center vector v in a window •  We also need gradients for external vectors v’ (u on the board) •  Derive! •  Generally in each window we will compute updates for all parameters that are being used in that window. •  For example window size c = 1, sentence: “I like learning .” •  First window computes gradients for: •  internal vector vlike and external vectors v’I and v’learning •  Next window in that sentence? Lecture 1, Slide 2 Richard Socher 4/6/15 Compute all vector gradients! •  We oXen define the set of ALL parameters in a model in terms of one long vector •  In our case with d-­‐dimensional vectors and V many words: Lecture 1, Slide 3 Richard Socher 4/6/15 Gradient Descent •  To minimize over the full batch (the en,re training data) would require us to compute gradients for all windows •  Updates would be for each element of µ : •  With step size ®
•  In matrix nota,on for all parameters: Lecture 1, Slide 4 Richard Socher 4/6/15 Vanilla Gradient Descent Code Lecture 1, Slide 5 Richard Socher 4/6/15 Intui7on •  For a simple convex func,on over two parameters. •  Contour lines show levels of objec,ve func,on •  See Whiteboard Lecture 1, Slide 6 Richard Socher 4/6/15 Stochas7c Gradient Descent •  But Corpus may have 40B tokens and windows •  You would wait a very long ,me before making a single update! •  Very bad idea for preey much all neural nets! •  Instead: We will update parameters aXer each window t à Stochas,c gradient descent (SGD) Lecture 1, Slide 7 Richard Socher 4/6/15 Stochas7c gradients with word vectors! •  But in each window, we only have at most 2c -­‐1 words, so is very sparse! Lecture 1, Slide 8 Richard Socher 4/6/15 Stochas7c gradients with word vectors! •  We may as well only update the word vectors that actually appear! •  Solu,on: either keep around hash for word vectors or only update certain columns of full embedding matrix L and L’ [ ] V d •  Important if you have millions of word vectors and do distributed compu,ng to not have to send gigan,c updates around. Lecture 1, Slide 9 Richard Socher 4/6/15 Approxima7ons: PSet 1 •  basic
The nSkip-gram
ormaliza,on factor is too computa,onally expensive . The
formulation
defines
p(wt+j |wt ) using
the softmax fu
!
"
′ ⊤
exp vw
vwI
O
!
"
p(wO |wI ) = #
W
′ ⊤v
exp
v
wI
w
w=1
′
d vw
the i“input”
vector trepresentations
of w, and W is
•  are
Hence, n PSet1 and
you “output”
will implement he skip-­‐gram model in the vocabulary. This formulation is impractical because the cost of
wI ) is proportional to W , which is often large (105 –107 terms).
•  Main idea: train binary logis,c regressions for a true pair (center word and word in its context window) and a couple of random chical Softmax
pairs (the center word with a random word) onally efficient approximation of the full softmax is the hierarchical softm
ural network language models, it was first introduced by Morin and Bengio
age is that instead of evaluating W output nodes in the neural network to
Lecture 1, Slide Richard Socher 4/6/15 istribution,
it 1is0 needed to evaluate
only
about log2 (W ) nodes.
oduced by Gutmann and Hyvarinen [4] and applied to language modeling by Mnih and Teh
CE posits that a good model should be able to differentiate data from noise by means of log
gression.
This
to hinge loss m
used
by Collobert
and Weston s[2]
who trained the m
PSet 1: isTsimilar
he skip-­‐gram odel and nega7ve ampling y ranking the data above noise.
hile • NCE
can be
shown“to
approximately
maximize the log o
probability
thePsoftmax,
From paper: Distributed Representa,ons f Words of
and hrases the S
am model is only concerned with learning high-quality vector representations, so we are fr
and astheir (Mikolov t al. quality.
2013) We define Negative sam
mplify NCE
long C
asomposi,onality” the vector representations
retainetheir
NEG) by the objective
⊤
′
log σ(vw
vwI ) +
O
k
%
i=1
&
'
⊤
′
vwI )
Ewi ∼Pn (w) log σ(−vw
i
3
•  Where k is the number of nega,ve samples and we use, •  The sigmoid func,on! (we’ll become good friends soon) •  So we maximize the probability of two words co-­‐occurring in first log à Lecture 1, Slide 11 Richard Socher 4/6/15 PSet 1: The skip-­‐gram model and nega7ve sampling •  Slightly clearer nota,on: •  Max. probability that real outside word appears, minimize prob. that random words appear around center word •  Pn = U(w)3/4/Z, the unigram distribu,on U(w) raised to the 3/4rd power (We provide this func,on in the starter code). •  The power makes less frequent words be sampled more oXen Lecture 1, Slide 12 Richard Socher 4/6/15 PSet 1: The con7nuous bag of words model •  Main idea for con,nuous bag of words (CBOW): Predict center word from sum of surrounding word vectors instead of predic,ng surrounding single words from center word as in skip-­‐
gram model •  To make PSet slightly easier: The implementa,on for the CBOW model is not required and for bonus points! Lecture 1, Slide 13 Richard Socher 4/6/15 What to do with the two sets of vectors? •  We end up with L and L’ from all the vectors v and v’ •  Both capture similar co-­‐occurrence informa,on. It turns out, the best solu,on is to simply sum them up: Lfinal = L + L’ •  One of many hyperparameters explored in GloVe: Global Vectors for Word Representa4on (Pennington et al. (2014) Lecture 1, Slide 14 Richard Socher 4/6/15 How to evaluate word vectors? •  Related to general evalua,on in NLP: Intrinsic vs extrinsic •  Intrinsic: •  Evalua,on on a specific/intermediate subtask •  Fast to compute •  Helps to understand that system •  Not clear if really helpful unless correla,on to real task is established •  Extrinsic: •  Evalua,on on a real task •  Can take a long ,me to compute accuracy •  Unclear if the subsystem is the problem or its interac,on or other subsystems •  If replacing one subsystem with another improves accuracy à Winning! Lecture 1, Slide 15 Richard Socher 4/6/15 Intrinsic word vector evalua7on •  Word Vector Analogies: Syntac,c and Seman,c a:b :: c:? man:woman :: king:? •  Evaluate word vectors by how well their cosine distance aXer addi,on captures intui,ve seman,c and syntac,c analogy ques,ons •  Discarding the input words from the search! •  Problem: What if the informa,on is there but not linear? Lecture 1, Slide 16 Richard Socher 4/6/15 Intrinsic word vector evalua7on •  Word Vector Analogies: Syntac,c and Seman7c examples from hep://code.google.com/p/word2vec/source/browse/trunk/ques,ons-­‐
words.txt : city-­‐in-­‐state
problem: different ci,es Chicago Illinois Houston Texas
may have same name Chicago Illinois Philadelphia Pennsylvania Chicago Illinois Phoenix Arizona Chicago Illinois Dallas Texas Chicago Illinois Jacksonville Florida Chicago Illinois Indianapolis Indiana Chicago Illinois Aus,n Texas Chicago Illinois Detroit Michigan Chicago Illinois Memphis Tennessee Chicago Illinois Boston Massachusees Lecture 1, Slide 17 Richard Socher 4/6/15 Intrinsic word vector evalua7on •  Word Vector Analogies: Syntac,c and Seman7c examples from : capital-­‐world Abuja Nigeria Accra Ghana Abuja Nigeria Algiers Algeria Abuja Nigeria Amman Jordan Abuja Nigeria Ankara Turkey Abuja Nigeria Antananarivo Madagascar Abuja Nigeria Apia Samoa Abuja Nigeria Ashgabat Turkmenistan Abuja Nigeria Asmara Eritrea Abuja Nigeria Astana Kazakhstan Lecture 1, Slide 18 Richard Socher problem: can change 4/6/15 Intrinsic word vector evalua7on •  Word Vector Analogies: Syntac7c and Seman,c examples from : gram4-­‐superla,ve bad worst big biggest bad worst bright brightest bad worst cold coldest bad worst cool coolest bad worst dark darkest bad worst easy easiest bad worst fast fastest bad worst good best bad worst great greatest Lecture 1, Slide 19 Richard Socher 4/6/15 Intrinsic word vector evalua7on •  Word Vector Analogies: Syntac7c and Seman,c examples from : gram7-­‐past-­‐tense dancing danced decreasing decreased dancing danced describing described dancing danced enhancing enhanced dancing danced falling fell dancing danced feeding fed dancing danced flying flew dancing danced genera,ng generated dancing danced going went dancing danced hiding hid dancing danced hitng hit Lecture 1, Slide 20 Richard Socher 4/6/15 Table 2: Results on the word analogy task, given
as percent accuracy. Underlined scores are best
within groups of similarly-sized models; bold
|X|
scores are best overall. HPCA vectors are publicly
X
k
Xi j =
= k H | X |,↵ ,
(18)
available2 ; (i)vLBL results are from (Mnih et al.,
↵
r
r =1
2013); skip-gram (SG) and CBOW results are
from
et al.,w
2013a,b);
we trained
SG† also capture •  the
Most careful analysis so (Mikolov
far: † Glove ord vectors (
which rewritten
last sum
in terms
of
and CBOW using the word2vec tool3 . See text
harmonic cooccurrence number Hn, m . The cupounts ut mand
ore directly of
so thethan kip-­‐gram) forbdetails
a description
SVD smodels.
e sum, |X |, is the maximum freModel
Dim. Size Sem. Syn. Tot.
hich coincides with the number of
ivLBL
100 1.5B 55.9 50.1 53.2
nts in the matrix X. This number is
HPCA
100 1.6B
4.2
16.4 10.8
e maximum value of r in Eqn. (17)
1/↵
GloVe
100 1.6B 67.5 54.3 60.3
1, i.e., |X | = k . Therefore we
SG
300
1B
61
61
61
(18) as,
CBOW
300 1.6B 16.1 52.6 36.1
|C| ⇠ |X | ↵ H | X |,↵ .
(19)
vLBL
300 1.5B 54.2 64.8 60.0
ivLBL
300 1.5B 65.2 63.0 64.0
ed in how |X | is related to |C| when
GloVe
300 1.6B 80.8 61.5 70.3
are large; therefore we are free to
SVD
300
6B
6.3
8.1
7.3
t hand side of the equation for large
SVD-S
300
6B
36.7 46.6 42.1
rpose we use the expansion of genSVD-L
300
6B
56.6 63.0 60.1
nic numbers (Apostol, 1976),
†
CBOW
300
6B
63.6 67.4 65.7
†
SG
300
6B
73.0 66.0 69.1
+ ⇣ (s) + O(x s ) if s > 0, s , 1 ,
GloVe
300
6B
77.4 67.0 71.7
(20)
CBOW 1000 6B
57.3 68.9 63.7
SG
1000 6B
66.1 65.1 65.6
SVD-L
300 42B 38.4 58.2 49.2
X|
↵
+ ⇣ (↵) |X | + O(1) ,
(21)
GloVe
300 42B 81.9 69.3 75.0
↵
sum over all elements of the corix X,
Analogy evalua7on and hyperparameters he Riemann
function.
In the
Lecture zeta
1, Slide 21 arge, only one of the two terms on
dataset for NER Richard (Tjong Kim
Sang and De MeulSocher der, 2003).
4/6/15 Analogy evalua7on and hyperparameters 70
70
70
65
65
60
60
60
50
40
Semantic
Syntactic
Overall
30
20
0
100
200
300
400
Vector Dimension
(a) Symmetric context
500
Accuracy [%]
80
Accuracy [%]
Accuracy [%]
•  Asymmetric context (only words to the leX) are not as good 55
50
Semantic
Syntactic
Overall
45
600
40
2
4
6
Window Size
8
(b) Symmetric context
55
50
Semantic
Syntactic
Overall
45
10
40
2
4
6
Window Size
8
10
(c) Asymmetric context
Figure 2: Accuracy on the analogy task as function of vector size and window size/type. All models are
•  Best imensions ~300, slight drop-­‐off aXerwards trained
on d
the
6 billion token
corpus.
In (a),
the window
size is 10. In (b) and (c), the vector size is 100.
•  But this might be d
ifferent Word
similarity.
While
the
analogyfor taskdownstream is our
has 6tasks! billion tokens; and on 42 billion tokens of
primary focus since it tests for interesting vector
web data, from Common Crawl5 . We tokenize
space substructures, we also evaluate our model on
and lowercase each corpus with the Stanford to•  Window size of 8 around each center word is good for Glove vectors a variety of word similarity tasks in Table 3. These
kenizer, build a vocabulary of the 400,000 most
include WordSim-353 (Finkelstein et al., 2001),
frequent words6 , and then construct a matrix of coMC (Miller and Charles, 1991), RG (Rubenstein
occurrence counts X. In constructing X, we must
Lecture 1, Slide 22 1965), SCWS (Huang Richard 4/6/15 window should be
and
Goodenough,
et al., Socher choose how large the context
Analogy evalua7on and hyperparameters •  More training ,me helps 5
6
Training Time (hrs)
9
12
15
18
21
24
72
Accuracy [%]
70
GloVe
CBOW
68
66
GloVe
Skip-Gram
64
62
20
W)
W
3
6
25
40
60
50
Lecture 1, Slide 23 20
40
60
80
100
Iterations (GloVe)
1 2 3 4 5 6
7
10 12
15
Negative Samples (Skip-Gram)
(b) GloVe
vs Skip-Gram
Richard Socher 20
4/6/15 Analogy evalua7on and hyperparameters •  More data helps, Wikipedia is beeer than news text! Semantic
Syntactic
Overall
85
80
Accuracy [%]
ctors.
. We
SMN,
75
70
65
60
55
50
Wiki2010
1B tokens
Wiki2014
1.6B tokens
Gigaword5
4.3B tokens
Gigaword5 +
Wiki2014
6B tokens
Common Crawl
42B tokens
Figure 3: Accuracy on the analogy task for 300Lecture 1, Slide 24 Richard Socher 4/6/15 dimensional vectors trained
on different corpora.
Intrinsic word vector evalua7on •  Word vector distances and their correla,on with human judgments •  Example dataset: WordSim353 hep://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ Word 1 Word 2
,ger cat
,ger ,ger
book paper
computer
plane car
professor
stock phone
stock CD
stock jaguar
Lecture 1, Slide 25 Human (mean) 7.35 10.00 7.46 internet 7.58 5.77 doctor 6.62 1.62 1.31 0.92 Richard Socher 4/6/15 rs. Doing so typTable 3: Spearman rank correlation on word simiormance, with the
larity tasks. All vectors are 300-dimensional. The
analogy task.
CBOW⇤ vectors are from the word2vec website
•  Word vector distances and their correla,on with human judgments d results
of a vaand differ in that they contain phrase vectors.
as well as with
Model Size WS353 MC RG SCWS RW
the word2vec
SVD
6B
35.3 35.1 42.5 38.3 25.6
sing SVDs. With
SVD-S 6B
56.5 71.5 71.0 53.6 34.7
gram (SG† ) and
SVD-L 6B
65.7 72.7 75.1 56.5 37.0
W† ) models on the
† 6B
CBOW
57.2 65.6 68.2 57.0 32.5
ia 2014 + GigaSG†
6B
62.8 65.2 69.7 58.1 37.2
top 400,000 most
GloVe 6B
65.8 72.7 77.8 53.9 38.1
ndow size of 10.
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
hich we show in
GloVe 42B 75.9 83.6 82.9 59.6 47.8
or this corpus.
CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5
nerate a truncated
formation of how
L model on this larger corpus. The fact that this
ith only
the top
•  Some ideas from love p
aper hdoes
ave not
been shows skip-­‐gram (SG) basicGSVD
model
scale
wellto toimprove large coralso (e.g. both vectors) This stepmodel is typiporasum lends
further
evidence to the necessity of the
based methods as
type of weighting scheme proposed in our model.
a disproportionTable 3 shows results on five different word
1, Slide 26 Richard Socher 4/6/15 the Lecture methods
are
similarity datasets. A similarity score is obtained
Correla7on evalua7on But what about ambiguity? •  You may hope that one vector captures both kinds of informa,on (run = verb and noun) but then vector is pulled in different direc,ons •  Alterna,ve described in: Improving Word Representa4ons Via Global Context And Mul4ple Word Prototypes (Huang et al. 2012) •  Idea: Cluster word windows around words, retrain with each word assigned to mul,ple different clusters bank1, bank2, etc Lecture 1, Slide 27 Richard Socher 4/6/15 But what about ambiguity? •  Improving Word Representa4ons Via Global Context And Mul4ple Word Prototypes (Huang et al. 2012) Lecture 1, Slide 28 Richard Socher 4/6/15 Extrinsic word vector evalua7on •  Extrinsic evalua,on of word vectors: All subsequent tasks in this class Semantic
Accuracy [%]
Table 4: F1 score on NER task with 50d vectors.
85
Discrete is the baseline without word vectors. We
80
•  One example use
where g
ood w
ord v
ectors s
hould h
elp d
irectly: n
amed en,ty 75
publicly-available vectors for HPCA, HSMN,
recogni,on: fiand
nding person, organiza,on or loca,on 70
CW.a See
text for
details.
65
Model
Dev Test ACE MUC7
60
Discrete 91.0 85.4 77.4
73.4
55
SVD
90.8 85.7 77.3
73.7
50
SVD-S 91.0 85.5 77.6
74.3
Wiki2010
Wiki2014
1B tokens
1.6B tokens
SVD-L 90.5 84.8 73.6
71.5
HPCA 92.6 88.7 81.7
80.7
Figure 3: Accuracy on
HSMN 90.5 85.7 78.7
74.7
dimensional vectors tra
CW
92.2 87.4 81.7
80.2
CBOW 93.1 88.2 82.2
81.1
entries are updated to
GloVe
93.2 88.3 82.9
82.2
whereas Gigaword is a
shown for neural vectors in (Turian et al., 2010).
•  Next: How to use word vectors in neural net models! 4.4
Lecture 1, Slide 29 Model Analysis: Vector Length and
Context Size Richard Socher outdated and possibly i
4.6
Model Analysis:
The total
4/6/15 run-time is s
and training the mode
Simple single word classifica7on •  What is the major benefit of deep learned word vectors? • 
Ability to also classify words accurately • 
• 
Countries cluster together à classifying loca,on words should be possible with word vectors Incorporate any informa,on into them other tasks • 
Project sen,ment into words to find most posi,ve/
nega,ve words in corpus Lecture 1, Slide 30 Richard Socher 4/6/15 The soOmax Logis,c regression = SoXmax classifica,on on word vector x to obtain probability for class y: a1 a2 x1 x2 x3 where: Generalizes >2 classes (for just binary sigmoid unit would suffice as in skip-­‐gram) Lecture 1, Slide 31 Richard Socher 4/6/15 The soOmax -­‐ details •  Terminology: Loss func,on = cost func,on = objec,ve func,on •  Loss for soXmax: Cross entropy •  To compute p(y|x): first take the y’th row of W and mul,ply that with row with x: •  Compute all fc for c=1,…,C •  Normalize to obtain probability with soXmax func,on: Lecture 1, Slide 32 Richard Socher 4/6/15 The soOmax and cross-­‐entropy error •  The loss wants to maximize the probability of the correct class y •  Hence, we minimize the nega,ve log probability of that class: •  As before: we sum up mul,ple cross entropy errors if we have mul,ple classifica,ons in our total error func,on over the corpus (more next lecture) Lecture 1, Slide 33 Richard Socher 4/6/15 Background: The Cross entropy error •  Assuming a ground truth (or gold or target) probability distribu,on that is 1 at the right class and 0 everywhere else: p = [0,…,0,1,0,…0] and our computed probability is q, then the cross entropy is: •  Because of one-­‐hot p, the only term leX is the nega,ve probability of the true class •  Cross-­‐entropy can be re-­‐wrieen in terms of the entropy and Kullback-­‐Leibler divergence between the two distribu,ons: Lecture 1, Slide 34 Richard Socher 4/6/15 The KL divergence •  Cross entropy: •  Because p is zero in our case (and even if it wasn’t it would be fixed and have no contribu,on to gradient), to minimize this is equal to minimizing the KL divergence •  The KL divergence is not a distance but a non-­‐symmetric measure of the difference between two probability distribu,ons p and q Lecture 1, Slide 35 Richard Socher 4/6/15 PSet 1 •  Derive the gradient of the cross entropy error with respect to the input word vector x and the matrix W Lecture 1, Slide 36 Richard Socher 4/6/15 Simple single word classifica7on •  Example: Sen,ment •  Two op,ons: train only soBmax weights W and fix word vectors or also train word vectors •  Ques,on: What are the advantages and disadvantages of training the word vectors? •  Pro: beeer fit on training data •  Con: Worse generaliza,on because the words move in the vector space Lecture 1, Slide 37 Richard Socher 4/6/15 Visualiza7on of sen7ment trained word vectors Lecture 1, Slide 38 Richard Socher 4/6/15 Next level up: Window classifica7on •  Single word classifica,on has no context! •  Let’s add context by taking in windows and classifying the center word of that window! •  Possible: SoXmax and cross entropy error or max-­‐margin loss •  Next class! Lecture 1, Slide 39 Richard Socher 4/6/15 References 40 Richard Socher 4/6/15