tive of the Skip-gram model is to find word represe ounding words insaimple sentence or a document. More for Refresher: The word2vec model w2 , w3 , . . . , wT , the objective of the Skip-gram model • Main cost func,on J: T ! 1 T t=1 ! log p(wt+j |wt ) −c≤j≤c,j̸=0 training time. The basic Skip-gram formulation defines p(wt+j |wt ) u ! " ′ ⊤ exp v of the training context (which can be a function wO vwI of th ! " • With probabili,es defined as: p(wO |wI ) = # W raining examples and thus can lead tow=1a exp higher ′ ⊤ vaccu vw wI ning time. The basic Skip-gram formulation defines p(wt+j |wt ) using the so ′ ! " representatio where vw and vw are the “input” and “output” vector ′ ⊤ exp v • ber We of derived t he g radient f or t he i nternal v ectors wO vwIis impractical be words in the vocabulary. This formulation 2 " (105 –107 p(w |w ) = O I (v o n t he b oard) #,Wwhich is!often ∇ clog p(wO |wI ) is proportional to W large ′ ⊤v exp v wI w w=1 Lecture 1, Slide 1 Richard Socher 4/6/15 Calcula7ng all gradients! • We went through gradients for each center vector v in a window • We also need gradients for external vectors v’ (u on the board) • Derive! • Generally in each window we will compute updates for all parameters that are being used in that window. • For example window size c = 1, sentence: “I like learning .” • First window computes gradients for: • internal vector vlike and external vectors v’I and v’learning • Next window in that sentence? Lecture 1, Slide 2 Richard Socher 4/6/15 Compute all vector gradients! • We oXen define the set of ALL parameters in a model in terms of one long vector • In our case with d-‐dimensional vectors and V many words: Lecture 1, Slide 3 Richard Socher 4/6/15 Gradient Descent • To minimize over the full batch (the en,re training data) would require us to compute gradients for all windows • Updates would be for each element of µ : • With step size ® • In matrix nota,on for all parameters: Lecture 1, Slide 4 Richard Socher 4/6/15 Vanilla Gradient Descent Code Lecture 1, Slide 5 Richard Socher 4/6/15 Intui7on • For a simple convex func,on over two parameters. • Contour lines show levels of objec,ve func,on • See Whiteboard Lecture 1, Slide 6 Richard Socher 4/6/15 Stochas7c Gradient Descent • But Corpus may have 40B tokens and windows • You would wait a very long ,me before making a single update! • Very bad idea for preey much all neural nets! • Instead: We will update parameters aXer each window t à Stochas,c gradient descent (SGD) Lecture 1, Slide 7 Richard Socher 4/6/15 Stochas7c gradients with word vectors! • But in each window, we only have at most 2c -‐1 words, so is very sparse! Lecture 1, Slide 8 Richard Socher 4/6/15 Stochas7c gradients with word vectors! • We may as well only update the word vectors that actually appear! • Solu,on: either keep around hash for word vectors or only update certain columns of full embedding matrix L and L’ [ ] V d • Important if you have millions of word vectors and do distributed compu,ng to not have to send gigan,c updates around. Lecture 1, Slide 9 Richard Socher 4/6/15 Approxima7ons: PSet 1 • basic The nSkip-gram ormaliza,on factor is too computa,onally expensive . The formulation defines p(wt+j |wt ) using the softmax fu ! " ′ ⊤ exp vw vwI O ! " p(wO |wI ) = # W ′ ⊤v exp v wI w w=1 ′ d vw the i“input” vector trepresentations of w, and W is • are Hence, n PSet1 and you “output” will implement he skip-‐gram model in the vocabulary. This formulation is impractical because the cost of wI ) is proportional to W , which is often large (105 –107 terms). • Main idea: train binary logis,c regressions for a true pair (center word and word in its context window) and a couple of random chical Softmax pairs (the center word with a random word) onally efficient approximation of the full softmax is the hierarchical softm ural network language models, it was first introduced by Morin and Bengio age is that instead of evaluating W output nodes in the neural network to Lecture 1, Slide Richard Socher 4/6/15 istribution, it 1is0 needed to evaluate only about log2 (W ) nodes. oduced by Gutmann and Hyvarinen [4] and applied to language modeling by Mnih and Teh CE posits that a good model should be able to differentiate data from noise by means of log gression. This to hinge loss m used by Collobert and Weston s[2] who trained the m PSet 1: isTsimilar he skip-‐gram odel and nega7ve ampling y ranking the data above noise. hile • NCE can be shown“to approximately maximize the log o probability thePsoftmax, From paper: Distributed Representa,ons f Words of and hrases the S am model is only concerned with learning high-quality vector representations, so we are fr and astheir (Mikolov t al. quality. 2013) We define Negative sam mplify NCE long C asomposi,onality” the vector representations retainetheir NEG) by the objective ⊤ ′ log σ(vw vwI ) + O k % i=1 & ' ⊤ ′ vwI ) Ewi ∼Pn (w) log σ(−vw i 3 • Where k is the number of nega,ve samples and we use, • The sigmoid func,on! (we’ll become good friends soon) • So we maximize the probability of two words co-‐occurring in first log à Lecture 1, Slide 11 Richard Socher 4/6/15 PSet 1: The skip-‐gram model and nega7ve sampling • Slightly clearer nota,on: • Max. probability that real outside word appears, minimize prob. that random words appear around center word • Pn = U(w)3/4/Z, the unigram distribu,on U(w) raised to the 3/4rd power (We provide this func,on in the starter code). • The power makes less frequent words be sampled more oXen Lecture 1, Slide 12 Richard Socher 4/6/15 PSet 1: The con7nuous bag of words model • Main idea for con,nuous bag of words (CBOW): Predict center word from sum of surrounding word vectors instead of predic,ng surrounding single words from center word as in skip-‐ gram model • To make PSet slightly easier: The implementa,on for the CBOW model is not required and for bonus points! Lecture 1, Slide 13 Richard Socher 4/6/15 What to do with the two sets of vectors? • We end up with L and L’ from all the vectors v and v’ • Both capture similar co-‐occurrence informa,on. It turns out, the best solu,on is to simply sum them up: Lfinal = L + L’ • One of many hyperparameters explored in GloVe: Global Vectors for Word Representa4on (Pennington et al. (2014) Lecture 1, Slide 14 Richard Socher 4/6/15 How to evaluate word vectors? • Related to general evalua,on in NLP: Intrinsic vs extrinsic • Intrinsic: • Evalua,on on a specific/intermediate subtask • Fast to compute • Helps to understand that system • Not clear if really helpful unless correla,on to real task is established • Extrinsic: • Evalua,on on a real task • Can take a long ,me to compute accuracy • Unclear if the subsystem is the problem or its interac,on or other subsystems • If replacing one subsystem with another improves accuracy à Winning! Lecture 1, Slide 15 Richard Socher 4/6/15 Intrinsic word vector evalua7on • Word Vector Analogies: Syntac,c and Seman,c a:b :: c:? man:woman :: king:? • Evaluate word vectors by how well their cosine distance aXer addi,on captures intui,ve seman,c and syntac,c analogy ques,ons • Discarding the input words from the search! • Problem: What if the informa,on is there but not linear? Lecture 1, Slide 16 Richard Socher 4/6/15 Intrinsic word vector evalua7on • Word Vector Analogies: Syntac,c and Seman7c examples from hep://code.google.com/p/word2vec/source/browse/trunk/ques,ons-‐ words.txt : city-‐in-‐state problem: different ci,es Chicago Illinois Houston Texas may have same name Chicago Illinois Philadelphia Pennsylvania Chicago Illinois Phoenix Arizona Chicago Illinois Dallas Texas Chicago Illinois Jacksonville Florida Chicago Illinois Indianapolis Indiana Chicago Illinois Aus,n Texas Chicago Illinois Detroit Michigan Chicago Illinois Memphis Tennessee Chicago Illinois Boston Massachusees Lecture 1, Slide 17 Richard Socher 4/6/15 Intrinsic word vector evalua7on • Word Vector Analogies: Syntac,c and Seman7c examples from : capital-‐world Abuja Nigeria Accra Ghana Abuja Nigeria Algiers Algeria Abuja Nigeria Amman Jordan Abuja Nigeria Ankara Turkey Abuja Nigeria Antananarivo Madagascar Abuja Nigeria Apia Samoa Abuja Nigeria Ashgabat Turkmenistan Abuja Nigeria Asmara Eritrea Abuja Nigeria Astana Kazakhstan Lecture 1, Slide 18 Richard Socher problem: can change 4/6/15 Intrinsic word vector evalua7on • Word Vector Analogies: Syntac7c and Seman,c examples from : gram4-‐superla,ve bad worst big biggest bad worst bright brightest bad worst cold coldest bad worst cool coolest bad worst dark darkest bad worst easy easiest bad worst fast fastest bad worst good best bad worst great greatest Lecture 1, Slide 19 Richard Socher 4/6/15 Intrinsic word vector evalua7on • Word Vector Analogies: Syntac7c and Seman,c examples from : gram7-‐past-‐tense dancing danced decreasing decreased dancing danced describing described dancing danced enhancing enhanced dancing danced falling fell dancing danced feeding fed dancing danced flying flew dancing danced genera,ng generated dancing danced going went dancing danced hiding hid dancing danced hitng hit Lecture 1, Slide 20 Richard Socher 4/6/15 Table 2: Results on the word analogy task, given as percent accuracy. Underlined scores are best within groups of similarly-sized models; bold |X| scores are best overall. HPCA vectors are publicly X k Xi j = = k H | X |,↵ , (18) available2 ; (i)vLBL results are from (Mnih et al., ↵ r r =1 2013); skip-gram (SG) and CBOW results are from et al.,w 2013a,b); we trained SG† also capture • the Most careful analysis so (Mikolov far: † Glove ord vectors ( which rewritten last sum in terms of and CBOW using the word2vec tool3 . See text harmonic cooccurrence number Hn, m . The cupounts ut mand ore directly of so thethan kip-‐gram) forbdetails a description SVD smodels. e sum, |X |, is the maximum freModel Dim. Size Sem. Syn. Tot. hich coincides with the number of ivLBL 100 1.5B 55.9 50.1 53.2 nts in the matrix X. This number is HPCA 100 1.6B 4.2 16.4 10.8 e maximum value of r in Eqn. (17) 1/↵ GloVe 100 1.6B 67.5 54.3 60.3 1, i.e., |X | = k . Therefore we SG 300 1B 61 61 61 (18) as, CBOW 300 1.6B 16.1 52.6 36.1 |C| ⇠ |X | ↵ H | X |,↵ . (19) vLBL 300 1.5B 54.2 64.8 60.0 ivLBL 300 1.5B 65.2 63.0 64.0 ed in how |X | is related to |C| when GloVe 300 1.6B 80.8 61.5 70.3 are large; therefore we are free to SVD 300 6B 6.3 8.1 7.3 t hand side of the equation for large SVD-S 300 6B 36.7 46.6 42.1 rpose we use the expansion of genSVD-L 300 6B 56.6 63.0 60.1 nic numbers (Apostol, 1976), † CBOW 300 6B 63.6 67.4 65.7 † SG 300 6B 73.0 66.0 69.1 + ⇣ (s) + O(x s ) if s > 0, s , 1 , GloVe 300 6B 77.4 67.0 71.7 (20) CBOW 1000 6B 57.3 68.9 63.7 SG 1000 6B 66.1 65.1 65.6 SVD-L 300 42B 38.4 58.2 49.2 X| ↵ + ⇣ (↵) |X | + O(1) , (21) GloVe 300 42B 81.9 69.3 75.0 ↵ sum over all elements of the corix X, Analogy evalua7on and hyperparameters he Riemann function. In the Lecture zeta 1, Slide 21 arge, only one of the two terms on dataset for NER Richard (Tjong Kim Sang and De MeulSocher der, 2003). 4/6/15 Analogy evalua7on and hyperparameters 70 70 70 65 65 60 60 60 50 40 Semantic Syntactic Overall 30 20 0 100 200 300 400 Vector Dimension (a) Symmetric context 500 Accuracy [%] 80 Accuracy [%] Accuracy [%] • Asymmetric context (only words to the leX) are not as good 55 50 Semantic Syntactic Overall 45 600 40 2 4 6 Window Size 8 (b) Symmetric context 55 50 Semantic Syntactic Overall 45 10 40 2 4 6 Window Size 8 10 (c) Asymmetric context Figure 2: Accuracy on the analogy task as function of vector size and window size/type. All models are • Best imensions ~300, slight drop-‐off aXerwards trained on d the 6 billion token corpus. In (a), the window size is 10. In (b) and (c), the vector size is 100. • But this might be d ifferent Word similarity. While the analogyfor taskdownstream is our has 6tasks! billion tokens; and on 42 billion tokens of primary focus since it tests for interesting vector web data, from Common Crawl5 . We tokenize space substructures, we also evaluate our model on and lowercase each corpus with the Stanford to• Window size of 8 around each center word is good for Glove vectors a variety of word similarity tasks in Table 3. These kenizer, build a vocabulary of the 400,000 most include WordSim-353 (Finkelstein et al., 2001), frequent words6 , and then construct a matrix of coMC (Miller and Charles, 1991), RG (Rubenstein occurrence counts X. In constructing X, we must Lecture 1, Slide 22 1965), SCWS (Huang Richard 4/6/15 window should be and Goodenough, et al., Socher choose how large the context Analogy evalua7on and hyperparameters • More training ,me helps 5 6 Training Time (hrs) 9 12 15 18 21 24 72 Accuracy [%] 70 GloVe CBOW 68 66 GloVe Skip-Gram 64 62 20 W) W 3 6 25 40 60 50 Lecture 1, Slide 23 20 40 60 80 100 Iterations (GloVe) 1 2 3 4 5 6 7 10 12 15 Negative Samples (Skip-Gram) (b) GloVe vs Skip-Gram Richard Socher 20 4/6/15 Analogy evalua7on and hyperparameters • More data helps, Wikipedia is beeer than news text! Semantic Syntactic Overall 85 80 Accuracy [%] ctors. . We SMN, 75 70 65 60 55 50 Wiki2010 1B tokens Wiki2014 1.6B tokens Gigaword5 4.3B tokens Gigaword5 + Wiki2014 6B tokens Common Crawl 42B tokens Figure 3: Accuracy on the analogy task for 300Lecture 1, Slide 24 Richard Socher 4/6/15 dimensional vectors trained on different corpora. Intrinsic word vector evalua7on • Word vector distances and their correla,on with human judgments • Example dataset: WordSim353 hep://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ Word 1 Word 2 ,ger cat ,ger ,ger book paper computer plane car professor stock phone stock CD stock jaguar Lecture 1, Slide 25 Human (mean) 7.35 10.00 7.46 internet 7.58 5.77 doctor 6.62 1.62 1.31 0.92 Richard Socher 4/6/15 rs. Doing so typTable 3: Spearman rank correlation on word simiormance, with the larity tasks. All vectors are 300-dimensional. The analogy task. CBOW⇤ vectors are from the word2vec website • Word vector distances and their correla,on with human judgments d results of a vaand differ in that they contain phrase vectors. as well as with Model Size WS353 MC RG SCWS RW the word2vec SVD 6B 35.3 35.1 42.5 38.3 25.6 sing SVDs. With SVD-S 6B 56.5 71.5 71.0 53.6 34.7 gram (SG† ) and SVD-L 6B 65.7 72.7 75.1 56.5 37.0 W† ) models on the † 6B CBOW 57.2 65.6 68.2 57.0 32.5 ia 2014 + GigaSG† 6B 62.8 65.2 69.7 58.1 37.2 top 400,000 most GloVe 6B 65.8 72.7 77.8 53.9 38.1 ndow size of 10. SVD-L 42B 74.0 76.4 74.1 58.3 39.9 hich we show in GloVe 42B 75.9 83.6 82.9 59.6 47.8 or this corpus. CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5 nerate a truncated formation of how L model on this larger corpus. The fact that this ith only the top • Some ideas from love p aper hdoes ave not been shows skip-‐gram (SG) basicGSVD model scale wellto toimprove large coralso (e.g. both vectors) This stepmodel is typiporasum lends further evidence to the necessity of the based methods as type of weighting scheme proposed in our model. a disproportionTable 3 shows results on five different word 1, Slide 26 Richard Socher 4/6/15 the Lecture methods are similarity datasets. A similarity score is obtained Correla7on evalua7on But what about ambiguity? • You may hope that one vector captures both kinds of informa,on (run = verb and noun) but then vector is pulled in different direc,ons • Alterna,ve described in: Improving Word Representa4ons Via Global Context And Mul4ple Word Prototypes (Huang et al. 2012) • Idea: Cluster word windows around words, retrain with each word assigned to mul,ple different clusters bank1, bank2, etc Lecture 1, Slide 27 Richard Socher 4/6/15 But what about ambiguity? • Improving Word Representa4ons Via Global Context And Mul4ple Word Prototypes (Huang et al. 2012) Lecture 1, Slide 28 Richard Socher 4/6/15 Extrinsic word vector evalua7on • Extrinsic evalua,on of word vectors: All subsequent tasks in this class Semantic Accuracy [%] Table 4: F1 score on NER task with 50d vectors. 85 Discrete is the baseline without word vectors. We 80 • One example use where g ood w ord v ectors s hould h elp d irectly: n amed en,ty 75 publicly-available vectors for HPCA, HSMN, recogni,on: fiand nding person, organiza,on or loca,on 70 CW.a See text for details. 65 Model Dev Test ACE MUC7 60 Discrete 91.0 85.4 77.4 73.4 55 SVD 90.8 85.7 77.3 73.7 50 SVD-S 91.0 85.5 77.6 74.3 Wiki2010 Wiki2014 1B tokens 1.6B tokens SVD-L 90.5 84.8 73.6 71.5 HPCA 92.6 88.7 81.7 80.7 Figure 3: Accuracy on HSMN 90.5 85.7 78.7 74.7 dimensional vectors tra CW 92.2 87.4 81.7 80.2 CBOW 93.1 88.2 82.2 81.1 entries are updated to GloVe 93.2 88.3 82.9 82.2 whereas Gigaword is a shown for neural vectors in (Turian et al., 2010). • Next: How to use word vectors in neural net models! 4.4 Lecture 1, Slide 29 Model Analysis: Vector Length and Context Size Richard Socher outdated and possibly i 4.6 Model Analysis: The total 4/6/15 run-time is s and training the mode Simple single word classifica7on • What is the major benefit of deep learned word vectors? • Ability to also classify words accurately • • Countries cluster together à classifying loca,on words should be possible with word vectors Incorporate any informa,on into them other tasks • Project sen,ment into words to find most posi,ve/ nega,ve words in corpus Lecture 1, Slide 30 Richard Socher 4/6/15 The soOmax Logis,c regression = SoXmax classifica,on on word vector x to obtain probability for class y: a1 a2 x1 x2 x3 where: Generalizes >2 classes (for just binary sigmoid unit would suffice as in skip-‐gram) Lecture 1, Slide 31 Richard Socher 4/6/15 The soOmax -‐ details • Terminology: Loss func,on = cost func,on = objec,ve func,on • Loss for soXmax: Cross entropy • To compute p(y|x): first take the y’th row of W and mul,ply that with row with x: • Compute all fc for c=1,…,C • Normalize to obtain probability with soXmax func,on: Lecture 1, Slide 32 Richard Socher 4/6/15 The soOmax and cross-‐entropy error • The loss wants to maximize the probability of the correct class y • Hence, we minimize the nega,ve log probability of that class: • As before: we sum up mul,ple cross entropy errors if we have mul,ple classifica,ons in our total error func,on over the corpus (more next lecture) Lecture 1, Slide 33 Richard Socher 4/6/15 Background: The Cross entropy error • Assuming a ground truth (or gold or target) probability distribu,on that is 1 at the right class and 0 everywhere else: p = [0,…,0,1,0,…0] and our computed probability is q, then the cross entropy is: • Because of one-‐hot p, the only term leX is the nega,ve probability of the true class • Cross-‐entropy can be re-‐wrieen in terms of the entropy and Kullback-‐Leibler divergence between the two distribu,ons: Lecture 1, Slide 34 Richard Socher 4/6/15 The KL divergence • Cross entropy: • Because p is zero in our case (and even if it wasn’t it would be fixed and have no contribu,on to gradient), to minimize this is equal to minimizing the KL divergence • The KL divergence is not a distance but a non-‐symmetric measure of the difference between two probability distribu,ons p and q Lecture 1, Slide 35 Richard Socher 4/6/15 PSet 1 • Derive the gradient of the cross entropy error with respect to the input word vector x and the matrix W Lecture 1, Slide 36 Richard Socher 4/6/15 Simple single word classifica7on • Example: Sen,ment • Two op,ons: train only soBmax weights W and fix word vectors or also train word vectors • Ques,on: What are the advantages and disadvantages of training the word vectors? • Pro: beeer fit on training data • Con: Worse generaliza,on because the words move in the vector space Lecture 1, Slide 37 Richard Socher 4/6/15 Visualiza7on of sen7ment trained word vectors Lecture 1, Slide 38 Richard Socher 4/6/15 Next level up: Window classifica7on • Single word classifica,on has no context! • Let’s add context by taking in windows and classifying the center word of that window! • Possible: SoXmax and cross entropy error or max-‐margin loss • Next class! Lecture 1, Slide 39 Richard Socher 4/6/15 References 40 Richard Socher 4/6/15
© Copyright 2024