CS 224D: DEEP LEARNING FOR NLP MIDTERM REVIEW Neural Networks: Terminology, Forward Pass, Backpropagation. Rohit Mundra May 4, 2015 CS224D: Deep Learning for NLP Overview • Neural Network Example • Terminology • Example 1: • Forward Pass • Backpropagation Using Chain Rule • What is delta? From Chain Rule to Modular Error Flow • Example 2: • Forward Pass • Backpropagation 2 CS224D: Deep Learning for NLP 3 Neural Networks • One of many different types of non-linear classifiers (i.e. leads to non-linear decision boundaries) • Most common design involves the stacking of affine transformations followed by point-wise (element-wise) non-linearity CS224D: Deep Learning for NLP 4 An example of a neural network • This is a 4 layer neural network. • 2 hidden-layer neural network. • 2-10-10-3 neural network (complete architecture defn.) CS224D: Deep Learning for NLP Our first example Layer 1 x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) Layer 2 z1(2) a Layer 3 (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) • This is a 3 layer neural network • 1 hidden-layer neural network a2(2) 5 CS224D: Deep Learning for NLP Our first example: Terminology Layer 1 x1 x2 x3 x4 Model Input z1(1) a (1) 1 1 z2(1) Layer 2 z1(2) a Layer 3 (2) 1 z1(3) a1(3) 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 6 1 s a2(2) Model Output CS224D: Deep Learning for NLP Our first example: Terminology Layer 1 x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) Layer 2 z1(2) a Layer 3 (2) 1 z1(3) a1(3) 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 7 1 s a2(2) Model Input Model Output Activation Units CS224D: Deep Learning for NLP 8 Our first example: Activation Unit Terminology z1(2) a1(2) σ We draw this + z1(2) σ a1(2) This is actually what’s going on z1(2) = W11(1)a1(1) + W12(1)a2(1) + W13(1)a3(1) + W14(1)a4(1) a1(2) is the 1st activation unit of layer 2 a1(2) = σ(z1(2)) CS224D: Deep Learning for NLP Our first example: Forward Pass x1 x2 x3 x4 z1(1) = z2(1) = z3(1) = z4(1) = x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 9 a2(2) CS224D: Deep Learning for NLP Our first example: Forward Pass x1 x2 x3 x4 a1(1) = a2(1) = a3(1) = a4(1) = z1(1) z2(1) z3(1) z4(1) z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 10 a2(2) CS224D: Deep Learning for NLP Our first example: Forward Pass x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 11 a2(2) z1(2) = W11(1)a1(1) + W12(1)a2(1) + W13(1)a3(1) + W14(1)a4(1) z2(2) = W21(1)a1(1) + W22(1)a2(1) + W23(1)a3(1) + W24(1)a4(1) CS224D: Deep Learning for NLP Our first example: Forward Pass x1 x2 x3 x4 z(2) z1(2) z2(2) W(1) z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a(1) (1) W (1) W (1) W (1) W a1(1) 11 12 13 14 = W21(1) W22(1) W23(1) W24(1) a2(1) a3(1) a4(1) a4(1) 12 a2(2) CS224D: Deep Learning for NLP Our first example: Forward Pass x1 x2 x3 x4 z(2) =W(1)a(1) Affine transformation z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 13 a2(2) CS224D: Deep Learning for NLP Our first example: Forward Pass x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 14 a4(1) a(2) = σ(z(2)) Point-wise/Element-wise non-linearity a2(2) CS224D: Deep Learning for NLP Our first example: Forward Pass x1 x2 x3 x4 z(3) = W(2)a(2) Affine transformation z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 15 a2(2) CS224D: Deep Learning for NLP Our first example: Forward Pass x1 x2 x3 x4 a(3) = z(3) s = a(3) z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 16 a2(2) CS224D: Deep Learning for NLP Our first example: Backpropagation using chain rule x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 17 a2(2) Let us try to calculate the error gradient wrt W14(1) Thus we want to find: CS224D: Deep Learning for NLP Our first example: Backpropagation using chain rule x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 18 a2(2) Let us try to calculate the error gradient wrt W14(1) Thus we want to find: CS224D: Deep Learning for NLP Our first example: Backpropagation using chain rule x1 x2 x3 x4 This is simply 1 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 19 a2(2) CS224D: Deep Learning for NLP Our first example: Backpropagation using chain rule z1(1) a (1) 1 1 x1 z2(1) x2 (2) 1 z1(3) a1(3) s 1 1 z3(1) x3 z2(2) 1 x4 z4(1) (!) (!) (!) !!! (!) !!! !!! !!! (!) ! !(!!! !! z1(2) a 1 a4(1) a2(2) (!) !!! (!) !!!" 20 ! (!) (!) (!) (!) + ! !!" !! ) !!! !!! ! (!) (!) (!) !!! !!! !!!" CS224D: Deep Learning for NLP Our first example: Backpropagation using chain rule x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 21 a2(2) CS224D: Deep Learning for NLP Our first example: Backpropagation using chain rule x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z4(1) !!!! !′ !!! z2(2) 1 1 a4(1) (!) !!! ! (!) !!!" 22 a2(2) CS224D: Deep Learning for NLP Our first example: Backpropagation using chain rule x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 23 a2(2) CS224D: Deep Learning for NLP Our first example: Backpropagation using chain rule x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) (!) !!!! !′ !!! !! ! δ1(2) z1(2) a 24 a2(2) CS224D: Deep Learning for NLP Our first example: Backpropagation Observations We got error gradient wrt W14(1) x1 x2 x3 x4 z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 25 a2(2) Required: • the signal forwarded by W14(1) = a4(1) • the error propagating backwards W11(2) • the local gradient σ’(z1(2)) CS224D: Deep Learning for NLP Our first example: Backpropagation Observations We tried to get error gradient wrt W14(1) x1 x2 x3 x4 Required: • the signal forwarded by W14(1) = a4(1) • the error propagating backwards W11(2) • the local gradient σ’(z1(2)) z1(1) a (1) 1 1 z2(1) z1(2) a (2) 1 z1(3) a1(3) s 1 1 z3(1) z2(2) 1 z4(1) 1 a4(1) 26 a2(2) We can do this for all of W(1): δ1(2)a1(1) δ1(2)a2(1) δ1(2)a3(1) δ1(2)a4(1) δ2(2)a1(1) δ2(2)a2(1) δ2(2)a3(1) δ2(2)a4(1) (as outer product) (1) (1 (1) (1) δ1(2) a1 a2 a3 a4 δ2(2) CS224D: Deep Learning for NLP 27 Our first example: Let us define δ + z1(2) σ a1(2) Recall that this is forward pass + δ1(2) σ This is the backpropagation δ1(2) is the error flowing backwards at the same point where z1(2) passed forwards. Thus it is simply the gradient of the error wrt z1(2). CS224D: Deep Learning for NLP 28 Our first example: Backpropagation using error vectors The chain rule of differentiation just boils down very simple patterns in error backpropagation: 1. An error x flowing backwards passes a neuron by getting amplified by the local gradient. 2. An error δ that needs to go through an affine transformation distributes itself in the way signal combined in forward pass. Orange = Backprop. Green = Fwd. Pass δ = σ’(z)x δw1 a1w1 δw2 a2w2 δw3 a3w3 σ + x δ z CS224D: Deep Learning for NLP Our first example: Backpropagation using error vectors z(1) 1 a(1) W(1) z(2) σ a(2) W(2) z(3) 1 s 29 CS224D: Deep Learning for NLP 30 Our first example: Backpropagation using error vectors z(1) 1 a(1) W(1) z(2) σ a(2) W(2) z(3) 1 s δ(3) This is for softmax CS224D: Deep Learning for NLP Our first example: Backpropagation using error vectors z(1) 1 a(1) W(1) z(2) σ a(2) W(2) z(3) δ(3) Gradient w.r.t W(2) = δ(3)a(2)T 1 s 31 CS224D: Deep Learning for NLP 32 Our first example: Backpropagation using error vectors z(1) 1 a(1) W(1) z(2) σ a(2) W(2)T δ(3) W(2) z(3) 1 s δ(3) --Reusing the δ(3) for downstream updates. --Moving error vector across affine transformation simply requires multiplication with the transpose of forward matrix --Notice that the dimensions will line up perfectly too! CS224D: Deep Learning for NLP Our first example: Backpropagation using error vectors z(1) 1 a(1) W(1) σ’(z(2))¤W(2)T δ(3) z(2) σ a(2) W(2) z(3) 1 s W(2)T δ(3) = δ(2) --Moving error vector across point-wise non-linearity requires point-wise multiplication with local gradient of the non-linearity 33 CS224D: Deep Learning for NLP Our first example: Backpropagation using error vectors z(1) 1 a(1) W(1)T δ(2) W(1) z(2) σ a(2) δ(2) Gradient w.r.t W(1) = δ(2)a(1)T W(2) z(3) 1 s 34 CS224D: Deep Learning for NLP 35 Our second example (4-layer network): Backpropagation using error vectors z(1) 1 a(1) W(1) z(2) σ a(2) W(2) z(3) σ a(3) W(3) z(4) soft max yp CS224D: Deep Learning for NLP 36 Our second example (4-layer network): Backpropagation using error vectors z(1) 1 a(1) W(1) z(2) σ a(2) W(2) z(3) σ a(3) W(3) z(4) yp– y = δ(4) soft max yp CS224D: Deep Learning for NLP 37 Our second example (4-layer network): Backpropagation using error vectors Grad W(3) = δ(4)a(3)T z(1) 1 a(1) W(1) z(2) σ a(2) W(2) z(3) σ a(3) W(3)Tδ(4) W(3) z(4) soft max δ(4) yp CS224D: Deep Learning for NLP 38 Our second example (4-layer network): Backpropagation using error vectors z(1) 1 a(1) W(1) z(2) σ a(2) W(2) z(3) δ(3)= σ’(z(3))¤W(3)Tδ(4) σ a(3) W(3) W(3)Tδ(4) z(4) soft max yp CS224D: Deep Learning for NLP 39 Our second example (4-layer network): Backpropagation using error vectors Grad W(2) = δ(3)a(2)T z(1) 1 a(1) W(1) z(2) σ a(2) W(2)Tδ(3) W(2) z(3) σ δ(3) a(3) W(3) z(4) soft max yp CS224D: Deep Learning for NLP 40 Our second example (4-layer network): Backpropagation using error vectors z(1) 1 a(1) W(1) z(2) σ a(2) W(2) W(2)Tδ(3) δ(2)= σ’(z(2))¤W(2)Tδ(3) z(3) σ a(3) W(3) z(4) soft max yp CS224D: Deep Learning for NLP 41 Our second example (4-layer network): Backpropagation using error vectors Grad W(1) = δ(2)a(1)T z(1) 1 a(1) W(1)Tδ(2) W(1) z(2) δ(2) σ a(2) W(2) z(3) σ a(3) W(3) z(4) soft max yp CS224D: Deep Learning for NLP 42 Our second example (4-layer network): Backpropagation using error vectors Grad wrt input vector = W(1)Tδ(2) z(1) 1 a(1) W(1) z(2) W(1)Tδ(2) W(1)Tδ(2) σ a(2) W(2) z(3) σ a(3) W(3) z(4) soft max yp CS224D Midterm Review Ian Tenney May 4, 2015 Outline Backpropagation (continued) RNN Structure RNN Backpropagation Backprop on a DAG Example: Gated Recurrent Units (GRUs) GRU Backpropagation Outline Backpropagation (continued) RNN Structure RNN Backpropagation Backprop on a DAG Example: Gated Recurrent Units (GRUs) GRU Backpropagation Basic RNN Structure yˆ(t) h(t−1) h(t) ... x(t) I Basic RNN ("Elman network") I You’ve seen this on Assignment #2 (and also in Lecture #5) Basic RNN Structure yˆ(t) h(t−1) h(t) ... x(t) I Two layers between input and prediction, plus hidden state h(t) = sigmoid Hh(t−1) + W x(t) + b1 yˆ(t) = softmax U h(t) + b2 Unrolled RNN h(t−3) yˆ(t−2) yˆ(t−1) yˆ(t) h(t−2) h(t−1) h(t) x(t−2) x(t−1) x(t) ... I Helps to think about as “unrolled” network: distinct nodes for each timestep I Just do backprop on this! Then combine shared gradients. Backprop on RNN I Usual cross-entropy loss (k-class): (t) P¯ (y (t) = j | x(t) , . . . , x(1) ) = yˆj J (t) (θ) = − k X (t) j=1 I Just do backprop on this! First timestep (τ = 1): ∂J (t) ∂H (t) ∂J (t) ∂U ∂J (t) ∂b2 ∂J (t) ∂h(t) ∂J (t) ∂W (t) (t) yj log yˆj ∂J (t) ∂x(t) Backprop on RNN I First timestep (s = 0): ∂J (t) ∂H (t) I ∂J (t) ∂U ∂J (t) ∂b2 ∂J (t) ∂h(t) ∂J (t) ∂W (t) Back in time (s = 1, 2, . . . , τ − 1) ∂J (t) ∂J (t) ∂J (t) ∂H (t−s) ∂W (t−s) ∂h(t−s) ∂J (t) ∂x(t) ∂J (t) ∂x(t−s) Backprop on RNN Yuck, that’s a lot of math! I Actually, it’s not so bad. I Solution: error vectors (δ) Making sense of the madness I Chain rule to the rescue! I a(t) = U h(t) + b2 I yˆ(t) = softmax(a(t) ) I Gradient is transpose of Jacobian: ∇a J = I ∂J (t) ∂a(t) !T = yˆ(t) − y (t) = δ (2)(t) ∈ Rk×1 Now dimensions work out: ∂J (t) ∂a(t) · = (δ (2)(t) )T I ∂a(t) ∂b2 ∈ R(1×k)·(k×k) = R1×k Making sense of the madness I Chain rule to the rescue! I a(t) = U h(t) + b2 I yˆ(t) = softmax(a(t) ) I Gradient is transpose of Jacobian: ∇a J = I ∂J (t) ∂a(t) !T = yˆ(t) − y (t) = δ (2)(t) ∈ Rk×1 Now dimensions work out: ∂J (t) ∂a(t) · = (δ (2)(t) )T I ∂a(t) ∂b2 ∈ R(1×k)·(k×k) = R1×k Making sense of the madness I Chain rule to the rescue! I a(t) = U h(t) + b2 I yˆ(t) = softmax(a(t) ) I Gradient is transpose of Jacobian: ∇a J = I ∂J (t) ∂a(t) !T = yˆ(t) − y (t) = δ (2)(t) ∈ Rk×1 Now dimensions work out: ∂J (t) ∂a(t) · = (δ (2)(t) )T I ∂a(t) ∂b2 ∈ R(1×k)·(k×k) = R1×k Making sense of the madness I Chain rule to the rescue! I a(t) = U h(t) + b2 I yˆ(t) = softmax(a(t) ) I Matrix dimensions get weird: ∂a(t) ∂U I But we don’t need fancy tensors: ∇U J (t) = I ∈ Rk×(k×Dh ) ∂J (t) ∂a(t) · ∂a(t) ∂U !T = δ (2)(t) (h(t) )T ∈ Rk×Dh NumPy: self.grads.U += outer(d2, hs[t]) Making sense of the madness I Chain rule to the rescue! I a(t) = U h(t) + b2 I yˆ(t) = softmax(a(t) ) I Matrix dimensions get weird: ∂a(t) ∂U I But we don’t need fancy tensors: ∇U J (t) = I ∈ Rk×(k×Dh ) ∂J (t) ∂a(t) · ∂a(t) ∂U !T = δ (2)(t) (h(t) )T ∈ Rk×Dh NumPy: self.grads.U += outer(d2, hs[t]) Going deeper I Really just need one simple pattern: I z (t) = Hh(t−1) + W x(t) + b1 I h(t) = f (z (t) ) Compute error delta (s = 0, 1, 2, . . .): I I I I From top: δ (t) = h(t) ◦ (1 − h(t) ) ◦ U T δ (2)(t) Deeper: δ (t−s) = h(t−s) ◦ (1 − h(t−s) ) ◦ H T δ (t−s+1) These are just chain-rule expansions! ∂J (t) ∂a(t) ∂h(t) ∂J (t) = · · = (δ (t) )T ∂z (t) ∂a(t) ∂h(t) ∂z (t) Going deeper I Really just need one simple pattern: I z (t) = Hh(t−1) + W x(t) + b1 I h(t) = f (z (t) ) Compute error delta (s = 0, 1, 2, . . .): I I I I From top: δ (t) = h(t) ◦ (1 − h(t) ) ◦ U T δ (2)(t) Deeper: δ (t−s) = h(t−s) ◦ (1 − h(t−s) ) ◦ H T δ (t−s+1) These are just chain-rule expansions! ∂J (t) ∂J (t) ∂a(t) ∂h(t) = · · = (δ (t) )T ∂z (t) ∂a(t) ∂h(t) ∂z (t) Going deeper I These are just chain-rule expansions! ! (t) ∂J (t) ∂a(t) ∂h(t) ∂J (t) ∂z (t) (t) T ∂z = · · · = (δ ) ∂b1 (t) ∂b1 ∂b1 ∂a(t) ∂h(t) ∂z (t) ! (t) ∂J (t) ∂a(t) ∂h(t) ∂z (t) (t) T ∂z · · = (δ ) · ∂H ∂H ∂a(t) ∂h(t) ∂z (t) ! (t) ∂J (t) ∂a(t) ∂h(t) ∂z (t) (t) T ∂z · · · = (δ ) ∂a(t) ∂h(t) ∂z (t) ∂h(t−1) ∂z (t−1) ∂J (t) = ∂H (t) ∂J (t) = ∂z (t−1) Going deeper I And there’s shortcuts for them too: !T ∂J (t) = δ (t) ∂b1 (t) !T ∂J (t) = δ (t) · (h(t−1) )T ∂H (t) !T h i ∂J (t) (t−1) (t−1) = h ◦ (1 − h ) ◦ H T δ (t) = δ (t−1) (t−1) ∂z Outline Backpropagation (continued) RNN Structure RNN Backpropagation Backprop on a DAG Example: Gated Recurrent Units (GRUs) GRU Backpropagation Motivation I Gated units with “reset” and “output” gates I Reduce problems with vanishing gradients Figure : You are likely to be eaten by a GRU. (Figure from Chung, et al. 2014) Intuition I I I I I Gates zi and ri for each hidden layer neuron zi , ri ∈ [0, 1] ˜ as “candidate” hidden layer h ˜ z, r all depend on on x(t) , h(t−1) h, ˜ (t) h(t) depends on h(t−1) mixed with h Figure : You are likely to be eaten by a GRU. (Figure from Chung, et al. 2014) Equations I I I I I z (t) = σ Wz x(t) + Uz h(t−1) r(t) = σ Wr x(t) + Ur h(t−1) ˜ (t) = tanh W x(t) + r(t) ◦ U h(t−1) h ˜ (t) h(t) = z (t) ◦ h(t−1) + (1 − z (t) ) ◦ h Optionally can have biases; omitted for clarity. Figure : You are likely to be eaten by a GRU. (Figure from Chung, et al. 2014) Same eqs. as Lecture 8, subscripts/superscripts as in Assignment #2. Backpropagation Multi-path to compute ∂x∂J(t) T ∂J I Start with δ (t) = ∂h(t) ∈ Rd I ˜ (t) h(t) = z (t) ◦ h(t−1) + (1 − z (t) ) ◦ h I Expand chain rule into sum (a.k.a. product rule): " # (t−1) (t) ∂J ∂J ∂h ∂z = · z (t) ◦ + (t) ◦ h(t−1) ∂x(t) ∂h(t) ∂x(t) ∂x " # ˜ (t) ∂(1 − z (t) ) ∂J ∂h (t) (t) ˜ + · (1 − z ) ◦ (t) + ◦h ∂h(t) ∂x ∂x(t) It gets (a little) better Multi-path to compute I ∂J ∂x(t) Drop terms that don’t depend on x(t) : " # (t−1) (t) ∂J ∂J ∂h ∂z = · z (t) ◦ + (t) ◦ h(t−1) ∂x(t) ∂h(t) ∂x(t) ∂x " # ˜ (t) ∂(1 − z (t) ) ∂J ∂ h ˜ (t) + · (1 − z (t) ) ◦ (t) + ◦h ∂h(t) ∂x ∂x(t) " # ˜ (t) ∂J ∂z (t) ∂ h = · ◦ h(t−1) + (1 − z (t) ) ◦ (t) ∂h(t) ∂x(t) ∂x − ∂J ∂z (t) ˜ (t) ◦h ∂h(t) ∂x(t) Almost there! Multi-path to compute ∂J ∂x(t) I Now we really just need to compute two things: I Output gate: ∂z (t) = z (t) ◦ (1 − z (t) ) ◦ Wz ∂x(t) I ˜ Candidate h: ˜ (t) ∂h = ∂x(t) ˜ (t) )2 ) ◦ W (1 − (h (t) ˜ (t) )2 ) ◦ ∂r ◦ U h(t−1) + (1 − (h ∂x(t) I Ok, I lied - there’s a third. I Don’t forget to check all paths! Almost there! Multi-path to compute ∂J ∂x(t) I Now we really just need to compute two things: I Output gate: ∂z (t) = z (t) ◦ (1 − z (t) ) ◦ Wz ∂x(t) I ˜ Candidate h: ˜ (t) ∂h = ∂x(t) ˜ (t) )2 ) ◦ W (1 − (h (t) ˜ (t) )2 ) ◦ ∂r ◦ U h(t−1) + (1 − (h ∂x(t) I Ok, I lied - there’s a third. I Don’t forget to check all paths! Almost there! Multi-path to compute ∂J ∂x(t) I Now we really just need to compute two things: I Output gate: ∂z (t) = z (t) ◦ (1 − z (t) ) ◦ Wz ∂x(t) I ˜ Candidate h: ˜ (t) ∂h = ∂x(t) ˜ (t) )2 ) ◦ W (1 − (h (t) ˜ (t) )2 ) ◦ ∂r ◦ U h(t−1) + (1 − (h ∂x(t) I Ok, I lied - there’s a third. I Don’t forget to check all paths! Almost there! Multi-path to compute I ∂J ∂x(t) Last one: I ∂r(t) = r(t) ◦ (1 − r(t) ) ◦ Wr ∂x(t) Now we can just add things up! I (I’ll spare you the pain...) Whew. I I I Why three derivatives? Three arrows from x(t) to distinct nodes ∂z (t) Four paths total ( ∂x (t) appears twice) Whew. I I I GRUs are complicated All the pieces are simple Same matrix gradients that you’ve seen before Summary I Check your dimensions! I Write error vectors δ; just parentheses around chain rule Combine simple operations to make complex network I I I Matrix-vector product Activation functions (tanh, sigmoid, softmax) Good luck on Wednesday! May the 4th be with you! CS 224d Midterm Review (word vectors) Peng Qi May 5, 2015 Outline I word2vec and GloVe revisited I word2vec with backpropagation Outline I word2vec and GloVe revisited I I I I I Skip-gram revisited (Optional) CBOW and its connection to Skip-gram (Optional) word2vec as matrix factorization (conceptually) GloVe v.s. word2vec word2vec with backpropagation Skip-gram INPUT PROJECTION OUTPUT INPUT PROJECTION w(t-2) I OUTPUT w(t-2) Task: given a center word, predict its context words w(t-1) w(t-1) SUM w(t) I For each word, we have an “input vector” vw and an “output vector” vw0 w(t) w(t+1) w(t+1) w(t+2) w(t+2) CBOW Skip-gram Figure 1: New model architectures. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word. R words from the future of the current word as correct labels. This will require us to do R ⇥ 2 word classifications, with the current word as input, and each of the R + R words as output. In the following experiments, we use C = 10. Skip-gram I We have seen two types of costs for an expected word given a vector prediction r > 0 exp(r vwi ) CE (wi |r ) = − log P|V | >v 0 ) exp(r wj j=1 INPUT PROJECTION OUTPUT INPUT w(t-2) PROJECTION OUTPUT w(t-2) w(t-1) w(t-1) SUM NEG (wi |r ) = − log(σ(r > vw0 i )) K X w(t+1) − w(t+2) w(t) w(t) w(t+1) log(σ(−r > vw0 k )) k=1 w(t+2) Skip-gram In the case of skip-gram, theCBOW vector prediction r is just the “input vector” of Figure 1: New model architectures. The CBOW architecture predicts the current word based on the and the Skip-gram predicts surrounding words given the current word. the center word, vwcontext, i. σ(·) is the sigmoid (logistic) function. R words from the future of the current word as correct labels. This will require us to do R ⇥ 2 word classifications, with the current word as input, and each of the R + R words as output. In the Skip-gram I Now we have all the pieces of skip-gram, the cost for a context window [wi−C , · · · , wi−1 , wi , wi+1 , · · · , wi+C ] is (wi is the center word) INPUT PROJECTION OUTPUT INPUT PROJECTION w(t-2) Jskip−gram ([wi−C , · · · , wi+C ]) = X F (wj |vwi ) OUTPUT w(t-2) w(t-1) w(t-1) SUM w(t) w(t) i−C ≤j≤i+C ,i6=j w(t+1) where F is one of the cost functions we defined in the previous slide. w(t+1) w(t+2) w(t+2) CBOW I Skip-gram You might ask: but why are we introducing so many Figurenotations? 1: New model architectures. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word. R words from the future of the current word as correct labels. This will require us to do R ⇥ 2 word classifications, with the current word as input, and each of the R + R words as output. In the Skip-gram v.s. CBOW Skip-gram ROJECTION OUTPUT INPUT PROJECTION CBOW OUTPUT INPUT PROJECTION OUTPUT INPUT P w(t-2) w(t-2) w(t-1) SUM w(t-1) SUM w(t) w(t) w(t) w(t) w(t+1) w(t+1) w(t+2) CBOW Task r Skip-gram Center word → Context v wi architectures. The CBOW architecture predicts the current word based on the gram predicts surrounding words given the current word. ure of the current word as correct labels. This will require us to do R ⇥ 2 with the current word as input, and each of the R + R words as output. In the , we use C = 10. w(t+2) Context →CBOW Center word f (vwi−C , · · · , vwi−1 , vwi+1 , · · · , vwi+C ) S Figure 1: New model architectures. The CBOW architecture predicts th context, and the Skip-gram predicts surrounding words given the curren R words from the future of the current word as correct labels. This w word classifications, with the current word as input, and each of the R + All word2vec figures are from http://arxiv.org/pdf/1301.3781.pdf following experiments, we use C = 10. word2vec as matrix factorization (conceptually) I Matrix factorization M n×n . . B . k×n ≈ A> . n×k Mij ≈ ai> bj I Imagine M is a matrix of counts for events co-occurring, but we only get to observe the co-occurrences one at a time. E.g. 1 0 4 M= 0 0 2 1 3 0 but we only see (1,1), (2,3), (3,2), (2,3), (1,3), . . . word2vec as matrix factorization (conceptually) Mij ≈ ai> bj I I I I Whenever we see a pair (i, j) co-occur, we try to increasing ai> bj We also try to make all the other inner-products smaller to account for pairs never observed (or unobserved yet), by > b and a> b decreasing a¬i j i ¬j Remember from the lecture that the word co-occurrence matrix usually captures the semantic meaning of a word? For word2vec models, roughly speaking, M is the windowed word co-occurrence matrix, A is the output vector matrix, and B is the input vector matrix. Why not just use one set of vectors? It’s equivalent to A = B in our formulation here, but less constraints is usually easier for optimization. GloVe v.s. word2vec Direct prediction (word2vec) GloVe Fast training Efficient usage of statistics Quality affected by size of corpora Captures complex patterns Scales with size of corpus No No* Yes Yes Yes No Yes * Skip-gram and CBOW are qualitatively different when it comes to smaller corpora Outline I word2vec and GloVe revisited I word2vec with backpropagation word2vec with backpropagation I CE (wi |r ) = I CE (wi |r ) = CE (ˆ y , yi ) yˆ = softmax(θ) θ = (V 0 )> r I δ= I exp(r > vw0 i ) − log P|V | > 0 j=1 exp(r vwj ) ∂CE = yˆ − yi ∂θ ∂CE = r δ> ∂V 0 ∂CE = V 0δ ∂r Thanks for your attention and best of luck with the mid-term! Any questions?
© Copyright 2024