Download Report

CS 224D: DEEP LEARNING FOR NLP
MIDTERM REVIEW
Neural Networks: Terminology, Forward Pass, Backpropagation.
Rohit Mundra
May 4, 2015
CS224D: Deep Learning for NLP
Overview
•  Neural Network Example
•  Terminology
•  Example 1:
•  Forward Pass
•  Backpropagation Using Chain Rule
•  What is delta? From Chain Rule to Modular Error Flow
•  Example 2:
•  Forward Pass
•  Backpropagation
2
CS224D: Deep Learning for NLP
3
Neural Networks
•  One of many diﬀerent types of non-linear classiﬁers (i.e.
leads to non-linear decision boundaries)
•  Most common design involves the stacking of aﬃne
transformations followed by point-wise (element-wise)
non-linearity
CS224D: Deep Learning for NLP
4
An example of a neural network
•  This is a 4 layer neural network.
•  2 hidden-layer neural network.
•  2-10-10-3 neural network (complete architecture defn.)
CS224D: Deep Learning for NLP
Our first example
Layer 1
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
Layer 2
z1(2) a
Layer 3
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
•  This is a 3 layer neural network
•  1 hidden-layer neural network
a2(2)
5
CS224D: Deep Learning for NLP
Our first example:
Terminology Layer 1
x1
x2
x3
x4
Model Input
z1(1) a (1)
1
1
z2(1)
Layer 2
z1(2) a
Layer 3
(2)
1
z1(3) a1(3)
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
6
1
s
a2(2)
Model Output
CS224D: Deep Learning for NLP
Our first example:
Terminology Layer 1
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
Layer 2
z1(2) a
Layer 3
(2)
1
z1(3) a1(3)
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
7
1
s
a2(2)
Model Input
Model Output
Activation Units
CS224D: Deep Learning for NLP
8
Our first example:
Activation Unit Terminology
z1(2) a1(2)
σ
We draw this
+
z1(2)
σ
a1(2)
This is actually what’s going on
z1(2) = W11(1)a1(1) + W12(1)a2(1) + W13(1)a3(1) + W14(1)a4(1)
a1(2) is the 1st activation unit of layer 2
a1(2) = σ(z1(2))
CS224D: Deep Learning for NLP
Our first example:
Forward Pass
x1
x2
x3
x4
z1(1) =
z2(1) =
z3(1) =
z4(1) =
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
9
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Forward Pass
x1
x2
x3
x4
a1(1) =
a2(1) =
a3(1) =
a4(1) =
z1(1)
z2(1)
z3(1)
z4(1)
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
10
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Forward Pass
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
11
a2(2)
z1(2) = W11(1)a1(1) + W12(1)a2(1) + W13(1)a3(1) + W14(1)a4(1)
z2(2) = W21(1)a1(1) + W22(1)a2(1) + W23(1)a3(1) + W24(1)a4(1)
CS224D: Deep Learning for NLP
Our first example:
Forward Pass
x1
x2
x3
x4
z(2)
z1(2)
z2(2)
W(1)
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a(1)
(1) W (1) W (1) W (1)
W
a1(1)
11
12
13
14
=
W21(1) W22(1) W23(1) W24(1) a2(1)
a3(1)
a4(1)
a4(1)
12
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Forward Pass
x1
x2
x3
x4
z(2) =W(1)a(1)
Aﬃne transformation
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
13
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Forward Pass
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
14
a4(1)
a(2) = σ(z(2))
Point-wise/Element-wise non-linearity
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Forward Pass
x1
x2
x3
x4
z(3) = W(2)a(2)
Aﬃne transformation
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
15
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Forward Pass
x1
x2
x3
x4
a(3) = z(3)
s = a(3)
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
16
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
using chain rule
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
17
a2(2)
Let us try to calculate the error gradient wrt W14(1)
Thus we want to ﬁnd:
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
using chain rule
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
18
a2(2)
Let us try to calculate the error gradient wrt W14(1)
Thus we want to ﬁnd:
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
using chain rule
x1
x2
x3
x4
This is simply 1
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
19
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
using chain rule
z1(1) a (1)
1
1
x1
z2(1)
x2
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
x3
z2(2)
1
x4
z4(1)
(!)
(!)
(!)
!!!
(!)
!!!
!!! !!!
(!) !
!(!!! !!
z1(2) a
1
a4(1)
a2(2)
(!)
!!!
(!)
!!!"
20
!
(!) (!)
(!)
(!)
+ ! !!" !! ) !!! !!!
!
(!)
(!)
(!)
!!!
!!! !!!"
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
using chain rule
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
21
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
using chain rule
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z4(1)
!!!! !′ !!!
z2(2)
1
1
a4(1)
(!)
!!!
!
(!)
!!!"
22
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
using chain rule
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
23
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
using chain rule
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
(!)
!!!! !′ !!! !! !
δ1(2)
z1(2) a
24
a2(2)
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
Observations
We got error
gradient wrt W14(1)
x1
x2
x3
x4
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
25
a2(2)
Required:
•  the signal forwarded by W14(1) = a4(1)
•  the error propagating backwards W11(2)
•  the local gradient σ’(z1(2))
CS224D: Deep Learning for NLP
Our first example:
Backpropagation
Observations
We tried to get error
gradient wrt W14(1)
x1
x2
x3
x4
Required:
• 
the signal forwarded by W14(1) = a4(1)
• 
the error propagating backwards W11(2)
• 
the local gradient σ’(z1(2))
z1(1) a (1)
1
1
z2(1)
z1(2) a
(2)
1
z1(3) a1(3)
s
1
1
z3(1)
z2(2)
1
z4(1)
1
a4(1)
26
a2(2)
We can do this for
all of W(1):
δ1(2)a1(1) δ1(2)a2(1) δ1(2)a3(1) δ1(2)a4(1)
δ2(2)a1(1) δ2(2)a2(1) δ2(2)a3(1) δ2(2)a4(1)
(as outer product)
(1)
(1
(1)
(1)
δ1(2) a1 a2 a3 a4
δ2(2)
CS224D: Deep Learning for NLP
27
Our first example:
Let us define δ
+
z1(2)
σ
a1(2)
Recall that this is forward pass
+
δ1(2)
σ
This is the backpropagation
δ1(2) is the error ﬂowing backwards at the same
point where z1(2) passed forwards. Thus it is simply the gradient
of the error wrt z1(2).
CS224D: Deep Learning for NLP
28
Our first example:
Backpropagation using error vectors
The chain rule of diﬀerentiation just boils
down very simple patterns in error
backpropagation:
1.  An error x ﬂowing backwards passes a
neuron by getting ampliﬁed by the local
gradient.
2.  An error δ that needs to go through an
aﬃne transformation distributes itself in
the way signal combined in forward pass.
Orange = Backprop.
Green = Fwd. Pass
δ = σ’(z)x
δw1
a1w1
δw2
a2w2
δw3
a3w3
σ
+
x
δ
z
CS224D: Deep Learning for NLP
Our first example:
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)
z(3)
1
s
29
CS224D: Deep Learning for NLP
30
Our first example:
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)
z(3)
1
s
δ(3)
This is
for softmax
CS224D: Deep Learning for NLP
Our first example:
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)
z(3)
δ(3)
Gradient w.r.t W(2) = δ(3)a(2)T
1
s
31
CS224D: Deep Learning for NLP
32
Our first example:
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)T δ(3)
W(2)
z(3)
1
s
δ(3)
--Reusing the δ(3) for downstream updates.
--Moving error vector across affine transformation simply requires multiplication with
the transpose of forward matrix
--Notice that the dimensions will line up perfectly too!
CS224D: Deep Learning for NLP
Our first example:
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
σ’(z(2))¤W(2)T δ(3)
z(2)
σ
a(2)
W(2)
z(3)
1
s
W(2)T δ(3)
= δ(2)
--Moving error vector across point-wise non-linearity requires point-wise
multiplication with local gradient of the non-linearity
33
CS224D: Deep Learning for NLP
Our first example:
Backpropagation using error vectors
z(1)
1
a(1)
W(1)T δ(2)
W(1)
z(2)
σ
a(2)
δ(2)
Gradient w.r.t W(1) = δ(2)a(1)T
W(2)
z(3)
1
s
34
CS224D: Deep Learning for NLP
35
Our second example (4-layer network):
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)
z(3)
σ
a(3)
W(3)
z(4)
soft
max
yp
CS224D: Deep Learning for NLP
36
Our second example (4-layer network):
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)
z(3)
σ
a(3)
W(3)
z(4)
yp– y = δ(4)
soft
max
yp
CS224D: Deep Learning for NLP
37
Our second example (4-layer network):
Backpropagation using error vectors
Grad W(3) = δ(4)a(3)T
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)
z(3)
σ
a(3)
W(3)Tδ(4)
W(3)
z(4)
soft
max
δ(4)
yp
CS224D: Deep Learning for NLP
38
Our second example (4-layer network):
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)
z(3)
δ(3)= σ’(z(3))¤W(3)Tδ(4)
σ
a(3)
W(3)
W(3)Tδ(4)
z(4)
soft
max
yp
CS224D: Deep Learning for NLP
39
Our second example (4-layer network):
Backpropagation using error vectors
Grad W(2) = δ(3)a(2)T
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)Tδ(3)
W(2)
z(3)
σ
δ(3)
a(3)
W(3)
z(4)
soft
max
yp
CS224D: Deep Learning for NLP
40
Our second example (4-layer network):
Backpropagation using error vectors
z(1)
1
a(1)
W(1)
z(2)
σ
a(2)
W(2)
W(2)Tδ(3)
δ(2)= σ’(z(2))¤W(2)Tδ(3)
z(3)
σ
a(3)
W(3)
z(4)
soft
max
yp
CS224D: Deep Learning for NLP
41
Our second example (4-layer network):
Backpropagation using error vectors
Grad W(1) = δ(2)a(1)T
z(1)
1
a(1)
W(1)Tδ(2)
W(1)
z(2)
δ(2)
σ
a(2)
W(2)
z(3)
σ
a(3)
W(3)
z(4)
soft
max
yp
CS224D: Deep Learning for NLP
42
Our second example (4-layer network):
Backpropagation using error vectors
Grad wrt input vector = W(1)Tδ(2)
z(1)
1
a(1)
W(1)
z(2)
W(1)Tδ(2) W(1)Tδ(2)
σ
a(2)
W(2)
z(3)
σ
a(3)
W(3)
z(4)
soft
max
yp
CS224D Midterm Review
Ian Tenney
May 4, 2015
Outline
Backpropagation (continued)
RNN Structure
RNN Backpropagation
Backprop on a DAG
Example: Gated Recurrent Units (GRUs)
GRU Backpropagation
Outline
Backpropagation (continued)
RNN Structure
RNN Backpropagation
Backprop on a DAG
Example: Gated Recurrent Units (GRUs)
GRU Backpropagation
Basic RNN Structure
yˆ(t)
h(t−1)
h(t)
...
x(t)
I
Basic RNN ("Elman network")
I
You’ve seen this on Assignment #2 (and also in Lecture #5)
Basic RNN Structure
yˆ(t)
h(t−1)
h(t)
...
x(t)
I
Two layers between input and prediction, plus hidden state
h(t) = sigmoid Hh(t−1) + W x(t) + b1
yˆ(t) = softmax U h(t) + b2
Unrolled RNN
h(t−3)
yˆ(t−2)
yˆ(t−1)
yˆ(t)
h(t−2)
h(t−1)
h(t)
x(t−2)
x(t−1)
x(t)
...
I
Helps to think about as “unrolled” network: distinct nodes
for each timestep
I
Just do backprop on this! Then combine shared gradients.
Backprop on RNN
I
Usual cross-entropy loss (k-class):
(t)
P¯ (y (t) = j | x(t) , . . . , x(1) ) = yˆj
J (t) (θ) = −
k
X
(t)
j=1
I
Just do backprop on this! First timestep (τ = 1):
∂J (t) ∂H (t)
∂J (t)
∂U
∂J (t)
∂b2
∂J (t)
∂h(t)
∂J (t) ∂W (t)
(t)
yj log yˆj
∂J (t)
∂x(t)
Backprop on RNN
I
First timestep (s = 0):
∂J (t) ∂H (t)
I
∂J (t)
∂U
∂J (t)
∂b2
∂J (t)
∂h(t)
∂J (t) ∂W (t)
Back in time (s = 1, 2, . . . , τ − 1)
∂J (t) ∂J (t) ∂J (t)
∂H (t−s)
∂W (t−s)
∂h(t−s)
∂J (t)
∂x(t)
∂J (t)
∂x(t−s)
Backprop on RNN
Yuck, that’s a lot of math!
I
Actually, it’s not so bad.
I
Solution: error vectors (δ)
Making sense of the madness
I
Chain rule to the rescue!
I
a(t) = U h(t) + b2
I
yˆ(t) = softmax(a(t) )
I
Gradient is transpose of Jacobian:
∇a J =
I
∂J (t)
∂a(t)
!T
= yˆ(t) − y (t) = δ (2)(t)
∈ Rk×1
Now dimensions work out:
∂J (t) ∂a(t)
·
= (δ (2)(t) )T I
∂a(t) ∂b2
∈ R(1×k)·(k×k) = R1×k
Making sense of the madness
I
Chain rule to the rescue!
I
a(t) = U h(t) + b2
I
yˆ(t) = softmax(a(t) )
I
Gradient is transpose of Jacobian:
∇a J =
I
∂J (t)
∂a(t)
!T
= yˆ(t) − y (t) = δ (2)(t)
∈ Rk×1
Now dimensions work out:
∂J (t) ∂a(t)
·
= (δ (2)(t) )T I
∂a(t) ∂b2
∈ R(1×k)·(k×k) = R1×k
Making sense of the madness
I
Chain rule to the rescue!
I
a(t) = U h(t) + b2
I
yˆ(t) = softmax(a(t) )
I
Gradient is transpose of Jacobian:
∇a J =
I
∂J (t)
∂a(t)
!T
= yˆ(t) − y (t) = δ (2)(t)
∈ Rk×1
Now dimensions work out:
∂J (t) ∂a(t)
·
= (δ (2)(t) )T I
∂a(t) ∂b2
∈ R(1×k)·(k×k) = R1×k
Making sense of the madness
I
Chain rule to the rescue!
I
a(t) = U h(t) + b2
I
yˆ(t) = softmax(a(t) )
I
Matrix dimensions get weird:
∂a(t)
∂U
I
But we don’t need fancy tensors:
∇U J (t) =
I
∈ Rk×(k×Dh )
∂J (t) ∂a(t)
·
∂a(t) ∂U
!T
= δ (2)(t) (h(t) )T
∈ Rk×Dh
NumPy: self.grads.U += outer(d2, hs[t])
Making sense of the madness
I
Chain rule to the rescue!
I
a(t) = U h(t) + b2
I
yˆ(t) = softmax(a(t) )
I
Matrix dimensions get weird:
∂a(t)
∂U
I
But we don’t need fancy tensors:
∇U J (t) =
I
∈ Rk×(k×Dh )
∂J (t) ∂a(t)
·
∂a(t) ∂U
!T
= δ (2)(t) (h(t) )T
∈ Rk×Dh
NumPy: self.grads.U += outer(d2, hs[t])
Going deeper
I
Really just need one simple pattern:
I
z (t) = Hh(t−1) + W x(t) + b1
I
h(t) = f (z (t) )
Compute error delta (s = 0, 1, 2, . . .):
I
I
I
I
From top: δ (t) = h(t) ◦ (1 − h(t) ) ◦ U T δ (2)(t)
Deeper: δ (t−s) = h(t−s) ◦ (1 − h(t−s) ) ◦ H T δ (t−s+1)
These are just chain-rule expansions!
∂J (t) ∂a(t) ∂h(t)
∂J (t)
=
·
·
= (δ (t) )T
∂z (t)
∂a(t) ∂h(t) ∂z (t)
Going deeper
I
Really just need one simple pattern:
I
z (t) = Hh(t−1) + W x(t) + b1
I
h(t) = f (z (t) )
Compute error delta (s = 0, 1, 2, . . .):
I
I
I
I
From top: δ (t) = h(t) ◦ (1 − h(t) ) ◦ U T δ (2)(t)
Deeper: δ (t−s) = h(t−s) ◦ (1 − h(t−s) ) ◦ H T δ (t−s+1)
These are just chain-rule expansions!
∂J (t)
∂J (t) ∂a(t) ∂h(t)
=
·
·
= (δ (t) )T
∂z (t)
∂a(t) ∂h(t) ∂z (t)
Going deeper
I
These are just chain-rule expansions!
!
(t)
∂J (t) ∂a(t) ∂h(t)
∂J (t) ∂z (t)
(t) T ∂z
=
·
·
·
=
(δ
)
∂b1 (t)
∂b1
∂b1
∂a(t) ∂h(t) ∂z (t)
!
(t)
∂J (t) ∂a(t) ∂h(t)
∂z (t)
(t) T ∂z
·
·
=
(δ
)
·
∂H
∂H
∂a(t) ∂h(t) ∂z (t)
!
(t)
∂J (t) ∂a(t) ∂h(t)
∂z (t)
(t) T ∂z
·
·
·
=
(δ
)
∂a(t) ∂h(t) ∂z (t)
∂h(t−1)
∂z (t−1)
∂J (t) =
∂H (t)
∂J (t)
=
∂z (t−1)
Going deeper
I
And there’s shortcuts for them too:
!T
∂J (t) = δ (t)
∂b1 (t)
!T
∂J (t) = δ (t) · (h(t−1) )T
∂H (t)
!T
h
i
∂J (t)
(t−1)
(t−1)
=
h
◦
(1
−
h
)
◦ H T δ (t) = δ (t−1)
(t−1)
∂z
Outline
Backpropagation (continued)
RNN Structure
RNN Backpropagation
Backprop on a DAG
Example: Gated Recurrent Units (GRUs)
GRU Backpropagation
Motivation
I
Gated units with “reset” and “output” gates
I
Reduce problems with vanishing gradients
Figure : You are likely to be eaten by a GRU. (Figure from Chung, et
al. 2014)
Intuition
I
I
I
I
I
Gates zi and ri for each hidden layer neuron
zi , ri ∈ [0, 1]
˜ as “candidate” hidden layer
h
˜ z, r all depend on on x(t) , h(t−1)
h,
˜ (t)
h(t) depends on h(t−1) mixed with h
Figure : You are likely to be eaten by a GRU. (Figure from Chung, et
al. 2014)
Equations
I
I
I
I
I
z (t) = σ Wz x(t) + Uz h(t−1)
r(t) = σ Wr x(t) + Ur h(t−1)
˜ (t) = tanh W x(t) + r(t) ◦ U h(t−1)
h
˜ (t)
h(t) = z (t) ◦ h(t−1) + (1 − z (t) ) ◦ h
Optionally can have biases; omitted for clarity.
Figure : You are likely to be eaten by a GRU. (Figure from Chung, et
al. 2014)
Same eqs. as Lecture 8, subscripts/superscripts as in Assignment #2.
Backpropagation
Multi-path to compute ∂x∂J(t)
T
∂J
I Start with δ (t) =
∂h(t)
∈ Rd
I
˜ (t)
h(t) = z (t) ◦ h(t−1) + (1 − z (t) ) ◦ h
I
Expand chain rule into sum (a.k.a. product rule):
"
#
(t−1)
(t)
∂J
∂J
∂h
∂z
=
· z (t) ◦
+ (t) ◦ h(t−1)
∂x(t)
∂h(t)
∂x(t)
∂x
"
#
˜ (t) ∂(1 − z (t) )
∂J
∂h
(t)
(t)
˜
+
· (1 − z ) ◦ (t) +
◦h
∂h(t)
∂x
∂x(t)
It gets (a little) better
Multi-path to compute
I
∂J
∂x(t)
Drop terms that don’t depend on x(t) :
"
#
(t−1)
(t)
∂J
∂J
∂h
∂z
=
· z (t) ◦
+ (t) ◦ h(t−1)
∂x(t)
∂h(t)
∂x(t)
∂x
"
#
˜ (t) ∂(1 − z (t) )
∂J
∂
h
˜ (t)
+
· (1 − z (t) ) ◦ (t) +
◦h
∂h(t)
∂x
∂x(t)
"
#
˜ (t)
∂J
∂z (t)
∂
h
=
·
◦ h(t−1) + (1 − z (t) ) ◦ (t)
∂h(t)
∂x(t)
∂x
−
∂J ∂z (t) ˜ (t)
◦h
∂h(t) ∂x(t)
Almost there!
Multi-path to compute
∂J
∂x(t)
I
Now we really just need to compute two things:
I
Output gate:
∂z (t)
= z (t) ◦ (1 − z (t) ) ◦ Wz
∂x(t)
I
˜
Candidate h:
˜ (t)
∂h
=
∂x(t)
˜ (t) )2 ) ◦ W
(1 − (h
(t)
˜ (t) )2 ) ◦ ∂r ◦ U h(t−1)
+ (1 − (h
∂x(t)
I
Ok, I lied - there’s a third.
I
Don’t forget to check all paths!
Almost there!
Multi-path to compute
∂J
∂x(t)
I
Now we really just need to compute two things:
I
Output gate:
∂z (t)
= z (t) ◦ (1 − z (t) ) ◦ Wz
∂x(t)
I
˜
Candidate h:
˜ (t)
∂h
=
∂x(t)
˜ (t) )2 ) ◦ W
(1 − (h
(t)
˜ (t) )2 ) ◦ ∂r ◦ U h(t−1)
+ (1 − (h
∂x(t)
I
Ok, I lied - there’s a third.
I
Don’t forget to check all paths!
Almost there!
Multi-path to compute
∂J
∂x(t)
I
Now we really just need to compute two things:
I
Output gate:
∂z (t)
= z (t) ◦ (1 − z (t) ) ◦ Wz
∂x(t)
I
˜
Candidate h:
˜ (t)
∂h
=
∂x(t)
˜ (t) )2 ) ◦ W
(1 − (h
(t)
˜ (t) )2 ) ◦ ∂r ◦ U h(t−1)
+ (1 − (h
∂x(t)
I
Ok, I lied - there’s a third.
I
Don’t forget to check all paths!
Almost there!
Multi-path to compute
I
∂J
∂x(t)
Last one:
I
∂r(t)
= r(t) ◦ (1 − r(t) ) ◦ Wr
∂x(t)
Now we can just add things up!
I
(I’ll spare you the pain...)
Whew.
I
I
I
Why three derivatives?
Three arrows from x(t) to distinct nodes
∂z (t)
Four paths total ( ∂x
(t) appears twice)
Whew.
I
I
I
GRUs are complicated
All the pieces are simple
Same matrix gradients that you’ve seen before
Summary
I
Check your dimensions!
I
Write error vectors δ; just parentheses around chain rule
Combine simple operations to make complex network
I
I
I
Matrix-vector product
Activation functions (tanh, sigmoid, softmax)
Good luck on Wednesday!
May the 4th be with you!
CS 224d Midterm Review (word vectors)
Peng Qi
May 5, 2015
Outline
I
word2vec and GloVe revisited
I
word2vec with backpropagation
Outline
I
word2vec and GloVe revisited
I
I
I
I
I
Skip-gram revisited
(Optional) CBOW and its connection to Skip-gram
(Optional) word2vec as matrix factorization (conceptually)
GloVe v.s. word2vec
word2vec with backpropagation
Skip-gram
INPUT
PROJECTION
OUTPUT
INPUT
PROJECTION
w(t-2)
I
OUTPUT
w(t-2)
Task: given a center word, predict its
context words
w(t-1)
w(t-1)
SUM
w(t)
I
For each word, we have an “input vector”
vw and an “output vector” vw0
w(t)
w(t+1)
w(t+1)
w(t+2)
w(t+2)
CBOW
Skip-gram
Figure 1: New model architectures. The CBOW architecture predicts the current word based on the
context, and the Skip-gram predicts surrounding words given the current word.
R words from the future of the current word as correct labels. This will require us to do R ⇥ 2
word classifications, with the current word as input, and each of the R + R words as output. In the
following experiments, we use C = 10.
Skip-gram
I
We have seen two types of costs for an
expected word given a vector prediction r


>
0
exp(r vwi )

CE (wi |r ) = − log  P|V |
>v 0 )
exp(r
wj
j=1
INPUT
PROJECTION
OUTPUT
INPUT
w(t-2)
PROJECTION
OUTPUT
w(t-2)
w(t-1)
w(t-1)
SUM
NEG (wi |r ) = − log(σ(r > vw0 i ))
K
X
w(t+1)
−
w(t+2)
w(t)
w(t)
w(t+1)
log(σ(−r > vw0 k ))
k=1
w(t+2)
Skip-gram
In the case of skip-gram, theCBOW
vector
prediction r is just the “input vector” of
Figure 1: New model architectures. The CBOW architecture predicts the current word based on the
and the Skip-gram predicts surrounding words given the current word.
the center word, vwcontext,
i.
σ(·) is the sigmoid (logistic) function.
R words from the future of the current word as correct labels. This will require us to do R ⇥ 2
word classifications, with the current word as input, and each of the R + R words as output. In the
Skip-gram
I
Now we have all the pieces of skip-gram,
the cost for a context window
[wi−C , · · · , wi−1 , wi , wi+1 , · · · , wi+C ] is
(wi is the center word)
INPUT
PROJECTION
OUTPUT
INPUT
PROJECTION
w(t-2)
Jskip−gram ([wi−C , · · · , wi+C ]) =
X
F (wj |vwi )
OUTPUT
w(t-2)
w(t-1)
w(t-1)
SUM
w(t)
w(t)
i−C ≤j≤i+C
,i6=j
w(t+1)
where F is one of the cost functions we
defined in the previous slide.
w(t+1)
w(t+2)
w(t+2)
CBOW
I
Skip-gram
You might ask: but why are we
introducing so many
Figurenotations?
1: New model architectures. The CBOW architecture predicts the current word based on the
context, and the Skip-gram predicts surrounding words given the current word.
R words from the future of the current word as correct labels. This will require us to do R ⇥ 2
word classifications, with the current word as input, and each of the R + R words as output. In the
Skip-gram v.s. CBOW
Skip-gram
ROJECTION
OUTPUT
INPUT
PROJECTION
CBOW
OUTPUT
INPUT
PROJECTION
OUTPUT
INPUT
P
w(t-2)
w(t-2)
w(t-1)
SUM
w(t-1)
SUM
w(t)
w(t)
w(t)
w(t)
w(t+1)
w(t+1)
w(t+2)
CBOW
Task
r
Skip-gram
Center word
→ Context
v wi
architectures. The CBOW architecture predicts the current word based on the
gram predicts surrounding words given the current word.
ure of the current word as correct labels. This will require us to do R ⇥ 2
with the current word as input, and each of the R + R words as output. In the
, we use C = 10.
w(t+2)
Context →CBOW
Center word
f (vwi−C , · · · , vwi−1 , vwi+1 , · · · , vwi+C )
S
Figure 1: New model architectures. The CBOW architecture predicts th
context, and the Skip-gram predicts surrounding words given the curren
R words from the future of the current word as correct labels. This w
word classifications, with the current word as input, and each of the R +
All word2vec figures are from http://arxiv.org/pdf/1301.3781.pdf
following experiments, we use C = 10.
word2vec as matrix factorization (conceptually)
I
Matrix factorization



M

n×n


.
. B . k×n
≈  A> 
.
n×k
Mij ≈ ai> bj
I
Imagine M is a matrix of counts for events co-occurring, but
we only get to observe the co-occurrences one at a time. E.g.


1 0 4
M= 0 0 2 
1 3 0
but we only see
(1,1), (2,3), (3,2), (2,3), (1,3), . . .
word2vec as matrix factorization (conceptually)
Mij ≈ ai> bj
I
I
I
I
Whenever we see a pair (i, j) co-occur, we try to increasing
ai> bj
We also try to make all the other inner-products smaller to
account for pairs never observed (or unobserved yet), by
> b and a> b
decreasing a¬i
j
i ¬j
Remember from the lecture that the word co-occurrence
matrix usually captures the semantic meaning of a word?
For word2vec models, roughly speaking, M is the windowed
word co-occurrence matrix, A is the output vector matrix, and
B is the input vector matrix.
Why not just use one set of vectors? It’s equivalent to A = B
in our formulation here, but less constraints is usually easier
for optimization.
GloVe v.s. word2vec
Direct
prediction
(word2vec)
GloVe
Fast
training
Efficient
usage of
statistics
Quality
affected
by size of
corpora
Captures
complex
patterns
Scales
with size
of corpus
No
No*
Yes
Yes
Yes
No
Yes
* Skip-gram and CBOW are qualitatively different when it comes to smaller corpora
Outline
I
word2vec and GloVe revisited
I
word2vec with backpropagation
word2vec with backpropagation
I

CE (wi |r ) =
I
CE (wi |r ) = CE (ˆ
y , yi )
yˆ = softmax(θ)
θ = (V 0 )> r
I
δ=
I

exp(r > vw0 i )

− log  P|V |
>
0
j=1 exp(r vwj )
∂CE
= yˆ − yi
∂θ
∂CE
= r δ>
∂V 0
∂CE
= V 0δ
∂r
Thanks for your attention
and best of luck with the mid-term!
Any questions?