Vocabulary for Deep Learning (In 40 Minutes! With Pictures!) Graham Neubig

Deep Learning Vocabulary
Vocabulary for Deep Learning
(In 40 Minutes! With Pictures!)
Graham Neubig
Nara Institute of Science and Technology (NAIST)
1
Deep Learning Vocabulary
Prediction Problems
Given x, predict y
2
Deep Learning Vocabulary
Example: Sentiment Analysis
x
y
I am happy
+1
I am sad
-1
3
Deep Learning Vocabulary
Linear Classifiers
T
y = step(w ϕ( x))
x: the input
●
φ(x): vector of feature functions
●
w: the weight vector
●
y: the prediction, +1 if “yes”, -1 if “no”
y
●
2
1
0
-4 -3 -2 -1 0 1 2
-1
-2
step(w*φ)
3
4
4
Deep Learning Vocabulary
Example Feature Functions:
Unigram Features
φ(I am happy)
{}
1
0
1
0
0
1
…
I
was
am
sad
the
happy
φ(I am sad)
{}
1
0
1
1
0
0
…
5
Deep Learning Vocabulary
The Perceptron
●
Think of it as a “machine” to calculate a weighted sum
and insert it into an activation function
w
ϕ ( x)
T
step( w ϕ (x ))
y
6
Deep Learning Vocabulary
Sigmoid Function (Logistic Function)
●
The sigmoid function is a “softened” version of the step
function
w⋅ϕ ( x)
e
P( y =1∣x)=
w⋅ϕ (x)
1+e
Logistic Function
Step Function
1
p(y|x)
p(y|x)
1
0.5
-10
-5
0
0
5
10
w*phi(x)
●
●
Can account for uncertainty
Differentiable
0.5
-10
-5
0
0
5
10
w*phi(x)
7
Deep Learning Vocabulary
Logistic Regression
●
●
Train based on conditional likelihood
Find the parameters w that maximize the conditional
likelihood of all answers yi given the example xi
w=argmax
^
∏i P( yi∣x i ; w)
w
●
How do we solve this?
8
Deep Learning Vocabulary
Stochastic Gradient Descent
●
Online training algorithm for probabilistic models
(including logistic regression)
create map w
for I iterations
for each labeled pair x, y in the data
w += α * dP(y|x)/dw
●
In other words
●
●
For every training example, calculate the gradient
(the direction that will increase the probability of y)
Move in that direction, multiplied by learning rate α
9
Deep Learning Vocabulary
Gradient of the Sigmoid Function
Take the derivative of the probability
w⋅ϕ ( x)
d
P ( y=1∣x ) =
dw
d e
d w 1+e w⋅ϕ (x)
w⋅ϕ (x)
e
= ϕ (x )
w⋅ϕ (x) 2
(1+e
)
d
P ( y=−1∣x ) =
dw
dp(y|x)/dw*phi(x)
●
0.4
0.3
0.2
0.1
-10
-5
0
0
5
10
w*phi(x)
w⋅ϕ (x)
d
e
(1−
)
w⋅ϕ (x)
dw
1+ e
w⋅ϕ (x)
e
= −ϕ (x )
w⋅ϕ (x) 2
(1+ e
)
10
Deep Learning Vocabulary
Neural Networks
11
Deep Learning Vocabulary
Problem: Linear Constraint
●
●
Perceptron cannot achieve high accuracy on nonlinear functions
X
O
O
X
Example: “I am not happy”
12
Deep Learning Vocabulary
Neural Networks
(Multi-Layer Perceptron)
●
Neural networks connect multiple perceptrons together
ϕ ( x)
-1
13
Deep Learning Vocabulary
Example:
●
Build two classifiers:
φ(x1) = {-1, 1} φ(x2) = {1, 1}
X
φ2
O
φ1
O
X
φ(x3) = {-1, -1} φ(x4) = {1, -1}
φ1
1
φ2
1
1
-1
w1
y1
w2
y2
φ1 -1
φ2 -1
1 -1
14
Deep Learning Vocabulary
Example:
●
These classifiers map the points to a new space
y(x3) = {-1, 1}
φ(x1) = {-1, 1} φ(x2) = {1, 1}
X
φ2
O
O
y2
φ1
O
y1
X
X
φ(x3) = {-1, -1} φ(x4) = {1, -1}
1
1
-1
y1
-1
-1
-1
y2
y(x1) = {-1, -1}
y(x4) = {-1, -1}
O
y(x2) = {1, -1}
15
Deep Learning Vocabulary
Example:
●
In the new space, examples are classifiable!
φ(x1) = {-1, 1} φ(x2) = {1, 1}
φ2
X
O
φ1
O
1
1
1
X
y3
φ(x3) = {-1, -1} φ(x4) = {1, -1}
1
1
-1
-1
-1
-1
y1
y(x3) = {-1, 1} O
y2
y1
y2
y(x1) = {-1, -1}
y(x4) = {-1, -1} X
O y(x2) = {1, -1}
16
Deep Learning Vocabulary
Example:
●
Final neural network:
φ1
1
φ2
1
1
-1
w1
φ1 -1
φ2 -1
1 -1
1
1
w3
y4
w2
1 1
17
Deep Learning Vocabulary
Hidden Layer Activation Functions
-4 -3 -2 -1
Linear Function:
- Whole net become linear
2
2
1
1
0
0
-1
y
y
Step Function:
- Cannot calculate gradient
0
1
2
3
-4 -3 -2 -1
4
Tanh Function:
Standard (also 0-1 sigmoid)
2
3
4
Rectified Linear Function:
+ Gradients at large vals.
2
2
1
1
0
0
y
y
1
w*φ
step(w*φ)
-1
0
-2
-2
-4 -3 -2 -1
-1
0
1
-2
tanh(w*φ)
2
3
4
-4 -3 -2 -1 0
-1
1
2
3
4
-2
relu(w*φ)
18
Deep Learning Vocabulary
Deep Networks
ϕ ( x)
●
●
Why deep?
Gradually more
detailed functions
(e.g. [Lee+ 09])
19
Deep Learning Vocabulary
Learning Neural Nets
20
Deep Learning Vocabulary
Learning:
Don't Know Derivative for Hidden Units!
●
For NNs, only know correct tag for last layer
h( x)
w1
d P( y=1∣x)
=?
d w1
d P( y=1∣x)
e w ⋅h(x )
=h (x)
w ⋅h( x) 2
d
w
(1+ e
)
4
w
4
ϕ ( x)
w2
w3
4
4
y=1
d P( y=1∣x)
=?
d w2
d P( y=1∣x)
=?
d w3
21
Deep Learning Vocabulary
Answer: Back-Propogation
●
Calculate derivative w/ chain rule
d P( y=1∣x) d P( y=1∣x ) d w 4 h(x ) d h1 (x )
=
d w1
d w 4 h( x) d h1 (x) d w 1
w 4⋅h( x )
e
w ⋅h( x) 2
(1+e
)
w 1,4
4
Error of
next unit (δ4)
In General
Calculate i based
on next units j:
Weight Gradient of
this unit
d P( y=1∣x) d hi ( x)
=
δ
w
∑
j
i, j
j
wi
d wi
22
Deep Learning Vocabulary
BP in Deep Networks: Vanishing Gradients
y
ϕ ( x)
δ=-8e-7 δ=9e-5 δ=6e-4 δ=-2e-3 δ=0.01
δ=1e-6 δ=7e-5 δ=-1e-3 δ=5e-3 δ=0.02 δ=0.1
δ=2e-6 δ=-2e-5 δ=-8e-4 δ=1e-3 δ=-0.01
●
Exploding gradient as well
23
Deep Learning Vocabulary
Layerwise Training
ϕ ( x)
●
X
y
Train one layer at a time
24
Deep Learning Vocabulary
Layerwise Training
ϕ ( x)
●
X
y
Train one layer at a time
25
Deep Learning Vocabulary
Layerwise Training
ϕ ( x)
●
y
Train one layer at a time
26
Deep Learning Vocabulary
Autoencoders
●
Initialize the NN by training it to reproduce itself
ϕ ( x)
●
ϕ ( x)
Advantage: No need for labels y!
27
Deep Learning Vocabulary
Dropout
●
Problem: Overfitting
X
ϕ ( x)
X
XX
X
XX
●
Deactivate part of the network on each update
●
Can be seen as an ensemble method
y
28
Deep Learning Vocabulary
Network Architectures
29
Deep Learning Vocabulary
Recurrent Neural Nets
y1
y2
y3
y4
NET
NET
NET
NET
x1
x2
x3
x4
●
Good for modeling sequence data
●
e.g. Speech
30
Deep Learning Vocabulary
Convolutional Neural Nets
y
x1
(x1-x3)
(x2-x4)
(x3-x5)
x2
x3
x4
x5
●
Good for modeling data with spatial correlation
●
e.g. Images
31
Deep Learning Vocabulary
Recursive Neural Nets
y
NET
NET
NET
●
●
NET
x1
x2
x3
x4
x5
Good for modeling data with tree structures
e.g. Language
32
Deep Learning Vocabulary
Other Topics
33
Deep Learning Vocabulary
Other Topics
●
Deep belief networks
●
●
Restricted Boltzmann machines
Contrastive estimation
●
Generating with Neural Nets
●
Visualizing internal states
●
Batch/mini-batch update strategies
●
Long short-term memory
●
Learning on GPUs
●
Tools for implementing deep learning
34