Deep Learning Vocabulary
Vocabulary for Deep Learning
(In 40 Minutes! With Pictures!)
Graham Neubig
Nara Institute of Science and Technology (NAIST)
1
Deep Learning Vocabulary
Prediction Problems
Given x, predict y
2
Deep Learning Vocabulary
Example: Sentiment Analysis
x
y
I am happy
+1
I am sad
-1
3
Deep Learning Vocabulary
Linear Classifiers
T
y = step(w ϕ( x))
x: the input
●
φ(x): vector of feature functions
●
w: the weight vector
●
y: the prediction, +1 if “yes”, -1 if “no”
y
●
2
1
0
-4 -3 -2 -1 0 1 2
-1
-2
step(w*φ)
3
4
4
Deep Learning Vocabulary
Example Feature Functions:
Unigram Features
φ(I am happy)
{}
1
0
1
0
0
1
…
I
was
am
sad
the
happy
φ(I am sad)
{}
1
0
1
1
0
0
…
5
Deep Learning Vocabulary
The Perceptron
●
Think of it as a “machine” to calculate a weighted sum
and insert it into an activation function
w
ϕ ( x)
T
step( w ϕ (x ))
y
6
Deep Learning Vocabulary
Sigmoid Function (Logistic Function)
●
The sigmoid function is a “softened” version of the step
function
w⋅ϕ ( x)
e
P( y =1∣x)=
w⋅ϕ (x)
1+e
Logistic Function
Step Function
1
p(y|x)
p(y|x)
1
0.5
-10
-5
0
0
5
10
w*phi(x)
●
●
Can account for uncertainty
Differentiable
0.5
-10
-5
0
0
5
10
w*phi(x)
7
Deep Learning Vocabulary
Logistic Regression
●
●
Train based on conditional likelihood
Find the parameters w that maximize the conditional
likelihood of all answers yi given the example xi
w=argmax
^
∏i P( yi∣x i ; w)
w
●
How do we solve this?
8
Deep Learning Vocabulary
Stochastic Gradient Descent
●
Online training algorithm for probabilistic models
(including logistic regression)
create map w
for I iterations
for each labeled pair x, y in the data
w += α * dP(y|x)/dw
●
In other words
●
●
For every training example, calculate the gradient
(the direction that will increase the probability of y)
Move in that direction, multiplied by learning rate α
9
Deep Learning Vocabulary
Gradient of the Sigmoid Function
Take the derivative of the probability
w⋅ϕ ( x)
d
P ( y=1∣x ) =
dw
d e
d w 1+e w⋅ϕ (x)
w⋅ϕ (x)
e
= ϕ (x )
w⋅ϕ (x) 2
(1+e
)
d
P ( y=−1∣x ) =
dw
dp(y|x)/dw*phi(x)
●
0.4
0.3
0.2
0.1
-10
-5
0
0
5
10
w*phi(x)
w⋅ϕ (x)
d
e
(1−
)
w⋅ϕ (x)
dw
1+ e
w⋅ϕ (x)
e
= −ϕ (x )
w⋅ϕ (x) 2
(1+ e
)
10
Deep Learning Vocabulary
Neural Networks
11
Deep Learning Vocabulary
Problem: Linear Constraint
●
●
Perceptron cannot achieve high accuracy on nonlinear functions
X
O
O
X
Example: “I am not happy”
12
Deep Learning Vocabulary
Neural Networks
(Multi-Layer Perceptron)
●
Neural networks connect multiple perceptrons together
ϕ ( x)
-1
13
Deep Learning Vocabulary
Example:
●
Build two classifiers:
φ(x1) = {-1, 1} φ(x2) = {1, 1}
X
φ2
O
φ1
O
X
φ(x3) = {-1, -1} φ(x4) = {1, -1}
φ1
1
φ2
1
1
-1
w1
y1
w2
y2
φ1 -1
φ2 -1
1 -1
14
Deep Learning Vocabulary
Example:
●
These classifiers map the points to a new space
y(x3) = {-1, 1}
φ(x1) = {-1, 1} φ(x2) = {1, 1}
X
φ2
O
O
y2
φ1
O
y1
X
X
φ(x3) = {-1, -1} φ(x4) = {1, -1}
1
1
-1
y1
-1
-1
-1
y2
y(x1) = {-1, -1}
y(x4) = {-1, -1}
O
y(x2) = {1, -1}
15
Deep Learning Vocabulary
Example:
●
In the new space, examples are classifiable!
φ(x1) = {-1, 1} φ(x2) = {1, 1}
φ2
X
O
φ1
O
1
1
1
X
y3
φ(x3) = {-1, -1} φ(x4) = {1, -1}
1
1
-1
-1
-1
-1
y1
y(x3) = {-1, 1} O
y2
y1
y2
y(x1) = {-1, -1}
y(x4) = {-1, -1} X
O y(x2) = {1, -1}
16
Deep Learning Vocabulary
Example:
●
Final neural network:
φ1
1
φ2
1
1
-1
w1
φ1 -1
φ2 -1
1 -1
1
1
w3
y4
w2
1 1
17
Deep Learning Vocabulary
Hidden Layer Activation Functions
-4 -3 -2 -1
Linear Function:
- Whole net become linear
2
2
1
1
0
0
-1
y
y
Step Function:
- Cannot calculate gradient
0
1
2
3
-4 -3 -2 -1
4
Tanh Function:
Standard (also 0-1 sigmoid)
2
3
4
Rectified Linear Function:
+ Gradients at large vals.
2
2
1
1
0
0
y
y
1
w*φ
step(w*φ)
-1
0
-2
-2
-4 -3 -2 -1
-1
0
1
-2
tanh(w*φ)
2
3
4
-4 -3 -2 -1 0
-1
1
2
3
4
-2
relu(w*φ)
18
Deep Learning Vocabulary
Deep Networks
ϕ ( x)
●
●
Why deep?
Gradually more
detailed functions
(e.g. [Lee+ 09])
19
Deep Learning Vocabulary
Learning Neural Nets
20
Deep Learning Vocabulary
Learning:
Don't Know Derivative for Hidden Units!
●
For NNs, only know correct tag for last layer
h( x)
w1
d P( y=1∣x)
=?
d w1
d P( y=1∣x)
e w ⋅h(x )
=h (x)
w ⋅h( x) 2
d
w
(1+ e
)
4
w
4
ϕ ( x)
w2
w3
4
4
y=1
d P( y=1∣x)
=?
d w2
d P( y=1∣x)
=?
d w3
21
Deep Learning Vocabulary
Answer: Back-Propogation
●
Calculate derivative w/ chain rule
d P( y=1∣x) d P( y=1∣x ) d w 4 h(x ) d h1 (x )
=
d w1
d w 4 h( x) d h1 (x) d w 1
w 4⋅h( x )
e
w ⋅h( x) 2
(1+e
)
w 1,4
4
Error of
next unit (δ4)
In General
Calculate i based
on next units j:
Weight Gradient of
this unit
d P( y=1∣x) d hi ( x)
=
δ
w
∑
j
i, j
j
wi
d wi
22
Deep Learning Vocabulary
BP in Deep Networks: Vanishing Gradients
y
ϕ ( x)
δ=-8e-7 δ=9e-5 δ=6e-4 δ=-2e-3 δ=0.01
δ=1e-6 δ=7e-5 δ=-1e-3 δ=5e-3 δ=0.02 δ=0.1
δ=2e-6 δ=-2e-5 δ=-8e-4 δ=1e-3 δ=-0.01
●
Exploding gradient as well
23
Deep Learning Vocabulary
Layerwise Training
ϕ ( x)
●
X
y
Train one layer at a time
24
Deep Learning Vocabulary
Layerwise Training
ϕ ( x)
●
X
y
Train one layer at a time
25
Deep Learning Vocabulary
Layerwise Training
ϕ ( x)
●
y
Train one layer at a time
26
Deep Learning Vocabulary
Autoencoders
●
Initialize the NN by training it to reproduce itself
ϕ ( x)
●
ϕ ( x)
Advantage: No need for labels y!
27
Deep Learning Vocabulary
Dropout
●
Problem: Overfitting
X
ϕ ( x)
X
XX
X
XX
●
Deactivate part of the network on each update
●
Can be seen as an ensemble method
y
28
Deep Learning Vocabulary
Network Architectures
29
Deep Learning Vocabulary
Recurrent Neural Nets
y1
y2
y3
y4
NET
NET
NET
NET
x1
x2
x3
x4
●
Good for modeling sequence data
●
e.g. Speech
30
Deep Learning Vocabulary
Convolutional Neural Nets
y
x1
(x1-x3)
(x2-x4)
(x3-x5)
x2
x3
x4
x5
●
Good for modeling data with spatial correlation
●
e.g. Images
31
Deep Learning Vocabulary
Recursive Neural Nets
y
NET
NET
NET
●
●
NET
x1
x2
x3
x4
x5
Good for modeling data with tree structures
e.g. Language
32
Deep Learning Vocabulary
Other Topics
33
Deep Learning Vocabulary
Other Topics
●
Deep belief networks
●
●
Restricted Boltzmann machines
Contrastive estimation
●
Generating with Neural Nets
●
Visualizing internal states
●
Batch/mini-batch update strategies
●
Long short-term memory
●
Learning on GPUs
●
Tools for implementing deep learning
34
© Copyright 2025