Deep Learning - Machine Learning Center

Deep Learning: A Quick Overview
Seungjin Choi
Department of Computer Science and Engineering
Pohang University of Science and Technology
77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea
[email protected]
http://mlg.postech.ac.kr/∼seungjin
April 24, 2015
1 / 25
Flat vs Deep
Figure: Flat model (e.g., linear models
and RBM)
2 / 25
Flat vs Deep
Figure: Flat model (e.g., linear models
and RBM)
Figure: Deep model (e.g., deep belief
network and deep Boltzmann
machine)
2 / 25
Flat vs Deep
Figure: Flat model (e.g., linear models
and RBM)
Figure: Deep model (e.g., deep belief
network and deep Boltzmann
machine)
Representation learning: embedding, features, clusters, topics
2 / 25
Deep Learning Everywhere:
Acoustic Modeling in Speech Recognition
G. Hinton, et al. (2012),
”Deep Neural Networks for Acoustic Modeling in Speech Recognition”
3 / 25
Deep Learning Everywhere:
ImageNet Classification
Deep CNN: Classifiy 1.2 million images into 1000 classes
A. Krizhevsky, I. Sutskever, G. Hinton (2012),
”ImageNet Classification with Deep Convolutional Neural Networks”
4 / 25
Deep Learning Everywhere:
ImageNet Classification
Deep CNN: Classifiy 1.2 million images into 1000 classes
A. Krizhevsky, I. Sutskever, G. Hinton (2012),
”ImageNet Classification with Deep Convolutional Neural Networks”
4 / 25
Deep Learning Everywhere:
ImageNet Classification
Deep CNN: Classifiy 1.2 million images into 1000 classes
A. Krizhevsky, I. Sutskever, G. Hinton (2012),
”ImageNet Classification with Deep Convolutional Neural Networks”
I
SIFT BoW: 256K-dimensional features, MAP 35%
I
deep learning: 1K-dimensional features, MAP 38%
4 / 25
Deep Learning Everywhere:
Approximate Similarity Search
(c)
...
(b)
(a)
...
...
...
...
...
...
...
...
Y. Kang, S. Kim, S. Choi (2012),
”Deep learning to hash with multiple representations”
5 / 25
Query
Retrieved images
6 / 25
Success and Issues
I
Why successful?
I
I
I
I
Pre-training: Restricted
Boltzmann machine (RBM),
Auto-encoder, nonnegative
matrix factorization (NMF)
Training: Drop out
Rectified units: No vanishing
gradient, sparse activation
Issues
I
I
Distributions: Exponential family harmonium
Multi-modal extensions: Dual-wing harmonium, restricted deep belief
nets, multi-modal stacked auto-encoders, and multi-modal deep
Boltzmann machines
7 / 25
Success and Issues
I
Why successful?
I
I
I
I
Pre-training: Restricted
Boltzmann machine (RBM),
Auto-encoder, nonnegative
matrix factorization (NMF)
Training: Drop out
Rectified units: No vanishing
gradient, sparse activation
Issues
I
I
Distributions: Exponential family harmonium
Multi-modal extensions: Dual-wing harmonium, restricted deep belief
nets, multi-modal stacked auto-encoders, and multi-modal deep
Boltzmann machines
7 / 25
Deep Learning
I
Single view
I
I
I
I
I
I
I
Multiplicative up-prop (Ahn, Oh, Choi, 2004)
Deep belief networks (Hinton et al., 2006)
Deep auto-encoders (Hinton and Salakhutdinov, 2006)
Stacked denoising auto-encoders (Vincent et al., 2010)
Deep Boltzmann machines (Salakhutdinov and Hinton, 2009)
Deep convolutional neural networks (Krizhevsky et al., 2012)
Multiple views
I
I
I
Restricted deep belief networks (Kang and Choi, 2011)
Multi-modal stacked auto-encoders (Ngiam et al., 2011)
Multi-modal deep Boltzmann machines (Srivastava and
Salakhutdinov, 2014)
8 / 25
Deep Learning
I
Single view
I
I
I
I
I
I
I
Multiplicative up-prop (Ahn, Oh, Choi, 2004)
Deep belief networks (Hinton et al., 2006)
Deep auto-encoders (Hinton and Salakhutdinov, 2006)
Stacked denoising auto-encoders (Vincent et al., 2010)
Deep Boltzmann machines (Salakhutdinov and Hinton, 2009)
Deep convolutional neural networks (Krizhevsky et al., 2012)
Multiple views
I
I
I
Restricted deep belief networks (Kang and Choi, 2011)
Multi-modal stacked auto-encoders (Ngiam et al., 2011)
Multi-modal deep Boltzmann machines (Srivastava and
Salakhutdinov, 2014)
8 / 25
Nonnegative Matrix Factorization
(Seung and Lee, 1999)
9 / 25
Nonnegative Matrix Factorization
(Seung and Lee, 1999)
Parts-based representation
Seung and Lee, 1999
9 / 25
Least Squares NMF
W
H
I
Involves the optimization:
arg min kV − W Hk2 ,
W ,H
subject to W > 0 and H > 0.
I
Multiplicative updates:
W
=
V
H
V ≈ WH
←
←
(V > 0)
=
V H>
W HH >
RH >
W
,
NH >
W >V
H
W >W H
W >R
H
.
W >N
W
10 / 25
Least Squares NMF
W
H
I
Involves the optimization:
arg min kV − W Hk2 ,
W ,H
subject to W > 0 and H > 0.
I
Multiplicative updates:
W
=
V
H
V ≈ WH
←
←
(V > 0)
=
V H>
W HH >
RH >
W
,
NH >
W >V
H
W >W H
W >R
H
.
W >N
W
10 / 25
Multiplicative Up-Prop
(Ahn, Oh, and Choi, ICML-2014)
(l)
ia
←
(L)
H aµ
←
W

η
R (l) H (l)T
(l) 

ia  ,
ia 
N (l) H (l)T
ia

η
W (L)T R (L)
aµ 
(L) 
H aµ   ,
W (L)T N (L)
aµ
W
where 0 < η 1 is the learning rate. R
(l+1)
(l+1)
and N
are
iµ
iµ
up-propaged:
R
N
(l+1)
iµ
(l+1)
iµ
W
=
=
W
(l)T
(l)T
R
(l)
N
(l)
g
0
0
iµ
g
iµ
W
W
(l)
H
(l)
(l)
H
(l)
,
iµ
,
iµ
where l = 1, 2, · · · , L − 1 and
R
(1)
iµ
=
N
(1)
iµ
=
V iµ
0
(1) (1)
g
(W
H
)iµ ,
(W (1) H (1) )iµ
0
(1) (1)
g
(W
H
)iµ .
11 / 25
Hierarchical Representation: From Bottom to Top
12 / 25
Harmonium (or Restricted Boltzmann Machine)
x1
h1
h2
h3
x2
x3
x4
x5
I
Harmonium (Smolensky, 1986) or RBM (Hinton and Sejnowski, 1986) is
an undirected model which allows only inter-layer connections (complete
bipartite graphs).
I
Energy-based probabilistic model defines a probability distribution through
an energy function, associating an energy function to each configuration
of the variables of interest:
p(x) =
X
h
p(x, h) =
X 1 −E (x ,h )
e
,
Z
h
P P −E (x ,h )
.
x he
Learning corresponds to modifying the energy function so that its shape
has desirable properties (maximum likelihood estimation).
where E (h, x) = −b > x − c > h − h > W x and Z =
I
13 / 25
Harmonium (or Restricted Boltzmann Machine)
x1
h1
h2
h3
x2
x3
x4
x5
I
Harmonium (Smolensky, 1986) or RBM (Hinton and Sejnowski, 1986) is
an undirected model which allows only inter-layer connections (complete
bipartite graphs).
I
Energy-based probabilistic model defines a probability distribution through
an energy function, associating an energy function to each configuration
of the variables of interest:
p(x) =
X
h
p(x, h) =
X 1 −E (x ,h )
e
,
Z
h
P P −E (x ,h )
.
x he
Learning corresponds to modifying the energy function so that its shape
has desirable properties (maximum likelihood estimation).
where E (h, x) = −b > x − c > h − h > W x and Z =
I
13 / 25
Contrastive Divergence Learning
Average log-likelihood gradient is approximated by k Gibbs steps,
+
*
∂ log p(x)
∂
∂
+
E (x, h)
,
≈
− E (x, h)
∂θ
∂θ
∂θ
p (k) (h|x)
p (k) (x,h)
pe(x)
pe(x)
where p (k) (x, h) is the joint distribution determined by k Gibbs steps.
14 / 25
Exponential Family Harmonium
(Welling, Rosen-Zvi, Hinton, 2004)
Choose Nx independent distributions pi (xi ) for the observed variables and Nh
independent distributions pi (hi ) for the hidden variables from the exponential
family:
)
(
Nx
Y
X
p(x) =
exp
ξia fia (xi ) − Ai ({ξia }) ,
a
i=1
p(h)
=
Nh
Y
(
exp
j=1
)
X
λjb gjb (hj ) − Bj ({λjb }) ,
b
where {fia (xi ), gjb (hi )} are sufficient statistics, {ξia , λjb } are canonical
parameters, {Ai , Bj } are log-partition functions.
Couple the random variables in the log-domain by the introduction of a
quadratic interaction term, leading to the harmonium random fields:


X

X
X
p(x, h) ∝ exp
ξia fia (xi ) +
λjb gjb (hj ) +
Wiajb fia (xi )gjb (hj ) .


i,a
j,b
i,a,j,b
Applied to information retrieval (fast inference, compared to directed models).
15 / 25
Multi-Wing Harmonium
h
x
y
I
Xing et al., 2005.
I
Take inputs from multiple
sources.
I
Assume that multi-view inputs
reflect the same central theme.
I
Consist of multiple harmoniums
joint by a shared array of hidden
nodes.
16 / 25
Multi-View Harmonium
h
x
h
x
h
y
y
I
Kang and Choi, 2011.
I
Take inputs from multiple
sources.
I
View-specific hidden nodes and
shared hidden nodes.
I
Allow for independent +
dependent views.
17 / 25
Restricted Deep Belief Networks
hx
x
h
h
I
Kang and Choi, 2011.
I
Take inputs from multiple
sources.
I
View-specific hidden nodes and
shared hidden nodes.
I
Allow for independent +
dependent views.
I
Multi-layer extension of
multi-view harmoniums.
y
y
18 / 25
RDBN for Multi-View Learning
hx
h
x
hy
y
“snow”
“lake”
y* = max p ( y | x)
y
(a)
(b)
19 / 25
Image Annotation: Precision-Recall
0.2
2ïlayer MWH
2ïlayer RDBN
3ïlayer RDBN
4ïlayer RDBN
Random
SGPLVM
FOLSïGPLVM
Precision
0.15
0.1
0.05
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
20 / 25
Denoising Autoencoder
...
...
...
Figure: Autoencoder
21 / 25
Denoising Autoencoder
...
...
...
...
...
X ... X
...
Figure: Denoising autoencoder
Figure: Autoencoder
21 / 25
Stacked Denoising Autoencoder for Face Pose
Normalization (Kang and Choi, 2013)
(frontal faces)
...
...
...
(Non-frontal faces)
Figure: Pre-training
22 / 25
Stacked Denoising Autoencoder for Face Pose
Normalization (Kang and Choi, 2013)
(frontal faces)
...
...
...
(Non-frontal faces)
Figure: Pre-training
...
...
Figure: Train another DAE
22 / 25
Figure: 10 examples of corrupted face images from Georgia tech face database
(left), their pose-normalized version obtained by the proposed method
(middle), and ground-truth frontal face images (right).
23 / 25
Deep Belief Net vs Deep Boltzmann Machine
h3
p(h2, h3)
h2
p(h1|h2)
h1
p(v|h1)
v
24 / 25
Deep Belief Net vs Deep Boltzmann Machine
h3
h3
h2
h2
h1
h1
v
v
p(h2, h3)
p(h1|h2)
p(v|h1)
24 / 25
Question and Discussion
25 / 25