Deep Neural Networks - ASI

Deep Neural Networks
Romain H E´ RAULT
Normandie Universite´ - INSA de Rouen - LITIS
April 29 2015
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
1 / 56
Introduction to supervised learning
Outline
1
Introduction to supervised learning
2
Introduction to Neural Networks
3
Multi-Layer Perceptron - Feed-forward network
4
Deep Neural Networks
5
Extension to structured output
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
2 / 56
Introduction to supervised learning
Supervised learning: Concept
Setup
A input (or features) space, X ∈ Rm ,
A output (or target) space Y,
Objective
Find the link f : X → Y (or the dependencies p(y |x) ) between the input and the output
spaces.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
2 / 56
Introduction to supervised learning
Supervised learning: general framework
Hypotheses space
f belongs to a hypotheses space H that depends on the chosen methods (MLP,SVM,
Decision trees, . . . ).
How to choose f within H ?
Expected Prediction Error
or generalization error, or generalization risk,
Z Z
R(f ) = EX ,Y [L(f (X ), Y )] =
L(f (x), y )p(x, y )dxdy
where L is a loss function that measures the accuracy of a prediction f (x) to a target
value y .
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
3 / 56
Introduction to supervised learning
Supervised learning: different tasks, different losses
Regression
Support Vector Machine Regression
1
0.5
0
y
If Y ∈ Ro , it is a regression task.
Standard loss are (y − f (x))2 or |y − f (x)|.
−0.5
−1
−1.5
0
0.5
1
1.5
2
2.5
x
3
3.5
4
4.5
5
Classification / Discrimination
3
2
1
1
0
1
0
−1
−1
0
−1
−1
1
0
1
−1
−2
−1
−3
−3
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
−2
−1
0
1
1
−1
0
If Y in a discrete set, it is a classification
or discrimination task.
Standard loss is Θ(−yf (x))2 where Θ is the
step function.
0
2
3
April 29 2015
4 / 56
Introduction to supervised learning
Supervised learning: Experimental setup
Available data
Data consists in a set of n examples (x, y) where x ∈ X and y ∈ Y It is split into:
A training set that will be used to choose f ,
i.e. to learn the parameters w of the model
A test set to evaluate the chosen f
(A validation set to choose the hyper-parameters of f )
Because of the human cost of labelling data, one can found a separate unlabelled set,
i.e. examples with only the feature x (see semi-supervised learning)
Evaluation: Empirical risk
RS (f ) =
n
X
1
L(f (x), y)
card(S)
(x,y)∈S
where S is the train set during learning, the test set during final evaluation.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
5 / 56
Introduction to supervised learning
Supervised learning: Overfitting
Empirical risk
Test set
Learning set
Low
High
Model complexity
Adding noise to data or to model parameters (dark age)
Limiting model capacity ⇒ Regularization
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
6 / 56
Introduction to supervised learning
Supervised learning as an optimization problem
Tikhonov regularization scheme
X
arg min
w
L(f (x; w), y) + λΩ(w)
(x,y)∈Strain
where
L is a loss term that measures the accuracy of the model,
Ω is a regularization term that limits the capacity of the model,
λ ∈ [0, ∞[ is the regularization hyper-parameter.
Example: Ridge regression
Linear regression with the sum squared error as loss and a L2-norm as regularization:
X
arg min ||Y − X.w||2 + λ
||wd ||2
w∈Rd
Solution
d
w(λ) = (X| X + λI)−1 X| Y
Regularization path:
{w(λ)|λ ∈ [0, ∞[}
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
7 / 56
Introduction to supervised learning
Ridge regression: illustration
5
Reg. term
Loss term
Reg. Path
4
3
λ=0
w1
2
1
0
λ = +∞
−1
−2
−2
R. H E´ RAULT (INSA LITIS)
−1
0
1
2
3
4
5
w0
Deep Neural Networks
April 29 2015
8 / 56
Introduction to supervised learning
Why do we care about sparsity ?
Sparsity is a very useful property of some Machine
Learning algorithms.
Machine Learning is model selection
Cheap to store & transmit
Sparse coefficients are meaningful.
They make more sense.
More robust to errors
Need fewer data to begin with provides scalable
optimization
In the Big Data era, as datasets become larger, it
becomes desirable to process the structured information
contained within data, rather than data itself.
For lectures on sparsity, see Stephane Canu website.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
9 / 56
Introduction to supervised learning
Introducing sparsity
Lasso
Linear regression with the sum squared error as loss and a L1-norm as regularization:
X
arg min ||Y − X.w||2 + λ
|wd |
w∈Rd
d
which is equivalent to
arg min
||Y − X.(w+ − w− )||2 + λ
w+ ∈Rd ,w− ∈Rd
P
d (w
+
+ w− )
s.t.
w+
i ≥ 0
∀i ∈ [1..d]
w−
i
∀i ∈ [1..d]
≥0
Why is it sparse ?
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
10 / 56
Introduction to supervised learning
Lasso: illustration
5
Reg. term
Loss term
Reg. Path
4
3
λ=0
w1
2
1
0
λ = +∞
−1
−2
−2
R. H E´ RAULT (INSA LITIS)
−1
0
1
2
3
4
5
w0
Deep Neural Networks
April 29 2015
11 / 56
Introduction to Neural Networks
Outline
1
Introduction to supervised learning
2
Introduction to Neural Networks
3
Multi-Layer Perceptron - Feed-forward network
4
Deep Neural Networks
5
Extension to structured output
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
12 / 56
Introduction to Neural Networks
History . . .
1940 : Turing machine
1943 : Formal neuron (Mc Culloch & Pitts)
1948 : Automate networks (Von Neuman)
1949 : First learning rules (Hebb)
1957 : Perceptron (Rosenblatt)
1960 : Adaline (Widrow & Hoff)
1969 : Perceptrons (Minsky & Papert)
Limitation of the perceptron
Need for more complex architectures, but then how to learn?
1974 : Gradient back-propagation (Werbos)
no success !?!?
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
12 / 56
Introduction to Neural Networks
History . . .
1986 : Gradient back-propagation bis (Rumelhart & McClelland, Lecun)
New neural networks architectures
New Applications :
Character recognition
Speech recognition and synthesis
Vision (image processing)
1990-2010 : Information society
New fields
Web crawlers
Information extraction
Multimedia (indexation,. . . ).
Data-mining
Needs to combine many models and build adequate features
1992-1995 : Kernel methods
Support Vector Machine (Boser, Guyon and Vapnik)
2005 : Deep networks
Deep Belief Machine, DBM (Hinton and Salakhutdinov, 2006)
Deep Neural Network, DNN
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
13 / 56
Introduction to Neural Networks
Biological neuron
Figure: Scheme of a biological neuron [Wikimedia commons - M. R. Villarreal]
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
14 / 56
Introduction to Neural Networks
Formal neuron (1)
Origin
Warren McCulloch and Walter Pitts (1943), Frank Rosenblatt (1957),
Mathematical representation of a biological neuron
Schematic
x1
w1
x2
w2
Σ
cd
yˆ1
...
b
wm
xm
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
15 / 56
Introduction to Neural Networks
Formal neuron (2)
Formulation
yˆ = f (hw, xi + b)
(1)
where
x, input vector,
yˆ , output estimation,
w, weights linked to each input (model parameter),
b, bias (model parameter),
f , activation function.
Evaluation
Typical losses are
Classification
L(yˆ , y) = − (y.log(yˆ ) + (1 − y).log(1 − yˆ ))
Regression
L(yˆ , y) = ||y − yˆ ||2
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
16 / 56
Introduction to Neural Networks
Formal neuron (3)
Activation functions are typically step function, sigmoid function ([0 1]) or hyperbolic
tangent ([−1 1]).
f (x)
f (x) = sigm(x)
1
1
x
Figure: Sigmoid
f (x)
f (x) = tanh(x)
1
1
x
Figure: Hyperbolic tangent
If loss and activation function are differentiable, parameters w and b can be learned by
gradient descent.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
17 / 56
Introduction to Neural Networks
A perceptron
x0 = 1
x1
x2
x3
w10
w20
w11
w21
w12
w22
w13
w23
P
S1
f
y1
P
S2
f
y2
Let’s be xi input number i and yj output number j
Sj
=
X
Wji xi
i
yj
=
f (Sj )
with Wj0 = bj and x0 = 1.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
18 / 56
Introduction to Neural Networks
A perceptron
x0 = 1
x1
x2
x3
w10
w20
w11
w21
w12
w22
w13
w23
P
S1
f
y1
P
S2
f
y2
As the loss is differentiable, we can compute
∂L
∂wji
∂L
∂wji
R. H E´ RAULT (INSA LITIS)
=
=
∂L
∂yj
.
∂L ∂yj ∂Sj
∂yj ∂Sj ∂wji
∂L 0
f (Sj )xi
∂yj
Deep Neural Networks
April 29 2015
18 / 56
Introduction to Neural Networks
Gradient descent : general algorithm
Input: Integer Nb : Batch number
Input: Boolean Sto : Stochastic grad ?
Input: (Xtrain , Ytrain ) : Training set
W ← random initialization
(Xsplit , Ysplit ) ← split ((Xtrain , Ytrain ), Nb)
while stopping criterion not reached do
if Sto then
(Xsplit , Ysplit ) ← randperm ((Xsplit , Ysplit ))
end if
for (Xbloc , Ybloc ) ∈ (Xsplit , Ysplit ) do
∆W ← 0
for (x, y) ∈ (Xbloc , Ybloc ) do
∆Wi ← ∆Wi + ∂L(x,W,y)
∀i
∂Wi
end for
∆W
∆W ← card(X
bloc )
W ← W − η∆W
end for
end while
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
19 / 56
Introduction to Neural Networks
Neural network
A perceptron can only solve linearly separable problems
Neural network
To solve more complex problems, we need to build a network of perceptrons
Principles
The network is an oriented graph, each node represent a formal neuron,
Information follows graph edges,
Calculus is distributed over nodes
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
20 / 56
Introduction to Neural Networks
Multi-Layer Perceptron - Feed-forward network
x1
x2
yˆ1
x3
yˆ2
x4
Figure: Feed-forward network, with two layers and one hidden representation
Neurons are layered.
Calculus always flows in one direction.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
21 / 56
Introduction to Neural Networks
Recurrent network
At least one retroactive loop
Hysteresis effect
x1
x2
yˆ1
x3
yˆ2
x4
Figure: Recurrent network
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
22 / 56
Introduction to Neural Networks
Recurrent network
x1,t
x2,t
x3,t
yˆ1,t
yˆ1,t−3
yˆ1,t−2
yˆ1,t−1
Figure: NARX Recurrent network
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
23 / 56
Multi-Layer Perceptron - Feed-forward network
Outline
1
Introduction to supervised learning
2
Introduction to Neural Networks
3
Multi-Layer Perceptron - Feed-forward network
4
Deep Neural Networks
5
Extension to structured output
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
24 / 56
Multi-Layer Perceptron - Feed-forward network
Scheme of a Multi Layer Perceptron
x1
x2
yˆ1
x3
yˆ2
x4
Figure: Example of feed-forward network: a 2-layer perceptron
Formalism:
Layer, computational element,
Representation, data element
R. H E´ RAULT (INSA LITIS)
This MLP has
an input layer and an output layer (2 layers),
an input, a hidden and output representations (3
representations).
Deep Neural Networks
April 29 2015
24 / 56
Multi-Layer Perceptron - Feed-forward network
ˆ: Forward path
Estimation of y
(l)
I0 = 1
(l)
I1
(l)
I2
(l)
I3
w10
w20
w11
w21
w12
w22
w13
w23
(l)
If we look at layer (l), let’s be Ii
P
S1
P
S2
f (l)
O1
(l)
f (l)
O2
(l)
input number i and Oj
(l)
Sj
=
X
(l)
(l)
(l)
output number j,
(l) (l)
Wji Ii
i
(l)
Oj
=
(l)
f (l) (Sj ) = I (l+1)
ˆ
Starts with I (0) = x and finishes with O (last) = y
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
25 / 56
Multi-Layer Perceptron - Feed-forward network
How to learn parameters ? Gradient back-propagation
(l)
I0 = 1
(l)
I1
(l)
I2
(l)
I3
We assume to know
w11
w21
w12
w22
w13
w23
w10
w20
P
S1
P
S2
(l)
(l)
f (l)
O1
(l)
f (l)
O2
(l)
∂L
(l)
j
∂O
∂L
(l)
∂wji
∂L
(l)
∂wji
R. H E´ RAULT (INSA LITIS)
(l)
=
=
∂L ∂Oj
(l)
∂Oj
∂L
(l)
∂Oj
(l)
∂Sj
(l)
∂Sj
∂wji
0(l)
(l)
f
(l)
(l)
(Sj )Ii
Deep Neural Networks
April 29 2015
26 / 56
Multi-Layer Perceptron - Feed-forward network
How to learn parameters ? Gradient back-propagation
(l)
I0 = 1
(l)
I1
(l)
I2
(l)
I3
Now we compute
w11
w21
w12
w22
w13
w23
w10
w20
P
S1
P
S2
(l)
(l)
f (l)
O1
(l)
f (l)
O2
(l)
∂L
(l)
i
∂I
∂L
(l)
∂Ii
∂L
(l)
∂Ii
∂L
(l)
∂Ii
R. H E´ RAULT (INSA LITIS)
(l)
=
X ∂L ∂Oj
j
(l)
∂Oj
(l)
∂Ii
(l)
=
X ∂L ∂Oj
j
=
(l)
∂Oj
X ∂L
(l)
j
(l)
∂Sj
f
0(l)
∂Oj
Deep Neural Networks
(l)
∂Sj
(l)
∂Ii
(l)
(Sj )wji
April 29 2015
26 / 56
Multi-Layer Perceptron - Feed-forward network
How to learn parameters ? Gradient back-propagation
(l)
I0 = 1
(l)
I1
(l)
I2
(l)
I3
Start
∂L
(last)
j
∂O
=
w11
w21
w12
w22
w13
w23
w10
w20
P
S1
P
S2
(l)
(l)
f (l)
O1
(l)
f (l)
O2
(l)
∂L
∂ˆ
yj
Backward recurrence
∂L
(l)
=
∂wji
∂L
(l)
=
∂Ii
∂L
(l−1)
∂Oj
R. H E´ RAULT (INSA LITIS)
∂L
(l)
f
0(l)
∂Oj
X ∂L
(l)
∂Oj
j
=
(l)
(l)
0(l)
(l)
(Sj )Ii
f
(Sj )wji
∂L
(l)
∂Ii
Deep Neural Networks
April 29 2015
26 / 56
Deep Neural Networks
Outline
1
Introduction to supervised learning
2
Introduction to Neural Networks
3
Multi-Layer Perceptron - Feed-forward network
4
Deep Neural Networks
5
Extension to structured output
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
27 / 56
Deep Neural Networks
Deep architecture
x1
x2
x3
yˆ1
x4
x5
Why ?
Some problems needs exponential number of neurons on the hidden
representation,
Build / extract features inside the NN in order not to rely on handmade extraction
(human prior).
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
27 / 56
Deep Neural Networks
The vanishing gradient problem
f (x)
f (x) = tanh(x)
1
x
1
Figure: Hyperbolic tangent
∂L
(l)
∂Ii
=
X ∂L
(l)
j
∂Oj
f 0(l) (Sj )wji
(l)
When neurons at higher layers are saturated, the gradient decreases toward zero.
Solution
Better topology, better initialization of the weights,
Regularization !
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
28 / 56
Deep Neural Networks
Convolutional network
A unit on representation (l) is connected to a sub-slice of o units from representation (l − 1). All
the weights between units are tied leading to only o weights. Warning, bias are not tied.
If representation (l − 1) is in Rm and (l) is in Rn , number of parameters:
(m + 1) ∗ n → (o + 1) ∗ n
w1
w2
w1
w3
w2
w1
w3
w2
w3
Figure: 1D convolutional network
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
29 / 56
Deep Neural Networks
Convolutional network : 2D example
Figure: [LeCun 2010]
LeCun, Y. (1989). Generalization and network design strategies. Connections in Perspective. North-Holland, Amsterdam, 143-55.
LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS),
Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
30 / 56
Deep Neural Networks
Better initialization through unsupervised learning
The learning is split into two steps:
Pre-training
A unsupervised pre-training of the input layers with auto-encoders. Intuition: learning
the manifold where the input data resides.
Can take into account an unlabelled dataset.
Finetuning
A finetuning of the whole network with supervised back-propagation.
Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554
Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507,
28 July 2006.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
31 / 56
Deep Neural Networks
Diabolo network, Autoencoders
Autoencoders are neural network where the input and output representations have the
same number of units. The learned target is the input itself.
x1
xˆ1
x2
xˆ2
h1
x3
xˆ3
x
h2
x4
xˆ4
x5
xˆ5
Figure: Diabolo network
When 2 layers :
The input layer is called the encoder,
The output layer, the decoder.
|
Tied weights Wdec = Wenc
, convergence? PCA ?
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
32 / 56
Deep Neural Networks
Diabolo network, Autoencoders
Autoencoders are neural network where the input and output representations have the
same number of units. The learned target is the input itself.
x1
xˆ1
x2
xˆ2
h1
x3
xˆ3
x
h2
x4
xˆ4
x5
xˆ5
Figure: Diabolo network
Undercomplete, size(h) < size(x)
Overcomplete, size(x) < size(h).
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
32 / 56
Deep Neural Networks
Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its
weights are fixed until the finetuning.
R. H E´ RAULT (INSA LITIS)
x1
xˆ1
x2
xˆ2
x3
xˆ3
x4
xˆ4
x5
xˆ5
Deep Neural Networks
April 29 2015
33 / 56
Deep Neural Networks
Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its
weights are fixed until the finetuning.
R. H E´ RAULT (INSA LITIS)
x1
h1,1
ˆ1,1
h
x2
h1,2
ˆ1,2
h
x3
h1,3
ˆ1,3
h
x4
h1,4
ˆ1,4
h
x5
h1,5
ˆ1,5
h
Deep Neural Networks
April 29 2015
33 / 56
Deep Neural Networks
Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its
weights are fixed until the finetuning.
x1
x2
h2,1
ˆ2,1
h
x3
h2,2
ˆ2,2
h
x4
h2,3
ˆ2,3
h
x5
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
33 / 56
Deep Neural Networks
Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its
weights are fixed until the finetuning.
x1
x2
x3
yˆ1
x4
x5
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
33 / 56
Deep Neural Networks
Simplified stacked AE Algorithm
Input: X , a training feature set of size Nbexamples × Nbfeatures
Input: Y , a corresponding training label set of size Nbexamples × Nblabels
Input: Ninput , the number of input layers to be pre-trained
Input: Noutput , the number of output layers to be pre-trained
Input: N, the number of layers in the IODA, Ninput + Noutput < N
Output: [w1 , w2 , . . . , wN ], the parameters for all the layers
Randomly initialize [w1 , w2 , . . . , wN ]
Input pre-training
R←X
for i ← 1..Ninput do
{Training an AE on R and keeps its encoding parameters}
[wi , wdummy ] ← MLPT RAIN([wi , w|i ], R, R)
Drop wdummy
R ← MLPF ORWARD([wi ], R)
end for
Final supervised learning
[w1 , w2 , . . . , wN ] ← MLPT RAIN([w1 , w2 , . . . , wN ], X , Y )
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
34 / 56
Deep Neural Networks
Improve optimization by adding noise 1/3
Denoising (undercomplete) auto-encoders
˜, a disturbed x; the target is still x.
The auto-encoder is learned from x
x˜1
xˆ1
x2
x˜2
xˆ2
x3
x4
x5
Disturbance
x1
h1
x˜3
xˆ3
x
h2
x˜4
xˆ4
x˜5
xˆ5
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
35 / 56
Deep Neural Networks
Improve optimization by adding noise 2/3
Prevent co-adaptation in (overcomplete) autoencoders
During training, randomly disconnect hidden units.
h1
x1
xˆ1
h2
x2
xˆ2
h3
x3
xˆ3
h4
h5
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
36 / 56
Deep Neural Networks
Improve optimization by adding noise 2/3
Prevent co-adaptation in (overcomplete) autoencoders
During training, randomly disconnect hidden units.
h1
x1
xˆ1
h2
x2
xˆ2
h3
x3
xˆ3
h4
h5
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
36 / 56
Deep Neural Networks
Improve optimization by adding noise 2/3
Prevent co-adaptation in (overcomplete) autoencoders
During training, randomly disconnect hidden units.
Figure: MNIST [Hinton 2012]
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
36 / 56
Deep Neural Networks
Improve optimization by adding noise 3/3
Dropout
During training, randomly disconnect at each iteration weights by probability p.
At testing, multiply the weights by
# actual disconnections
# iterations
(6= p).
x1
x2
x3
yˆ1
x4
x5
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
37 / 56
Deep Neural Networks
Improve optimization by adding noise 3/3
Dropout
During training, randomly disconnect at each iteration weights by probability p.
At testing, multiply the weights by
# actual disconnections
# iterations
(6= p).
x1
x2
x3
yˆ1
x4
x5
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
37 / 56
Deep Neural Networks
Improve optimization by adding noise 3/3
Dropout
During training, randomly disconnect at each iteration weights by probability p.
At testing, multiply the weights by
# actual disconnections
# iterations
(6= p).
Figure: Reuters dataset
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
37 / 56
Deep Neural Networks
Tikhonov regularization scheme
Noise and early stopping are connected to regularization.
So why not using Tikhonov regularization scheme ?
X
J =
L(yi , f (xi ; w)) + λ.Ω(w)
i
Notation
2-layer MLP
ˆ = fMLP (x; win , wout ) = fout (bout + wout .fin (bin + win .x))
y
AE
ˆ = fAE (x; wenc , wdec ) = fdec (bdec + wdec .fenc (benc + wenc .x))
x
Tied weights
win ↔ wenc ,
wdec ↔ w|enc
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1), 108-116
Collobert, R. and Bengio, S. (2004). Links between perceptrons, MLPs and SVMs. In ICML’2004
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
38 / 56
Deep Neural Networks
Regularization on weights
J =
X
L(yi , fMLP (xi ; w)) + λ.Ω(wout )
i
It is enough to regularize output-layer weights.
L2 (Gaussian prior):
X
Ω(wout ) =
||wd ||2
d
L1 (Laplace prior):
Ω(wout ) =
X
|wd |
d
t-Student:
Ω(wout ) =
X
log(1 + w2d )
d
With infinite units,
L1 : boosting
L2 : SVM
Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2005). Convex neural networks. In Advances in neural information processing
systems (pp. 123-130)
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
39 / 56
Deep Neural Networks
Contractive autoencoder 1/2
Figure: Input manifold
AE must be sensitive to [blue] direction to reconstruct well
It can be insensitive to [orange] direction.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
40 / 56
Deep Neural Networks
Contractive autoencoder 2/2
The autoencoder should:
reconstruct correctly x which lies on the input manifold
X
L(xi , fAE (xi ; wenc ))
i
be insensitive to small changes on x outside the manifold (i.e. project on the
manifold)
⇒ penalize by the Jacobian
X ∂fj (x; wenc ) 2
||Jfenc (x; wenc )||F =
∂xi
ij
Objective function
J =
X
L(xi , fAE (xi ; wenc )) + λ.||Jfenc (x; wenc )||2F
i
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In
Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 833–840
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
41 / 56
Deep Neural Networks
Regularization brought by multi-task learning / embedding
Combine multiple tasks in the same optimization problem.
Tasks are sharing parameters.
J
=
λL
X
L(yi , fMLP (xi ; wout , win ))
i∈L
+λU
X
L(xi , fAE (xi ; win ))
i∈L∪U
+λΩ
Ω(wout )
Mix supervised and unsupervised data.
Weston, J., Ratle, F., and Collobert, R. . Deep learning via semi-supervised embedding. ICML, pages 1168–1175, 2008
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
42 / 56
Deep Neural Networks
Regularization brought by multi-task learning / embedding
Combine multiple tasks in the same optimization problem.
Tasks are sharing parameters.
J
=
λL
X
L(yi , fMLP (xi ; wout , win ))
i∈L
+λU
X
L(xi , fAE (xi ; win ))
i∈L∪U
+λΩ
Ω(wout )
+λJ
||Jfin (x; win )||2F
+...
Mix supervised and unsupervised data.
Weston, J., Ratle, F., and Collobert, R. . Deep learning via semi-supervised embedding. ICML, pages 1168–1175, 2008
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
42 / 56
Extension to structured output
Outline
1
Introduction to supervised learning
2
Introduction to Neural Networks
3
Multi-Layer Perceptron - Feed-forward network
4
Deep Neural Networks
5
Extension to structured output
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
43 / 56
Extension to structured output
Structured output
Ad-hoc definition
Data that consists of several parts, and not only the parts themselves contain
information, but also the way in which the parts belong together. Christoph Lampert
Automatic transcription
Automatic translation
Point matching
Image labeling (semantic image segmentation)
Landmark detection
Input/Output Deep Architecture (IODA)
Learn output dependencies the same way a DNN learns input dependencies.
´ R. Herault
´
B. Labbe,
& C. Chatelain Learning Deep Neural Networks for High Dimensional Output Problems. In IEEE International Conference on
Machine Learning and Applications, 2009 (pp. 63-68).
J. Lerouge, R. Herault, C. Chatelain, F. Jardin, R. Modzelewski, IODA: An input/output deep architecture for image labeling, Pattern Recognition,
Available online 27 March 2015, ISSN 0031-3203
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
43 / 56
Extension to structured output
The image labeling problem
Dataset
Input
Target
Toy
Sarcopenia
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
44 / 56
Extension to structured output
Input/Output Deep Architecture (IODA) for Image Labeling
Figure: The IODA architecture. It directly links the pixel matrix to the label matrix. The input layers
(left, light) are pre-trained to provide a high level representation of the image pixels, while the
output layers (right, dark) are pre-trained to learn the a priori knowledge of the problem.
´
R. H ERAULT
(INSA LITIS)
Deep Neural Networks
April 29 2015
45 / 56
Extension to structured output
Simplified IODA Algorithm 1/2
Input: X , a training feature set of size Nbexamples × Nbfeatures
Input: Y , a corresponding training label set of size Nbexamples × Nblabels
Input: Ninput , the number of input layers to be pre-trained
Input: Noutput , the number of output layers to be pre-trained
Input: N, the number of layers in the IODA, Ninput + Noutput < N
Output: [w1 , w2 , . . . , wN ], the parameters for all the layers
Randomly initialize [w1 , w2 , . . . , wN ]
Input pre-training
R←X
for i ← 1..Ninput do
{Training an AE on R and keeps its encoding parameters}
[wi , wdummy ] ← MLPT RAIN([wi , w|i ], R, R)
Drop wdummy
R ← MLPF ORWARD([wi ], R)
end for
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
46 / 56
Extension to structured output
Simplified IODA Algorithm 2/2
Output pre-training
R←Y
for i ← N..N − Noutput + 1 step − 1 do
{Training an AE on R and keeps its decoding parameters}
[u, wi ] ← MLPT RAIN([w|i , wi ], R, R)
R ← MLPF ORWARD([u], R)
Drop u
end for
Final supervised learning
[w1 , w2 , . . . , wN ] ← MLPT RAIN([w1 , w2 , . . . , wN ], X , Y )
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
47 / 56
Extension to structured output
Qualitative results 1/3
Iter.
10
Iter.
100
Iter.
200
Iter.
300
(NDA)
(IDA)
(IODA)
Figure: Evolution of the output image of the architecture according to the number of batch
gradient descent iterations for the three learning strategies, using the validation example #10.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
48 / 56
Extension to structured output
Qualitative results 2/3
(a) CT image
(b) Ground truth
(c) Chung
(d) IODA
Figure: Non-sarcopenic patient
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
49 / 56
Extension to structured output
Qualitative results 3/3
(a) CT image
(b) Ground truth
(c) Chung
(d) IODA
Figure: Sarcopenic patient
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
50 / 56
Extension to structured output
Quantitative results
Architecture
r1
r2
2048
2048
1024
1024
2048
2048
1024
1024
2048
2048
1024
1024
2048
2048
1024
1024
2048
2048
1024
1024
X
1282
1282
1282
1282
1282
1282
1282
1282
1282
1282
: input pre-training,
ˆ
Y
1282
1282
1282
1282
1282
1282
1282
1282
1282
1282
: no pre-training,
Train error
Test error
2.64e-02
3.11e-02
3.86e-02
4.44e-02
5.20e-02
6.29e-02
6.30e-02
7.09e-02
9.03e-02
1.03e-01
3.48e-02
3.91e-02
4.59e-02
5.13e-02
5.75e-02
6.77e-02
6.79e-02
7.55e-02
9.40e-02
1.06e-01
: output pre-training.
Table: Toy dataset: 3-layer MLP
Method
Chung
NDA
IDA
IODA
Diff. (%)
-10.6
0.12
0.15
3.37
Jaccard (%)
60.3
85.88
85.91
88.47
Table: Sarcopenia.
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
51 / 56
Extension to structured output
Why not using multi-tasking + Tikhonov schemes ?
Notation
3-layer MLP
ˆ = fMLP (x; win , wlink , wout ) = fout (bout + wout .flink (blink + wlink .fin (bin + win .x)))
y
Input AE
ˆ = fAEi (x; win ) = fdec bdec + w|in .fenc (benc + win .x)
x
Output AE
ˆ = fAEo (y; wout ) = fdec b0dec + wout .fenc (b0enc + w|out .y)
y
Objective function
J
=
λL
X
L(yi , fMLP (xi ; win , wlink , wout ))
i∈L
+λU
X
L(xi , fAEi (xi ; win ))) + λL0
i∈L∪U
+λΩ
X
L(yi , fAEo (yi ; wout ))
i∈L
Ω(wlink )
Submitted to ECML, Input/Output Deep Architecture for Structured Output Problems, Soufiane Belharbi, Clement Chatelain, Romain Herault and
Sebastien Adam, arXiv:1504.07550
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
52 / 56
Extension to structured output
Facial landmark detection problem
Competition i-bug: http://ibug.doc.ic.ac.uk/resources/300-W_IMAVIS
Images from:
Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang. Learning and Transferring Multi-task Deep Representation for Face Alignment.
Technical report, arXiv:1408.3967, 2014
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
53 / 56
Extension to structured output
Facial landmark detection, some results
Figure: Early results on facial landmark detection
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
54 / 56
Extension to structured output
Questions ?
?
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
55 / 56
Extension to structured output
References
Y. Bengio, A. Courville, P. Vincent, ”Representation Learning: A Review and New Perspectives,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, no. 8, pp. 1798-1828, Aug., 2013 (arXiv:1206.5538)
Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade (pp.
437-478). Springer Berlin Heidelberg.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. (arXiv:1207.0580).
LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS),
Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE.
J. Lerouge, R. Herault, C. Chatelain, F. Jardin, R. Modzelewski, IODA: An input/output deep architecture for image labeling, Pattern Recognition,
Available online 27 March 2015, ISSN 0031-3203, http://dx.doi.org/10.1016/j.patcog.2015.03.017.
Hugo Larochelle lectures:
http://info.usherbrooke.ca/hlarochelle/cours/ift725_A2013/contenu.html
R. H E´ RAULT (INSA LITIS)
Deep Neural Networks
April 29 2015
56 / 56