Deep Neural Networks Romain H E´ RAULT Normandie Universite´ - INSA de Rouen - LITIS April 29 2015 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 1 / 56 Introduction to supervised learning Outline 1 Introduction to supervised learning 2 Introduction to Neural Networks 3 Multi-Layer Perceptron - Feed-forward network 4 Deep Neural Networks 5 Extension to structured output R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 2 / 56 Introduction to supervised learning Supervised learning: Concept Setup A input (or features) space, X ∈ Rm , A output (or target) space Y, Objective Find the link f : X → Y (or the dependencies p(y |x) ) between the input and the output spaces. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 2 / 56 Introduction to supervised learning Supervised learning: general framework Hypotheses space f belongs to a hypotheses space H that depends on the chosen methods (MLP,SVM, Decision trees, . . . ). How to choose f within H ? Expected Prediction Error or generalization error, or generalization risk, Z Z R(f ) = EX ,Y [L(f (X ), Y )] = L(f (x), y )p(x, y )dxdy where L is a loss function that measures the accuracy of a prediction f (x) to a target value y . R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 3 / 56 Introduction to supervised learning Supervised learning: different tasks, different losses Regression Support Vector Machine Regression 1 0.5 0 y If Y ∈ Ro , it is a regression task. Standard loss are (y − f (x))2 or |y − f (x)|. −0.5 −1 −1.5 0 0.5 1 1.5 2 2.5 x 3 3.5 4 4.5 5 Classification / Discrimination 3 2 1 1 0 1 0 −1 −1 0 −1 −1 1 0 1 −1 −2 −1 −3 −3 R. H E´ RAULT (INSA LITIS) Deep Neural Networks −2 −1 0 1 1 −1 0 If Y in a discrete set, it is a classification or discrimination task. Standard loss is Θ(−yf (x))2 where Θ is the step function. 0 2 3 April 29 2015 4 / 56 Introduction to supervised learning Supervised learning: Experimental setup Available data Data consists in a set of n examples (x, y) where x ∈ X and y ∈ Y It is split into: A training set that will be used to choose f , i.e. to learn the parameters w of the model A test set to evaluate the chosen f (A validation set to choose the hyper-parameters of f ) Because of the human cost of labelling data, one can found a separate unlabelled set, i.e. examples with only the feature x (see semi-supervised learning) Evaluation: Empirical risk RS (f ) = n X 1 L(f (x), y) card(S) (x,y)∈S where S is the train set during learning, the test set during final evaluation. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 5 / 56 Introduction to supervised learning Supervised learning: Overfitting Empirical risk Test set Learning set Low High Model complexity Adding noise to data or to model parameters (dark age) Limiting model capacity ⇒ Regularization R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 6 / 56 Introduction to supervised learning Supervised learning as an optimization problem Tikhonov regularization scheme X arg min w L(f (x; w), y) + λΩ(w) (x,y)∈Strain where L is a loss term that measures the accuracy of the model, Ω is a regularization term that limits the capacity of the model, λ ∈ [0, ∞[ is the regularization hyper-parameter. Example: Ridge regression Linear regression with the sum squared error as loss and a L2-norm as regularization: X arg min ||Y − X.w||2 + λ ||wd ||2 w∈Rd Solution d w(λ) = (X| X + λI)−1 X| Y Regularization path: {w(λ)|λ ∈ [0, ∞[} R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 7 / 56 Introduction to supervised learning Ridge regression: illustration 5 Reg. term Loss term Reg. Path 4 3 λ=0 w1 2 1 0 λ = +∞ −1 −2 −2 R. H E´ RAULT (INSA LITIS) −1 0 1 2 3 4 5 w0 Deep Neural Networks April 29 2015 8 / 56 Introduction to supervised learning Why do we care about sparsity ? Sparsity is a very useful property of some Machine Learning algorithms. Machine Learning is model selection Cheap to store & transmit Sparse coefficients are meaningful. They make more sense. More robust to errors Need fewer data to begin with provides scalable optimization In the Big Data era, as datasets become larger, it becomes desirable to process the structured information contained within data, rather than data itself. For lectures on sparsity, see Stephane Canu website. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 9 / 56 Introduction to supervised learning Introducing sparsity Lasso Linear regression with the sum squared error as loss and a L1-norm as regularization: X arg min ||Y − X.w||2 + λ |wd | w∈Rd d which is equivalent to arg min ||Y − X.(w+ − w− )||2 + λ w+ ∈Rd ,w− ∈Rd P d (w + + w− ) s.t. w+ i ≥ 0 ∀i ∈ [1..d] w− i ∀i ∈ [1..d] ≥0 Why is it sparse ? R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 10 / 56 Introduction to supervised learning Lasso: illustration 5 Reg. term Loss term Reg. Path 4 3 λ=0 w1 2 1 0 λ = +∞ −1 −2 −2 R. H E´ RAULT (INSA LITIS) −1 0 1 2 3 4 5 w0 Deep Neural Networks April 29 2015 11 / 56 Introduction to Neural Networks Outline 1 Introduction to supervised learning 2 Introduction to Neural Networks 3 Multi-Layer Perceptron - Feed-forward network 4 Deep Neural Networks 5 Extension to structured output R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 12 / 56 Introduction to Neural Networks History . . . 1940 : Turing machine 1943 : Formal neuron (Mc Culloch & Pitts) 1948 : Automate networks (Von Neuman) 1949 : First learning rules (Hebb) 1957 : Perceptron (Rosenblatt) 1960 : Adaline (Widrow & Hoff) 1969 : Perceptrons (Minsky & Papert) Limitation of the perceptron Need for more complex architectures, but then how to learn? 1974 : Gradient back-propagation (Werbos) no success !?!? R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 12 / 56 Introduction to Neural Networks History . . . 1986 : Gradient back-propagation bis (Rumelhart & McClelland, Lecun) New neural networks architectures New Applications : Character recognition Speech recognition and synthesis Vision (image processing) 1990-2010 : Information society New fields Web crawlers Information extraction Multimedia (indexation,. . . ). Data-mining Needs to combine many models and build adequate features 1992-1995 : Kernel methods Support Vector Machine (Boser, Guyon and Vapnik) 2005 : Deep networks Deep Belief Machine, DBM (Hinton and Salakhutdinov, 2006) Deep Neural Network, DNN R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 13 / 56 Introduction to Neural Networks Biological neuron Figure: Scheme of a biological neuron [Wikimedia commons - M. R. Villarreal] R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 14 / 56 Introduction to Neural Networks Formal neuron (1) Origin Warren McCulloch and Walter Pitts (1943), Frank Rosenblatt (1957), Mathematical representation of a biological neuron Schematic x1 w1 x2 w2 Σ cd yˆ1 ... b wm xm R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 15 / 56 Introduction to Neural Networks Formal neuron (2) Formulation yˆ = f (hw, xi + b) (1) where x, input vector, yˆ , output estimation, w, weights linked to each input (model parameter), b, bias (model parameter), f , activation function. Evaluation Typical losses are Classification L(yˆ , y) = − (y.log(yˆ ) + (1 − y).log(1 − yˆ )) Regression L(yˆ , y) = ||y − yˆ ||2 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 16 / 56 Introduction to Neural Networks Formal neuron (3) Activation functions are typically step function, sigmoid function ([0 1]) or hyperbolic tangent ([−1 1]). f (x) f (x) = sigm(x) 1 1 x Figure: Sigmoid f (x) f (x) = tanh(x) 1 1 x Figure: Hyperbolic tangent If loss and activation function are differentiable, parameters w and b can be learned by gradient descent. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 17 / 56 Introduction to Neural Networks A perceptron x0 = 1 x1 x2 x3 w10 w20 w11 w21 w12 w22 w13 w23 P S1 f y1 P S2 f y2 Let’s be xi input number i and yj output number j Sj = X Wji xi i yj = f (Sj ) with Wj0 = bj and x0 = 1. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 18 / 56 Introduction to Neural Networks A perceptron x0 = 1 x1 x2 x3 w10 w20 w11 w21 w12 w22 w13 w23 P S1 f y1 P S2 f y2 As the loss is differentiable, we can compute ∂L ∂wji ∂L ∂wji R. H E´ RAULT (INSA LITIS) = = ∂L ∂yj . ∂L ∂yj ∂Sj ∂yj ∂Sj ∂wji ∂L 0 f (Sj )xi ∂yj Deep Neural Networks April 29 2015 18 / 56 Introduction to Neural Networks Gradient descent : general algorithm Input: Integer Nb : Batch number Input: Boolean Sto : Stochastic grad ? Input: (Xtrain , Ytrain ) : Training set W ← random initialization (Xsplit , Ysplit ) ← split ((Xtrain , Ytrain ), Nb) while stopping criterion not reached do if Sto then (Xsplit , Ysplit ) ← randperm ((Xsplit , Ysplit )) end if for (Xbloc , Ybloc ) ∈ (Xsplit , Ysplit ) do ∆W ← 0 for (x, y) ∈ (Xbloc , Ybloc ) do ∆Wi ← ∆Wi + ∂L(x,W,y) ∀i ∂Wi end for ∆W ∆W ← card(X bloc ) W ← W − η∆W end for end while R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 19 / 56 Introduction to Neural Networks Neural network A perceptron can only solve linearly separable problems Neural network To solve more complex problems, we need to build a network of perceptrons Principles The network is an oriented graph, each node represent a formal neuron, Information follows graph edges, Calculus is distributed over nodes R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 20 / 56 Introduction to Neural Networks Multi-Layer Perceptron - Feed-forward network x1 x2 yˆ1 x3 yˆ2 x4 Figure: Feed-forward network, with two layers and one hidden representation Neurons are layered. Calculus always flows in one direction. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 21 / 56 Introduction to Neural Networks Recurrent network At least one retroactive loop Hysteresis effect x1 x2 yˆ1 x3 yˆ2 x4 Figure: Recurrent network R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 22 / 56 Introduction to Neural Networks Recurrent network x1,t x2,t x3,t yˆ1,t yˆ1,t−3 yˆ1,t−2 yˆ1,t−1 Figure: NARX Recurrent network R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 23 / 56 Multi-Layer Perceptron - Feed-forward network Outline 1 Introduction to supervised learning 2 Introduction to Neural Networks 3 Multi-Layer Perceptron - Feed-forward network 4 Deep Neural Networks 5 Extension to structured output R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 24 / 56 Multi-Layer Perceptron - Feed-forward network Scheme of a Multi Layer Perceptron x1 x2 yˆ1 x3 yˆ2 x4 Figure: Example of feed-forward network: a 2-layer perceptron Formalism: Layer, computational element, Representation, data element R. H E´ RAULT (INSA LITIS) This MLP has an input layer and an output layer (2 layers), an input, a hidden and output representations (3 representations). Deep Neural Networks April 29 2015 24 / 56 Multi-Layer Perceptron - Feed-forward network ˆ: Forward path Estimation of y (l) I0 = 1 (l) I1 (l) I2 (l) I3 w10 w20 w11 w21 w12 w22 w13 w23 (l) If we look at layer (l), let’s be Ii P S1 P S2 f (l) O1 (l) f (l) O2 (l) input number i and Oj (l) Sj = X (l) (l) (l) output number j, (l) (l) Wji Ii i (l) Oj = (l) f (l) (Sj ) = I (l+1) ˆ Starts with I (0) = x and finishes with O (last) = y R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 25 / 56 Multi-Layer Perceptron - Feed-forward network How to learn parameters ? Gradient back-propagation (l) I0 = 1 (l) I1 (l) I2 (l) I3 We assume to know w11 w21 w12 w22 w13 w23 w10 w20 P S1 P S2 (l) (l) f (l) O1 (l) f (l) O2 (l) ∂L (l) j ∂O ∂L (l) ∂wji ∂L (l) ∂wji R. H E´ RAULT (INSA LITIS) (l) = = ∂L ∂Oj (l) ∂Oj ∂L (l) ∂Oj (l) ∂Sj (l) ∂Sj ∂wji 0(l) (l) f (l) (l) (Sj )Ii Deep Neural Networks April 29 2015 26 / 56 Multi-Layer Perceptron - Feed-forward network How to learn parameters ? Gradient back-propagation (l) I0 = 1 (l) I1 (l) I2 (l) I3 Now we compute w11 w21 w12 w22 w13 w23 w10 w20 P S1 P S2 (l) (l) f (l) O1 (l) f (l) O2 (l) ∂L (l) i ∂I ∂L (l) ∂Ii ∂L (l) ∂Ii ∂L (l) ∂Ii R. H E´ RAULT (INSA LITIS) (l) = X ∂L ∂Oj j (l) ∂Oj (l) ∂Ii (l) = X ∂L ∂Oj j = (l) ∂Oj X ∂L (l) j (l) ∂Sj f 0(l) ∂Oj Deep Neural Networks (l) ∂Sj (l) ∂Ii (l) (Sj )wji April 29 2015 26 / 56 Multi-Layer Perceptron - Feed-forward network How to learn parameters ? Gradient back-propagation (l) I0 = 1 (l) I1 (l) I2 (l) I3 Start ∂L (last) j ∂O = w11 w21 w12 w22 w13 w23 w10 w20 P S1 P S2 (l) (l) f (l) O1 (l) f (l) O2 (l) ∂L ∂ˆ yj Backward recurrence ∂L (l) = ∂wji ∂L (l) = ∂Ii ∂L (l−1) ∂Oj R. H E´ RAULT (INSA LITIS) ∂L (l) f 0(l) ∂Oj X ∂L (l) ∂Oj j = (l) (l) 0(l) (l) (Sj )Ii f (Sj )wji ∂L (l) ∂Ii Deep Neural Networks April 29 2015 26 / 56 Deep Neural Networks Outline 1 Introduction to supervised learning 2 Introduction to Neural Networks 3 Multi-Layer Perceptron - Feed-forward network 4 Deep Neural Networks 5 Extension to structured output R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 27 / 56 Deep Neural Networks Deep architecture x1 x2 x3 yˆ1 x4 x5 Why ? Some problems needs exponential number of neurons on the hidden representation, Build / extract features inside the NN in order not to rely on handmade extraction (human prior). R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 27 / 56 Deep Neural Networks The vanishing gradient problem f (x) f (x) = tanh(x) 1 x 1 Figure: Hyperbolic tangent ∂L (l) ∂Ii = X ∂L (l) j ∂Oj f 0(l) (Sj )wji (l) When neurons at higher layers are saturated, the gradient decreases toward zero. Solution Better topology, better initialization of the weights, Regularization ! R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 28 / 56 Deep Neural Networks Convolutional network A unit on representation (l) is connected to a sub-slice of o units from representation (l − 1). All the weights between units are tied leading to only o weights. Warning, bias are not tied. If representation (l − 1) is in Rm and (l) is in Rn , number of parameters: (m + 1) ∗ n → (o + 1) ∗ n w1 w2 w1 w3 w2 w1 w3 w2 w3 Figure: 1D convolutional network R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 29 / 56 Deep Neural Networks Convolutional network : 2D example Figure: [LeCun 2010] LeCun, Y. (1989). Generalization and network design strategies. Connections in Perspective. North-Holland, Amsterdam, 143-55. LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 30 / 56 Deep Neural Networks Better initialization through unsupervised learning The learning is split into two steps: Pre-training A unsupervised pre-training of the input layers with auto-encoders. Intuition: learning the manifold where the input data resides. Can take into account an unlabelled dataset. Finetuning A finetuning of the whole network with supervised back-propagation. Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554 Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 31 / 56 Deep Neural Networks Diabolo network, Autoencoders Autoencoders are neural network where the input and output representations have the same number of units. The learned target is the input itself. x1 xˆ1 x2 xˆ2 h1 x3 xˆ3 x h2 x4 xˆ4 x5 xˆ5 Figure: Diabolo network When 2 layers : The input layer is called the encoder, The output layer, the decoder. | Tied weights Wdec = Wenc , convergence? PCA ? R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 32 / 56 Deep Neural Networks Diabolo network, Autoencoders Autoencoders are neural network where the input and output representations have the same number of units. The learned target is the input itself. x1 xˆ1 x2 xˆ2 h1 x3 xˆ3 x h2 x4 xˆ4 x5 xˆ5 Figure: Diabolo network Undercomplete, size(h) < size(x) Overcomplete, size(x) < size(h). R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 32 / 56 Deep Neural Networks Building from auto-encoders Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are fixed until the finetuning. R. H E´ RAULT (INSA LITIS) x1 xˆ1 x2 xˆ2 x3 xˆ3 x4 xˆ4 x5 xˆ5 Deep Neural Networks April 29 2015 33 / 56 Deep Neural Networks Building from auto-encoders Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are fixed until the finetuning. R. H E´ RAULT (INSA LITIS) x1 h1,1 ˆ1,1 h x2 h1,2 ˆ1,2 h x3 h1,3 ˆ1,3 h x4 h1,4 ˆ1,4 h x5 h1,5 ˆ1,5 h Deep Neural Networks April 29 2015 33 / 56 Deep Neural Networks Building from auto-encoders Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are fixed until the finetuning. x1 x2 h2,1 ˆ2,1 h x3 h2,2 ˆ2,2 h x4 h2,3 ˆ2,3 h x5 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 33 / 56 Deep Neural Networks Building from auto-encoders Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, its weights are fixed until the finetuning. x1 x2 x3 yˆ1 x4 x5 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 33 / 56 Deep Neural Networks Simplified stacked AE Algorithm Input: X , a training feature set of size Nbexamples × Nbfeatures Input: Y , a corresponding training label set of size Nbexamples × Nblabels Input: Ninput , the number of input layers to be pre-trained Input: Noutput , the number of output layers to be pre-trained Input: N, the number of layers in the IODA, Ninput + Noutput < N Output: [w1 , w2 , . . . , wN ], the parameters for all the layers Randomly initialize [w1 , w2 , . . . , wN ] Input pre-training R←X for i ← 1..Ninput do {Training an AE on R and keeps its encoding parameters} [wi , wdummy ] ← MLPT RAIN([wi , w|i ], R, R) Drop wdummy R ← MLPF ORWARD([wi ], R) end for Final supervised learning [w1 , w2 , . . . , wN ] ← MLPT RAIN([w1 , w2 , . . . , wN ], X , Y ) R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 34 / 56 Deep Neural Networks Improve optimization by adding noise 1/3 Denoising (undercomplete) auto-encoders ˜, a disturbed x; the target is still x. The auto-encoder is learned from x x˜1 xˆ1 x2 x˜2 xˆ2 x3 x4 x5 Disturbance x1 h1 x˜3 xˆ3 x h2 x˜4 xˆ4 x˜5 xˆ5 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 35 / 56 Deep Neural Networks Improve optimization by adding noise 2/3 Prevent co-adaptation in (overcomplete) autoencoders During training, randomly disconnect hidden units. h1 x1 xˆ1 h2 x2 xˆ2 h3 x3 xˆ3 h4 h5 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 36 / 56 Deep Neural Networks Improve optimization by adding noise 2/3 Prevent co-adaptation in (overcomplete) autoencoders During training, randomly disconnect hidden units. h1 x1 xˆ1 h2 x2 xˆ2 h3 x3 xˆ3 h4 h5 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 36 / 56 Deep Neural Networks Improve optimization by adding noise 2/3 Prevent co-adaptation in (overcomplete) autoencoders During training, randomly disconnect hidden units. Figure: MNIST [Hinton 2012] R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 36 / 56 Deep Neural Networks Improve optimization by adding noise 3/3 Dropout During training, randomly disconnect at each iteration weights by probability p. At testing, multiply the weights by # actual disconnections # iterations (6= p). x1 x2 x3 yˆ1 x4 x5 Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 37 / 56 Deep Neural Networks Improve optimization by adding noise 3/3 Dropout During training, randomly disconnect at each iteration weights by probability p. At testing, multiply the weights by # actual disconnections # iterations (6= p). x1 x2 x3 yˆ1 x4 x5 Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 37 / 56 Deep Neural Networks Improve optimization by adding noise 3/3 Dropout During training, randomly disconnect at each iteration weights by probability p. At testing, multiply the weights by # actual disconnections # iterations (6= p). Figure: Reuters dataset Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 37 / 56 Deep Neural Networks Tikhonov regularization scheme Noise and early stopping are connected to regularization. So why not using Tikhonov regularization scheme ? X J = L(yi , f (xi ; w)) + λ.Ω(w) i Notation 2-layer MLP ˆ = fMLP (x; win , wout ) = fout (bout + wout .fin (bin + win .x)) y AE ˆ = fAE (x; wenc , wdec ) = fdec (bdec + wdec .fenc (benc + wenc .x)) x Tied weights win ↔ wenc , wdec ↔ w|enc Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1), 108-116 Collobert, R. and Bengio, S. (2004). Links between perceptrons, MLPs and SVMs. In ICML’2004 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 38 / 56 Deep Neural Networks Regularization on weights J = X L(yi , fMLP (xi ; w)) + λ.Ω(wout ) i It is enough to regularize output-layer weights. L2 (Gaussian prior): X Ω(wout ) = ||wd ||2 d L1 (Laplace prior): Ω(wout ) = X |wd | d t-Student: Ω(wout ) = X log(1 + w2d ) d With infinite units, L1 : boosting L2 : SVM Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2005). Convex neural networks. In Advances in neural information processing systems (pp. 123-130) R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 39 / 56 Deep Neural Networks Contractive autoencoder 1/2 Figure: Input manifold AE must be sensitive to [blue] direction to reconstruct well It can be insensitive to [orange] direction. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 40 / 56 Deep Neural Networks Contractive autoencoder 2/2 The autoencoder should: reconstruct correctly x which lies on the input manifold X L(xi , fAE (xi ; wenc )) i be insensitive to small changes on x outside the manifold (i.e. project on the manifold) ⇒ penalize by the Jacobian X ∂fj (x; wenc ) 2 ||Jfenc (x; wenc )||F = ∂xi ij Objective function J = X L(xi , fAE (xi ; wenc )) + λ.||Jfenc (x; wenc )||2F i Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 833–840 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 41 / 56 Deep Neural Networks Regularization brought by multi-task learning / embedding Combine multiple tasks in the same optimization problem. Tasks are sharing parameters. J = λL X L(yi , fMLP (xi ; wout , win )) i∈L +λU X L(xi , fAE (xi ; win )) i∈L∪U +λΩ Ω(wout ) Mix supervised and unsupervised data. Weston, J., Ratle, F., and Collobert, R. . Deep learning via semi-supervised embedding. ICML, pages 1168–1175, 2008 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 42 / 56 Deep Neural Networks Regularization brought by multi-task learning / embedding Combine multiple tasks in the same optimization problem. Tasks are sharing parameters. J = λL X L(yi , fMLP (xi ; wout , win )) i∈L +λU X L(xi , fAE (xi ; win )) i∈L∪U +λΩ Ω(wout ) +λJ ||Jfin (x; win )||2F +... Mix supervised and unsupervised data. Weston, J., Ratle, F., and Collobert, R. . Deep learning via semi-supervised embedding. ICML, pages 1168–1175, 2008 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 42 / 56 Extension to structured output Outline 1 Introduction to supervised learning 2 Introduction to Neural Networks 3 Multi-Layer Perceptron - Feed-forward network 4 Deep Neural Networks 5 Extension to structured output R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 43 / 56 Extension to structured output Structured output Ad-hoc definition Data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Christoph Lampert Automatic transcription Automatic translation Point matching Image labeling (semantic image segmentation) Landmark detection Input/Output Deep Architecture (IODA) Learn output dependencies the same way a DNN learns input dependencies. ´ R. Herault ´ B. Labbe, & C. Chatelain Learning Deep Neural Networks for High Dimensional Output Problems. In IEEE International Conference on Machine Learning and Applications, 2009 (pp. 63-68). J. Lerouge, R. Herault, C. Chatelain, F. Jardin, R. Modzelewski, IODA: An input/output deep architecture for image labeling, Pattern Recognition, Available online 27 March 2015, ISSN 0031-3203 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 43 / 56 Extension to structured output The image labeling problem Dataset Input Target Toy Sarcopenia R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 44 / 56 Extension to structured output Input/Output Deep Architecture (IODA) for Image Labeling Figure: The IODA architecture. It directly links the pixel matrix to the label matrix. The input layers (left, light) are pre-trained to provide a high level representation of the image pixels, while the output layers (right, dark) are pre-trained to learn the a priori knowledge of the problem. ´ R. H ERAULT (INSA LITIS) Deep Neural Networks April 29 2015 45 / 56 Extension to structured output Simplified IODA Algorithm 1/2 Input: X , a training feature set of size Nbexamples × Nbfeatures Input: Y , a corresponding training label set of size Nbexamples × Nblabels Input: Ninput , the number of input layers to be pre-trained Input: Noutput , the number of output layers to be pre-trained Input: N, the number of layers in the IODA, Ninput + Noutput < N Output: [w1 , w2 , . . . , wN ], the parameters for all the layers Randomly initialize [w1 , w2 , . . . , wN ] Input pre-training R←X for i ← 1..Ninput do {Training an AE on R and keeps its encoding parameters} [wi , wdummy ] ← MLPT RAIN([wi , w|i ], R, R) Drop wdummy R ← MLPF ORWARD([wi ], R) end for R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 46 / 56 Extension to structured output Simplified IODA Algorithm 2/2 Output pre-training R←Y for i ← N..N − Noutput + 1 step − 1 do {Training an AE on R and keeps its decoding parameters} [u, wi ] ← MLPT RAIN([w|i , wi ], R, R) R ← MLPF ORWARD([u], R) Drop u end for Final supervised learning [w1 , w2 , . . . , wN ] ← MLPT RAIN([w1 , w2 , . . . , wN ], X , Y ) R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 47 / 56 Extension to structured output Qualitative results 1/3 Iter. 10 Iter. 100 Iter. 200 Iter. 300 (NDA) (IDA) (IODA) Figure: Evolution of the output image of the architecture according to the number of batch gradient descent iterations for the three learning strategies, using the validation example #10. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 48 / 56 Extension to structured output Qualitative results 2/3 (a) CT image (b) Ground truth (c) Chung (d) IODA Figure: Non-sarcopenic patient R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 49 / 56 Extension to structured output Qualitative results 3/3 (a) CT image (b) Ground truth (c) Chung (d) IODA Figure: Sarcopenic patient R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 50 / 56 Extension to structured output Quantitative results Architecture r1 r2 2048 2048 1024 1024 2048 2048 1024 1024 2048 2048 1024 1024 2048 2048 1024 1024 2048 2048 1024 1024 X 1282 1282 1282 1282 1282 1282 1282 1282 1282 1282 : input pre-training, ˆ Y 1282 1282 1282 1282 1282 1282 1282 1282 1282 1282 : no pre-training, Train error Test error 2.64e-02 3.11e-02 3.86e-02 4.44e-02 5.20e-02 6.29e-02 6.30e-02 7.09e-02 9.03e-02 1.03e-01 3.48e-02 3.91e-02 4.59e-02 5.13e-02 5.75e-02 6.77e-02 6.79e-02 7.55e-02 9.40e-02 1.06e-01 : output pre-training. Table: Toy dataset: 3-layer MLP Method Chung NDA IDA IODA Diff. (%) -10.6 0.12 0.15 3.37 Jaccard (%) 60.3 85.88 85.91 88.47 Table: Sarcopenia. R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 51 / 56 Extension to structured output Why not using multi-tasking + Tikhonov schemes ? Notation 3-layer MLP ˆ = fMLP (x; win , wlink , wout ) = fout (bout + wout .flink (blink + wlink .fin (bin + win .x))) y Input AE ˆ = fAEi (x; win ) = fdec bdec + w|in .fenc (benc + win .x) x Output AE ˆ = fAEo (y; wout ) = fdec b0dec + wout .fenc (b0enc + w|out .y) y Objective function J = λL X L(yi , fMLP (xi ; win , wlink , wout )) i∈L +λU X L(xi , fAEi (xi ; win ))) + λL0 i∈L∪U +λΩ X L(yi , fAEo (yi ; wout )) i∈L Ω(wlink ) Submitted to ECML, Input/Output Deep Architecture for Structured Output Problems, Soufiane Belharbi, Clement Chatelain, Romain Herault and Sebastien Adam, arXiv:1504.07550 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 52 / 56 Extension to structured output Facial landmark detection problem Competition i-bug: http://ibug.doc.ic.ac.uk/resources/300-W_IMAVIS Images from: Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang. Learning and Transferring Multi-task Deep Representation for Face Alignment. Technical report, arXiv:1408.3967, 2014 R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 53 / 56 Extension to structured output Facial landmark detection, some results Figure: Early results on facial landmark detection R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 54 / 56 Extension to structured output Questions ? ? R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 55 / 56 Extension to structured output References Y. Bengio, A. Courville, P. Vincent, ”Representation Learning: A Review and New Perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828, Aug., 2013 (arXiv:1206.5538) Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade (pp. 437-478). Springer Berlin Heidelberg. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. (arXiv:1207.0580). LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE. J. Lerouge, R. Herault, C. Chatelain, F. Jardin, R. Modzelewski, IODA: An input/output deep architecture for image labeling, Pattern Recognition, Available online 27 March 2015, ISSN 0031-3203, http://dx.doi.org/10.1016/j.patcog.2015.03.017. Hugo Larochelle lectures: http://info.usherbrooke.ca/hlarochelle/cours/ift725_A2013/contenu.html R. H E´ RAULT (INSA LITIS) Deep Neural Networks April 29 2015 56 / 56
© Copyright 2024