Deep Learning: A Quick Overview Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea [email protected] http://mlg.postech.ac.kr/∼seungjin April 24, 2015 1 / 25 Flat vs Deep Figure: Flat model (e.g., linear models and RBM) 2 / 25 Flat vs Deep Figure: Flat model (e.g., linear models and RBM) Figure: Deep model (e.g., deep belief network and deep Boltzmann machine) 2 / 25 Flat vs Deep Figure: Flat model (e.g., linear models and RBM) Figure: Deep model (e.g., deep belief network and deep Boltzmann machine) Representation learning: embedding, features, clusters, topics 2 / 25 Deep Learning Everywhere: Acoustic Modeling in Speech Recognition G. Hinton, et al. (2012), ”Deep Neural Networks for Acoustic Modeling in Speech Recognition” 3 / 25 Deep Learning Everywhere: ImageNet Classification Deep CNN: Classifiy 1.2 million images into 1000 classes A. Krizhevsky, I. Sutskever, G. Hinton (2012), ”ImageNet Classification with Deep Convolutional Neural Networks” 4 / 25 Deep Learning Everywhere: ImageNet Classification Deep CNN: Classifiy 1.2 million images into 1000 classes A. Krizhevsky, I. Sutskever, G. Hinton (2012), ”ImageNet Classification with Deep Convolutional Neural Networks” 4 / 25 Deep Learning Everywhere: ImageNet Classification Deep CNN: Classifiy 1.2 million images into 1000 classes A. Krizhevsky, I. Sutskever, G. Hinton (2012), ”ImageNet Classification with Deep Convolutional Neural Networks” I SIFT BoW: 256K-dimensional features, MAP 35% I deep learning: 1K-dimensional features, MAP 38% 4 / 25 Deep Learning Everywhere: Approximate Similarity Search (c) ... (b) (a) ... ... ... ... ... ... ... ... Y. Kang, S. Kim, S. Choi (2012), ”Deep learning to hash with multiple representations” 5 / 25 Query Retrieved images 6 / 25 Success and Issues I Why successful? I I I I Pre-training: Restricted Boltzmann machine (RBM), Auto-encoder, nonnegative matrix factorization (NMF) Training: Drop out Rectified units: No vanishing gradient, sparse activation Issues I I Distributions: Exponential family harmonium Multi-modal extensions: Dual-wing harmonium, restricted deep belief nets, multi-modal stacked auto-encoders, and multi-modal deep Boltzmann machines 7 / 25 Success and Issues I Why successful? I I I I Pre-training: Restricted Boltzmann machine (RBM), Auto-encoder, nonnegative matrix factorization (NMF) Training: Drop out Rectified units: No vanishing gradient, sparse activation Issues I I Distributions: Exponential family harmonium Multi-modal extensions: Dual-wing harmonium, restricted deep belief nets, multi-modal stacked auto-encoders, and multi-modal deep Boltzmann machines 7 / 25 Deep Learning I Single view I I I I I I I Multiplicative up-prop (Ahn, Oh, Choi, 2004) Deep belief networks (Hinton et al., 2006) Deep auto-encoders (Hinton and Salakhutdinov, 2006) Stacked denoising auto-encoders (Vincent et al., 2010) Deep Boltzmann machines (Salakhutdinov and Hinton, 2009) Deep convolutional neural networks (Krizhevsky et al., 2012) Multiple views I I I Restricted deep belief networks (Kang and Choi, 2011) Multi-modal stacked auto-encoders (Ngiam et al., 2011) Multi-modal deep Boltzmann machines (Srivastava and Salakhutdinov, 2014) 8 / 25 Deep Learning I Single view I I I I I I I Multiplicative up-prop (Ahn, Oh, Choi, 2004) Deep belief networks (Hinton et al., 2006) Deep auto-encoders (Hinton and Salakhutdinov, 2006) Stacked denoising auto-encoders (Vincent et al., 2010) Deep Boltzmann machines (Salakhutdinov and Hinton, 2009) Deep convolutional neural networks (Krizhevsky et al., 2012) Multiple views I I I Restricted deep belief networks (Kang and Choi, 2011) Multi-modal stacked auto-encoders (Ngiam et al., 2011) Multi-modal deep Boltzmann machines (Srivastava and Salakhutdinov, 2014) 8 / 25 Nonnegative Matrix Factorization (Seung and Lee, 1999) 9 / 25 Nonnegative Matrix Factorization (Seung and Lee, 1999) Parts-based representation Seung and Lee, 1999 9 / 25 Least Squares NMF W H I Involves the optimization: arg min kV − W Hk2 , W ,H subject to W > 0 and H > 0. I Multiplicative updates: W = V H V ≈ WH ← ← (V > 0) = V H> W HH > RH > W , NH > W >V H W >W H W >R H . W >N W 10 / 25 Least Squares NMF W H I Involves the optimization: arg min kV − W Hk2 , W ,H subject to W > 0 and H > 0. I Multiplicative updates: W = V H V ≈ WH ← ← (V > 0) = V H> W HH > RH > W , NH > W >V H W >W H W >R H . W >N W 10 / 25 Multiplicative Up-Prop (Ahn, Oh, and Choi, ICML-2014) (l) ia ← (L) H aµ ← W η R (l) H (l)T (l) ia , ia N (l) H (l)T ia η W (L)T R (L) aµ (L) H aµ , W (L)T N (L) aµ W where 0 < η 1 is the learning rate. R (l+1) (l+1) and N are iµ iµ up-propaged: R N (l+1) iµ (l+1) iµ W = = W (l)T (l)T R (l) N (l) g 0 0 iµ g iµ W W (l) H (l) (l) H (l) , iµ , iµ where l = 1, 2, · · · , L − 1 and R (1) iµ = N (1) iµ = V iµ 0 (1) (1) g (W H )iµ , (W (1) H (1) )iµ 0 (1) (1) g (W H )iµ . 11 / 25 Hierarchical Representation: From Bottom to Top 12 / 25 Harmonium (or Restricted Boltzmann Machine) x1 h1 h2 h3 x2 x3 x4 x5 I Harmonium (Smolensky, 1986) or RBM (Hinton and Sejnowski, 1986) is an undirected model which allows only inter-layer connections (complete bipartite graphs). I Energy-based probabilistic model defines a probability distribution through an energy function, associating an energy function to each configuration of the variables of interest: p(x) = X h p(x, h) = X 1 −E (x ,h ) e , Z h P P −E (x ,h ) . x he Learning corresponds to modifying the energy function so that its shape has desirable properties (maximum likelihood estimation). where E (h, x) = −b > x − c > h − h > W x and Z = I 13 / 25 Harmonium (or Restricted Boltzmann Machine) x1 h1 h2 h3 x2 x3 x4 x5 I Harmonium (Smolensky, 1986) or RBM (Hinton and Sejnowski, 1986) is an undirected model which allows only inter-layer connections (complete bipartite graphs). I Energy-based probabilistic model defines a probability distribution through an energy function, associating an energy function to each configuration of the variables of interest: p(x) = X h p(x, h) = X 1 −E (x ,h ) e , Z h P P −E (x ,h ) . x he Learning corresponds to modifying the energy function so that its shape has desirable properties (maximum likelihood estimation). where E (h, x) = −b > x − c > h − h > W x and Z = I 13 / 25 Contrastive Divergence Learning Average log-likelihood gradient is approximated by k Gibbs steps, + * ∂ log p(x) ∂ ∂ + E (x, h) , ≈ − E (x, h) ∂θ ∂θ ∂θ p (k) (h|x) p (k) (x,h) pe(x) pe(x) where p (k) (x, h) is the joint distribution determined by k Gibbs steps. 14 / 25 Exponential Family Harmonium (Welling, Rosen-Zvi, Hinton, 2004) Choose Nx independent distributions pi (xi ) for the observed variables and Nh independent distributions pi (hi ) for the hidden variables from the exponential family: ) ( Nx Y X p(x) = exp ξia fia (xi ) − Ai ({ξia }) , a i=1 p(h) = Nh Y ( exp j=1 ) X λjb gjb (hj ) − Bj ({λjb }) , b where {fia (xi ), gjb (hi )} are sufficient statistics, {ξia , λjb } are canonical parameters, {Ai , Bj } are log-partition functions. Couple the random variables in the log-domain by the introduction of a quadratic interaction term, leading to the harmonium random fields: X X X p(x, h) ∝ exp ξia fia (xi ) + λjb gjb (hj ) + Wiajb fia (xi )gjb (hj ) . i,a j,b i,a,j,b Applied to information retrieval (fast inference, compared to directed models). 15 / 25 Multi-Wing Harmonium h x y I Xing et al., 2005. I Take inputs from multiple sources. I Assume that multi-view inputs reflect the same central theme. I Consist of multiple harmoniums joint by a shared array of hidden nodes. 16 / 25 Multi-View Harmonium h x h x h y y I Kang and Choi, 2011. I Take inputs from multiple sources. I View-specific hidden nodes and shared hidden nodes. I Allow for independent + dependent views. 17 / 25 Restricted Deep Belief Networks hx x h h I Kang and Choi, 2011. I Take inputs from multiple sources. I View-specific hidden nodes and shared hidden nodes. I Allow for independent + dependent views. I Multi-layer extension of multi-view harmoniums. y y 18 / 25 RDBN for Multi-View Learning hx h x hy y “snow” “lake” y* = max p ( y | x) y (a) (b) 19 / 25 Image Annotation: Precision-Recall 0.2 2ïlayer MWH 2ïlayer RDBN 3ïlayer RDBN 4ïlayer RDBN Random SGPLVM FOLSïGPLVM Precision 0.15 0.1 0.05 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 20 / 25 Denoising Autoencoder ... ... ... Figure: Autoencoder 21 / 25 Denoising Autoencoder ... ... ... ... ... X ... X ... Figure: Denoising autoencoder Figure: Autoencoder 21 / 25 Stacked Denoising Autoencoder for Face Pose Normalization (Kang and Choi, 2013) (frontal faces) ... ... ... (Non-frontal faces) Figure: Pre-training 22 / 25 Stacked Denoising Autoencoder for Face Pose Normalization (Kang and Choi, 2013) (frontal faces) ... ... ... (Non-frontal faces) Figure: Pre-training ... ... Figure: Train another DAE 22 / 25 Figure: 10 examples of corrupted face images from Georgia tech face database (left), their pose-normalized version obtained by the proposed method (middle), and ground-truth frontal face images (right). 23 / 25 Deep Belief Net vs Deep Boltzmann Machine h3 p(h2, h3) h2 p(h1|h2) h1 p(v|h1) v 24 / 25 Deep Belief Net vs Deep Boltzmann Machine h3 h3 h2 h2 h1 h1 v v p(h2, h3) p(h1|h2) p(v|h1) 24 / 25 Question and Discussion 25 / 25
© Copyright 2024