“Optimal Deep Learning” and the Information Bottleneck method ICRI-CI retreat, Haifa, May 2015 Naftali Tishby Noga Zaslavsky School of Engineering and Computer Science The Edmond & Lily Safra Center for Brain Sciences Hebrew University, Jerusalem, Israel 1 5/8/2015 Outline • Deep Neural Networks and Deep Learning – What are Deep Neural Networks (DNN)? – The incredible success of DNN’s – Theoretical challenges • The Information Bottleneck method – – – – Finding (approximate) Minimal sufficient statistics DPI & Centroid consistency The IB complexity-accuracy tradeoff The nature of the optimal solutions – IB bifurcations • Bifurcation Theory of Deep Neural Networks – – – – Statistical characterizations of Neural Nets Learning optimality and sample complexity bound The connection between NN layers and IB phase transitions Design principles for optimal DNN’s ICRI-CI 2015 - Tishby & Zaslavsky 5/8/2015 3 Deep Learning: Neural-Nets strike back ICRI-CI 2015 - Tishby & Zaslavsky 3 5/8/2015 ICRI-CI 2015 - Tishby & Zaslavsky 4 5/8/2015 We desperately need a Theory… • Why DNN’s work so well? • How can they be improved? – Optimality bounds… • What is “an optimal DNN”? • Sample and computational complexity bounds – Design principles • What determines the number & width of the layers? • What determines the connectivity and inter-layer connections? – Interpretability • What do the layers/neurons capture/represent? – Better learning algorithms • Is stochastic gradient descent the best we can do? 5 5/8/2015 ICRI-CI 2015 - Tishby & Zaslavsky Deep Neural Nets and Information Theory ?? From causal to predictive systems… ICRI-CI 2015 - Tishby & Zaslavsky 6 5/8/2015 Outline • Deep Neural Networks and Deep Learning – What are Deep Neural Networks (DNN)? – The incredible success of DNN’s – Theoretical challenges • The Information Bottleneck method – – – – Finding (approximate) Minimal sufficient statistics DPI & Centroid consistency The IB complexity-accuracy tradeoff The nature of the optimal solutions – IB bifurcations • Bifurcation Theory of Deep Neural Networks – – – – Statistical characterizations of Neural Nets Learning optimality bound The connection between NN layers and IB phase transitions Design principles for optimal DNN’s ICRI-CI 2015 - Tishby & Zaslavsky 5/8/2015 3 The Information Bottleneck Method (Tishby, Pereira, Bialek, 1999) (1) Approximate Minimal Sufficient Statistics: Markov chain: Y X S ( X ) Xˆ Xˆ arg min S ( X ):I ( S ( X );Y ) I ( X ;Y ) I ( S ( X ); X ) Re laxation - given p( X , Y ) : Xˆ arg min p ( xˆ| x ) I ( Xˆ ; X ) I ( Xˆ ; Y ) , 0 (Shamir, Sabato,T., TCS 2010) (2) A Rate-Distortion problem with KL- divergence distortion: d IB ( x, xˆ ) D[ p ( y | x) || p ( y | xˆ )] (Bachrach, Navot,T., COLT 2006) (3) The ONLY distributional quantization measure which satisfy both DPI (f-divergences) and Statistical Consistency (Bregman divergences) (Harremoes-T., ISIT 2008) 5/8/2015 ICRI-CI 2015 - Tishby & Zaslavsky 8 The Information Bottleneck Method (Tishby, Pereira, Bialek, 1999) The IB optimality/stationarity equations: min p ( xˆ| x ):Y X Xˆ I ( Xˆ ; X ) I ( Xˆ ; Y ) , 0 p( x) ˆ p ( x | x) Z ( x, ) exp( D[ p ( y | x) || p( y | xˆ ]) Z ( x, ) p ( xˆ ) exp( D[ p ( y | x) || p( y | xˆ ]) xˆ p ( xˆ ) x p ( xˆ | x) p ( x ) p ( y | xˆ ) x p (y | x) p ( x | xˆ ) Solved by Arimoto-Blahut like iterations, but with possibly sub-optimal solutions (!), similar to K-means distributional-clustering with centroids update. 5/8/2015 ICRI-CI 2015 - Tishby & Zaslavsky 9 I I I ( Xˆ ; Y ) I T1 F T2 F T3 (IT2P) (IT1P) The limit is always RD like concave envelope with sub-optimal bifurcations I ( X ; Xˆ ) ICRI-CI 2015 - Tishby & Zaslavsky 10 5/8/2015 F (IT3P) Critical points are 2nd order phase transitions ICRI-CI 2015 - Tishby & Zaslavsky 11 5/8/2015 The IB bifurcation (phase-transitions) points The IB bifurcation points can be found as follows: p( x) ˆ ln p ( x | x) ln D[ p ( y | x) || p( y | xˆ )] Z ( x, ) then: ln p ( x | xˆ ) ln p(y | xˆ ) y p ( y | x) xˆ xˆ ln p (y | xˆ ) 1 ln p ( x | xˆ ) ˆ p ( y | x) p ( x | x) x ˆ ˆ x p( y | x) xˆ these equations can be combined into two (non-linear) eigenvalue problems: ln p ( x | xˆ ) 0 xˆ ln p (y | xˆ ) 0 I CY ( xˆ, ) xˆ I C X ( xˆ, ) These eigenvalue problems have non-trivial solutions (eigenvectors) only at the critical bifurcation points (second order phase transitions). 5/8/2015 ICRI-CI 2015 - Tishby & Zaslavsky 12 c ( xˆ ) 2 (C X ( xˆ, c ) 2 (CY ( xˆ, c ) 1 ICRI-CI 2015 - Tishby & Zaslavsky 13 5/8/2015 1 IB bifurcation diagram c ( xˆ ) 2 (C X ( xˆ, c ) 2 (CY ( xˆ, c ) 1 ICRI-CI 2015 - Tishby & Zaslavsky 14 5/8/2015 1 Outline • Deep Neural Networks and Deep Learning – What are Deep Neural Networks (DNN)? – The incredible success of DNN’s – Theoretical challenges • The Information Bottleneck method – – – – Finding (approximate) Minimal sufficient statistics DPI & Centroid consistency The IB complexity-accuracy tradeoff The nature of the optimal solutions – IB bifurcations • Bifurcation Theory of Deep Neural Networks – – – – Statistical characterizations of Neural Nets Learning optimality and sample complexity bound The connection between NN layers and IB phase transitions Design principles for optimal DNN’s ICRI-CI 2015 - Tishby & Zaslavsky 5/8/2015 3 DNN’s and the Information Bottleneck Linearly separable units, (pos. stochastic): ln p(h i | h i 1) hT i 1W i h i b(h i ) ln p(h i | h i 1) Wi hi h i 1 Near the optimal IB curve: The inter-layer mapping is of exponential form : ln ln p(h i 1 | h i ) hT i 1W i h i a(h i 1) W i - is the i-th layer connection matrix. with ICRI-CI 2015 - Tishby & Zaslavsky 16 5/8/2015 p(h i | h i 1) ln p( x | xˆ) h i 1 xˆ h i a x and h i 1 a xˆ DNN’s and the Information Bottleneck Near the optimal IB curve: ln with But on the optimal IB curve there is a non-trivial derivative only at the IB bifurcation points: ln p(h i | h i 1) =W i h i v2 (h i , c ) h i 1 p(h i | h i 1) ln p( x | xˆ) h i 1 xˆ h i a x and h i 1 a xˆ This provides an equation for the optimal weights: i where v2 ( x h i , c ) is the second eigenvector of the bifurcation matrix: W ln p( x | xˆ ) I c C X ( xˆ, c ) v2 ( x, c ) 0 I C X ( xˆ, ) xˆ ICRI-CI 2015 - Tishby & Zaslavsky 17 5/8/2015 v2 (h i , c ) h i h 0 i Optimal design principles Hidden layers Output layer ICRI-CI 2015 - Tishby & Zaslavsky 18 5/8/2015 Sample complexity bounds ICRI-CI 2015 - Tishby & Zaslavsky 20 5/8/2015 Real DNN’s on the IB plane ICRI-CI 2015 - Tishby & Zaslavsky 21 5/8/2015 Summary • An Information Theory of Deep Neural Networks – Based on the Information Bottleneck (IB) tradeoff – Uniquely and consistently quantifies the hidden layers • The optimal hidden layers correspond to IB bifurcation points – New spectral algorithm for finding the IB bifurcation points – Determines number and width of the optimal DNN layers – New spectral learning rule: Weights are derived from 2nd eigenvector • New design principles and finite sample complexity bounds – Network structure is determined from the bifurcation diagram – Finite sample bounds from mutual-information estimation bounds – Stochastic networks are proved to be optimal (in terms of complexity) – Possible implications on real (biological) layered networks – …. ICRI-CI 2015 - Tishby & Zaslavsky 5/8/2015 21
© Copyright 2024