“Optimal Deep Learning” and the Information Bottleneck - ICRI-CI

“Optimal Deep Learning”
and the
Information Bottleneck method
ICRI-CI retreat, Haifa, May 2015
Naftali Tishby
Noga Zaslavsky
School of Engineering and Computer Science
The Edmond & Lily Safra Center for Brain Sciences
Hebrew University, Jerusalem, Israel
1
5/8/2015
Outline
• Deep Neural Networks and Deep Learning
– What are Deep Neural Networks (DNN)?
– The incredible success of DNN’s
– Theoretical challenges
• The Information Bottleneck method
–
–
–
–
Finding (approximate) Minimal sufficient statistics
DPI & Centroid consistency
The IB complexity-accuracy tradeoff
The nature of the optimal solutions – IB bifurcations
• Bifurcation Theory of Deep Neural Networks
–
–
–
–
Statistical characterizations of Neural Nets
Learning optimality and sample complexity bound
The connection between NN layers and IB phase transitions
Design principles for optimal DNN’s
ICRI-CI 2015 - Tishby & Zaslavsky
5/8/2015
3
Deep Learning: Neural-Nets strike back
ICRI-CI 2015 - Tishby & Zaslavsky
3
5/8/2015
ICRI-CI 2015 - Tishby & Zaslavsky
4
5/8/2015
We desperately need a Theory…
• Why DNN’s work so well?
• How can they be improved?
– Optimality bounds…
• What is “an optimal DNN”?
• Sample and computational complexity bounds
– Design principles
• What determines the number & width of the layers?
• What determines the connectivity and inter-layer connections?
– Interpretability
• What do the layers/neurons capture/represent?
– Better learning algorithms
• Is stochastic gradient descent the best we can do?
5
5/8/2015
ICRI-CI 2015 - Tishby & Zaslavsky
Deep Neural Nets and Information Theory ??
From causal to predictive systems…
ICRI-CI 2015 - Tishby & Zaslavsky
6
5/8/2015
Outline
• Deep Neural Networks and Deep Learning
– What are Deep Neural Networks (DNN)?
– The incredible success of DNN’s
– Theoretical challenges
• The Information Bottleneck method
–
–
–
–
Finding (approximate) Minimal sufficient statistics
DPI & Centroid consistency
The IB complexity-accuracy tradeoff
The nature of the optimal solutions – IB bifurcations
• Bifurcation Theory of Deep Neural Networks
–
–
–
–
Statistical characterizations of Neural Nets
Learning optimality bound
The connection between NN layers and IB phase transitions
Design principles for optimal DNN’s
ICRI-CI 2015 - Tishby & Zaslavsky
5/8/2015
3
The Information Bottleneck Method
(Tishby, Pereira, Bialek, 1999)
(1) Approximate Minimal Sufficient Statistics:
Markov chain:
Y  X  S ( X )  Xˆ
Xˆ  arg min S ( X ):I ( S ( X );Y )  I ( X ;Y ) I ( S ( X ); X )
Re laxation - given p( X , Y ) :
Xˆ  arg min p ( xˆ| x ) I ( Xˆ ; X )   I ( Xˆ ; Y ) ,   0
(Shamir, Sabato,T., TCS 2010)
(2) A Rate-Distortion problem with KL- divergence distortion:
d IB ( x, xˆ )  D[ p ( y | x) || p ( y | xˆ )]
(Bachrach, Navot,T., COLT 2006)
(3) The ONLY distributional quantization measure which satisfy
both DPI (f-divergences) and Statistical Consistency (Bregman divergences)
(Harremoes-T., ISIT 2008)
5/8/2015
ICRI-CI 2015 - Tishby & Zaslavsky
8
The Information Bottleneck Method
(Tishby, Pereira, Bialek, 1999)
The IB optimality/stationarity equations:
min p ( xˆ| x ):Y X  Xˆ I ( Xˆ ; X )   I ( Xˆ ; Y ) ,   0
p( x)

ˆ
 p ( x | x)  Z ( x,  ) exp(  D[ p ( y | x) || p( y | xˆ ])

 Z ( x,  )   p ( xˆ ) exp(  D[ p ( y | x) || p( y | xˆ ])
xˆ


p ( xˆ )   x p ( xˆ | x) p ( x )


p ( y | xˆ )   x p (y | x) p ( x | xˆ )

Solved by Arimoto-Blahut like iterations, but with possibly sub-optimal solutions (!),
similar to K-means distributional-clustering with centroids update.
5/8/2015
ICRI-CI 2015 - Tishby & Zaslavsky
9
I
I
I ( Xˆ ; Y )
I
T1
F
T2
F
T3
(IT2P)
(IT1P)
The limit is always RD like concave
envelope with sub-optimal bifurcations
I ( X ; Xˆ )
ICRI-CI 2015 - Tishby & Zaslavsky
10
5/8/2015
F
(IT3P)
Critical points are 2nd order phase transitions
ICRI-CI 2015 - Tishby & Zaslavsky
11
5/8/2015
The IB bifurcation (phase-transitions) points
The IB bifurcation points can be found as follows:
p( x)
ˆ
ln p ( x | x)  ln
  D[ p ( y | x) || p( y | xˆ )]
Z ( x,  )
then:
 ln p ( x | xˆ )
 ln p(y | xˆ )
   y p ( y | x)
xˆ
xˆ
 ln p (y | xˆ )
1
 ln p ( x | xˆ )
ˆ

p ( y | x) p ( x | x)

x
ˆ
ˆ
x
p( y | x)
xˆ
these equations can be combined into two (non-linear) eigenvalue problems:
 ln p ( x | xˆ )
0
xˆ
 ln p (y | xˆ )
0
 I   CY ( xˆ,  )
xˆ
 I   C X ( xˆ,  )
These eigenvalue problems have non-trivial solutions (eigenvectors) only at the
critical bifurcation points (second order phase transitions).
5/8/2015
ICRI-CI 2015 - Tishby & Zaslavsky
12
 c ( xˆ )   2 (C X ( xˆ,  c )    2 (CY ( xˆ,  c ) 
1
ICRI-CI 2015 - Tishby & Zaslavsky
13
5/8/2015
1
IB bifurcation diagram
 c ( xˆ )   2 (C X ( xˆ,  c )    2 (CY ( xˆ,  c ) 
1
ICRI-CI 2015 - Tishby & Zaslavsky
14
5/8/2015
1
Outline
• Deep Neural Networks and Deep Learning
– What are Deep Neural Networks (DNN)?
– The incredible success of DNN’s
– Theoretical challenges
• The Information Bottleneck method
–
–
–
–
Finding (approximate) Minimal sufficient statistics
DPI & Centroid consistency
The IB complexity-accuracy tradeoff
The nature of the optimal solutions – IB bifurcations
• Bifurcation Theory of Deep Neural Networks
–
–
–
–
Statistical characterizations of Neural Nets
Learning optimality and sample complexity bound
The connection between NN layers and IB phase transitions
Design principles for optimal DNN’s
ICRI-CI 2015 - Tishby & Zaslavsky
5/8/2015
3
DNN’s and the Information Bottleneck
Linearly separable
units, (pos. stochastic):
ln p(h i | h i 1)  hT i 1W i h i  b(h i )
 ln
p(h i | h i 1)
 Wi hi
h i 1
Near the optimal IB curve:
The inter-layer mapping is of exponential form :
 ln
ln p(h i 1 | h i )  hT i 1W i h i  a(h i 1)
W i - is the i-th layer connection matrix.
with
ICRI-CI 2015 - Tishby & Zaslavsky
16
5/8/2015
p(h i | h i 1)  ln p( x | xˆ)

h i 1
xˆ
h i a x and h i 1 a xˆ
DNN’s and the Information Bottleneck
Near the optimal IB curve:
 ln
with
But on the optimal IB curve there is a non-trivial derivative
only at the IB bifurcation points:
 ln
p(h i | h i 1)
=W i h i  v2 (h i ,  c )
h i 1
p(h i | h i 1)  ln p( x | xˆ)

h i 1
xˆ
h i a x and h i 1 a xˆ
This provides an
equation for the
optimal weights:
i
where v2 ( x  h i ,  c ) is the second eigenvector of the bifurcation matrix: W 
 ln p( x | xˆ )
  I   c C X ( xˆ,  c )  v2 ( x,  c )  0
 I   C X ( xˆ,  )
xˆ
ICRI-CI 2015 - Tishby & Zaslavsky
17
5/8/2015
v2 (h i ,  c )
h i
h 0
i
Optimal design principles
Hidden
layers
Output
layer
ICRI-CI 2015 - Tishby & Zaslavsky
18
5/8/2015
Sample complexity bounds
ICRI-CI 2015 - Tishby & Zaslavsky
20
5/8/2015
Real DNN’s on the IB plane
ICRI-CI 2015 - Tishby & Zaslavsky
21
5/8/2015
Summary
• An Information Theory of Deep Neural Networks
– Based on the Information Bottleneck (IB) tradeoff
– Uniquely and consistently quantifies the hidden layers
• The optimal hidden layers correspond to IB bifurcation points
– New spectral algorithm for finding the IB bifurcation points
– Determines number and width of the optimal DNN layers
– New spectral learning rule: Weights are derived from 2nd eigenvector
• New design principles and finite sample complexity bounds
– Network structure is determined from the bifurcation diagram
– Finite sample bounds from mutual-information estimation bounds
– Stochastic networks are proved to be optimal (in terms of complexity)
– Possible implications on real (biological) layered networks
– ….
ICRI-CI 2015 - Tishby & Zaslavsky
5/8/2015
21