A New Propagator for Two-Layer Neural Networks in

A New Propagator for Two-Layer Neural
Networks in Empirical Model Learning
Michele Lombardi (University of Bologna)
Stefano Gualandi (University of Pavia)
Context
Sometimes, real-world optimization problems
are defined over complex domains
...and complex domains are difficult to model
Context
Example #1
§ Given a budget
§ Place traffic lights
§ Optimize a traffic metric
§ How do you compute this metric?
§ How is it affected by the decisions?
Context
§ How will people react to the plan?
Example #2
§ Given fixed budget
§ Design an incentive plan
§ Renewable energy production goals
Context
Example #3
§ Many-core platform
§ Dispatch workload
§ On-line scheduling
§ Avoid loss of efficiency due
to thermal controllers
§ How does the temperature behave?
§ How is it affected by the scheduler?
Empirical Model Learning
Empirical Model Learning is a technique to enable optimal
decision making over complex systems
Three main steps:
1
Obtain input/output tuples
2
Extract an approximate model via Machine Learning
3
Encode it using a Combinatorial Optimization technique
How? Simple evaluation? More?
Neuron Constraints
Specific case: Neural Networks
in0
in1
in2
out
y=
X
i
Encode each neuron
using CP!
x0
x1
x2
X
w i · xi + b
y
f
z
§ monotone
§ non-decreasing
e.g.
Neuron Constraints
Specific case: Neural Networks
in0
in1
out
x0
x1
x2
X
y
f
z
in2
decision variables
constraints
Neuron Constraint
Every NN can be encoded
via Neuron Constraints
Drawbacks
hidden layer
However, consider this:
1
x0#∈#[%1,1]
%1
n0
1
n2
1
x1#∈#[%1,1]
§ Weights on the arcs
§ No bias
§ n0, n1: sigmoind. n2: linear.
1
n1
1
output layer
Drawbacks
1
x0#∈#[%1,1]
However, consider this:
%1
n0
1
n2
1
x1#∈#[%1,1]
1
n1
1
Propagation:
For n0
For n1
For n2
x0 ∈ [-­‐1,1]
x1 ∈ [-­‐1,1]
x0 ∈ [-­‐1,1]
x1 ∈ [-­‐1,1]
[...,0.96]
[...,0.96]
x0 = 1
x1 = 1
x0 = -­‐1
x1 = 1
0.96
0.96
X
[...,2]
[...,0.96]
X
[...,2]
[...,0.96]
X
[...,1.93]
[...,1.93]
Drawbacks
The real network maximum is 1.51
§ Instead of 1.93 :-(
What to do?
Build a global constraint
Let’s start from a common case (2 layers):
twolayerANN ([xi ], z, [bj ], [wj,i ], ˆb, [w
ˆj ])
The Bounding Problem
§ Given bounds on the network input
§ How do we compute bounds on the network output?
General Structure:
x0
wj,i
x1
X
y0
X
y1
f
w
ˆ0
b0
b1
f
w
ˆ1
X
z
f
ˆb
§ monotone
§ non-decreasing
The Bounding Problem
§ Given bounds on the network input
§ How do we compute bounds on the network output?
Upper bound = solution of:
max z = ˆb +
m
X1
all its fault!
w
ˆj f (yj )
j=0
s.t. yj = bj +
n
X1
wj,i xi
i=0
xi 2 [xi , xi ]
BUT: this is non-linear AND non-convex
8j =Lagrangian
0..m 1
A way out:
relaxation
8i = 0..n
1
The Bounding Problem
§ Given bounds on the network input
§ How do we compute bounds on the network output?
Upper bound = solution of:
max z = ˆb +
m
X1
w
ˆj f (yj )
relax this
j=0
s.t. yj = bj +
n
X1
wj,i xi
i=0
xi 2 [xi , xi ]
BUT: this is non-linear AND non-convex
8j =Lagrangian
0..m 1
A way out:
relaxation
8i = 0..n
1
The Bounding Problem
§ Given bounds on the network input
§ How do we compute bounds on the network output?
Upper bound = solution of:
max z( ) = ˆb +
m
X1
Lagrangian
w
ˆj f (yj )+ multipliers
j=0
+
m
X1
j=0
xi 2 [xi , xi ]
yj 2 [y j , y j ]
j
bj +
n
X1
wj,i xi
yj
i=0
8i = 0..n
8j = 0..m
1
1
!
The Bounding Problem
For every value of the multipliers we get a bound
(over)simplified example:
z = f (y)
y =2·x+1
x 2 [x, x]
our z
z
feasible space
x
y
The Bounding Problem
For every value of the multipliers we get a bound
(over)simplified example:
z = f (y)
y =2·x+1
x 2 [x, x]
y 2 [y, y]
z
feasible space
x
y
The Bounding Problem
For every value of the multipliers we get a bound
(over)simplified example:
z = f (y) + · (2 · x + 1
y =2·x+1
y)
z
x 2 [x, x]
y 2 [y, y]
x
y
The Bounding Problem
For every value of the multipliers we get a bound
(over)simplified example:
z = f (y) + · (2 · x + 1
y =2·x+1
y)
z
x 2 [x, x]
y 2 [y, y]
x
y
Solving the Relaxation
§ Given bounds on the network input
§ How do we compute bounds on the network output?
The problem is
Upper bound = solution of:
separable
max z( ) = ˆb +
m
X1
w
ˆj f (yj )+
y-­‐part
j=0
+
x-­‐part
m
X1
j=0
xi 2 [xi , xi ]
yj 2 [y j , y j ]
j
bj +
n
X1
wj,i xi
yj
i=0
8i = 0..n
8j = 0..m
1
1
!
Solving the Relaxation
The x-part:
max zx ( ) =
x
n
X1
i=0
s.t.
0
@
m
X1
j wj,i
j=0
xi 2 [xi , xi ]
≥ 0: maximize xi
1
A xi
< 0: minimize xi
8i = 0..n
§ Linear problem with box constraints
§ O(nm) to compute the weights
§ O(n) to solve the problem
1
Solving the Relaxation
The y-part:
max zy ( ) =
y
(w
ˆj f (yj )
j yj )
j=0
s.t.
§
§
§
§
m
X1
yj 2 [y j , y j ]
8j = 0..m
1
Sum of single-variable functions
Optimum via classical analytic methods (derivative = 0)
Best yj in constant time
Solved in O(m)
Finding the optimal multipliers
The Lagrangian problem:
Which multipliers provide the best upper bound?
min z( )
s.t.
2R
m
§ Solved via subgradient method + deflection
Deflection: the update direction is a composition of the
current subgradient and the last direction
Finding the optimal multipliers
A
B
min bound
1
min bound
first iteration
first iteration
0
0
No deflection
With deflection
z⇤( )
Finding the optimal multipliers
A
B
min bound
1
min bound
first iteration
first iteration
0
0
No deflection
The best bound is 1.52!
Real: 1.51,
1.93
With Prev:
deflection
z⇤( )
Finding the optimal multipliers
The Lagrangian problem:
Which multipliers provide the best upper bound?
min z( )
s.t.
2R
m
§ Solved via subgradient + deflection
Deflection: the update direction is a composition of the
current subgradient and the last direction
§ 100 iterations at the root node
§ During search: 3 iterations per x-update
Experimental Setup
Implemented in Google or-tools
Test on a workload dispatching problem:
§ 16-20 tasks, 4 cores
§ 6 platforms
§ Thermal controller (lowers efficiency)
§ Efficiency threshold, find feasible assignment
§ Time Limit
Static var/val choice heuristic: the structure of the search
tree does not depend on the propagation
Branches for 16 Tasks
ptf1
#branches NO LAG
ptf2
ptf3
#branches LAG
Time for 16 Tasks
ptf1
ptf2
Time NO LAG
ptf3
Time LAG
Branches for 20 Tasks
ptf1
#branches NO LAG
ptf2
ptf3
#branches LAG
Time for 20 Tasks
ptf1
ptf2
Time NO LAG
ptf3
Time LAG
Conclusions
A new propagator for an important Neural Network class:
§ Impressive reduction of the #branches (in some cases)
§ The propagation time is an issue
Roadmap:
§ A more efficient way to update the multipliers
§ Prune the input variables
§ Comparison with MINLP solvers
§ Comparison with local search based solvers
§ Apply EML to different machine learning technologies
The End
Thanks!
Questions?