Slides - People - University of Chicago

A Probabilistic Graphical Model-based Approach for
Minimizing Energy under Performance Constraints
Nikita Mishra, Huazhe Zhang, John Lafferty and Hank Hoffmann
University of Chicago
Fraction of time
CPU utilization
CPU utilization
Average CPU utilization of more than 5,000 servers during 6-month period [1]
[1]Barroso, Luiz André, and Urs Hölzle. "The case for energy-proportional computing." IEEE computer 40.12 (2007): 33-37.
2
Example of a configuration space
2.26 Hz
Clock Speed
Memory
Controller 1
Memory
Controller 2
Cores
Memory controller
3
Adaptive systems
Automatically tune configurations for different utilizations
to achieve most energy efficient state
4
Adaptive systems
Automatically tune configurations for different utilizations
to achieve most energy efficient state
Requires the power and performance
profile for the application
4
Why is it a difficult problem?
5
Why is it a difficult problem?
• Configuration space can be quite large. With brute
force it may take a lot of time.
5
Why is it a difficult problem?
• Configuration space can be quite large. With brute
force it may take a lot of time.
• The behavior of each application is different for
different machine.
5
Why is it a difficult problem?
• Configuration space can be quite large. With brute
force it may take a lot of time.
• The behavior of each application is different for
different machine.
• The application behavior could even vary with different
input. E.g. (Video streaming application x264)
5
Example: streamcluster
Cores
Performance rate (in iter/s)
Clock speed
A contour plot of performance rate (in iter/s) for
streamcluster benchmark at different configurations
6
Example: streamcluster
Cores
Performance rate (in iter/s)
8
Clock speed
A contour plot of performance rate (in iter/s) for
streamcluster benchmark at different configurations
6
Example: streamcluster
Performance rate (in iter/s)
Cores
Multiple local
solutions
8
Clock speed
A contour plot of performance rate (in iter/s) for
streamcluster benchmark at different configurations
6
Example: kmeans
Optimal configuration
frontier
Pareto frontier of Performance rate (in Iter/s) vs
system-power(in Watts) at different configurations
7
LEO (Learning for Energy Optimization)
Historical Data
Target Application
8
LEO (Learning for Energy Optimization)
Historical Data
Target Application
Incorporate performance profiles of previously seen
applications
8
Example: kmeans
Performance rate (in Iter/s) vs
Configuration index
Estimated Pareto-optimal
frontiers vs true frontier found
with exhaustive search
9
Outline
•
•
•
•
Motivation/Overview
Statistical modelling
Evaluation
Summary
10
Outline
• Statistical modelling
10
Outline
• Statistical modelling
Graphical
Models
Hierarchical
Bayesian
model
Expectationmaximization
algorithm
10
Outline
• Statistical modelling
Graphical
Models
Hierarchical
Bayesian
model
Expectationmaximization
algorithm
10
Outline
• Statistical modelling
Graphical
Models
Hierarchical
Bayesian
model
Expectationmaximization
algorithm
10
Outline
• Statistical modelling
Graphical
Models
Hierarchical
Bayesian
model
Expectationmaximization
algorithm
10
Graphical Models
z1
z2
zM -1
zM
y1
y2
yM -1
yM
yi : Vector of performance rate by the ith application for different configurations.
11
Graphical Models
z1
z2
zM -1
zM
y1
y2
yM -1
yM
yi : Vector of performance rate by the ith application for different configurations.
11
Graphical Models
z1
z2
zM -1
zM
y1
y2
yM -1
yM
yi : Vector of performance rate by the ith application for different configurations.
11
Hierarchical Bayesian Model
,
Hidden
Nodes
All applications
(Observed data)
z1
z2
zM -1
zM
y1
y2
yM -1
yM
yi : Vector of performance rate by the ith application for different configurations.
12
Hierarchical Bayesian Model
,
Hidden
Nodes
z1
All applications
(Observed data)
y1
z2
y2
zM -1
yM -1
zM
yM
Target Application
(Partially observed data)
yi : Vector of performance rate by the ith application for different configurations.
12
Hierarchical Bayesian Model
Couples each of the applications
,
Hidden
Nodes
z1
All applications
(Observed data)
y1
z2
y2
zM -1
yM -1
zM
yM
Target Application
(Partially observed data)
yi : Vector of performance rate by the ith application for different configurations.
12
Hierarchical Bayesian Model
Couples each of the applications
,
Hidden
Nodes
z1
All applications
(Observed data)
y1
z2
y2
zM -1
yM -1
zM
yM
Penalizes large
variations in the
application
Target Application
(Partially observed data)
yi : Vector of performance rate by the ith application for different configurations.
12
Hierarchical Bayesian Model
,
Hidden
Nodes
All applications
(Observed data)
z1
z2
zM -1
zM
y1
y2
yM -1
yM
yi : Vector of performance rate by the ith application for different configurations.
12
Hierarchical Bayesian Model
,
Hidden
Nodes
All applications
(Observed data)
z1
z2
zM -1
zM
y1
y2
yM -1
yM
True value of target
application
yi : Vector of performance rate by the ith application for different configurations.
13
Expectation Maximization Algorithm
Model Parameters
Latent variables
Initialize
14
Expectation Maximization Algorithm
Model Parameters
Ɵnew=
Latent variables
Initialize
Initialize
14
Expectation Maximization Algorithm
Model Parameters
Ɵnew=
Initialize
Initialize
Latent variables
=
E-step
Create
Expected
log-likelihood
function
14
Expectation Maximization Algorithm
Model Parameters
Ɵnew=
M-step
Maximize
Initialize
Expected
Initialize
log-likelihood
function
Latent variables
=
E-step
Create
Expected
log-likelihood
function
14
Expectation Maximization Algorithm
Model Parameters
Ɵnew
Ɵnew=
M-step
Maximize
Initialize
Expected
Initialize
log-likelihood
function
Observed data
Latent variables
=
E-step
Create
Expected
log-likelihood
function
14
Example: kmeans
Performance (in Iter/s)
(Initialization)
Cores
Different iterations of EM algorithm for estimating performance
rate (in Iter/s) vs Cores
15
Example: kmeans
Performance (in Iter/s)
(Initialization)
Observed Samples
Cores
Different iterations of EM algorithm for estimating performance
rate (in Iter/s) vs Cores
15
Example: kmeans
Performance (in Iter/s)
(EM Iteration - 1 )
Cores
Different iterations of EM algorithm for estimating performance
rate (in Iter/s) vs Cores
15
Example: kmeans
Performance (in Iter/s)
(EM Iteration - 2 )
Cores
Different iterations of EM algorithm for estimating performance
rate (in Iter/s) vs Cores
15
Example: kmeans
Performance (in Iter/s)
(EM Iteration - 3 )
Cores
Different iterations of EM algorithm for estimating performance
rate (in Iter/s) vs Cores
15
Example: kmeans
Performance (in Iter/s)
(EM Iteration - 4 )
Cores
Different iterations of EM algorithm for estimating performance
rate (in Iter/s) vs Cores
15
Example: kmeans
Performance (in Iter/s)
(EM Iteration - 4 )
Cores
Different iterations of EM algorithm for estimating performance
rate (in Iter/s) vs Cores
15
LEO (Learning for Energy Optimization)
Set yM = Observed Power
LEO
Get p = Estimated Power
Controller
Feedback!
Select the
configuration
LEO
Set yM = Observed Performance
Get r = Estimated Performance
16
LEO (Learning for Energy Optimization)
Set yM = Observed Power
LEO
Get p = Estimated Power
Controller
Feedback!
Select the
configuration
LEO
Set yM = Observed Performance
Get r = Estimated Performance
16
LEO (Learning for Energy Optimization)
Set yM = Observed Power
LEO
Get p = Estimated Power
Controller
Feedback!
Select the
configuration
LEO
Set yM = Observed Performance
Get r = Estimated Performance
16
Outline
• Motivation/Overview
• Statistical modelling
• Evaluation
• Experimental Setup
• Power and performance estimation
• Energy savings/ Phase transition
• Summary
17
Outline
• Evaluation
• Experimental Setup
17
Outline
• Evaluation
• Experimental Setup
Dual-socket Linux 3.2.0 system with SuperMICRO X9DRL-iF motherboard
and two Intel Xeon E5-2690 processors
17
Experimental Setup
• Configurations (1024 configurations)
18
Experimental Setup
• Configurations (1024 configurations)
• Clock speed:
• Set using cpufrequtils package
• 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings
18
Experimental Setup
• Configurations (1024 configurations)
• Clock speed:
• Set using cpufrequtils package
• 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings
• Memory controller:
• numactl library to control the access.
• 2 memory controls - 2 settings
18
Experimental Setup
• Configurations (1024 configurations)
• Clock speed:
• Set using cpufrequtils package
• 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings
• Memory controller:
• numactl library to control the access.
• 2 memory controls - 2 settings
• Cores:
• Two 8 cores and hyper-threading - 32 settings
18
Experimental Setup
• Configurations (1024 configurations)
• Clock speed:
• Set using cpufrequtils package
• 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings
• Memory controller:
• numactl library to control the access.
• 2 memory controls - 2 settings
• Cores:
• Two 8 cores and hyper-threading - 32 settings
• Measurements
18
Experimental Setup
• Configurations (1024 configurations)
• Clock speed:
• Set using cpufrequtils package
• 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings
• Memory controller:
• numactl library to control the access.
• 2 memory controls - 2 settings
• Cores:
• Two 8 cores and hyper-threading - 32 settings
• Measurements
• Power
• WattsUp meter provides total system power at 1s intervals.
18
Experimental Setup
• Configurations (1024 configurations)
• Clock speed:
• Set using cpufrequtils package
• 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings
• Memory controller:
• numactl library to control the access.
• 2 memory controls - 2 settings
• Cores:
• Two 8 cores and hyper-threading - 32 settings
• Measurements
• Power
• WattsUp meter provides total system power at 1s intervals.
• Performance
• Applications report the heartrate, which is application specific.
18
Experimental Setup
• Benchmarks
19
Experimental Setup
• Benchmarks
• We use 25 benchmarks from 3 different suites, PARSEC,
Minebench, Rodinia and some others.
19
Experimental Setup
• Benchmarks
• We use 25 benchmarks from 3 different suites, PARSEC,
Minebench, Rodinia and some others.
• Baseline heuristics
19
Experimental Setup
• Benchmarks
• We use 25 benchmarks from 3 different suites, PARSEC,
Minebench, Rodinia and some others.
• Baseline heuristics
• Online algorithm- Polynomial multivariate regression over
configuration values on the observed dataset.
19
Experimental Setup
• Benchmarks
• We use 25 benchmarks from 3 different suites, PARSEC,
Minebench, Rodinia and some others.
• Baseline heuristics
• Online algorithm- Polynomial multivariate regression over
configuration values on the observed dataset.
• Offline algorithm- Average over the rest of the applications to
estimate the power and performance of the given application.
19
Experimental Setup
• Benchmarks
• We use 25 benchmarks from 3 different suites, PARSEC,
Minebench, Rodinia and some others.
• Baseline heuristics
• Online algorithm- Polynomial multivariate regression over
configuration values on the observed dataset.
• Offline algorithm- Average over the rest of the applications to
estimate the power and performance of the given application.
• Race-to-idle- Allocates all resources to the application and once it
is finished the system goes to idle.
19
Outline
• Motivation/Overview
• Statistical modelling
• Evaluation
• Experimental setup
• Power and performance estimation
• Energy savings/ Phase transition
• Summary
20
Power and performance estimation
Performance rate (in Iter/s) vs
Configuration index
System-power (in Watts) vs
Configuration index
21
Power and performance estimation
Swish
Search web- server
X264
Video encoder
22
Summary: Performance estimation
LEO
Online
Offline
1
0.9
0.8
ACCURACY
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Kmeans
LEO
Online
Offline
0.99
0.33
0.25
23
Summary: Performance estimation
LEO
Online
Offline
1
0.9
0.8
ACCURACY
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Jacobi
LEO
Online
Offline
0.94
0.87
0.15
24
Summary: Performance estimation
LEO
Online
Offline
1
0.9
0.8
ACCURACY
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Overall
LEO
Online
Offline
0.97
0.87
0.68
25
Summary: System-power estimation
LEO
Online
Offline
1
0.9
0.8
ACCURACY
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Overall
LEO
Online
Offline
0.98
0.85
0.89
26
Outline
• Motivation/Overview
• Statistical modelling
• Experiments
• Experimental setup
• Power and performance estimation
• Energy savings/ Phase transition
• Summary
27
Summary: Energy savings
• Comparison of average energy compared with the optimal
(over different utilizations and all the benchmarks),
•
•
•
•
LEO Online Offline Race-to idle -
+6%
+24%
+29%
+90%
28
Phase - transitions
Performance and power for fluidanimate along phases with different
computational demands
29
Phase - transitions
Performance and power for fluidanimate along phases with different
computational demands
29
Multiple Applications
Comparison of performance estimation(in iter/s) and system-power(in Watts)
estimation for different algorithms over the set of mixture of applications
Performance(in Iter/s)
System-power(in Watts)
Mixture 1
Mixture 2
Overall
Mixture 1
Mixture 2
Overall
LEO
0.88
0.91
0.90
0.87
0.86
0.87
Online
0.87
0.82
0.85
0.82
0.80
0.81
Offline
0.39
0.86
0.63
0.73
0.73
0.73
30
Summary
Sensitivity analysis of LEO vs Online
As compared to LEO which quickly reaches near optimality, our baseline
method (online regression) cannot perform below 15 samples because the
design matrix of regression model would be rank deficient.
32
Related Work
• Offline optimization techniques (e.g.,[59, 35, 33,
10, 2])
• But they are limited by reliance on a robust training phase.
• Online optimization techniques [44]
• For example, Flicker is a configurable architecture and
optimization framework that uses only online models to maximize
performance under a power limitation.
• ParallelismDial,
• Uses online adaptation to tailor parallelism to application
workload.
33