A Probabilistic Graphical Model-based Approach for Minimizing Energy under Performance Constraints Nikita Mishra, Huazhe Zhang, John Lafferty and Hank Hoffmann University of Chicago Fraction of time CPU utilization CPU utilization Average CPU utilization of more than 5,000 servers during 6-month period [1] [1]Barroso, Luiz André, and Urs Hölzle. "The case for energy-proportional computing." IEEE computer 40.12 (2007): 33-37. 2 Example of a configuration space 2.26 Hz Clock Speed Memory Controller 1 Memory Controller 2 Cores Memory controller 3 Adaptive systems Automatically tune configurations for different utilizations to achieve most energy efficient state 4 Adaptive systems Automatically tune configurations for different utilizations to achieve most energy efficient state Requires the power and performance profile for the application 4 Why is it a difficult problem? 5 Why is it a difficult problem? • Configuration space can be quite large. With brute force it may take a lot of time. 5 Why is it a difficult problem? • Configuration space can be quite large. With brute force it may take a lot of time. • The behavior of each application is different for different machine. 5 Why is it a difficult problem? • Configuration space can be quite large. With brute force it may take a lot of time. • The behavior of each application is different for different machine. • The application behavior could even vary with different input. E.g. (Video streaming application x264) 5 Example: streamcluster Cores Performance rate (in iter/s) Clock speed A contour plot of performance rate (in iter/s) for streamcluster benchmark at different configurations 6 Example: streamcluster Cores Performance rate (in iter/s) 8 Clock speed A contour plot of performance rate (in iter/s) for streamcluster benchmark at different configurations 6 Example: streamcluster Performance rate (in iter/s) Cores Multiple local solutions 8 Clock speed A contour plot of performance rate (in iter/s) for streamcluster benchmark at different configurations 6 Example: kmeans Optimal configuration frontier Pareto frontier of Performance rate (in Iter/s) vs system-power(in Watts) at different configurations 7 LEO (Learning for Energy Optimization) Historical Data Target Application 8 LEO (Learning for Energy Optimization) Historical Data Target Application Incorporate performance profiles of previously seen applications 8 Example: kmeans Performance rate (in Iter/s) vs Configuration index Estimated Pareto-optimal frontiers vs true frontier found with exhaustive search 9 Outline • • • • Motivation/Overview Statistical modelling Evaluation Summary 10 Outline • Statistical modelling 10 Outline • Statistical modelling Graphical Models Hierarchical Bayesian model Expectationmaximization algorithm 10 Outline • Statistical modelling Graphical Models Hierarchical Bayesian model Expectationmaximization algorithm 10 Outline • Statistical modelling Graphical Models Hierarchical Bayesian model Expectationmaximization algorithm 10 Outline • Statistical modelling Graphical Models Hierarchical Bayesian model Expectationmaximization algorithm 10 Graphical Models z1 z2 zM -1 zM y1 y2 yM -1 yM yi : Vector of performance rate by the ith application for different configurations. 11 Graphical Models z1 z2 zM -1 zM y1 y2 yM -1 yM yi : Vector of performance rate by the ith application for different configurations. 11 Graphical Models z1 z2 zM -1 zM y1 y2 yM -1 yM yi : Vector of performance rate by the ith application for different configurations. 11 Hierarchical Bayesian Model , Hidden Nodes All applications (Observed data) z1 z2 zM -1 zM y1 y2 yM -1 yM yi : Vector of performance rate by the ith application for different configurations. 12 Hierarchical Bayesian Model , Hidden Nodes z1 All applications (Observed data) y1 z2 y2 zM -1 yM -1 zM yM Target Application (Partially observed data) yi : Vector of performance rate by the ith application for different configurations. 12 Hierarchical Bayesian Model Couples each of the applications , Hidden Nodes z1 All applications (Observed data) y1 z2 y2 zM -1 yM -1 zM yM Target Application (Partially observed data) yi : Vector of performance rate by the ith application for different configurations. 12 Hierarchical Bayesian Model Couples each of the applications , Hidden Nodes z1 All applications (Observed data) y1 z2 y2 zM -1 yM -1 zM yM Penalizes large variations in the application Target Application (Partially observed data) yi : Vector of performance rate by the ith application for different configurations. 12 Hierarchical Bayesian Model , Hidden Nodes All applications (Observed data) z1 z2 zM -1 zM y1 y2 yM -1 yM yi : Vector of performance rate by the ith application for different configurations. 12 Hierarchical Bayesian Model , Hidden Nodes All applications (Observed data) z1 z2 zM -1 zM y1 y2 yM -1 yM True value of target application yi : Vector of performance rate by the ith application for different configurations. 13 Expectation Maximization Algorithm Model Parameters Latent variables Initialize 14 Expectation Maximization Algorithm Model Parameters Ɵnew= Latent variables Initialize Initialize 14 Expectation Maximization Algorithm Model Parameters Ɵnew= Initialize Initialize Latent variables = E-step Create Expected log-likelihood function 14 Expectation Maximization Algorithm Model Parameters Ɵnew= M-step Maximize Initialize Expected Initialize log-likelihood function Latent variables = E-step Create Expected log-likelihood function 14 Expectation Maximization Algorithm Model Parameters Ɵnew Ɵnew= M-step Maximize Initialize Expected Initialize log-likelihood function Observed data Latent variables = E-step Create Expected log-likelihood function 14 Example: kmeans Performance (in Iter/s) (Initialization) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15 Example: kmeans Performance (in Iter/s) (Initialization) Observed Samples Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15 Example: kmeans Performance (in Iter/s) (EM Iteration - 1 ) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15 Example: kmeans Performance (in Iter/s) (EM Iteration - 2 ) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15 Example: kmeans Performance (in Iter/s) (EM Iteration - 3 ) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15 Example: kmeans Performance (in Iter/s) (EM Iteration - 4 ) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15 Example: kmeans Performance (in Iter/s) (EM Iteration - 4 ) Cores Different iterations of EM algorithm for estimating performance rate (in Iter/s) vs Cores 15 LEO (Learning for Energy Optimization) Set yM = Observed Power LEO Get p = Estimated Power Controller Feedback! Select the configuration LEO Set yM = Observed Performance Get r = Estimated Performance 16 LEO (Learning for Energy Optimization) Set yM = Observed Power LEO Get p = Estimated Power Controller Feedback! Select the configuration LEO Set yM = Observed Performance Get r = Estimated Performance 16 LEO (Learning for Energy Optimization) Set yM = Observed Power LEO Get p = Estimated Power Controller Feedback! Select the configuration LEO Set yM = Observed Performance Get r = Estimated Performance 16 Outline • Motivation/Overview • Statistical modelling • Evaluation • Experimental Setup • Power and performance estimation • Energy savings/ Phase transition • Summary 17 Outline • Evaluation • Experimental Setup 17 Outline • Evaluation • Experimental Setup Dual-socket Linux 3.2.0 system with SuperMICRO X9DRL-iF motherboard and two Intel Xeon E5-2690 processors 17 Experimental Setup • Configurations (1024 configurations) 18 Experimental Setup • Configurations (1024 configurations) • Clock speed: • Set using cpufrequtils package • 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings 18 Experimental Setup • Configurations (1024 configurations) • Clock speed: • Set using cpufrequtils package • 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings • Memory controller: • numactl library to control the access. • 2 memory controls - 2 settings 18 Experimental Setup • Configurations (1024 configurations) • Clock speed: • Set using cpufrequtils package • 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings • Memory controller: • numactl library to control the access. • 2 memory controls - 2 settings • Cores: • Two 8 cores and hyper-threading - 32 settings 18 Experimental Setup • Configurations (1024 configurations) • Clock speed: • Set using cpufrequtils package • 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings • Memory controller: • numactl library to control the access. • 2 memory controls - 2 settings • Cores: • Two 8 cores and hyper-threading - 32 settings • Measurements 18 Experimental Setup • Configurations (1024 configurations) • Clock speed: • Set using cpufrequtils package • 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings • Memory controller: • numactl library to control the access. • 2 memory controls - 2 settings • Cores: • Two 8 cores and hyper-threading - 32 settings • Measurements • Power • WattsUp meter provides total system power at 1s intervals. 18 Experimental Setup • Configurations (1024 configurations) • Clock speed: • Set using cpufrequtils package • 15 DVFS settings (from 1.2 { 2.9 GHz) + TurboBoost - 16 settings • Memory controller: • numactl library to control the access. • 2 memory controls - 2 settings • Cores: • Two 8 cores and hyper-threading - 32 settings • Measurements • Power • WattsUp meter provides total system power at 1s intervals. • Performance • Applications report the heartrate, which is application specific. 18 Experimental Setup • Benchmarks 19 Experimental Setup • Benchmarks • We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. 19 Experimental Setup • Benchmarks • We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. • Baseline heuristics 19 Experimental Setup • Benchmarks • We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. • Baseline heuristics • Online algorithm- Polynomial multivariate regression over configuration values on the observed dataset. 19 Experimental Setup • Benchmarks • We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. • Baseline heuristics • Online algorithm- Polynomial multivariate regression over configuration values on the observed dataset. • Offline algorithm- Average over the rest of the applications to estimate the power and performance of the given application. 19 Experimental Setup • Benchmarks • We use 25 benchmarks from 3 different suites, PARSEC, Minebench, Rodinia and some others. • Baseline heuristics • Online algorithm- Polynomial multivariate regression over configuration values on the observed dataset. • Offline algorithm- Average over the rest of the applications to estimate the power and performance of the given application. • Race-to-idle- Allocates all resources to the application and once it is finished the system goes to idle. 19 Outline • Motivation/Overview • Statistical modelling • Evaluation • Experimental setup • Power and performance estimation • Energy savings/ Phase transition • Summary 20 Power and performance estimation Performance rate (in Iter/s) vs Configuration index System-power (in Watts) vs Configuration index 21 Power and performance estimation Swish Search web- server X264 Video encoder 22 Summary: Performance estimation LEO Online Offline 1 0.9 0.8 ACCURACY 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Kmeans LEO Online Offline 0.99 0.33 0.25 23 Summary: Performance estimation LEO Online Offline 1 0.9 0.8 ACCURACY 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Jacobi LEO Online Offline 0.94 0.87 0.15 24 Summary: Performance estimation LEO Online Offline 1 0.9 0.8 ACCURACY 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Overall LEO Online Offline 0.97 0.87 0.68 25 Summary: System-power estimation LEO Online Offline 1 0.9 0.8 ACCURACY 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Overall LEO Online Offline 0.98 0.85 0.89 26 Outline • Motivation/Overview • Statistical modelling • Experiments • Experimental setup • Power and performance estimation • Energy savings/ Phase transition • Summary 27 Summary: Energy savings • Comparison of average energy compared with the optimal (over different utilizations and all the benchmarks), • • • • LEO Online Offline Race-to idle - +6% +24% +29% +90% 28 Phase - transitions Performance and power for fluidanimate along phases with different computational demands 29 Phase - transitions Performance and power for fluidanimate along phases with different computational demands 29 Multiple Applications Comparison of performance estimation(in iter/s) and system-power(in Watts) estimation for different algorithms over the set of mixture of applications Performance(in Iter/s) System-power(in Watts) Mixture 1 Mixture 2 Overall Mixture 1 Mixture 2 Overall LEO 0.88 0.91 0.90 0.87 0.86 0.87 Online 0.87 0.82 0.85 0.82 0.80 0.81 Offline 0.39 0.86 0.63 0.73 0.73 0.73 30 Summary Sensitivity analysis of LEO vs Online As compared to LEO which quickly reaches near optimality, our baseline method (online regression) cannot perform below 15 samples because the design matrix of regression model would be rank deficient. 32 Related Work • Offline optimization techniques (e.g.,[59, 35, 33, 10, 2]) • But they are limited by reliance on a robust training phase. • Online optimization techniques [44] • For example, Flicker is a configurable architecture and optimization framework that uses only online models to maximize performance under a power limitation. • ParallelismDial, • Uses online adaptation to tailor parallelism to application workload. 33
© Copyright 2024