Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency

Size: px

Start display at page:

Download "Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency"

Beatrix Greer
6 years ago
Views:

1 PhD Dissertation Proposal Characterizing, Optimizing, and Auto-Tuning Applications for Efficiency Wei Wang The Committee: Chair: Dr. John Cavazos Member: Dr. Guang R. Gao Member: Dr. James Clause Member: Dr. Allan Porterfield January 28, 205

2 2 / 45 HPC Optimization Challenge Table: Performance, power, and energy efficiency of top/green500 and exascale systems System Name Performance (TFLOP/s) Power (KW) GFLOPS/W Exascale System,000,000 20, MilkyWay-2 33, ,808.9 Titan 7, , L-CSC Exascale computing requires more than 20 improvement in GFLOPS/Watts.

3 3 / 45 Outline Auto-Tuning and Optimization in Polyhedral Optimization Space 2 Optimization with Program Characterization and CPU Clock Modulation 3 Proposed Future Work

4 4 / 45 Outline Auto-Tuning and Optimization in Polyhedral Optimization Space 2 Optimization with Program Characterization and CPU Clock Modulation 3 Proposed Future Work

5 5 / 45 Motivation Polyhedral optimization effective for optimizing computational kernels Accurate predictive performance model derived Effective Predictive Performance Model Effective Predictive Model?

6 6 / 45 Adapting AutoTuning Framework for Program Characterization Loop Pattern Control Flow Graph Optimization Sequences Src-to-Src Compiler Profiling Counters Machine Learning Algorithms SVM Figure: Auto-Tuning framework for energy. (Refs: Park et al. CGO, CGO 2, IJPP 3, ICPP 4)

power measurement used (/sys/class/micras/power) 20Hz update frequency

7 7 / 45 Measurement using RCRTool SandyBridge Monitors MSR Counter 000Hz+ update frequency Measures energy, computes power KNIGHT s Corner Built-in power measurement used (/sys/class/micras/power) 20Hz update frequency Measures power, computes energy Figure: Simplified view of RCRtool energy monitoring

8 8 / 45 Measurement APIs Figure: Original Program Figure: Added with energy profiling APIs

9 9 / 45 Polyhedral Compilers Generate code variants of programs containing Static Control Parts (SCoP) using PoCC (Polyhedral Compiler Collection) Loop Transformations Auto Parallelization (PLUTO) Tested Applications Existing: Polybench New: 2D Cardiac Wave Propagation Simulation, LULESH

10 0 / 45 Exposing SCoP Figure: Simplified version of the original and the transformed loop nest

11 / 45 Profiling of Different Program Optimizations Figure: Workflow of energy-aware polyhedral framework

12 2 / 45 Experiments Setup Hardware Intel Xeon E (dual socket 8-core processor with 20MB cache) Xeon Phi coprocessor (6 cores,.09ghz, 52KB cache each) Software Polyhedral Compilers: PoCC v.2 and Polyopt v0.2. Application: Polybench v3.2 and LULESH v.0 (OpenMP) Back-end Compilers: GCC v4.4.6 and ICC v4.0.0

13 Consumption and Execution Correlation (Polybench) Covariance Polybench 2mm Polybench (joules) Program Variants Execution (seconds) (joules) Program Variants Execution (seconds) Best optimizations for time are best for energy savings for these two polybench application. 3 / 45

14 4 / 45 Consumption and Execution Correlation (Stencil Seidel2D Polybench and LULESH) Seidel2D LULESH (joules) Program Variants Execution (seconds) (joules) Program Variants Execution (seconds) Jumps in Seidel2D energy usage (and decreased execution time) are results of turning parallelization on.

15 5 / 45 Polyhedral Optimizations on a Realistic Application 2D Cardiac Wave Propagation Simulation /Performance improvement on the Sandy Bridge system Speedups Problem Size Normalized Savings Baseline: manual OpenMP implementation

16 6 / 45 Results on Xeon Phi for Cardiac Simulation Speedups Manual Polyopt Problem Size Speedups Speedups Savings Problem Size Normalized Savings Conclusion: Polyhedral approach is effective in optimizing the 2D Cardiac Wave Propagation Simulation.

17 Consumption and Execution Correlation (2D Cardiac Wave Propagation Simulation) (joules) Program Variants Execution (seconds) (joules) Program Variants Execution (seconds) Left: Sandy Bridge Right: Xeon Phi Conclusion: Saving energy consumption is consistent with improving performance on both processors 7 / 45

18 8 / 45 Conclusion Tuning for time can be used as proxy to tuning for energy Polyhedral optimizations for realistic applications possible

19 9 / 45 Outline Auto-Tuning and Optimization in Polyhedral Optimization Space 2 Optimization with Program Characterization and CPU Clock Modulation 3 Proposed Future Work

20 20 / 45 Motivation HPC energy optimizations focus on DVFS DVFS only applied in the coarse-grain cases Fine-grained energy control requires faster frequency transition techniques

21 CPU Clock Modulation Write Specific Value to IA32 CLOCK MODULATION (0x9a) MSR Modify /dev/cpu/cpu{0:5}/msr with root privilege Invoke wrmsr inline assembly from applications using added System Call Figure: CPU Clock Modulation. Sample Modulation with 25% Duty Cycle. (Source: IA-32 Intel Architecture Software Developer s Manual, Volume 3: System Programming Guide) 2 / 45

22 Available Frequencies Duty Cycle Level Binary Decimal Hexadecimal Effective Frequency 000B 7 H 6.25% 2 000B 8 2H 2.5% 3 00B 9 3H 8.75% 4 000B 20 4H 25% 5 00B 2 5H 3.25% 6 00B 22 6H 37.5% 7 00B 23 7H 43.75% 8 000B 24 8H 50% 9 00B 25 9H 56.25% 0 00B 26 AH 63.5% 0B 27 BH 69.75% 2 00B 28 CH 75% 3 0B 29 DH 8.25% 4 0B 30 EH 87.5% 5 B 3 FH 93.75% B 0 00H 00% 22 / 45

23 23 / 45 Benchmarks and Experimental Setup Benchmarks LULESH : Hydrodynamics minife from Mantevo Project: implicit finite-element application brdr2d: 2D Cardiac Wave Propagation Simulation Polybench: 30 Computational Kernels 2 Hardware/Software Setup Intel Xeon E (Dual Socket, 8-core processor with 20MB LLC, 2.7GHz) Linux with ACPI and MSR modules Intel ICC v4.0.2 with -O3 RCRdaemon taskset to core 6

24 24 / 45 Loops with High Memory Access.6.4 EDP EDP.8.6 Normalized Metrics % 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25%.2 Normalized Metrics % 87.5% 8.25% 75% 68.75% % % 43.75% 37.5% 56.25% 62.5% 75% 68.75% 87.5% 8.25% 93.75% % 0.4 Duty Cycle (Clock Skipping) (a) LULESH with DCM Duty Cycle (Clock Skipping) (b) jacobi-2d Polybench

25 24 / 45 Loops with High Memory Access reduced and EDP lowered with very low performance EDP impact Normalized Metrics % 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 00% Normalized Metrics % 43.75% 50% 56.25% 62.5% 68.75% 75% 8.25% 87.5% 93.75% 00% Duty Cycle (Clock Skipping) (a) LULESH with DCM Duty Cycle (Clock Skipping) (b) jacobi-2d Polybench

26 25 / 45 Loops with Low Memory Access (More Computation) Normalized Metrics EDP Normalized Metrics EDP % 00% 93.75% 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% Duty Cycle (Clock Skipping) (a) LULESH 93.75% 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% Duty Cycle (Clock Skipping) (b) minife

27 25 / 45 Loops with Low Memory Access (More Computation) EDP EDP reduced with high performance impact, resulting in higher EDP Normalized Metrics % 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 93.75% 87.5% 8.25% 75% 68.75% 00% Normalized Metrics % 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 93.75% 87.5% 8.25% 75% 68.75% 62.5% 00% Duty Cycle (Clock Skipping) (a) LULESH Duty Cycle (Clock Skipping) (b) minife

28 26 / 45 Loops with Balanced Memory Access and Computation.8.6 EDP EDP.8.6 Normalized Metrics Normalized Metrics % % % 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% Duty Cycle (Clock Skipping) (a) LULESH 93.75% 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% Duty Cycle (Clock Skipping) (b) minife

29 26 / 45 Loops with Balanced Memory Access and Computation reduced and EDP lowered EDP.6 with.6 relatively low performance EDP impact.6 Normalized Metrics % 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 00% Normalized Metrics % 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 00% Duty Cycle (Clock Skipping) (a) LULESH Duty Cycle (Clock Skipping) (b) minife

30 Polybench Loops Normalized Metrics durbin adi jacobi-2d-imper fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks EDP Frequency 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d Polybench programs running at the best non-full speed setting reg-detect syr2k symm 27 / 45

31 Polybench Loops Normalized Metrics Loops have different energy characteristics responding to frequency EDP changes Frequency durbin adi jacobi-2d-imper fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d Polybench programs running at the best non-full speed setting reg-detect syr2k symm 27 / 45

32 Memory Access Density vs. Three Types of Loops. EDP MAD Normalized EDP durbin adi jacobi-2d-imper fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d reg-detect syr2k symm MAD Value 28 / 45

33 Memory Access Density vs. Three Types of Loops Normalized EDP Memory Access Density could be used as loop type indictor durbin adi jacobi-2d-imper EDP fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks MAD 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d reg-detect syr2k symm MAD Value 28 / 45

34 29 / 45 Loop Characterization Summary Loop Type Mem-Int. Comp-Int. Balanced Characteristics Power reduced and EDP significantly lowered. Full frequency required Saves energy with relatively low performance degradation.

35 for (j=0; j<n; j++) {... } } 30 / 45 Hybrid Execution of Multi-loop Applications Adding energy control APIs around loops Fine-grain loop regions require fast machine power-state transition to avoid overhead while (condition) {... setlowfrequency(); for (i=0; i<n; i++) {... } resetfrequency();

36 3 / 45 LULESH Results Table: Comparison of execution time, energy consumption, and EDP for LULESH Version Duty Cycle Level EDP mint 00% mine 56.25% minedp 8.25% Hybrid 00% & 50% Hybrid 2 00% & 62.5% Hybrid 3 00% & 68.75%

37 32 / 45 minife Results Table: Comparison of execution time, energy consumption, and EDP for minife Version Duty Cycle Level EDP mint 00% mine 62.5% minedp 8.25% Hybrid 00% & 8.25% Hybrid 2 00% & 87.5% Hybrid 3 00% & 93.75%

38 Fine-grained DVFS vs. Fine-grained DCM (Entire Application) Normalized Metrics EDP H-DCM-50% H-DCM-62.5% H-DCM-68.75% H-DVFS-2.6GHz H-DVFS-2.5GHz H-DVFS-2.4G Hybrid Versions H-DVFS-2.3GHz H-DVFS-2.2G H-DVFS-2.G 33 / 45

39 DVFS vs. DCM: Another Comparison Measure and compare the execution time and power of Loop and Loop2 with and without energy control APIs. while (condition) {... //Loop MemLoop(); //Loop2 CompLoop(); OtherLoops(); } VS. while (condition) {... setfrequency(); MemLoop(); resetfrequency(); CompLoop(); OtherLoops(); } 34 / 45

40 35 / 45 Power Transition Overhead DVFS-2.4GHz.5 Power DVFS-2.GHz DVFS-.8GHz DCM-68.75% DCM-62.5% DCM-50% Normalized Metric Loop (MemLoop) Normalized Metric Loop2 (CompLoop) Power

41 35 / 45 Power Transition Overhead.5 Power DVFS-2.4GHz DVFS-2.GHz DVFS-.8GHz DVFS energy control is not synchronized with fine-grain loops, but DCM is. DCM-68.75% 0.5 Normalized Metric Loop (MemLoop) 0 DCM-62.5% DCM-50% Normalized Metric Loop2 (CompLoop) Power

42 Fine-Grain vs. Coarse-Grain DVFS/DCM Control of Loop (MemLoop).5 FG- FG-Power CG- CG-Power Normalized Metric DCM-50% DCM-62.5% DCM-68.75% DVFS-.8GHz DVFS-2.GHz Control Settings DVFS-2.4GHz 36 / 45

43 Fine-Grain vs. Coarse-Grain DVFS/DCM Control of Loop (MemLoop) Normalized Metric FG- FG-Power CG- CG-Power DCM fine-grain energy control is almost identical to coarse-grain. 0 DCM-50% DVFS overhead is larger than DCM. DCM-62.5% DCM-68.75% DVFS-.8GHz DVFS-2.GHz Control Settings DVFS-2.4GHz 36 / 45

44 37 / 45 Clock Skipping with Concurrency Throttling Concurrency Throttling mitigates resource contention Clock Modulation reduces idle state power # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping (c) with concurrency throttling and clock skipping. Minimum occurs at (75%, 4) (d) with concurrency throttling and clock skipping. Minimum occurs at (00%, 6) Figure: fdtd-2d Polybench

45 38 / 45 LULESH # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping (a) results. Minimum occurs at (75%, 6) (b) results. Minimum occurs at (00%, 8) Version # of Threads Duty Cycle Level EDP Default 6 00% CT 6 00% CT+CS 6 75%

46 39 / 45 minife # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping # of Threads % % 8.25% 87.5% 93.75% Clock Skipping (a) results. Minimum occurs at (75%, 8) (b) results. Minimum occurs at (00%, 4) Version # of Threads Duty Cycle Level EDP Default 6 00% CT 0 00% CT+CS %

47 40 / 45 When Concurency Throttling is Not Beneficial # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping # of Threads % 8.25% 87.5% 93.75% 00% Clock Skipping (a) results. Minimum occurs at (75%, 6) (b) results. Minimum occurs at (00%, 6) Figure: brdr2d results when applying both concurrency throttling and clock skipping.

48 Concurrency Throttling and Memory Access Density Normalized /EDP durbin adi jacobi-2d-imper EDP NumThreads fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d reg-detect syr2k symm Number of Threads 4 / 45

49 Concurrency Throttling and Memory Access Density Normalized /EDP Loops with high Mem-value tend to benefit from concurrency throttling durbin adi jacobi-2d-imper EDP NumThreads fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d reg-detect syr2k symm Number of Threads 4 / 45

50 42 / 45 Conclusion Hybrid execution of OpenMP loops with Clock Modulation can achieve better energy efficiency 2 Concurrency throttling can be combined with Clock Modulation to save more energy

51 43 / 45 Outline Auto-Tuning and Optimization in Polyhedral Optimization Space 2 Optimization with Program Characterization and CPU Clock Modulation 3 Proposed Future Work

52 44 / 45 Future Work Build predictive energy/performance model (May 205) 2 Enhance results with frequency/threads configuration (Aug. 205)

53 Future Work Power Impacts of Polyhedral Optimizations (Dec. 205) Power (Watts) With-tiling-Power W/o-tiling-Power Program Variants With-tiling- W/o-tiling- Figure: Covariance Polybench program variants with and without loop tiling Execution (seconds) Extending DCM Optimization Technique (Dec. 205) MPI/OpenMP programs Runtime Control Automating frequency/number-ofthreads configuration 45 / 45

Evaluation of CPU Frequency Transition Latency

Evaluation of CPU Frequency Transition Latency Abdelhafid Mazouz 1 Alexandre Laurent 1 Benoît Pradelle 1 William Jalby 1 1 University of Versailles Saint-Quentin-en-Yvelines, France ENA-HPC 2013, Dresden