Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency
|
|
- Beatrix Greer
- 6 years ago
- Views:
Transcription
1 PhD Dissertation Proposal Characterizing, Optimizing, and Auto-Tuning Applications for Efficiency Wei Wang The Committee: Chair: Dr. John Cavazos Member: Dr. Guang R. Gao Member: Dr. James Clause Member: Dr. Allan Porterfield January 28, 205
2 2 / 45 HPC Optimization Challenge Table: Performance, power, and energy efficiency of top/green500 and exascale systems System Name Performance (TFLOP/s) Power (KW) GFLOPS/W Exascale System,000,000 20, MilkyWay-2 33, ,808.9 Titan 7, , L-CSC Exascale computing requires more than 20 improvement in GFLOPS/Watts.
3 3 / 45 Outline Auto-Tuning and Optimization in Polyhedral Optimization Space 2 Optimization with Program Characterization and CPU Clock Modulation 3 Proposed Future Work
4 4 / 45 Outline Auto-Tuning and Optimization in Polyhedral Optimization Space 2 Optimization with Program Characterization and CPU Clock Modulation 3 Proposed Future Work
5 5 / 45 Motivation Polyhedral optimization effective for optimizing computational kernels Accurate predictive performance model derived Effective Predictive Performance Model Effective Predictive Model?
6 6 / 45 Adapting AutoTuning Framework for Program Characterization Loop Pattern Control Flow Graph Optimization Sequences Src-to-Src Compiler Profiling Counters Machine Learning Algorithms SVM Figure: Auto-Tuning framework for energy. (Refs: Park et al. CGO, CGO 2, IJPP 3, ICPP 4)
7 7 / 45 Measurement using RCRTool SandyBridge Monitors MSR Counter 000Hz+ update frequency Measures energy, computes power KNIGHT s Corner Built-in power measurement used (/sys/class/micras/power) 20Hz update frequency Measures power, computes energy Figure: Simplified view of RCRtool energy monitoring
8 8 / 45 Measurement APIs Figure: Original Program Figure: Added with energy profiling APIs
9 9 / 45 Polyhedral Compilers Generate code variants of programs containing Static Control Parts (SCoP) using PoCC (Polyhedral Compiler Collection) Loop Transformations Auto Parallelization (PLUTO) Tested Applications Existing: Polybench New: 2D Cardiac Wave Propagation Simulation, LULESH
10 0 / 45 Exposing SCoP Figure: Simplified version of the original and the transformed loop nest
11 / 45 Profiling of Different Program Optimizations Figure: Workflow of energy-aware polyhedral framework
12 2 / 45 Experiments Setup Hardware Intel Xeon E (dual socket 8-core processor with 20MB cache) Xeon Phi coprocessor (6 cores,.09ghz, 52KB cache each) Software Polyhedral Compilers: PoCC v.2 and Polyopt v0.2. Application: Polybench v3.2 and LULESH v.0 (OpenMP) Back-end Compilers: GCC v4.4.6 and ICC v4.0.0
13 Consumption and Execution Correlation (Polybench) Covariance Polybench 2mm Polybench (joules) Program Variants Execution (seconds) (joules) Program Variants Execution (seconds) Best optimizations for time are best for energy savings for these two polybench application. 3 / 45
14 4 / 45 Consumption and Execution Correlation (Stencil Seidel2D Polybench and LULESH) Seidel2D LULESH (joules) Program Variants Execution (seconds) (joules) Program Variants Execution (seconds) Jumps in Seidel2D energy usage (and decreased execution time) are results of turning parallelization on.
15 5 / 45 Polyhedral Optimizations on a Realistic Application 2D Cardiac Wave Propagation Simulation /Performance improvement on the Sandy Bridge system Speedups Problem Size Normalized Savings Baseline: manual OpenMP implementation
16 6 / 45 Results on Xeon Phi for Cardiac Simulation Speedups Manual Polyopt Problem Size Speedups Speedups Savings Problem Size Normalized Savings Conclusion: Polyhedral approach is effective in optimizing the 2D Cardiac Wave Propagation Simulation.
17 Consumption and Execution Correlation (2D Cardiac Wave Propagation Simulation) (joules) Program Variants Execution (seconds) (joules) Program Variants Execution (seconds) Left: Sandy Bridge Right: Xeon Phi Conclusion: Saving energy consumption is consistent with improving performance on both processors 7 / 45
18 8 / 45 Conclusion Tuning for time can be used as proxy to tuning for energy Polyhedral optimizations for realistic applications possible
19 9 / 45 Outline Auto-Tuning and Optimization in Polyhedral Optimization Space 2 Optimization with Program Characterization and CPU Clock Modulation 3 Proposed Future Work
20 20 / 45 Motivation HPC energy optimizations focus on DVFS DVFS only applied in the coarse-grain cases Fine-grained energy control requires faster frequency transition techniques
21 CPU Clock Modulation Write Specific Value to IA32 CLOCK MODULATION (0x9a) MSR Modify /dev/cpu/cpu{0:5}/msr with root privilege Invoke wrmsr inline assembly from applications using added System Call Figure: CPU Clock Modulation. Sample Modulation with 25% Duty Cycle. (Source: IA-32 Intel Architecture Software Developer s Manual, Volume 3: System Programming Guide) 2 / 45
22 Available Frequencies Duty Cycle Level Binary Decimal Hexadecimal Effective Frequency 000B 7 H 6.25% 2 000B 8 2H 2.5% 3 00B 9 3H 8.75% 4 000B 20 4H 25% 5 00B 2 5H 3.25% 6 00B 22 6H 37.5% 7 00B 23 7H 43.75% 8 000B 24 8H 50% 9 00B 25 9H 56.25% 0 00B 26 AH 63.5% 0B 27 BH 69.75% 2 00B 28 CH 75% 3 0B 29 DH 8.25% 4 0B 30 EH 87.5% 5 B 3 FH 93.75% B 0 00H 00% 22 / 45
23 23 / 45 Benchmarks and Experimental Setup Benchmarks LULESH : Hydrodynamics minife from Mantevo Project: implicit finite-element application brdr2d: 2D Cardiac Wave Propagation Simulation Polybench: 30 Computational Kernels 2 Hardware/Software Setup Intel Xeon E (Dual Socket, 8-core processor with 20MB LLC, 2.7GHz) Linux with ACPI and MSR modules Intel ICC v4.0.2 with -O3 RCRdaemon taskset to core 6
24 24 / 45 Loops with High Memory Access.6.4 EDP EDP.8.6 Normalized Metrics % 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25%.2 Normalized Metrics % 87.5% 8.25% 75% 68.75% % % 43.75% 37.5% 56.25% 62.5% 75% 68.75% 87.5% 8.25% 93.75% % 0.4 Duty Cycle (Clock Skipping) (a) LULESH with DCM Duty Cycle (Clock Skipping) (b) jacobi-2d Polybench
25 24 / 45 Loops with High Memory Access reduced and EDP lowered with very low performance EDP impact Normalized Metrics % 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 00% Normalized Metrics % 43.75% 50% 56.25% 62.5% 68.75% 75% 8.25% 87.5% 93.75% 00% Duty Cycle (Clock Skipping) (a) LULESH with DCM Duty Cycle (Clock Skipping) (b) jacobi-2d Polybench
26 25 / 45 Loops with Low Memory Access (More Computation) Normalized Metrics EDP Normalized Metrics EDP % 00% 93.75% 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% Duty Cycle (Clock Skipping) (a) LULESH 93.75% 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% Duty Cycle (Clock Skipping) (b) minife
27 25 / 45 Loops with Low Memory Access (More Computation) EDP EDP reduced with high performance impact, resulting in higher EDP Normalized Metrics % 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 93.75% 87.5% 8.25% 75% 68.75% 00% Normalized Metrics % 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 93.75% 87.5% 8.25% 75% 68.75% 62.5% 00% Duty Cycle (Clock Skipping) (a) LULESH Duty Cycle (Clock Skipping) (b) minife
28 26 / 45 Loops with Balanced Memory Access and Computation.8.6 EDP EDP.8.6 Normalized Metrics Normalized Metrics % % % 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% Duty Cycle (Clock Skipping) (a) LULESH 93.75% 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% Duty Cycle (Clock Skipping) (b) minife
29 26 / 45 Loops with Balanced Memory Access and Computation reduced and EDP lowered EDP.6 with.6 relatively low performance EDP impact.6 Normalized Metrics % 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 00% Normalized Metrics % 87.5% 8.25% 75% 68.75% 62.5% 56.25% 50% 43.75% 37.5% 3.25% 25% 8.75% 2.5% 6.25% 00% Duty Cycle (Clock Skipping) (a) LULESH Duty Cycle (Clock Skipping) (b) minife
30 Polybench Loops Normalized Metrics durbin adi jacobi-2d-imper fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks EDP Frequency 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d Polybench programs running at the best non-full speed setting reg-detect syr2k symm 27 / 45
31 Polybench Loops Normalized Metrics Loops have different energy characteristics responding to frequency EDP changes Frequency durbin adi jacobi-2d-imper fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d Polybench programs running at the best non-full speed setting reg-detect syr2k symm 27 / 45
32 Memory Access Density vs. Three Types of Loops. EDP MAD Normalized EDP durbin adi jacobi-2d-imper fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d reg-detect syr2k symm MAD Value 28 / 45
33 Memory Access Density vs. Three Types of Loops Normalized EDP Memory Access Density could be used as loop type indictor durbin adi jacobi-2d-imper EDP fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks MAD 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d reg-detect syr2k symm MAD Value 28 / 45
34 29 / 45 Loop Characterization Summary Loop Type Mem-Int. Comp-Int. Balanced Characteristics Power reduced and EDP significantly lowered. Full frequency required Saves energy with relatively low performance degradation.
35 for (j=0; j<n; j++) {... } } 30 / 45 Hybrid Execution of Multi-loop Applications Adding energy control APIs around loops Fine-grain loop regions require fast machine power-state transition to avoid overhead while (condition) {... setlowfrequency(); for (i=0; i<n; i++) {... } resetfrequency();
36 3 / 45 LULESH Results Table: Comparison of execution time, energy consumption, and EDP for LULESH Version Duty Cycle Level EDP mint 00% mine 56.25% minedp 8.25% Hybrid 00% & 50% Hybrid 2 00% & 62.5% Hybrid 3 00% & 68.75%
37 32 / 45 minife Results Table: Comparison of execution time, energy consumption, and EDP for minife Version Duty Cycle Level EDP mint 00% mine 62.5% minedp 8.25% Hybrid 00% & 8.25% Hybrid 2 00% & 87.5% Hybrid 3 00% & 93.75%
38 Fine-grained DVFS vs. Fine-grained DCM (Entire Application) Normalized Metrics EDP H-DCM-50% H-DCM-62.5% H-DCM-68.75% H-DVFS-2.6GHz H-DVFS-2.5GHz H-DVFS-2.4G Hybrid Versions H-DVFS-2.3GHz H-DVFS-2.2G H-DVFS-2.G 33 / 45
39 DVFS vs. DCM: Another Comparison Measure and compare the execution time and power of Loop and Loop2 with and without energy control APIs. while (condition) {... //Loop MemLoop(); //Loop2 CompLoop(); OtherLoops(); } VS. while (condition) {... setfrequency(); MemLoop(); resetfrequency(); CompLoop(); OtherLoops(); } 34 / 45
40 35 / 45 Power Transition Overhead DVFS-2.4GHz.5 Power DVFS-2.GHz DVFS-.8GHz DCM-68.75% DCM-62.5% DCM-50% Normalized Metric Loop (MemLoop) Normalized Metric Loop2 (CompLoop) Power
41 35 / 45 Power Transition Overhead.5 Power DVFS-2.4GHz DVFS-2.GHz DVFS-.8GHz DVFS energy control is not synchronized with fine-grain loops, but DCM is. DCM-68.75% 0.5 Normalized Metric Loop (MemLoop) 0 DCM-62.5% DCM-50% Normalized Metric Loop2 (CompLoop) Power
42 Fine-Grain vs. Coarse-Grain DVFS/DCM Control of Loop (MemLoop).5 FG- FG-Power CG- CG-Power Normalized Metric DCM-50% DCM-62.5% DCM-68.75% DVFS-.8GHz DVFS-2.GHz Control Settings DVFS-2.4GHz 36 / 45
43 Fine-Grain vs. Coarse-Grain DVFS/DCM Control of Loop (MemLoop) Normalized Metric FG- FG-Power CG- CG-Power DCM fine-grain energy control is almost identical to coarse-grain. 0 DCM-50% DVFS overhead is larger than DCM. DCM-62.5% DCM-68.75% DVFS-.8GHz DVFS-2.GHz Control Settings DVFS-2.4GHz 36 / 45
44 37 / 45 Clock Skipping with Concurrency Throttling Concurrency Throttling mitigates resource contention Clock Modulation reduces idle state power # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping (c) with concurrency throttling and clock skipping. Minimum occurs at (75%, 4) (d) with concurrency throttling and clock skipping. Minimum occurs at (00%, 6) Figure: fdtd-2d Polybench
45 38 / 45 LULESH # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping (a) results. Minimum occurs at (75%, 6) (b) results. Minimum occurs at (00%, 8) Version # of Threads Duty Cycle Level EDP Default 6 00% CT 6 00% CT+CS 6 75%
46 39 / 45 minife # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping # of Threads % % 8.25% 87.5% 93.75% Clock Skipping (a) results. Minimum occurs at (75%, 8) (b) results. Minimum occurs at (00%, 4) Version # of Threads Duty Cycle Level EDP Default 6 00% CT 0 00% CT+CS %
47 40 / 45 When Concurency Throttling is Not Beneficial # of Threads % 75% 8.25% 87.5% 93.75% Clock Skipping # of Threads % 8.25% 87.5% 93.75% 00% Clock Skipping (a) results. Minimum occurs at (75%, 6) (b) results. Minimum occurs at (00%, 6) Figure: brdr2d results when applying both concurrency throttling and clock skipping.
48 Concurrency Throttling and Memory Access Density Normalized /EDP durbin adi jacobi-2d-imper EDP NumThreads fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d reg-detect syr2k symm Number of Threads 4 / 45
49 Concurrency Throttling and Memory Access Density Normalized /EDP Loops with high Mem-value tend to benefit from concurrency throttling durbin adi jacobi-2d-imper EDP NumThreads fdtd-2d lu jacobi-d-imper cholesky mvt floyd-warshall gemver gesummv covariance gemm gramschmidt ludcmp dynprog Benchmarks 2mm 3mmsyrk trmm correlation doitgen fdtd-apml trisolv atax bicg seidel-2d reg-detect syr2k symm Number of Threads 4 / 45
50 42 / 45 Conclusion Hybrid execution of OpenMP loops with Clock Modulation can achieve better energy efficiency 2 Concurrency throttling can be combined with Clock Modulation to save more energy
51 43 / 45 Outline Auto-Tuning and Optimization in Polyhedral Optimization Space 2 Optimization with Program Characterization and CPU Clock Modulation 3 Proposed Future Work
52 44 / 45 Future Work Build predictive energy/performance model (May 205) 2 Enhance results with frequency/threads configuration (Aug. 205)
53 Future Work Power Impacts of Polyhedral Optimizations (Dec. 205) Power (Watts) With-tiling-Power W/o-tiling-Power Program Variants With-tiling- W/o-tiling- Figure: Covariance Polybench program variants with and without loop tiling Execution (seconds) Extending DCM Optimization Technique (Dec. 205) MPI/OpenMP programs Runtime Control Automating frequency/number-ofthreads configuration 45 / 45
Evaluation of CPU Frequency Transition Latency
Evaluation of CPU Frequency Transition Latency Abdelhafid Mazouz 1 Alexandre Laurent 1 Benoît Pradelle 1 William Jalby 1 1 University of Versailles Saint-Quentin-en-Yvelines, France ENA-HPC 2013, Dresden
More informationUnderstanding the Interactions Hardware/Software Parameters on the Energy Consumption of Multi-Threaded Applications
Understanding the Interactions Hardware/Software Parameters on the Energy Consumption of Multi-Threaded Applications Jeyarajan Thiyagalingam, Anne E. Trefethen April 29, 2014 Abstract In recent years,
More informationProgramming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp
Programming and Optimization with Intel Xeon Phi Coprocessors Colfax Developer Training One-day Boot Camp Abstract: Colfax Developer Training (CDT) is an in-depth intensive course on efficient parallel
More informationProgramming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102
Programming and Optimization with Intel Xeon Phi Coprocessors Colfax Developer Training One-day Labs CDT 102 Abstract: Colfax Developer Training (CDT) is an in-depth intensive course on efficient parallel
More informationEvaluation of CPU Frequency Transition Latency
Noname manuscript No. (will be inserted by the editor) Evaluation of CPU Frequency Transition Latency Abdelhafid Mazouz Alexandre Laurent Benoît Pradelle William Jalby Abstract Dynamic Voltage and Frequency
More informationLatency-aware DVFS for Efficient Power State Transitions on Many-core Architectures
J Supercomput manuscript No. (will be inserted by the editor) Latency-aware DVFS for Efficient Power State Transitions on Many-core Architectures Zhiquan Lai King Tin Lam Cho-Li Wang Jinshu Su Received:
More informationEnergy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control
Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control Guangyi Cao and Arun Ravindran Department of Electrical and Computer Engineering University of North Carolina at Charlotte
More informationMosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes
Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur
More informationCharacterizing and Improving the Performance of Intel Threading Building Blocks
Characterizing and Improving the Performance of Intel Threading Building Blocks Gilberto Contreras, Margaret Martonosi Princeton University IISWC 08 Motivation Chip Multiprocessors are the new computing
More informationPerformance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics
Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Performance Metrics http://www.yildiz.edu.tr/~naydin 1 2 Objectives How can we meaningfully measure and compare
More informationFROM KNIGHTS CORNER TO LANDING: A CASE STUDY BASED ON A HODGKIN- HUXLEY NEURON SIMULATOR
FROM KNIGHTS CORNER TO LANDING: A CASE STUDY BASED ON A HODGKIN- HUXLEY NEURON SIMULATOR GEORGE CHATZIKONSTANTIS, DIEGO JIMÉNEZ, ESTEBAN MENESES, CHRISTOS STRYDIS, HARRY SIDIROPOULOS, AND DIMITRIOS SOUDRIS
More informationCS4617 Computer Architecture
1/26 CS4617 Computer Architecture Lecture 2 Dr J Vaughan September 10, 2014 2/26 Amdahl s Law Speedup = Execution time for entire task without using enhancement Execution time for entire task using enhancement
More informationCP2K PERFORMANCE FROM CRAY XT3 TO XC30. Iain Bethune Fiona Reid Alfio Lazzaro
CP2K PERFORMANCE FROM CRAY XT3 TO XC30 Iain Bethune (ibethune@epcc.ed.ac.uk) Fiona Reid Alfio Lazzaro Outline CP2K Overview Features Parallel Algorithms Cray HPC Systems Trends Water Benchmarks 2005 2013
More informationDeclarative Tuning for Locality in Parallel Programs
Declarative Tuning for Locality in Parallel Programs Sanjay Chatterjee, Nick Vrvilo, Zoran Budimlic, Kathleen Knobe, Vivek Sarkar Rice University Habanero Extreme Scale Software Research Project 2 Tuning
More informationExploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs
Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose,
More informationParallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir
Parallel Computing 2020: Preparing for the Post-Moore Era Marc Snir THE (CMOS) WORLD IS ENDING NEXT DECADE So says the International Technology Roadmap for Semiconductors (ITRS) 2 End of CMOS? IN THE LONG
More informationEarly Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida
Early Adopter : Multiprocessor Programming in the Undergraduate Program NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Narsingh Deo Damian Dechev Mahadevan Vasudevan Department
More informationAutomatic Energy Saving Schemes for Parallel Applications
Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2013 Automatic Energy Saving Schemes for Parallel Applications Vaibhav Sundriyal Iowa State University Follow
More informationΕΠΛ 605: Προχωρημένη Αρχιτεκτονική
ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Presentation of UniServer Horizon 2020 European project findings: X-Gene server chips, voltage-noise characterization, high-bandwidth voltage measurements,
More informationAn Energy Conservation DVFS Algorithm for the Android Operating System
Volume 1, Number 1, December 2010 Journal of Convergence An Energy Conservation DVFS Algorithm for the Android Operating System Wen-Yew Liang* and Po-Ting Lai Department of Computer Science and Information
More informationSignal Processing on GPUs for Radio Telescopes
Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes motivation processing pipelines signal-processing
More informationPower Capping Via Forced Idleness
Power Capping Via Forced Idleness Rajarshi Das IBM Research rajarshi@us.ibm.com Anshul Gandhi Carnegie Mellon University anshulg@cs.cmu.edu Jeffrey O. Kephart IBM Research kephart@us.ibm.com Mor Harchol-Balter
More informationLec 24: Parallel Processors. Announcements
Lec 24: Parallel Processors Kavita ala CS 3410, Fall 2008 Computer Science Cornell University P 3 out Hack n Seek nnouncements The goal is to have fun with it Recitations today will talk about it Pizza
More informationMultiple Clock and Voltage Domains for Chip Multi Processors
Multiple Clock and Voltage Domains for Chip Multi Processors Efraim Rotem- Intel Corporation Israel Avi Mendelson- Microsoft R&D Israel Ran Ginosar- Technion Israel institute of Technology Uri Weiser-
More informationAn Adaptive Core-specific Runtime for Energy Efficiency
2017 IEEE International Parallel and Distributed Processing Symposium An Adaptive Core-specific Runtime for Energy Efficiency Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, Jan F. Prins Department
More informationCUDA-Accelerated Satellite Communication Demodulation
CUDA-Accelerated Satellite Communication Demodulation Renliang Zhao, Ying Liu, Liheng Jian, Zhongya Wang School of Computer and Control University of Chinese Academy of Sciences Outline Motivation Related
More informationAnalysis of Image Compression Algorithm: GUETZLI
Analysis of Image Compression Algorithm: GUETZLI Lingyi Li August 18, 2017 Abstract How to balance picture size and quality is the core of image compression. This paper evaluates Google's jpeg image compression
More informationPerformance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System
Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System Ho Young Kim, Robert Maxwell, Ankil Patel, Byeong Kil Lee Abstract The purpose of this study is to analyze and compare the
More informationFinal Report: DBmbench
18-741 Final Report: DBmbench Yan Ke (yke@cs.cmu.edu) Justin Weisz (jweisz@cs.cmu.edu) Dec. 8, 2006 1 Introduction Conventional database benchmarks, such as the TPC-C and TPC-H, are extremely computationally
More informationRevisiting Dynamic Thermal Management Exploiting Inverse Thermal Dependence
Revisiting Dynamic Thermal Management Exploiting Inverse Thermal Dependence Katayoun Neshatpour George Mason University kneshatp@gmu.edu Amin Khajeh Broadcom Corporation amink@broadcom.com Houman Homayoun
More informationNRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology
NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology Bronson Messer Director of Science National Center for Computational Sciences & Senior R&D Staff Oak Ridge
More informationExperience with new architectures: moving from HELIOS to Marconi
Experience with new architectures: moving from HELIOS to Marconi Serhiy Mochalskyy, Roman Hatzky 3 rd Accelerated Computing For Fusion Workshop November 28 29 th, 2016, Saclay, France High Level Support
More informationDYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION
DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION Diary R. Suleiman Muhammed A. Ibrahim Ibrahim I. Hamarash e-mail: diariy@engineer.com e-mail: ibrahimm@itu.edu.tr
More informationIEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. 1, NO. 1, JANUARY
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 1.119/TMSCS.218.287438,
More informationEnhancing Power, Performance, and Energy Efficiency in Chip Multiprocessors Exploiting Inverse Thermal Dependence
778 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 4, APRIL 2018 Enhancing Power, Performance, and Energy Efficiency in Chip Multiprocessors Exploiting Inverse Thermal Dependence
More informationPROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs
PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs Li Zhou and Avinash Kodi Technologies for Emerging Computer Architecture Laboratory (TEAL) School of Electrical Engineering and
More informationPlatform Comptence Center Report
Platform Comptence Center Report CERN openlab Major Review Feb 2014 Paweł Szostek, CERN openlab On behalf of G.Bitzes, S.Jarp, P.Karpinski, A.Nowak, A.Santogidis, P.Szostek, L. Valsan Outline Manpower
More informationFall 2015 COMP Operating Systems. Lab #7
Fall 2015 COMP 3511 Operating Systems Lab #7 Outline Review and examples on virtual memory Motivation of Virtual Memory Demand Paging Page Replacement Q. 1 What is required to support dynamic memory allocation
More informationAdaptive Guardband Scheduling to Improve System-Level Efficiency of the POWER7+
Adaptive Guardband Scheduling to Improve System-Level Efficiency of the POWER7+ Yazhou Zu 1, Charles R. Lefurgy, Jingwen Leng 1, Matthew Halpern 1, Michael S. Floyd, Vijay Janapa Reddi 1 1 The University
More information2017 by Bilge Acun. All rights reserved.
2017 by Bilge Acun. All rights reserved. MITIGATING VARIABILITY IN HPC SYSTEMS AND APPLICATIONS FOR PERFORMANCE AND POWER EFFICIENCY BY BILGE ACUN DISSERTATION Submitted in partial fulfillment of the requirements
More informationCS Computer Architecture Spring Lecture 04: Understanding Performance
CS 35101 Computer Architecture Spring 2008 Lecture 04: Understanding Performance Taken from Mary Jane Irwin (www.cse.psu.edu/~mji) and Kevin Schaffer [Adapted from Computer Organization and Design, Patterson
More information7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10 CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan CSE-2021 July-12-2012 2 Single Cycle with Jump Multi-Cycle Implementation
More informationSystem Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators
System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching s Wonyoung Kim, Meeta S. Gupta, Gu-Yeon Wei and David Brooks School of Engineering and Applied Sciences, Harvard University, 33 Oxford
More informationLow Power Embedded Systems in Bioimplants
Low Power Embedded Systems in Bioimplants Steven Bingler Eduardo Moreno 1/32 Why is it important? Lower limbs amputation is a major impairment. Prosthetic legs are passive devices, they do not do well
More informationSimulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka
Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka Abstract Virtual prototyping is becoming increasingly important to embedded software developers, engineers, managers
More informationSelf-Aware Adaptation in FPGAbased
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Self-Aware Adaptation in FPGAbased Systems IEEE FPL 2010 Filippo Siorni: filippo.sironi@dresd.org Marco Triverio: marco.triverio@dresd.org Martina Maggio: mmaggio@mit.edu
More informationImproving Energy-Efficiency of Multicores using First-Order Modeling
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1404 Improving Energy-Efficiency of Multicores using First-Order Modeling VASILEIOS SPILIOPOULOS ACTA
More informationA Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability
A Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability L. Wanner, C. Apte, R. Balani, Puneet Gupta, and Mani Srivastava University of California, Los Angeles puneet@ee.ucla.edu
More informationDeadline scheduling: can your mobile device last longer?
Deadline scheduling: can your mobile device last longer? Juri Lelli, Mario Bambagini, Giuseppe Lipari Linux Plumbers Conference 202 San Diego (CA), USA, August 3 TeCIP Insitute, Scuola Superiore Sant'Anna
More informationComputer Architecture
Computer Architecture Lecture 01 Arkaprava Basu www.csa.iisc.ac.in Acknowledgements Several of the slides in the deck are from Luis Ceze (Washington), Nima Horanmand (Stony Brook), Mark Hill, David Wood,
More informationCUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads
Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA
More informationTrack and Vertex Reconstruction on GPUs for the Mu3e Experiment
Track and Vertex Reconstruction on GPUs for the Mu3e Experiment Dorothea vom Bruch for the Mu3e Collaboration GPU Computing in High Energy Physics, Pisa September 11th, 2014 Physikalisches Institut Heidelberg
More informationIHV means Independent Hardware Vendor. Example is Qualcomm Technologies Inc. that makes Snapdragon processors. OEM means Original Equipment
1 2 IHV means Independent Hardware Vendor. Example is Qualcomm Technologies Inc. that makes Snapdragon processors. OEM means Original Equipment Manufacturer. Examples are smartphone manufacturers. Tuning
More informationAdvances in Antenna Measurement Instrumentation and Systems
Advances in Antenna Measurement Instrumentation and Systems Steven R. Nichols, Roger Dygert, David Wayne MI Technologies Suwanee, Georgia, USA Abstract Since the early days of antenna pattern recorders,
More informationMessage Passing-Aware Power Management on Many-Core Systems
Copyright 214 American Scientific Publishers All rights reserved Printed in the United States of America Journal of Low Power Electronics Vol. 1, 1 19, 214 Message Passing-Aware Power Management on Many-Core
More informationDynamic MIPS Rate Stabilization in Out-of-Order Processors
Dynamic Rate Stabilization in Out-of-Order Processors Jinho Suh and Michel Dubois Ming Hsieh Dept of EE University of Southern California Outline Motivation Performance Variability of an Out-of-Order Processor
More informationBooster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips
Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips Timothy N. Miller, Xiang Pan, Renji Thomas, Naser Sedaghati, Radu Teodorescu
More information6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS
6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS Editor: Publisher: Prof. Pece Mitrevski, PhD Faculty of Information and Communication
More information22 Marzo 2012 IFEMA, Madrid spain.ni.com/nidays.
22 Marzo 2012 IFEMA, Madrid spain.ni.com/nidays www.infoplc.net The Art of Benchmarking Speed PXI Versus Rack-and-Stack Test Equipment Filippo Persia Systems Engineer Automated Test Mediterranean Region
More informationServer Operational Cost Optimization for Cloud Computing Service Providers over
Server Operational Cost Optimization for Cloud Computing Service Providers over a Time Horizon Haiyang(Ocean)Qian and Deep Medhi Networking and Telecommunication Research Lab (NeTReL) University of Missouri-Kansas
More informationAll Digital on Chip Process Sensor Using Ratioed Inverter Based Ring Oscillator
All Digital on Chip Process Sensor Using Ratioed Inverter Based Ring Oscillator 1 G. Rajesh, 2 G. Guru Prakash, 3 M.Yachendra, 4 O.Venka babu, 5 Mr. G. Kiran Kumar 1,2,3,4 Final year, B. Tech, Department
More informationCOTSon: Infrastructure for system-level simulation
COTSon: Infrastructure for system-level simulation Ayose Falcón, Paolo Faraboschi, Daniel Ortega HP Labs Exascale Computing Lab http://sites.google.com/site/hplabscotson MICRO-41 tutorial November 9, 28
More informationWhat can POP do for you?
What can POP do for you? Mike Dewar, NAG Ltd EU H2020 Center of Excellence (CoE) 1 October 2015 31 March 2018 Grant Agreement No 676553 Outline Overview of codes investigated Code audit & plan examples
More informationMS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.
MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng., UCLA - http://nanocad.ee.ucla.edu/ 1 Outline Introduction
More informationStudy On Two-stage Architecture For Synchronous Buck Converter In High-power-density Power Supplies title
Study On Two-stage Architecture For Synchronous Buck Converter In High-power-density Computing Click to add presentation Power Supplies title Click to edit Master subtitle Tirthajyoti Sarkar, Bhargava
More informationComputational Simulations of The World s Biggest Eye on GPUs
Computational Simulations of The World s Biggest Eye on GPUs Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology, Saudi Arabia NVIDIA GTC at San Jose, CA April
More informationDynamic hardware management of the H264/AVC encoder control structure using a framework for system scenarios
Dynamic hardware management of the H264/AVC encoder control structure using a framework for system scenarios Yahya H. Yassin, Per Gunnar Kjeldsberg, Andrew Perkis Department of Electronics and Telecommunications
More informationVampir Getting Started. Holger Brunst March 4th 2008
Vampir Getting Started Holger Brunst holger.brunst@tu-dresden.de March 4th 2008 What is Vampir? Program Monitoring, Visualization, and Analysis 1. Step: VampirTrace monitors your program s runtime behavior
More informationLS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40
LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40 Ting-Ting Zhu, Cray Inc. Jason Wang, LSTC Brian Wainscott, LSTC Abstract This work uses LS-DYNA to enhance the performance of engine
More informationPilot: Device-free Indoor Localization Using Channel State Information
ICDCS 2013 Pilot: Device-free Indoor Localization Using Channel State Information Jiang Xiao, Kaishun Wu, Youwen Yi, Lu Wang, Lionel M. Ni Department of Computer Science and Engineering Hong Kong University
More informationRecent Advances in Simulation Techniques and Tools
Recent Advances in Simulation Techniques and Tools Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Abstract: Simulation refers to using specified kind
More informationTrinity Center of Excellence
Trinity Center of Excellence I can t promise to solve all your problems, but I can promise you won t face them alone Hai Ah Nam Computational Physics & Methods (CCS-2) Presented to: Salishan Conference
More informationMonte Carlo integration and event generation on GPU and their application to particle physics
Monte Carlo integration and event generation on GPU and their application to particle physics Junichi Kanzaki (KEK) GPU2016 @ Rome, Italy Sep. 26, 2016 Motivation Increase of amount of LHC data (raw &
More informationThe Ghost in the Machine Observing the Effects of Kernel Operation on Parallel Application Performance
The Ghost in the Machine Observing the Effects of Kernel Operation on Parallel Application Performance Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony,
More informationEE382V-ICS: System-on-a-Chip (SoC) Design
EE38V-CS: System-on-a-Chip (SoC) Design Hardware Synthesis and Architectures Source: D. Gajski, S. Abdi, A. Gerstlauer, G. Schirner, Embedded System Design: Modeling, Synthesis, Verification, Chapter 6:
More informationMulti-core Platforms for
20 JUNE 2011 Multi-core Platforms for Immersive-Audio Applications Course: Advanced Computer Architectures Teacher: Prof. Cristina Silvano Student: Silvio La Blasca 771338 Introduction on Immersive-Audio
More informationMohit Arora. The Art of Hardware Architecture. Design Methods and Techniques. for Digital Circuits. Springer
Mohit Arora The Art of Hardware Architecture Design Methods and Techniques for Digital Circuits Springer Contents 1 The World of Metastability 1 1.1 Introduction 1 1.2 Theory of Metastability 1 1.3 Metastability
More informationANALOG-TO-DIGITAL CONVERTER FOR INPUT VOLTAGE MEASUREMENTS IN LOW- POWER DIGITALLY CONTROLLED SWITCH-MODE POWER SUPPLY CONVERTERS
ANALOG-TO-DIGITAL CONVERTER FOR INPUT VOLTAGE MEASUREMENTS IN LOW- POWER DIGITALLY CONTROLLED SWITCH-MODE POWER SUPPLY CONVERTERS Aleksandar Radić, S. M. Ahsanuzzaman, Amir Parayandeh, and Aleksandar Prodić
More informationA 32 Gbps 2048-bit 10GBASE-T Ethernet Energy Efficient LDPC Decoder with Split-Row Threshold Decoding Method
A 32 Gbps 248-bit GBASE-T Ethernet Energy Efficient LDPC Decoder with Split-Row Threshold Decoding Method Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California,
More informationMeasuring and Evaluating Computer System Performance
Measuring and Evaluating Computer System Performance Performance Marches On... But what is performance? The bottom line: Performance Car Time to Bay Area Speed Passengers Throughput (pmph) Ferrari 3.1
More informationPerformance Metrics, Amdahl s Law
ecture 26 Computer Science 61C Spring 2017 March 20th, 2017 Performance Metrics, Amdahl s Law 1 New-School Machine Structures (It s a bit more complicated!) Software Hardware Parallel Requests Assigned
More informationCAMEO: Continuous Analytics for Massively Multiplayer Online Games
CAMEO: Continuous Analytics for Massively Multiplayer Online Games Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology 1 MMOGs are a Popular, Growing Market 25,000,000
More informationOverview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture
Overview 1 Trends in Microprocessor Architecture R05 Robert Mullins Computer architecture Scaling performance and CMOS Where have performance gains come from? Modern superscalar processors The limits of
More informationABSTRACT. GHOLKAR, NEHA. On the Management of Power Constraints for High Performance Systems. (Under the direction of Frank Mueller).
ABSTRACT GHOLKAR, NEHA. On the Management of Power Constraints for High Performance Systems. (Under the direction of Frank Mueller). The supercomputing community is targeting exascale computing by 2023.
More informationFILA: Fine-grained Indoor Localization
IEEE 2012 INFOCOM FILA: Fine-grained Indoor Localization Kaishun Wu, Jiang Xiao, Youwen Yi, Min Gao, Lionel M. Ni Hong Kong University of Science and Technology March 29 th, 2012 Outline Introduction Motivation
More informationReal-time Concurrent Collection on Stock Multiprocessors
RETROSPECTIVE: Real-time Concurrent Collection on Stock Multiprocessors Andrew W. Appel Princeton University appel@cs.princeton.edu 1. INTRODUCTION In 1987, Kai Li of Princeton University was working with
More informationIF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps
CSE 30321 Computer Architecture I Fall 2011 Homework 06 Pipelined Processors 75 points Assigned: November 1, 2011 Due: November 8, 2011 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (15 points)
More informationDocument downloaded from:
Document downloaded from: http://hdl.handle.net/1251/64738 This paper must be cited as: Reaño González, C.; Pérez López, F.; Silla Jiménez, F. (215). On the design of a demo for exhibiting rcuda. 15th
More informationTrace Based Switching For A Tightly Coupled Heterogeneous Core
Trace Based Switching For A Tightly Coupled Heterogeneous Core Shru% Padmanabha, Andrew Lukefahr, Reetuparna Das, Sco@ Mahlke Micro- 46 December 2013 University of Michigan Electrical Engineering and Computer
More informationHARDWARE ACCELERATION OF THE GIPPS MODEL
HARDWARE ACCELERATION OF THE GIPPS MODEL FOR REAL-TIME TRAFFIC SIMULATION Salim Farah 1 and Magdy Bayoumi 2 The Center for Advanced Computer Studies, University of Louisiana at Lafayette, USA 1 snf3346@cacs.louisiana.edu
More informationFTSP Power Characterization
1. Introduction FTSP Power Characterization Chris Trezzo Tyler Netherland Over the last few decades, advancements in technology have allowed for small lowpowered devices that can accomplish a multitude
More informationTransient Temperature Analysis. Rajit Chandra, Ph.D. Gradient Design Automation
Transient Temperature Analysis Rajit Chandra, Ph.D. Gradient Design Automation Trends in mixed signal designs More designs with switching high power drivers (smart power chips, automotive, high-speed communications,
More information*Engineering and Industrial Services, TATA Consultancy Services Limited **Professor Emeritus, IIT Bombay
System Identification and Model Predictive Control of SI Engine in Idling Mode using Mathworks Tools Shivaram Kamat*, KP Madhavan**, Tejashree Saraf* *Engineering and Industrial Services, TATA Consultancy
More informationCSE502: Computer Architecture Welcome to CSE 502
Welcome to CSE 502 Introduction & Review Today s Lecture Course Overview Course Topics Grading Logistics Academic Integrity Policy Homework Quiz Key basic concepts for Computer Architecture Course Overview
More informationH8238/MCM MODBUS POINT MAP
H8238/MCM MODBUS POINT MAP F O R M A T Int Float R/W NV Description 1 257/258 R/W NV Energy Consumption, kwh, Low-word integer 2 259/260 R/W NV Energy Consumption, kwh, High-word integer Both 257/258 and
More informationIncreasing Performance Requirements and Tightening Cost Constraints
Maxim > Design Support > Technical Documents > Application Notes > Power-Supply Circuits > APP 3767 Keywords: Intel, AMD, CPU, current balancing, voltage positioning APPLICATION NOTE 3767 Meeting the Challenges
More informationIntroduction to Real-Time Systems
Introduction to Real-Time Systems Real-Time Systems, Lecture 1 Martina Maggio and Karl-Erik Årzén 16 January 2018 Lund University, Department of Automatic Control Content [Real-Time Control System: Chapter
More informationMicroarchitectural Attacks and Defenses in JavaScript
Microarchitectural Attacks and Defenses in JavaScript Michael Schwarz, Daniel Gruss, Moritz Lipp 25.01.2018 www.iaik.tugraz.at 1 Michael Schwarz, Daniel Gruss, Moritz Lipp www.iaik.tugraz.at Microarchitecture
More informationProject 5: Optimizer Jason Ansel
Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale
More informationEnergy-Performance Trade-offs on Energy-Constrained Devices with Multi-Component DVFS
Energy-Performance Trade-offs on Energy-Constrained Devices with Multi-Component DVFS Rizwana Begum, David Werner and Mark Hempstead Drexel University {rb639,daw77,mhempstead}@drexel.edu Guru Prasad, Jerry
More information