Dynamic MIPS Rate Stabilization in Out-of-Order Processors

Similar documents
Outline Simulators and such. What defines a simulator? What about emulation?

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

Proactive Thermal Management using Memory-based Computing in Multicore Architectures

CLIPPER: Counter-based Low Impact Processor Power Estimation at Run-time

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

A 90 nm High Volume Manufacturing Logic Technology Featuring Novel 45 nm Gate Length Strained Silicon CMOS Transistors

Energy Efficient Scheduling Techniques For Real-Time Embedded Systems

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Performance Evaluation of Recently Proposed Cache Replacement Policies

Probabilistic and Variation- Tolerant Design: Key to Continued Moore's Law. Tanay Karnik, Shekhar Borkar, Vivek De Circuit Research, Intel Labs

Statistical Simulation of Multithreaded Architectures

A Dynamic Voltage Scaling Algorithm for Dynamic Workloads

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Proactive Thermal Management Using Memory Based Computing

Transmission-Line-Based, Shared-Media On-Chip. Interconnects for Multi-Core Processors

Deadline scheduling: can your mobile device last longer?

Optimality and Improvement of Dynamic Voltage Scaling Algorithms for Multimedia Applications

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION

An Overview of Static Power Dissipation

Low Power Design for Systems on a Chip. Tutorial Outline

CHAPTER 1 INTRODUCTION

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

Static Energy Reduction Techniques in Microprocessor Caches

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Applying pinwheel scheduling and compiler profiling for power-aware real-time scheduling

An Energy Conservation DVFS Algorithm for the Android Operating System

Big versus Little: Who will trip?

shangupt 2260 Hayward St. #4861, Ann Arbor, MI 48105, Ph:

Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control

On-Chip Decoupling Capacitor Optimization Using Architectural Level Prediction

Dynamic hardware management of the H264/AVC encoder control structure using a framework for system scenarios

COTSon: Infrastructure for system-level simulation

Project 5: Optimizer Jason Ansel

A Power-efficient 32bit ARM ISA Processor using Timingerror. Detection and Correction for Transient-error Tolerance. and Adaptation to PVT Variation

A Static Power Model for Architects

An Optimal Design of Ring Oscillator and Differential LC using 45 nm CMOS Technology

Conventional 4-Way Set-Associative Cache

PV-PPV: Parameter Variability Aware, Automatically Extracted, Nonlinear Time-Shifted Oscillator Macromodels

DESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM

On Chip Active Decoupling Capacitors for Supply Noise Reduction for Power Gating and Dynamic Dual Vdd Circuits in Digital VLSI

Revisiting Dynamic Thermal Management Exploiting Inverse Thermal Dependence

FV-MSB: A Scheme for Reducing Transition Activity on Data Buses

WEI HUANG Curriculum Vitae

A DPLL-based per Core Variable Frequency Clock Generator for an Eight-Core POWER7 Microprocessor

Statistical Static Timing Analysis Technology

Context-Independent Codes for Off-Chip Interconnects

Enhancing Power, Performance, and Energy Efficiency in Chip Multiprocessors Exploiting Inverse Thermal Dependence

Server Operational Cost Optimization for Cloud Computing Service Providers over

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

II. Previous Work. III. New 8T Adder Design

Final Report: DBmbench

Practical Information

Contents CONTRIBUTING FACTORS. Preface. List of trademarks 1. WHY ARE CUSTOM CIRCUITS SO MUCH FASTER?

Low-Power CMOS VLSI Design

Process-sensitive Monitor Circuits for Estimation of Die-to-Die Process Variability

Optimality and Improvement of Dynamic Voltage Scaling Algorithms for Multimedia Applications

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Accomplishment and Timing Presentation: Clock Generation of CMOS in VLSI

Improved DFT for Testing Power Switches

Exploiting Synchronous and Asynchronous DVS

SURVEY AND EVALUATION OF LOW-POWER FULL-ADDER CELLS

Study the Analysis of Low power and High speed CMOS Logic Circuits in 90nm Technology

Run-time Power Control Scheme Using Software Feedback Loop for Low-Power Real-time Applications

Maximizing the execution rate of low-criticality tasks in mixed-criticality system

A Comparative Study of Π and Split R-Π Model for the CMOS Driver Receiver Pair for Low Energy On-Chip Interconnects

On the Rules of Low-Power Design

CS 6135 VLSI Physical Design Automation Fall 2003

An Overview of Computer Architecture and System Simulation

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

Lecture 1: Introduction to Digital System Design & Co-Design

System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators

Power Management in Multicore Processors through Clustered DVFS

Analog circuit design ( )

CMOS Process Variations: A Critical Operation Point Hypothesis

A Complete Real-Time a Baseband Receiver Implemented on an Array of Programmable Processors

MODELING THE PHASE STEP RESPONSE OF BANG-BANG DIGITAL PLLS

Experimental Evaluation of the MSP430 Microcontroller Power Requirements

Cherry Picking: Exploiting Process Variations in the Dark Silicon Era

Recent Advances in Simulation Techniques and Tools

A COMPACT, AGILE, LOW-PHASE-NOISE FREQUENCY SOURCE WITH AM, FM AND PULSE MODULATION CAPABILITIES

Closing the Power Gap between ASIC and Custom: An ASIC Perspective

Chapter 7 Introduction to 3D Integration Technology using TSV

Hardware-Software Interaction for Run-time Power Optimization: A Case Study of Embedded Linux on Multicore Smartphones

Synthesis of Optimal On-Chip Baluns

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

A Robust Oscillator for Embedded System without External Crystal

DUE TO THE popularity of streaming multimedia applications

CS4617 Computer Architecture

Manufacturing Case Studies: Copy Exactly (CE!) and the two-year cycle at Intel

Approximating Computation and Data for Energy Efficiency

Reducing the Sub-threshold and Gate-tunneling Leakage of SRAM Cells using Dual-V t and Dual-T ox Assignment

IBM Research Report. Characterizing the Impact of Different Memory-Intensity Levels. Ramakrishna Kotla University of Texas at Austin

Low-Power Digital CMOS Design: A Survey

Incorporating Variability into Design

Hybrid Dynamic Thermal Management Based on Statistical Characteristics of Multimedia Applications

DESIGN AND VERIFICATION OF ANALOG PHASE LOCKED LOOP CIRCUIT

Modular Performance Analysis

Energy Efficiency of Power-Gating in Low-Power Clocked Storage Elements

Transcription:

Dynamic Rate Stabilization in Out-of-Order Processors Jinho Suh and Michel Dubois Ming Hsieh Dept of EE University of Southern California

Outline Motivation Performance Variability of an Out-of-Order Processor Dynamic Rate Stabilization Stabilization Results Stabilization Framework Robustness Conclusion ISCA'09 2

Motivation Timing predictability of a task: top priority Dynamic timing analysis, WCET (Worst-Case Execution Time) analysis WCET analysis : best for hard RT But challenging due to advancement in microarchitectures [Hergenhan'00] Power/energy savings Throttling frequency with scheduling upon WCET analysis [Hughes'01] [Rotenberg'01][Zhu'00] Exploiting ILP (Instruction Level Parallelism) over frequency [Childers'00] Real-time systems avoid Out-of-Order processors with caches Stabilization framework: Novel methodology to improve predictability Fine-control of throughput T(maximum instruction count, deadline) Power and energy savings without task overrun ISCA'09 3

Performance Variability of an OoO Processor 17 MiBench [Guthaus'01] programs 388 slices (40MI each) : tasks OoO processor @ 1GHz 3-way 16KB IL1/DL1, 512KB UL2 ISA: PISA Statistics Mean: 1307.9 STDEV: 437.6 Timing predictability is difficult problem due to high variability in OoO processors with caches ISCA'09 4

Dynamic Rate Stabilization Fine Controllability of Rate Just-in-Time Completion without Overrun + Optimal Power/Energy Savings Stabilization Framework Profiling (code analysis) Target PID Feedback Control Processor Dynamic Volt/Freq Scaling Target Rate Controllability ISCA'09 5

Dynamic Rate Stabilization System -to-volt/freq Profiling PID Feedback Control Controller Mapper Stabilization Framework System OoO Goal: processor (continuous) target input is with calculated rate caches Frequency for from a task System (continuous) Error Frequency (discrete) Profiling (code P-, Assumption: I-, By Very analysis) and code dynamic D- current analysis parameters mathematically and next should phase be configured: difficult behave the to model Loop-tuning same Target Processor Target Worst-case Very different (MAX) to classical instruction control count problem over all (fixed Proper dynamic plant V/F traces model) Scale Dynamic Rate PID Feedback Volt/Freq Scaling Controllability Still controllable by PID controller Control Measure retiredinstructions per second MHz Next U : system Op - Point R : reference E : error input Mapper ( Freq, Vdd ) Plant ( CPU ) Y : output + Controller Throughput - 2 - w / Resync Frequency Penalty - ISCA'09 6

Dynamic Rate Stabilization: Framework Setup System -to-volt/freq Profiling PID Feedback Control Controller Mapper Task Changing instruction Volt/Freq count: requires 40 millions 20us PLL of instructions resynchronization pause PID Parameter Next Settings Frequency P I Freq D (MHz) Vdd (V) Setting 1: Slowest 870MHz Freq (MHz) ~ 1 10Vdd 0.1 (V) 1000 0.825 Setting 2 100010 50 0.825 Control window: Setting 3 605MHz ~ 870MHz 50 50 0.1800 0.772 Setting 4 800 75 50 0.772 50k instructions 370MHz ~ 605MHz 0.1500 0.694 Setting 5 500100 50 0.694 0.1 ~370MHz 300 0.641 Setting 6: Fastest 100 100 0.1 300 0.641 From Intel [Mistry'04] MHz Next U : system Op - Point R : reference E : error input Mapper ( Freq, Vdd ) Plant ( CPU ) Y : output + Controller Throughput - 2 - w / Resync Frequency Penalty - ISCA'09 7

Dynamic Rate Stabilization Last control window: 656 @ (800MHz, 0.772V) Current control window: @ (300MHz, 0.641V) Measure Current : 642 Calculate Error: 650 642 = 8 Calculate Next = 75*(8 (-6)) + 50*8 = 1450 Next Frequency = 300MHz * 1450/642 = 677.57MHz Next Volt/Freq = (800MHz, 0.772V) Next control window: PLL Resynchronization pause of 20usec Start running @ (800MHz, 0.772V) Target: 650 MHz Next U : system Op - Point R : reference E : error input Mapper ( Freq, Vdd ) Plant ( CPU ) Y : output + Controller Throughput - 2 - w / Resync Frequency Penalty - 50k instructions committed ISCA'09 8

Target rate and Task Overrun observed without stabilization on the WC trace Required target according to the deadline necessary Cumulative Task end (a) (b) (c) (a) Target is not achievable: Task overrun (b) Target might be achievable: Possible overruns Time or Retired instructions (c) Target is achievable: No overrun ISCA'09 9

Stabilization Results Target of 650 62ms deadline for 40MI task Statistics Mean: 650.37 STDEV: 1.51 Savings against baseline Average power: 46.64% Average energy: 72.06% 1. Predictability is much IMPROVED in OoO processors with caches 2. Power and energy savings due to just-in-time completion without task overrun ISCA'09 10

Stabilization Quality, Safety Margin and Target Task overrun under different safety margins Resultant Target + Safety Margin Target Target w/ Margin Slower Controller Undershoot Margin Target 650-5% 617.5 650-1% 643.5 650-0.5% 646.75 Resultant Target + Safety Margin Target 650-0.4% 647.4 650-0.3% 648.05 Faster Controller 650-0.1% 649.35 (millions) 1. Greedy setting of target : possible OVERRUN 2. Target with reasonable safety margin (1%): VERY FAST CONVERGENCE ISCA'09 11

Stabilization Framework Robustness: Different PID parameters PID Parameter Settings P I D 5% 1% 0.5% 0.4% 0.3% 0.1% Setting 1: Slowest 1 10 0.1 Setting 2 10 50 0.1 Setting 3 50 50 0.1 Setting 4 75 50 0.1 Setting 5 100 50 0.1 Setting 6: Fastest 100 100 0.1 Safe Loop-tuning for PID controller parameters No need to fine tune parameters From bitcount, qsort, fft_fwr, fft_inv Convergence (settings 2~6): STDEV difference < 1.5 Power/Energy savings: difference < 10% PID controller is ROBUST: Stabilization works well with different PID parameters ISCA'09 12

Stabilization Framework Robustness: Different Cache Configurations Same PID controller With 4KB-4KB-128KB caches IPC: 62.15% IL1 misses: 311.11% DL1 misses: 491.86% UL2 misses: 609.45% Statistics: Mean: 450.55 STDEV: 1.04 From qsort, patricia Same observation with 8KB-8KB-256KB caches PID controller is ROBUST: Stabilization works well upon a different cache configurations ISCA'09 13

Comparison with In-Order (IO) Processor Same Quality-of-Service: > 650 IO processor @ 1.4GHz Statistics: Mean: 793.14 STDEV: 65.5 Power to stabilized: 151.46% EPI to stabilized: 110.49% Power to baseline: 86.19% EPI to baseline: 102.64% From basicmath,, patricia, adpcm_p2a, adpcm_a2p 1. Stabilized OoO is better than IO for power/energy consumption 2. IO can be stabilized as well ISCA'09 14

Conclusion Fine-grain controllability of processor instruction throughput Make execution time highly predictable Optimize power/energy consumption by meeting deadlines right on time Applicable to many different kinds of (single) processors, including OoO processor with caches for RT applicability Stabilized OoO processor can be better than IO processor Robustness Over PID parameters, over different cache configurations Future Work Extension of the framework to Chip Multiprocessors ISCA'09 15

References [Cazorla'04] Cazorla, F. J., Knijnenburg, P. M., Sakellariou, R., Fernandez, E., Ramirez, A., and Valero, M. 2004. Predictable performance in SMT processors. In Proceedings of the 1st Conference on Computing Frontiers (Ischia, Italy, April 14-16, 2004). CF '04 [Childers'00] Bruce R. Childers, H. Tang and Rami Melhem, Adapting Processor Supply Voltage to Instruction-Level Parallelism, Koolchips 2000, during the 33rd Int'l. Symp. on Microarchitecture (MICRO-33), Monterey, CA, December 10, 2000. [Burger'97] Doug Burger and Todd M. Austin. The SimpleScalar Tool Set Version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin--Madison, May 1997. [Guthaus'01] Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization, 2001. Wwc-4. 2001 IEEE international Workshop on - Volume 00 (December 02-02, 2001). [Hamers'07] Hamers, J. and Eeckhout, L. 2007. Resource prediction for media stream decoding. In Proceedings of the Conference on Design, Automation and Test in Europe (Nice, France, April 16-20, 2007). Design, Automation, and Test in Europe. EDA Consortium, San Jose, CA, 594-599. [Hergenhan'00] A. Hergenhan and W. Rosenstiel. Static timing analysis of embedded software on advanced processor architectures. In Proceedings of Design, Automation and Test in Europe (DATE '00), pages 552--559, Paris, March 2000. [Hughes'01] C. J. Hughes, J. Srinivasan, and S. V. Adve. Saving Energy with Architectural and Frequency Adaptations for Multimedia Applications. In Proceedings of the 34th Annual International Symposium on Microarchitecture (MICRO-34), Dec. 2001. [Mistry'04] Mistry, K. Armstrong, M. Auth, C. Cea, S. Coan, T. Ghani, T. Hoffmann, T. Murthy, A. Sandford, J. Shaheed, R. Zawadzki, K. Zhang, K. Thompson, S. Bohr, M. Delaying forever: Uniaxial strained silicon transistors in a 90nm CMOS technology, Symposium on VLSI Technology, p. 50, (2004). [Rotenberg'01] E. Rotenberg. Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems. 34th International Symposium on Microarchitecture, December 2001. [Xu'05] C Xu, TM Le, TT Lay, H.264/AVC CODEC: Instruction Level Complexity Analysis. Ninth IASTED International Conference on Internet and Multimedia Systems and Applications; Honolulu, HI; USA; 15-17 Aug. 2005. [Zhu'00] Zhu, Y. and Mueller, F. Feedback EDF Scheduling Exploiting Dynamic Voltage Scaling. In Proceedings of the 10th IEEE Real-Time and Embedded Technology and Applications Symposium (Rtas'04) - Volume 00 (May 25-28, 2004). ISCA'09 16