Multiple Clock and Voltage Domains for Chip Multi Processors

Similar documents
H-EARtH: Heterogeneous Platform Energy Management

Interconnect-Power Dissipation in a Microprocessor

SCALCORE: DESIGNING A CORE

Power Management in Multicore Processors through Clustered DVFS

EECS 427 Lecture 22: Low and Multiple-Vdd Design

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH

System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators

V out. V in VRM. I Load

Enhancing Power, Performance, and Energy Efficiency in Chip Multiprocessors Exploiting Inverse Thermal Dependence

Probabilistic and Variation- Tolerant Design: Key to Continued Moore's Law. Tanay Karnik, Shekhar Borkar, Vivek De Circuit Research, Intel Labs

Leakage Power Minimization in Deep-Submicron CMOS circuits

Intel Architecture, Code Name Skylake Deep Dive: A New Architecture to Manage Power Performance and Energy Efficiency

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

THE power delivery of chip multiprocessors (CMP) faces

WiMAX Basestation: Software Reuse Using a Resource Pool. Arnon Friedmann SW Product Manager

Exploring Heterogeneity within a Core for Improved Power Efficiency

IN recent years, the chip industry has migrated toward chip

Chapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. 1, NO. 1, JANUARY

Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

VLSI System Testing. Outline

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Sourjya Bhaumik, Shoban Chandrabose, Kashyap Jataprolu, Gautam Kumar, Paul Polakos, Vikram Srinivasan, Thomas Woo

Revisiting Dynamic Thermal Management Exploiting Inverse Thermal Dependence

Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips

Programming an Othello AI Michael An (man4), Evan Liang (liange)

Low Power System-On-Chip-Design Chapter 12: Physical Libraries

A Multi-phase VCO Quantizer based Adaptive Digital LDO in 65nm CMOS Technology

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

INF3430 Clock and Synchronization

ECE 471 Embedded Systems Lecture 31

A PageRank Algorithm based on Asynchronous Gauss-Seidel Iterations

GUIDELINES: HEARTS AND ARROWS BY HRD ANTWERP

Cherry Picking: Exploiting Process Variations in the Dark Silicon Era

TD-SCDMA DesignGuide May 2003

3.5: Multimedia Operating Systems Resource Management. Resource Management Synchronization. Process Management Multimedia

Supervisory Control for Cost-Effective Redistribution of Robotic Swarms

Implementing Logic with the Embedded Array

Statistical Simulation of Multithreaded Architectures

Dynamic Routing and Spectrum Assignment in Brown-field Fixed/Flex Grid Optical Network. Tanjila Ahmed

Outline of the Lecture

Intelligent Adaptation And Cognitive Networking

New Approaches to Total Power Reduction Including Runtime Leakage. Leakage

Development of a parallel, tree-based neighbour-search algorithm

Event-Driven Scheduling. (closely following Jane Liu s Book)

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing

Adaptive Modulation and Coding (AMC)

The Need for Gate-Level CDC

Interfacing Techniques for Electromagnetic Transient (EMT) and Transient Stability (TS) Simulation

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

A Sliding Window PDA for Asynchronous CDMA, and a Proposal for Deliberate Asynchronicity

Multiplexing. Dr. Manas Khatua Assistant Professor Dept. of CSE IIT Jodhpur

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

A/D Conversion and Filtering for Ultra Low Power Radios. Dejan Radjen Yasser Sherazi. Advanced Digital IC Design. Contents. Why is this important?

Thermal Management of Manycore Systems with Silicon-Photonic Networks

Research Goal: to design a passive imager for the THz waves Specifications:

Discrete Event Simulation

2002 IEEE International Solid-State Circuits Conference 2002 IEEE

Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control

Experimental Energy Profiling of Energy-Critical Embedded Applications

Integrated Power Delivery for High Performance Server Based Microprocessors

An Empirical Study of Harvesting-Aware Duty Cycling in Sustainable Wireless Sensor Networks

Half-Duplex Spread Spectrum Networks

VRCon: Dynamic Reconfiguration of Voltage Regulators in a Multicore Platform

FIFO WITH OFFSETS HIGH SCHEDULABILITY WITH LOW OVERHEADS. RTAS 18 April 13, Björn Brandenburg

An Energy Conservation DVFS Algorithm for the Android Operating System

Deadline scheduling: can your mobile device last longer?

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

TOWARDS RELIABLE NANOPHOTONIC INTERCONNECTION NETWORK DESIGNS. by Yi Xu B.S., Nanjing University, 2004 M.S., Nanjing University, 2007

Final Report: DBmbench

ESE532: System-on-a-Chip Architecture. Today. Message. Crossbar. Interconnect Concerns

Latency-aware DVFS for Efficient Power State Transitions on Many-core Architectures

Lecture 3: Modulation & Clock Recovery. CSE 123: Computer Networks Stefan Savage

SVT-Pixel layer 0 recent achievements on chip readout architectures

Minimum Energy CMOS Design with Dual Subthreshold Supply and Multiple Logic-Level Gates

Towards PVT-Tolerant Glitch-Free Operation in FPGAs

CIS 480/899 Embedded and Cyber Physical Systems Spring 2009 Introduction to Real-Time Scheduling. Examples of real-time applications

5. Process and thread scheduling

Energy Efficient Circuit Design and the Future of Power Delivery

Communication Analysis

Ruixing Yang

Heat-and-Run: Leveraging SMT and CMP to Manage Power Density Through the Operating System

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION

Amber Path FX SPICE Accurate Statistical Timing for 40nm and Below Traditional Sign-Off Wastes 20% of the Timing Margin at 40nm

How a processor can permute n bits in O(1) cycles

Server Operational Cost Optimization for Cloud Computing Service Providers over

Module 5. DC to AC Converters. Version 2 EE IIT, Kharagpur 1

Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency

TUD Poker Challenge Reinforcement Learning with Imperfect Information

Big versus Little: Who will trip?

Bus Serialization for Reducing Power Consumption

CIRCUIT AND SYSTEM LEVEL DESIGN OPTIMIZATION FOR POWER DELIVERY AND MANAGEMENT. A Dissertation TONG XU

Kernels and Support Vector Machines

Design and Implementation of a Digital Image Processor for Image Enhancement Techniques using Verilog Hardware Description Language

Compiler-Directed Power Management for Superscalars

Hardware Based Frequency/Voltage Control of Voltage Frequency Island Systems Puru Choudhary

Chapter 7. Conclusion and Future Scope

Transcription:

Multiple Clock and Voltage Domains for Chip Multi Processors Efraim Rotem- Intel Corporation Israel Avi Mendelson- Microsoft R&D Israel Ran Ginosar- Technion Israel institute of Technology Uri Weiser- Technion Israel Institute of Technology Presented by: Michael Moeng- University of Pittsburgh

Outline Multiple Voltage Domains Power Model Performance Model Power Management Policies Results

Multiple Voltage Domains Multiprocessors can distribute power in several ways: Single clock domain (also implies single voltage domain) All cores operate at same frequency and voltage

Multiple Voltage Domains Multiprocessors can distribute power in several ways: Single clock domain (also implies single voltage domain) All cores operate at same frequency and voltage Multiple clock domains -- communicate through FIFO buffers with minor overhead Multiple Voltage Domains: Cores independently scale frequency and voltage

Multiple Voltage Domains Multiprocessors can distribute power in several ways: Single clock domain (also implies single voltage domain) All cores operate at same frequency and voltage Multiple clock domains -- communicate through FIFO buffers with minor overhead Multiple Voltage Domains: Cores independently scale frequency and voltage Single voltage domain Individual cores use only frequency scaling Single voltage for all cores determined by highest frequency

Multiple Voltage Domains Multiprocessors can distribute power in several ways: Single clock domain (also implies single voltage domain) All cores operate at same frequency and voltage Multiple clock domains -- communicate through FIFO buffers with minor overhead Multiple Voltage Domains: Cores independently scale frequency and voltage Single voltage domain Individual cores use only frequency scaling Single voltage for all cores determined by highest frequency Clustered topologies: Hybrid approach between two extremes

Multiple Voltage Domains - Power Delivery Previous works assume no overhead for extra voltage regulators. A voltage regulator must be designed for a nominal current. Additional voltage regulators have consequences for:

Multiple Voltage Domains - Power Delivery Previous works assume no overhead for extra voltage regulators. A voltage regulator must be designed for a nominal current. Additional voltage regulators have consequences for: Current Sharing Power Delivery Network Resistance

Current Sharing A regulator will realistically be designed for a maximum current of 130% to 250% of its nominal current. Compare chip power delivery systems: single voltage regulator, X~2.5X amps two voltage regulators,.5x~1.25x amps each N voltage regulators, X/N~2.5X/N amps each

Current Sharing A regulator will realistically be designed for a maximum current of 130% to 250% of its nominal current. Compare chip power delivery systems: single voltage regulator, X~2.5X amps two voltage regulators,.5x~1.25x amps each N voltage regulators, X/N~2.5X/N amps each Maximum power to a single core can be much higher with fewer regulators.

Resistance in Power Delivery Network Splitting Power Delivery Network N ways results in N times higher resistance For symmetric workloads, each regulator also supplies N times less current -- no penalty When assigning power asymmetrically, higher resistance results in a voltage drop -- wasted power

Power Model

Power Model Assumption: Future high-oower CMPs will be designed with nominal frequency and power at the minimum operating voltage allowed by a process.

Benchmarks

Quick Check If we run 16 copies of ammp at nominal frequency, how much power do we have left?

Performance Model

Performance Model Frequency

Performance Model Minimum Operating Frequency

Performance Model Minimum Operating Frequency

Benchmarks

Power Management Policies Goal: Maximize performance given a power constraint

Power Management Policies Goal: Maximize performance given a power constraint Assume benchmarks have already been profiled (we know the frequency scaling) Policies assume its better to give core with better scalability a higher frequency, and provide a function of frequency given scalability.

Quick Check 2 The polynomial policy scales frequency inversely with the freq-power dependency. What is this function?

Power Management: following constraints After each core's desired power level is determined: If desired current exceeds current capacity, scale frequency down to maximum allowed All values are normalized so total power meets power constraints

Evaluation Simulation and real machine execution used to determine parameters for each benchmark

Evaluation Simulation and real machine execution used to determine parameters for each benchmark "Oracle" simulated using a gradient descent algorithm

Evaluation Simulation and real machine execution used to determine parameters for each benchmark "Oracle" simulated using a gradient descent algorithm Monte Carlo modeling for workload generation Evaluates workloads with 2,4,8,12,14,16 threads to show performance with idle cores

Evaluation Simulation and real machine execution used to determine parameters for each benchmark "Oracle" simulated using a gradient descent algorithm Monte Carlo modeling for workload generation Evaluates workloads with 2,4,8,12,14,16 threads to show performance with idle cores Baseline is single-clock domain, single-voltage domain 10-30% improvement over no-dvfs Quick Check 3: How does this improve performance?

Oracle policy For about half the workloads, it's best to use the same frequency for all cores Loss comes from asynchronous FIFO buffers

Best policies for each configuration Shows loss vs oracle Lower is better Knowledge of frequency scalability is crucial

Limiting threads Multiple voltage domains are heavily dependent on high headroom for voltage regulators

Clustered Topologies Matches performance of single voltage domain with few threads Matches performance of multiple voltage domains with many threads