SCALCORE: DESIGNING A CORE

Similar documents
Project 5: Optimizer Jason Ansel

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Dynamic Scheduling II

CSE502: Computer Architecture CSE 502: Computer Architecture

Multiple Clock and Voltage Domains for Chip Multi Processors

On the Rules of Low-Power Design

RISC Central Processing Unit

Exploring Heterogeneity within a Core for Improved Power Efficiency

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing

Dynamic Scheduling I

Low-Power Digital CMOS Design: A Survey

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

CSE502: Computer Architecture CSE 502: Computer Architecture

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

EECS 427 Lecture 22: Low and Multiple-Vdd Design

CSE502: Computer Architecture CSE 502: Computer Architecture

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing *

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

Optimization of Overdrive Signoff

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER

A GALS Many-Core Heterogeneous DSP Platform with Source-Synchronous On-Chip Interconnection Network

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

A Static Power Model for Architects

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

NTE74S188 Integrated Circuit 256 Bit Open Collector PROM 16 Lead DIP Type Package

CS 110 Computer Architecture Lecture 11: Pipelining

Power Modeling and Characterization of Computing Devices: A Survey. Contents

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing

A Novel Latch design for Low Power Applications

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Design and Implement of Low Power Consumption SRAM Based on Single Port Sense Amplifier in 65 nm

Pipelined Processor Design

Energy-Recovery CMOS Design

Instruction Level Parallelism III: Dynamic Scheduling

Low Power Techniques for SoC Design: basic concepts and techniques

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

SOFTWARE IMPLEMENTATION OF THE

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Probabilistic and Variation- Tolerant Design: Key to Continued Moore's Law. Tanay Karnik, Shekhar Borkar, Vivek De Circuit Research, Intel Labs

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors

Out-of-Order Execution. Register Renaming. Nima Honarmand

Static Energy Reduction Techniques in Microprocessor Caches

SRAM SYSTEM DESIGN FOR MEMORY BASED COMPUTING

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

CS521 CSE IITG 11/23/2012

Power Optimization of FPGA Interconnect Via Circuit and CAD Techniques

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Instruction Level Parallelism. Data Dependence Static Scheduling

Announcements. Advanced Digital Integrated Circuits. Midterm feedback mailed back Homework #3 posted over the break due April 8

CMP 301B Computer Architecture. Appendix C

METHODS FOR TRUE ENERGY- PERFORMANCE OPTIMIZATION. Naga Harika Chinta

Auto refresh and self refresh refresh cycles / 64ms. Part No. Clock Frequency Power Organization Interface Package. Normal. 4Banks x 1Mbits x16

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering

Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips

Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control

Issue. Execute. Finish

A VCO-Based ADC Employing a Multi- Phase Noise-Shaping Beat Frequency Quantizer for Direct Sampling of Sub-1mV Input Signals

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

Tomasulo s Algorithm. Tomasulo s Algorithm

Decoupling Capacitance

IJMIE Volume 2, Issue 3 ISSN:

Computer Hardware. Pipeline

Final Report: DBmbench

Reducing Transistor Variability For High Performance Low Power Chips

Server Operational Cost Optimization for Cloud Computing Service Providers over

Leakage Power Minimization in Deep-Submicron CMOS circuits

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

On the Off-chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option? Mohamed Hassan

Tomasolu s s Algorithm

Revision History Revision 0.0 (October, 2003) Target spec release Revision 1.0 (November, 2003) Revision 1.0 spec release Revision 1.1 (December, 2003

Trace Based Switching For A Tightly Coupled Heterogeneous Core

Minimum Energy CMOS Design with Dual Subthreshold Supply and Multiple Logic-Level Gates

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

Revision No. History Draft Date Remark. 0.1 Initial Draft Jan Preliminary. 1.0 Final Version Apr. 2007

Low-Power CMOS VLSI Design

HY5V56D(L/S)FP. Revision History. No. History Draft Date Remark. 0.1 Defined Target Spec. May Rev. 0.1 / Jan

A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms

Revision No. History Draft Date Remark. 1.0 First Version Release Dec Corrected PIN ASSIGNMENT A12 to NC Jan. 2005

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

A Employing Circadian Rhythms to Enhance Power and Reliability

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Auto refresh and self refresh refresh cycles / 64ms. Part No. Clock Frequency Power Organization Interface Package. Normal. 4Banks x 2Mbits x16

CMOS Process Variations: A Critical Operation Point Hypothesis

Revision No. History Draft Date Remark. 0.1 Initial Draft Jul Preliminary. 1.0 Release Aug. 2009

Transcription:

SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

Motivation 2 Application characteristics vary dynamically Goal: Single core design that attains High performance at nominal voltage (~0.9V) High energy efficiency at low voltage (~0.5V) A Voltage-Scalable Core

Observations 3 SRAM vs Logic delay scaling Small increase in V à large improvement in delay Normalized Delay 12 11 10 9 8 7 6 5 4 3 2 1 0 SRAMDelay LogicDelay V min 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 V dd (V) ~2x

ScalCore Idea 4 Design a Voltage-Scalable core based on the two observations 10000 SRAM-Freq Logic-Freq f nom ~ 3500 Frequency (MHz) 1000 f op ~ 1200 f min ~ 900 f logic ~ 600 V nom V op V min V logic 100 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 V dd (V)

ScalCore: A Voltage-Scalable Core 5 Decouple V dd of logic and storage structures in the pipeline Improve energy efficiency Raise V dd of storage structures a little Reconfigure the pipeline to take advantage of faster storage structures Further improve performance and reduce leakage energy!! Evaluated the proposed design under various scenarios Reduce execution time by 31%, energy by 48% and ED by 60% over the state of the art Truly Voltage Scalable Core!!

ScalCore Design 6 Goal: Maximize the energy-efficiency at low-voltage (EEMode) Constraint: No impact on performance or energy at nominal voltage (HPMode) Approach: In EEMode, Provide different V dd s for storage and logic stages n Storage stage ~2x faster than logic stage Reconfigure pipeline in one of the two ways n Fuse storage stages in the pipeline (e.g. Accessing Register File) n Increase storage structure sizes

Fusing Two Pipeline Stages into One 7 Logic Stage 1 Storage Stage 2a Storage Stage 2b Logic Stage 3 HPMode V nom V nom V nom V nom CLK

Fusing Two Pipeline Stages into One 8 Logic Stage 1 Storage Stage 2a Storage Stage 2b Logic Stage 3 HPMode CLK V nom V nom V nom V nom 0 Enable Flow-through Logic Stage 1 2a 2b Logic Stage 3 EEMode CLK V logic V op Enable 1 Flow-through V op V logic

Increasing Size of Structures 9 Original Decoder HPMode Decoder EEMode Decoder 1 0 0 1 Array 0 V nom Array 0 V nom Array 1 Disabled Array 0 V op Array 1 V op Sense Amp Sense Amp 1 Sense Amp 0 Sense Amp 0 Sense Amp 1 Data Select Data Select Data Select

ScalCore Pipeline 10 Fetch Decode Rename Dispatch Wakeup Select Data Read Source Drive Ex Ex Write back Main storage structures L1-I Branch Pred. Decode Allocate Register File Register File EX EX EX LSU L1-D Next PC ROB/ Completion Table

Pipeline Structure Changes 11 Fuse Stages Increase Size Register File Array access + Source drive 1.5X PRF Allocation Rename + Dispatch --- Load Store Unit Addr. generation + Memory disambiguation 1.5X Load/Store Queue and Store buffer Reorder Buffer --- 1.5X ROB Branch Prediction --- ---

Controlpath Changes 12 Programmable counters for variable latencies Execution schedules Wakeup logic Programmable counters for variable sizes List of free registers ROB, Load/Store queue sizes

Circuit Issues 13 HPMode V op V nom EEMode V logic Logic Stage 1 w/ Level Conv. Storage Stage 2a Storage Stage 2b Logic Stage 3 F.Ishihara, F.Sheikh, and B.Nikolic, Level Conversion for Dual-Supply Systems, IEEE Transactions on VLSI Systems, February 2004

Circuit Issues 14 HPMode V op V nom EEMode V logic Logic Stage 1 w/ Level Conv. Storage Stage 2a Storage Stage 2b Logic Stage 3 d ck q (inv) clk ck V logic V op F.Ishihara, F.Sheikh, and B.Nikolic, Level Conversion for Dual-Supply Systems, IEEE Transactions on VLSI Systems, February 2004

Evaluation Methodology 15 16 OOO cores 64K I-L1/D-L1, 1MB L2 Voltages and frequencies HPMode 10000 n V nom = 0.90 V, f=3.5ghz f nom ~ 3500 EEMode n V logic = 0.50 V, f=600mhz n V op = 0.65 V, f=600mhz Frequency (MHz) 1000 f op ~ 1200 f min ~ 900 f logic ~ 600 n V min = 0.6V, f=900mhz V nom V op V min V logic 100 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 V dd (V)

Configurations 16 Baseline: HPRef: Conventional HP (3.5GHz,0.9V) DVFS: Most aggressive (900MHz,0.6V) ScalCore: HPMode: HPRef with penalty (3.3GHz,0.9V) EEMode (600MHz,0.5V,0.65V) n Pipe2Vdd: Separate voltages only n SC: Fuse the two stages of RF, Allocate ; Increase size ROB, LSQ structures

Iso-Power Comparison for Parallel Programs 17 1.4 Average Execution Time 2.6 2.7 1.4 Average Energy-Delay 7.0 7.5 1.9 1.2 1 0.8 1.2 31% 60% 15% 1 0.8 23% 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Reduced execution time by 31%; ED by 60% compared to DVFS

Dynamic Workloads 18 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 31% 28% 19% 22% Execution Time Energy Energy-Delay Product 46% 15% Dyn-Baseline Static-16-SC Dyn-SC Dyn-SC reduces execution time by 31%, energy by 22% and ED by 46% compared to conventional DVFS

Also in the Paper 19 ScalCore complexities and overheads Comparison with Intel Claremont Impact on different classes of applications More results Unconstrained power budget Intermediate design points of ScalCore

Conclusion 20 Presented ScalCore, a core designed for voltage scalability Designed a voltage-scalable core by Decoupling V dd of logic and storage structures in the pipeline Raising V dd of storage structures a little Reconfiguring the pipeline to take advantage of faster storage structures and improved performance, energy

SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

Why not Big/Little? 22 Heterogeneous cores on a chip ideally suited for different tasks Fixed partitioning of cores A fraction of chip unused Migration overhead

Configurations 23 Baseline: HPRef: Conventional HP (3.5GHz,0.9V) DVFS: Most aggressive (900MHz,0.6V) ScalCore: HPMode: HPRef with penalty (3.3GHz,0.9V) EEMode (600MHz,0.5V,0.65V) n Pipe2Vdd: Separate voltages only n SCspeed: Fuse the two stages of RF, LSU, Allocate n SC: Fuse the two stages of RF, Allocate Increase size ROB, LSQ structures