SCALCORE: DESIGNING A CORE - PDF Free Download

SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

Motivation 2 Application characteristics vary dynamically Goal: Single core design that attains High performance at nominal voltage (~0.9V) High energy efficiency at low voltage (~0.5V) A Voltage-Scalable Core

Observations 3 SRAM vs Logic delay scaling Small increase in V à large improvement in delay Normalized Delay 12 11 10 9 8 7 6 5 4 3 2 1 0 SRAMDelay LogicDelay V min 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 V dd (V) ~2x

ScalCore Idea 4 Design a Voltage-Scalable core based on the two observations 10000 SRAM-Freq Logic-Freq f nom ~ 3500 Frequency (MHz) 1000 f op ~ 1200 f min ~ 900 f logic ~ 600 V nom V op V min V logic 100 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 V dd (V)

ScalCore: A Voltage-Scalable Core 5 Decouple V dd of logic and storage structures in the pipeline Improve energy efficiency Raise V dd of storage structures a little Reconfigure the pipeline to take advantage of faster storage structures Further improve performance and reduce leakage energy!! Evaluated the proposed design under various scenarios Reduce execution time by 31%, energy by 48% and ED by 60% over the state of the art Truly Voltage Scalable Core!!

ScalCore Design 6 Goal: Maximize the energy-efficiency at low-voltage (EEMode) Constraint: No impact on performance or energy at nominal voltage (HPMode) Approach: In EEMode, Provide different V dd s for storage and logic stages n Storage stage ~2x faster than logic stage Reconfigure pipeline in one of the two ways n Fuse storage stages in the pipeline (e.g. Accessing Register File) n Increase storage structure sizes

Fusing Two Pipeline Stages into One 7 Logic Stage 1 Storage Stage 2a Storage Stage 2b Logic Stage 3 HPMode V nom V nom V nom V nom CLK

Fusing Two Pipeline Stages into One 8 Logic Stage 1 Storage Stage 2a Storage Stage 2b Logic Stage 3 HPMode CLK V nom V nom V nom V nom 0 Enable Flow-through Logic Stage 1 2a 2b Logic Stage 3 EEMode CLK V logic V op Enable 1 Flow-through V op V logic

Increasing Size of Structures 9 Original Decoder HPMode Decoder EEMode Decoder 1 0 0 1 Array 0 V nom Array 0 V nom Array 1 Disabled Array 0 V op Array 1 V op Sense Amp Sense Amp 1 Sense Amp 0 Sense Amp 0 Sense Amp 1 Data Select Data Select Data Select

ScalCore Pipeline 10 Fetch Decode Rename Dispatch Wakeup Select Data Read Source Drive Ex Ex Write back Main storage structures L1-I Branch Pred. Decode Allocate Register File Register File EX EX EX LSU L1-D Next PC ROB/ Completion Table

Pipeline Structure Changes 11 Fuse Stages Increase Size Register File Array access + Source drive 1.5X PRF Allocation Rename + Dispatch --- Load Store Unit Addr. generation + Memory disambiguation 1.5X Load/Store Queue and Store buffer Reorder Buffer --- 1.5X ROB Branch Prediction --- ---

Controlpath Changes 12 Programmable counters for variable latencies Execution schedules Wakeup logic Programmable counters for variable sizes List of free registers ROB, Load/Store queue sizes

Circuit Issues 13 HPMode V op V nom EEMode V logic Logic Stage 1 w/ Level Conv. Storage Stage 2a Storage Stage 2b Logic Stage 3 F.Ishihara, F.Sheikh, and B.Nikolic, Level Conversion for Dual-Supply Systems, IEEE Transactions on VLSI Systems, February 2004

Circuit Issues 14 HPMode V op V nom EEMode V logic Logic Stage 1 w/ Level Conv. Storage Stage 2a Storage Stage 2b Logic Stage 3 d ck q (inv) clk ck V logic V op F.Ishihara, F.Sheikh, and B.Nikolic, Level Conversion for Dual-Supply Systems, IEEE Transactions on VLSI Systems, February 2004

Evaluation Methodology 15 16 OOO cores 64K I-L1/D-L1, 1MB L2 Voltages and frequencies HPMode 10000 n V nom = 0.90 V, f=3.5ghz f nom ~ 3500 EEMode n V logic = 0.50 V, f=600mhz n V op = 0.65 V, f=600mhz Frequency (MHz) 1000 f op ~ 1200 f min ~ 900 f logic ~ 600 n V min = 0.6V, f=900mhz V nom V op V min V logic 100 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 V dd (V)

Configurations 16 Baseline: HPRef: Conventional HP (3.5GHz,0.9V) DVFS: Most aggressive (900MHz,0.6V) ScalCore: HPMode: HPRef with penalty (3.3GHz,0.9V) EEMode (600MHz,0.5V,0.65V) n Pipe2Vdd: Separate voltages only n SC: Fuse the two stages of RF, Allocate ; Increase size ROB, LSQ structures

Iso-Power Comparison for Parallel Programs 17 1.4 Average Execution Time 2.6 2.7 1.4 Average Energy-Delay 7.0 7.5 1.9 1.2 1 0.8 1.2 31% 60% 15% 1 0.8 23% 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Reduced execution time by 31%; ED by 60% compared to DVFS

Dynamic Workloads 18 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 31% 28% 19% 22% Execution Time Energy Energy-Delay Product 46% 15% Dyn-Baseline Static-16-SC Dyn-SC Dyn-SC reduces execution time by 31%, energy by 22% and ED by 46% compared to conventional DVFS

Also in the Paper 19 ScalCore complexities and overheads Comparison with Intel Claremont Impact on different classes of applications More results Unconstrained power budget Intermediate design points of ScalCore

Conclusion 20 Presented ScalCore, a core designed for voltage scalability Designed a voltage-scalable core by Decoupling V dd of logic and storage structures in the pipeline Raising V dd of storage structures a little Reconfiguring the pipeline to take advantage of faster storage structures and improved performance, energy

Why not Big/Little? 22 Heterogeneous cores on a chip ideally suited for different tasks Fixed partitioning of cores A fraction of chip unused Migration overhead

Configurations 23 Baseline: HPRef: Conventional HP (3.5GHz,0.9V) DVFS: Most aggressive (900MHz,0.6V) ScalCore: HPMode: HPRef with penalty (3.3GHz,0.9V) EEMode (600MHz,0.5V,0.65V) n Pipe2Vdd: Separate voltages only n SCspeed: Fuse the two stages of RF, LSU, Allocate n SC: Fuse the two stages of RF, Allocate Increase size ROB, LSQ structures