SCALCORE: DESIGNING A CORE

Size: px

Start display at page:

Download "SCALCORE: DESIGNING A CORE"

Stephany Richardson
5 years ago
Views:

1 SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

2 Motivation 2 Application characteristics vary dynamically Goal: Single core design that attains High performance at nominal voltage (~0.9V) High energy efficiency at low voltage (~0.5V) A Voltage-Scalable Core

3 Observations 3 SRAM vs Logic delay scaling Small increase in V à large improvement in delay Normalized Delay SRAMDelay LogicDelay V min V dd (V) ~2x

4 ScalCore Idea 4 Design a Voltage-Scalable core based on the two observations SRAM-Freq Logic-Freq f nom ~ 3500 Frequency (MHz) 1000 f op ~ 1200 f min ~ 900 f logic ~ 600 V nom V op V min V logic V dd (V)

5 ScalCore: A Voltage-Scalable Core 5 Decouple V dd of logic and storage structures in the pipeline Improve energy efficiency Raise V dd of storage structures a little Reconfigure the pipeline to take advantage of faster storage structures Further improve performance and reduce leakage energy!! Evaluated the proposed design under various scenarios Reduce execution time by 31%, energy by 48% and ED by 60% over the state of the art Truly Voltage Scalable Core!!

6 ScalCore Design 6 Goal: Maximize the energy-efficiency at low-voltage (EEMode) Constraint: No impact on performance or energy at nominal voltage (HPMode) Approach: In EEMode, Provide different V dd s for storage and logic stages n Storage stage ~2x faster than logic stage Reconfigure pipeline in one of the two ways n Fuse storage stages in the pipeline (e.g. Accessing Register File) n Increase storage structure sizes

7 Fusing Two Pipeline Stages into One 7 Logic Stage 1 Storage Stage 2a Storage Stage 2b Logic Stage 3 HPMode V nom V nom V nom V nom CLK

V nom V nom 0 Enable Flow-through Logic Stage 1 2a 2b Logic

8 Fusing Two Pipeline Stages into One 8 Logic Stage 1 Storage Stage 2a Storage Stage 2b Logic Stage 3 HPMode CLK V nom V nom V nom V nom 0 Enable Flow-through Logic Stage 1 2a 2b Logic Stage 3 EEMode CLK V logic V op Enable 1 Flow-through V op V logic

9 Increasing Size of Structures 9 Original Decoder HPMode Decoder EEMode Decoder Array 0 V nom Array 0 V nom Array 1 Disabled Array 0 V op Array 1 V op Sense Amp Sense Amp 1 Sense Amp 0 Sense Amp 0 Sense Amp 1 Data Select Data Select Data Select

10 ScalCore Pipeline 10 Fetch Decode Rename Dispatch Wakeup Select Data Read Source Drive Ex Ex Write back Main storage structures L1-I Branch Pred. Decode Allocate Register File Register File EX EX EX LSU L1-D Next PC ROB/ Completion Table

11 Pipeline Structure Changes 11 Fuse Stages Increase Size Register File Array access + Source drive 1.5X PRF Allocation Rename + Dispatch --- Load Store Unit Addr. generation + Memory disambiguation 1.5X Load/Store Queue and Store buffer Reorder Buffer X ROB Branch Prediction

12 Controlpath Changes 12 Programmable counters for variable latencies Execution schedules Wakeup logic Programmable counters for variable sizes List of free registers ROB, Load/Store queue sizes

13 Circuit Issues 13 HPMode V op V nom EEMode V logic Logic Stage 1 w/ Level Conv. Storage Stage 2a Storage Stage 2b Logic Stage 3 F.Ishihara, F.Sheikh, and B.Nikolic, Level Conversion for Dual-Supply Systems, IEEE Transactions on VLSI Systems, February 2004

14 Circuit Issues 14 HPMode V op V nom EEMode V logic Logic Stage 1 w/ Level Conv. Storage Stage 2a Storage Stage 2b Logic Stage 3 d ck q (inv) clk ck V logic V op F.Ishihara, F.Sheikh, and B.Nikolic, Level Conversion for Dual-Supply Systems, IEEE Transactions on VLSI Systems, February 2004

15 Evaluation Methodology OOO cores 64K I-L1/D-L1, 1MB L2 Voltages and frequencies HPMode n V nom = 0.90 V, f=3.5ghz f nom ~ 3500 EEMode n V logic = 0.50 V, f=600mhz n V op = 0.65 V, f=600mhz Frequency (MHz) 1000 f op ~ 1200 f min ~ 900 f logic ~ 600 n V min = 0.6V, f=900mhz V nom V op V min V logic V dd (V)

16 Configurations 16 Baseline: HPRef: Conventional HP (3.5GHz,0.9V) DVFS: Most aggressive (900MHz,0.6V) ScalCore: HPMode: HPRef with penalty (3.3GHz,0.9V) EEMode (600MHz,0.5V,0.65V) n Pipe2Vdd: Separate voltages only n SC: Fuse the two stages of RF, Allocate ; Increase size ROB, LSQ structures

17 Iso-Power Comparison for Parallel Programs Average Execution Time Average Energy-Delay % 60% 15% % Reduced execution time by 31%; ED by 60% compared to DVFS

Dynamic Workloads 18 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.

18 Dynamic Workloads % 28% 19% 22% Execution Time Energy Energy-Delay Product 46% 15% Dyn-Baseline Static-16-SC Dyn-SC Dyn-SC reduces execution time by 31%, energy by 22% and ED by 46% compared to conventional DVFS

19 Also in the Paper 19 ScalCore complexities and overheads Comparison with Intel Claremont Impact on different classes of applications More results Unconstrained power budget Intermediate design points of ScalCore

20 Conclusion 20 Presented ScalCore, a core designed for voltage scalability Designed a voltage-scalable core by Decoupling V dd of logic and storage structures in the pipeline Raising V dd of storage structures a little Reconfiguring the pipeline to take advantage of faster storage structures and improved performance, energy

21 SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

22 Why not Big/Little? 22 Heterogeneous cores on a chip ideally suited for different tasks Fixed partitioning of cores A fraction of chip unused Migration overhead

23 Configurations 23 Baseline: HPRef: Conventional HP (3.5GHz,0.9V) DVFS: Most aggressive (900MHz,0.6V) ScalCore: HPMode: HPRef with penalty (3.3GHz,0.9V) EEMode (600MHz,0.5V,0.65V) n Pipe2Vdd: Separate voltages only n SCspeed: Fuse the two stages of RF, LSU, Allocate n SC: Fuse the two stages of RF, Allocate Increase size ROB, LSQ structures

Project 5: Optimizer Jason Ansel

Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale