SCALCORE: DESIGNING A CORE

Size: px
Start display at page:

Download "SCALCORE: DESIGNING A CORE"

Transcription

1 SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

2 Motivation 2 Application characteristics vary dynamically Goal: Single core design that attains High performance at nominal voltage (~0.9V) High energy efficiency at low voltage (~0.5V) A Voltage-Scalable Core

3 Observations 3 SRAM vs Logic delay scaling Small increase in V à large improvement in delay Normalized Delay SRAMDelay LogicDelay V min V dd (V) ~2x

4 ScalCore Idea 4 Design a Voltage-Scalable core based on the two observations SRAM-Freq Logic-Freq f nom ~ 3500 Frequency (MHz) 1000 f op ~ 1200 f min ~ 900 f logic ~ 600 V nom V op V min V logic V dd (V)

5 ScalCore: A Voltage-Scalable Core 5 Decouple V dd of logic and storage structures in the pipeline Improve energy efficiency Raise V dd of storage structures a little Reconfigure the pipeline to take advantage of faster storage structures Further improve performance and reduce leakage energy!! Evaluated the proposed design under various scenarios Reduce execution time by 31%, energy by 48% and ED by 60% over the state of the art Truly Voltage Scalable Core!!

6 ScalCore Design 6 Goal: Maximize the energy-efficiency at low-voltage (EEMode) Constraint: No impact on performance or energy at nominal voltage (HPMode) Approach: In EEMode, Provide different V dd s for storage and logic stages n Storage stage ~2x faster than logic stage Reconfigure pipeline in one of the two ways n Fuse storage stages in the pipeline (e.g. Accessing Register File) n Increase storage structure sizes

7 Fusing Two Pipeline Stages into One 7 Logic Stage 1 Storage Stage 2a Storage Stage 2b Logic Stage 3 HPMode V nom V nom V nom V nom CLK

8 Fusing Two Pipeline Stages into One 8 Logic Stage 1 Storage Stage 2a Storage Stage 2b Logic Stage 3 HPMode CLK V nom V nom V nom V nom 0 Enable Flow-through Logic Stage 1 2a 2b Logic Stage 3 EEMode CLK V logic V op Enable 1 Flow-through V op V logic

9 Increasing Size of Structures 9 Original Decoder HPMode Decoder EEMode Decoder Array 0 V nom Array 0 V nom Array 1 Disabled Array 0 V op Array 1 V op Sense Amp Sense Amp 1 Sense Amp 0 Sense Amp 0 Sense Amp 1 Data Select Data Select Data Select

10 ScalCore Pipeline 10 Fetch Decode Rename Dispatch Wakeup Select Data Read Source Drive Ex Ex Write back Main storage structures L1-I Branch Pred. Decode Allocate Register File Register File EX EX EX LSU L1-D Next PC ROB/ Completion Table

11 Pipeline Structure Changes 11 Fuse Stages Increase Size Register File Array access + Source drive 1.5X PRF Allocation Rename + Dispatch --- Load Store Unit Addr. generation + Memory disambiguation 1.5X Load/Store Queue and Store buffer Reorder Buffer X ROB Branch Prediction

12 Controlpath Changes 12 Programmable counters for variable latencies Execution schedules Wakeup logic Programmable counters for variable sizes List of free registers ROB, Load/Store queue sizes

13 Circuit Issues 13 HPMode V op V nom EEMode V logic Logic Stage 1 w/ Level Conv. Storage Stage 2a Storage Stage 2b Logic Stage 3 F.Ishihara, F.Sheikh, and B.Nikolic, Level Conversion for Dual-Supply Systems, IEEE Transactions on VLSI Systems, February 2004

14 Circuit Issues 14 HPMode V op V nom EEMode V logic Logic Stage 1 w/ Level Conv. Storage Stage 2a Storage Stage 2b Logic Stage 3 d ck q (inv) clk ck V logic V op F.Ishihara, F.Sheikh, and B.Nikolic, Level Conversion for Dual-Supply Systems, IEEE Transactions on VLSI Systems, February 2004

15 Evaluation Methodology OOO cores 64K I-L1/D-L1, 1MB L2 Voltages and frequencies HPMode n V nom = 0.90 V, f=3.5ghz f nom ~ 3500 EEMode n V logic = 0.50 V, f=600mhz n V op = 0.65 V, f=600mhz Frequency (MHz) 1000 f op ~ 1200 f min ~ 900 f logic ~ 600 n V min = 0.6V, f=900mhz V nom V op V min V logic V dd (V)

16 Configurations 16 Baseline: HPRef: Conventional HP (3.5GHz,0.9V) DVFS: Most aggressive (900MHz,0.6V) ScalCore: HPMode: HPRef with penalty (3.3GHz,0.9V) EEMode (600MHz,0.5V,0.65V) n Pipe2Vdd: Separate voltages only n SC: Fuse the two stages of RF, Allocate ; Increase size ROB, LSQ structures

17 Iso-Power Comparison for Parallel Programs Average Execution Time Average Energy-Delay % 60% 15% % Reduced execution time by 31%; ED by 60% compared to DVFS

18 Dynamic Workloads % 28% 19% 22% Execution Time Energy Energy-Delay Product 46% 15% Dyn-Baseline Static-16-SC Dyn-SC Dyn-SC reduces execution time by 31%, energy by 22% and ED by 46% compared to conventional DVFS

19 Also in the Paper 19 ScalCore complexities and overheads Comparison with Intel Claremont Impact on different classes of applications More results Unconstrained power budget Intermediate design points of ScalCore

20 Conclusion 20 Presented ScalCore, a core designed for voltage scalability Designed a voltage-scalable core by Decoupling V dd of logic and storage structures in the pipeline Raising V dd of storage structures a little Reconfiguring the pipeline to take advantage of faster storage structures and improved performance, energy

21 SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia, Intel

22 Why not Big/Little? 22 Heterogeneous cores on a chip ideally suited for different tasks Fixed partitioning of cores A fraction of chip unused Migration overhead

23 Configurations 23 Baseline: HPRef: Conventional HP (3.5GHz,0.9V) DVFS: Most aggressive (900MHz,0.6V) ScalCore: HPMode: HPRef with penalty (3.3GHz,0.9V) EEMode (600MHz,0.5V,0.65V) n Pipe2Vdd: Separate voltages only n SCspeed: Fuse the two stages of RF, LSU, Allocate n SC: Fuse the two stages of RF, Allocate Increase size ROB, LSQ structures

Project 5: Optimizer Jason Ansel

Project 5: Optimizer Jason Ansel Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale

More information

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubu Mukherjee SPEARS Group, Intel Massachusetts EECS, University

More information

Dynamic Scheduling II

Dynamic Scheduling II so far: dynamic scheduling (out-of-order execution) Scoreboard omasulo s algorithm register renaming: removing artificial dependences (WAR/WAW) now: out-of-order execution + precise state advanced topic:

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When

More information

Multiple Clock and Voltage Domains for Chip Multi Processors

Multiple Clock and Voltage Domains for Chip Multi Processors Multiple Clock and Voltage Domains for Chip Multi Processors Efraim Rotem- Intel Corporation Israel Avi Mendelson- Microsoft R&D Israel Ran Ginosar- Technion Israel institute of Technology Uri Weiser-

More information

On the Rules of Low-Power Design

On the Rules of Low-Power Design On the Rules of Low-Power Design (and Why You Should Break Them) Prof. Todd Austin University of Michigan austin@umich.edu A long time ago, in a not so far away place The Rules of Low-Power Design P =

More information

RISC Central Processing Unit

RISC Central Processing Unit RISC Central Processing Unit Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Spring, 2014 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/

More information

Exploring Heterogeneity within a Core for Improved Power Efficiency

Exploring Heterogeneity within a Core for Improved Power Efficiency Computer Engineering Exploring Heterogeneity within a Core for Improved Power Efficiency Sudarshan Srinivasan Nithesh Kurella Israel Koren Sandip Kundu May 2, 215 CE Tech Report # 6 Available at http://www.eng.biu.ac.il/segalla/computer-engineering-tech-reports/

More information

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy CSE 2021: Computer Organization Single Cycle (Review) Lecture-10 CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan CSE-2021 July-12-2012 2 Single Cycle with Jump Multi-Cycle Implementation

More information

Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing

Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing Nestoras Tzartzanis and Bill Athas nestoras@isiedu, athas@isiedu http://wwwisiedu/acmos Information Sciences Institute

More information

Dynamic Scheduling I

Dynamic Scheduling I basic pipeline started with single, in-order issue, single-cycle operations have extended this basic pipeline with multi-cycle operations multiple issue (superscalar) now: dynamic scheduling (out-of-order

More information

Low-Power Digital CMOS Design: A Survey

Low-Power Digital CMOS Design: A Survey Low-Power Digital CMOS Design: A Survey Krister Landernäs June 4, 2005 Department of Computer Science and Electronics, Mälardalen University Abstract The aim of this document is to provide the reader with

More information

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Chapter 4 The Processor Part II Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup p = 2n/(0.5n + 1.5) 4 =

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION Diary R. Suleiman Muhammed A. Ibrahim Ibrahim I. Hamarash e-mail: diariy@engineer.com e-mail: ibrahimm@itu.edu.tr

More information

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3 [Partly adapted from Irwin and Narayanan, and Nikolic] 1 Reminders CAD assignments Please submit CAD5 by tomorrow noon CAD6 is due

More information

EECS 427 Lecture 22: Low and Multiple-Vdd Design

EECS 427 Lecture 22: Low and Multiple-Vdd Design EECS 427 Lecture 22: Low and Multiple-Vdd Design Reading: 11.7.1 EECS 427 W07 Lecture 22 1 Last Time Low power ALUs Glitch power Clock gating Bus recoding The low power design space Dynamic vs static EECS

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Ramon Canal NCD Master MIRI. NCD Master MIRI 1 Wattch, Hotspot, Hotleakage, McPAT http://www.eecs.harvard.edu/~dbrooks/wattch-form.html http://lava.cs.virginia.edu/hotspot http://lava.cs.virginia.edu/hotleakage http://www.hpl.hp.com/research/mcpat/

More information

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing *

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing * Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing * Radu Teodorescu, Jun Nakano, Abhishek Tiwari and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

More information

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018 omasulo s Algorithm Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University,

More information

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I U. Wisconsin CS/ECE 752 Advanced Computer Architecture I Prof. Karu Sankaralingam Unit 5: Dynamic Scheduling I Slides developed by Amir Roth of University of Pennsylvania with sources that included University

More information

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont MIPS R10000 Case Study Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Multiprocessor SGI Origin Using MIPS R10K Many thanks to Prof. Martin and Roth of University of Pennsylvania for

More information

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering Low-Power VLSI Seong-Ook Jung 2013. 5. 27. sjung@yonsei.ac.kr VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering Contents 1. Introduction 2. Power classification & Power performance

More information

Optimization of Overdrive Signoff

Optimization of Overdrive Signoff Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego UC San Diego / VLSI CAD Laboratory -1- Outline Motivation Design Cone

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOER 2013 1769 Enhancing the Efficiency of Energy-Constrained DVFS Designs Andrew. Kahng, Fellow, IEEE, Seokhyeong Kang,

More information

A GALS Many-Core Heterogeneous DSP Platform with Source-Synchronous On-Chip Interconnection Network

A GALS Many-Core Heterogeneous DSP Platform with Source-Synchronous On-Chip Interconnection Network A GALS Many-Core Heterogeneous DSP Platform with Source-Synchronous On-Chip Interconnection Network Anh Tran, Dean Truong and Bevan Baas University of California, Davis NOCS 09 May 13, 009 Outline Motivation

More information

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont   Core 2 Microarchitecture P6 µarchitecture Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Core 2 Microarchitecture Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions

More information

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood Supporting x86-64 Address Translation for 100s of GPU s Jason Power, Mark D. Hill, David A. Wood Summary Challenges: CPU&GPUs physically integrated, but logically separate; This reduces theoretical bandwidth,

More information

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides.

More information

A Static Power Model for Architects

A Static Power Model for Architects A Static Power Model for Architects J. Adam Butts and Guri Sohi University of Wisconsin-Madison {butts,sohi}@cs.wisc.edu 33rd International Symposium on Microarchitecture Monterey, California December,

More information

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J. Topics Low Power Techniques Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J. Rabaey Review: Energy & Power Equations E = C L V 2 DD P 0 1 +

More information

NTE74S188 Integrated Circuit 256 Bit Open Collector PROM 16 Lead DIP Type Package

NTE74S188 Integrated Circuit 256 Bit Open Collector PROM 16 Lead DIP Type Package NTE74S188 Integrated Circuit 256 Bit Open Collector PROM 16 Lead DIP Type Package Description: The NTE74S188 Schottky PROM memory is organized in the popular 32 words by 8 bits configuration. A memory

More information

CS 110 Computer Architecture Lecture 11: Pipelining

CS 110 Computer Architecture Lecture 11: Pipelining CS 110 Computer Architecture Lecture 11: Pipelining Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on

More information

Power Modeling and Characterization of Computing Devices: A Survey. Contents

Power Modeling and Characterization of Computing Devices: A Survey. Contents Foundations and Trends R in Electronic Design Automation Vol. 6, No. 2 (2012) 121 216 c 2012 S. Reda and A. N. Nowroz DOI: 10.1561/1000000022 Power Modeling and Characterization of Computing Devices: A

More information

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing Radu Teodorescu, Jun Nakano, Abhishek Tiwari and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

More information

A Novel Latch design for Low Power Applications

A Novel Latch design for Low Power Applications A Novel Latch design for Low Power Applications Abhilasha Deptt. of Electronics and Communication Engg., FET-MITS Lakshmangarh, Rajasthan (India) K. G. Sharma Suresh Gyan Vihar University, Jagatpura, Jaipur,

More information

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview

More information

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T. Pipeline Hazards Krste Asanovic Laboratory for Computer Science M.I.T. Pipelined DLX Datapath without interlocks and jumps 31 0x4 RegDst RegWrite inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel

More information

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach Onur

More information

Design and Implement of Low Power Consumption SRAM Based on Single Port Sense Amplifier in 65 nm

Design and Implement of Low Power Consumption SRAM Based on Single Port Sense Amplifier in 65 nm Journal of Computer and Communications, 2015, 3, 164-168 Published Online November 2015 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2015.311026 Design and Implement of Low

More information

Pipelined Processor Design

Pipelined Processor Design Pipelined Processor Design COE 38 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Pipelining versus Serial

More information

Energy-Recovery CMOS Design

Energy-Recovery CMOS Design Energy-Recovery CMOS Design Jay Moon, Bill Athas * Univ of Southern California * Apple Computer, Inc. jsmoon@usc.edu / athas@apple.com March 05, 2001 UCLA EE215B jsmoon@usc.edu / athas@apple.com 1 Outline

More information

Instruction Level Parallelism III: Dynamic Scheduling

Instruction Level Parallelism III: Dynamic Scheduling Instruction Level Parallelism III: Dynamic Scheduling Reading: Appendix A (A-67) H&P Chapter 2 Instruction Level Parallelism III: Dynamic Scheduling 1 his Unit: Dynamic Scheduling Application OS Compiler

More information

Low Power Techniques for SoC Design: basic concepts and techniques

Low Power Techniques for SoC Design: basic concepts and techniques Low Power Techniques for SoC Design: basic concepts and techniques Estagiário de Docência M.Sc. Vinícius dos Santos Livramento Prof. Dr. Luiz Cláudio Villar dos Santos Embedded Systems - INE 5439 Federal

More information

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Instruction-Level Parallelism Scoreboarding (A.8) Instruction Level Parallelism

More information

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs HetCore: -CMOS Hetero-Device Architecture for CPUs and GPUs Bhargava Gopireddy, Dimitrios Skarlatos, Wenjuan Zhu, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

More information

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs Li Zhou and Avinash Kodi Technologies for Emerging Computer Architecture Laboratory (TEAL) School of Electrical Engineering and

More information

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2) Lecture Topics Today: Pipelined Processors (P&H 4.5-4.10) Next: continued 1 Announcements Milestone #4 (due 2/23) Milestone #5 (due 3/2) 2 1 ISA Implementations Three different strategies: single-cycle

More information

SOFTWARE IMPLEMENTATION OF THE

SOFTWARE IMPLEMENTATION OF THE SOFTWARE IMPLEMENTATION OF THE IEEE 802.11A/P PHYSICAL LAYER SDR`12 WInnComm Europe 27 29 June, 2012 Brussels, Belgium T. Cupaiuolo, D. Lo Iacono, M. Siti and M. Odoni Advanced System Technologies STMicroelectronics,

More information

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling EE241 - Spring 2004 Advanced Digital Integrated Circuits Borivoje Nikolic Lecture 15 Low-Power Design: Supply Voltage Scaling Announcements Homework #2 due today Midterm project reports due next Thursday

More information

Probabilistic and Variation- Tolerant Design: Key to Continued Moore's Law. Tanay Karnik, Shekhar Borkar, Vivek De Circuit Research, Intel Labs

Probabilistic and Variation- Tolerant Design: Key to Continued Moore's Law. Tanay Karnik, Shekhar Borkar, Vivek De Circuit Research, Intel Labs Probabilistic and Variation- Tolerant Design: Key to Continued Moore's Law Tanay Karnik, Shekhar Borkar, Vivek De Circuit Research, Intel Labs 1 Outline Variations Process, supply voltage, and temperature

More information

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors Xin Fu, Tao Li and José Fortes Department of ECE, University of Florida xinfu@ufl.edu, taoli@ece.ufl.edu,

More information

Out-of-Order Execution. Register Renaming. Nima Honarmand

Out-of-Order Execution. Register Renaming. Nima Honarmand Out-of-Order Execution & Register Renaming Nima Honarmand Out-of-Order (OOO) Execution (1) Essence of OOO execution is Dynamic Scheduling Dynamic scheduling: processor hardware determines instruction execution

More information

Static Energy Reduction Techniques in Microprocessor Caches

Static Energy Reduction Techniques in Microprocessor Caches Static Energy Reduction Techniques in Microprocessor Caches Heather Hanson, Stephen W. Keckler, Doug Burger Computer Architecture and Technology Laboratory Department of Computer Sciences Tech Report TR2001-18

More information

SRAM SYSTEM DESIGN FOR MEMORY BASED COMPUTING

SRAM SYSTEM DESIGN FOR MEMORY BASED COMPUTING SRAM SYSTEM DESIGN FOR MEMORY BASED COMPUTING A Thesis Presented to The Academic Faculty by Muneeb Zia In Partial Fulfillment of the Requirements for the Degree Masters in the School of Electrical and

More information

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

Low Power VLSI Circuit Synthesis: Introduction and Course Outline Low Power VLSI Circuit Synthesis: Introduction and Course Outline Ajit Pal Professor Department of Computer Science and Engineering Indian Institute of Technology Kharagpur INDIA -721302 Agenda Why Low

More information

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University The 45th International

More information

CS521 CSE IITG 11/23/2012

CS521 CSE IITG 11/23/2012 Parallel Decoding and issue Parallel execution Preserving the sequential consistency of execution and exception processing 1 slide 2 Decode/issue data Issue bound fetch Dispatch bound fetch RS RS RS RS

More information

Power Optimization of FPGA Interconnect Via Circuit and CAD Techniques

Power Optimization of FPGA Interconnect Via Circuit and CAD Techniques Power Optimization of FPGA Interconnect Via Circuit and CAD Techniques Safeen Huda and Jason Anderson International Symposium on Physical Design Santa Rosa, CA, April 6, 2016 1 Motivation FPGA power increasingly

More information

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice ECOM 4311 Digital System Design using VHDL Chapter 9 Sequential Circuit Design: Practice Outline 1. Poor design practice and remedy 2. More counters 3. Register as fast temporary storage 4. Pipelined circuit

More information

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture Overview 1 Trends in Microprocessor Architecture R05 Robert Mullins Computer architecture Scaling performance and CMOS Where have performance gains come from? Modern superscalar processors The limits of

More information

Instruction Level Parallelism. Data Dependence Static Scheduling

Instruction Level Parallelism. Data Dependence Static Scheduling Instruction Level Parallelism Data Dependence Static Scheduling Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D

More information

Announcements. Advanced Digital Integrated Circuits. Midterm feedback mailed back Homework #3 posted over the break due April 8

Announcements. Advanced Digital Integrated Circuits. Midterm feedback mailed back Homework #3 posted over the break due April 8 EE241 - Spring 21 Advanced Digital Integrated Circuits Lecture 18: Dynamic Voltage Scaling Announcements Midterm feedback mailed back Homework #3 posted over the break due April 8 Reading: Chapter 5, 6,

More information

CMP 301B Computer Architecture. Appendix C

CMP 301B Computer Architecture. Appendix C CMP 301B Computer Architecture Appendix C Dealing with Exceptions What should be done when an exception arises and many instructions are in the pipeline??!! Force a trap instruction in the next IF stage

More information

METHODS FOR TRUE ENERGY- PERFORMANCE OPTIMIZATION. Naga Harika Chinta

METHODS FOR TRUE ENERGY- PERFORMANCE OPTIMIZATION. Naga Harika Chinta METHODS FOR TRUE ENERGY- PERFORMANCE OPTIMIZATION Naga Harika Chinta OVERVIEW Introduction Optimization Methods A. Gate size B. Supply voltage C. Threshold voltage Circuit level optimization A. Technology

More information

Auto refresh and self refresh refresh cycles / 64ms. Part No. Clock Frequency Power Organization Interface Package. Normal. 4Banks x 1Mbits x16

Auto refresh and self refresh refresh cycles / 64ms. Part No. Clock Frequency Power Organization Interface Package. Normal. 4Banks x 1Mbits x16 4 Banks x 1M x 16Bit Synchronous DRAM DESCRIPTION The Hynix HY57V641620HG is a 67,108,864-bit CMOS Synchronous DRAM, ideally suited for the main memory applications which require large memory density and

More information

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS 1 T.Thomas Leonid, 2 M.Mary Grace Neela, and 3 Jose Anand

More information

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering FPGA Fabrics Reference Wayne Wolf, FPGA-Based System Design Pearson Education, 2004 CPLD / FPGA CPLD Interconnection of several PLD blocks with Programmable interconnect on a single chip Logic blocks executes

More information

Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips

Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips Booster: Reactive Core Acceleration for Mitigating the Effects of Process Variation and Application Imbalance in Low-Voltage Chips Timothy N. Miller, Xiang Pan, Renji Thomas, Naser Sedaghati, Radu Teodorescu

More information

Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control

Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control Guangyi Cao and Arun Ravindran Department of Electrical and Computer Engineering University of North Carolina at Charlotte

More information

Issue. Execute. Finish

Issue. Execute. Finish Specula1on & Precise Interrupts Fall 2017 Prof. Ron Dreslinski h6p://www.eecs.umich.edu/courses/eecs470 In Order Out of Order In Order Issue Execute Finish Fetch Decode Dispatch Complete Retire Instruction/Decode

More information

A VCO-Based ADC Employing a Multi- Phase Noise-Shaping Beat Frequency Quantizer for Direct Sampling of Sub-1mV Input Signals

A VCO-Based ADC Employing a Multi- Phase Noise-Shaping Beat Frequency Quantizer for Direct Sampling of Sub-1mV Input Signals A VCO-Based ADC Employing a Multi- Phase Noise-Shaping Beat Frequency Quantizer for Direct Sampling of Sub-1mV Input Signals Bongjin Kim, Somnath Kundu, Seokkyun Ko and Chris H. Kim University of Minnesota,

More information

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS Low Power Design Part I Introduction and VHDL design Ricardo Santos ricardo@facom.ufms.br LSCAD/FACOM/UFMS Motivation for Low Power Design Low power design is important from three different reasons Device

More information

Tomasulo s Algorithm. Tomasulo s Algorithm

Tomasulo s Algorithm. Tomasulo s Algorithm Tomasulo s Algorithm Load and store buffers Contain data and addresses, act like reservation stations Branch Prediction Top-level design: 56 Tomasulo s Algorithm Three Steps: Issue Get next instruction

More information

Decoupling Capacitance

Decoupling Capacitance Decoupling Capacitance Nitin Bhardwaj ECE492 Department of Electrical and Computer Engineering Agenda Background On-Chip Algorithms for decap sizing and placement Based on noise estimation Decap modeling

More information

IJMIE Volume 2, Issue 3 ISSN:

IJMIE Volume 2, Issue 3 ISSN: IJMIE Volume 2, Issue 3 ISSN: 2249-0558 VLSI DESIGN OF LOW POWER HIGH SPEED DOMINO LOGIC Ms. Rakhi R. Agrawal* Dr. S. A. Ladhake** Abstract: Simple to implement, low cost designs in CMOS Domino logic are

More information

Computer Hardware. Pipeline

Computer Hardware. Pipeline Computer Hardware Pipeline Conventional Datapath 2.4 ns is required to perform a single operation (i.e. 416.7 MHz). Register file MUX B 0.6 ns Clock 0.6 ns 0.2 ns Function unit 0.8 ns MUX D 0.2 ns c. Production

More information

Final Report: DBmbench

Final Report: DBmbench 18-741 Final Report: DBmbench Yan Ke (yke@cs.cmu.edu) Justin Weisz (jweisz@cs.cmu.edu) Dec. 8, 2006 1 Introduction Conventional database benchmarks, such as the TPC-C and TPC-H, are extremely computationally

More information

Reducing Transistor Variability For High Performance Low Power Chips

Reducing Transistor Variability For High Performance Low Power Chips Reducing Transistor Variability For High Performance Low Power Chips HOT Chips 24 Dr Robert Rogenmoser Senior Vice President Product Development & Engineering 1 HotChips 2012 Copyright 2011 SuVolta, Inc.

More information

Server Operational Cost Optimization for Cloud Computing Service Providers over

Server Operational Cost Optimization for Cloud Computing Service Providers over Server Operational Cost Optimization for Cloud Computing Service Providers over a Time Horizon Haiyang(Ocean)Qian and Deep Medhi Networking and Telecommunication Research Lab (NeTReL) University of Missouri-Kansas

More information

Leakage Power Minimization in Deep-Submicron CMOS circuits

Leakage Power Minimization in Deep-Submicron CMOS circuits Outline Leakage Power Minimization in Deep-Submicron circuits Politecnico di Torino Dip. di Automatica e Informatica 1019 Torino, Italy enrico.macii@polito.it Introduction. Design for low leakage: Basics.

More information

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov The University

More information

On the Off-chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option? Mohamed Hassan

On the Off-chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option? Mohamed Hassan On the Off-chip Memory Latency of eal-time Systems: Is DD DAM eally the Best Option? Mohamed Hassan Motivation 2 PEDICTABILITY DAMs 3 LDAM 4 esults 5 Outline Historically, SAMs have been the option for

More information

Tomasolu s s Algorithm

Tomasolu s s Algorithm omasolu s s Algorithm Fall 2007 Prof. homas Wenisch http://www.eecs.umich.edu/courses/eecs4 70 Floating Point Buffers (FLB) ag ag ag Storage Bus Floating Point 4 3 Buffers FLB 6 5 5 4 Control 2 1 1 Result

More information

Revision History Revision 0.0 (October, 2003) Target spec release Revision 1.0 (November, 2003) Revision 1.0 spec release Revision 1.1 (December, 2003

Revision History Revision 0.0 (October, 2003) Target spec release Revision 1.0 (November, 2003) Revision 1.0 spec release Revision 1.1 (December, 2003 16Mb H-die SDRAM Specification 50 TSOP-II with Pb-Free (RoHS compliant) Revision 1.4 August 2004 Samsung Electronics reserves the right to change products or specification without notice. Revision History

More information

Trace Based Switching For A Tightly Coupled Heterogeneous Core

Trace Based Switching For A Tightly Coupled Heterogeneous Core Trace Based Switching For A Tightly Coupled Heterogeneous Core Shru% Padmanabha, Andrew Lukefahr, Reetuparna Das, Sco@ Mahlke Micro- 46 December 2013 University of Michigan Electrical Engineering and Computer

More information

Minimum Energy CMOS Design with Dual Subthreshold Supply and Multiple Logic-Level Gates

Minimum Energy CMOS Design with Dual Subthreshold Supply and Multiple Logic-Level Gates Minimum Energy CMOS Design with Dual Subthreshold Supply and Multiple Logic-Level Gates Kyungseok Kim and Vishwani D. Agrawal Department of ECE, Auburn University, Auburn, AL 36849, USA kyungkim@auburn.edu,

More information

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to. FPGAs 1 CMPE 415 Technology Timeline 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs FPGAs The Design Warrior s Guide

More information

Revision No. History Draft Date Remark. 0.1 Initial Draft Jan Preliminary. 1.0 Final Version Apr. 2007

Revision No. History Draft Date Remark. 0.1 Initial Draft Jan Preliminary. 1.0 Final Version Apr. 2007 64Mb Synchronous DRAM based on 1M x 4Bank x16 I/O Document Title 4Bank x 1M x 16bits Synchronous DRAM Revision History Revision No. History Draft Date Remark 0.1 Initial Draft Jan. 2007 Preliminary 1.0

More information

Low-Power CMOS VLSI Design

Low-Power CMOS VLSI Design Low-Power CMOS VLSI Design ( 范倫達 ), Ph. D. Department of Computer Science, National Chiao Tung University, Taiwan, R.O.C. Fall, 2017 ldvan@cs.nctu.edu.tw http://www.cs.nctu.tw/~ldvan/ Outline Introduction

More information

HY5V56D(L/S)FP. Revision History. No. History Draft Date Remark. 0.1 Defined Target Spec. May Rev. 0.1 / Jan

HY5V56D(L/S)FP. Revision History. No. History Draft Date Remark. 0.1 Defined Target Spec. May Rev. 0.1 / Jan Revision History No. History Draft Date Remark 0.1 Defined Target Spec. May 2003 Rev. 0.1 / Jan. 2005 1 Series 4 Banks x 4M x 16bits Synchronous DRAM DESCRIPTION The HY5V56D(L/S)FP is a 268,435,456bit

More information

A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms

A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 6, JUNE 200 897 A Reconfigurable Source-Synchronous On-Chip Network for GALS Many- Platforms Anh T. Tran, Dean

More information

Revision No. History Draft Date Remark. 1.0 First Version Release Dec Corrected PIN ASSIGNMENT A12 to NC Jan. 2005

Revision No. History Draft Date Remark. 1.0 First Version Release Dec Corrected PIN ASSIGNMENT A12 to NC Jan. 2005 128Mb Synchronous DRAM based on 2M x 4Bank x16 I/O Document Title 4Bank x 2M x 16bits Synchronous DRAM Revision History Revision No. History Draft Date Remark 1.0 First Version Release Dec. 2004 1.1 1.

More information

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time Pipelining Readings: 4.5-4.8 Example: Doing the laundry A B C D Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes

More information

A Employing Circadian Rhythms to Enhance Power and Reliability

A Employing Circadian Rhythms to Enhance Power and Reliability A Employing Circadian Rhythms to Enhance Power and Reliability Saket Gupta, Broadcom Corporation Sachin S. Sapatnekar, University of Minnesota, Twin Cities This paper presents a novel scheme for saving

More information

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold Pipelining Readings: 4.5-4.8 Example: Doing the laundry Ann, Brian, Cathy, & Dave A B C D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes

More information

Auto refresh and self refresh refresh cycles / 64ms. Part No. Clock Frequency Power Organization Interface Package. Normal. 4Banks x 2Mbits x16

Auto refresh and self refresh refresh cycles / 64ms. Part No. Clock Frequency Power Organization Interface Package. Normal. 4Banks x 2Mbits x16 4 Banks x 2M x 16bits Synchronous DRAM DESCRIPTION The Hynix HY57V281620A is a 134,217,728bit CMOS Synchronous DRAM, ideally suited for the Mobile applications which require low power consumption and extended

More information

CMOS Process Variations: A Critical Operation Point Hypothesis

CMOS Process Variations: A Critical Operation Point Hypothesis CMOS Process Variations: A Critical Operation Point Hypothesis Janak H. Patel Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign jhpatel@uiuc.edu Computer Systems

More information

Revision No. History Draft Date Remark. 0.1 Initial Draft Jul Preliminary. 1.0 Release Aug. 2009

Revision No. History Draft Date Remark. 0.1 Initial Draft Jul Preliminary. 1.0 Release Aug. 2009 128Mb Synchronous DRAM based on 2M x 4Bank x16 I/O Document Title 4Bank x 2M x 16bits Synchronous DRAM Revision History Revision No. History Draft Date Remark 0.1 Initial Draft Jul. 2009 Preliminary 1.0

More information