Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Similar documents
Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

CS4617 Computer Architecture

Design Challenges in Multi-GHz Microprocessors

Department Computer Science and Engineering IIT Kanpur

CMOS Technology for Computer Architects

EMT 251 Introduction to IC Design

Practical Information

CS 6135 VLSI Physical Design Automation Fall 2003

Computer Architecture

Course Outcome of M.Tech (VLSI Design)

Lec 24: Parallel Processors. Announcements

Navigo: An Early-Stage Model to Study Power-Constrained Architectures and Specialization

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

On-chip Networks in Multi-core era

Static Power and the Importance of Realistic Junction Temperature Analysis

A Static Power Model for Architects

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

The Transistor. Survey: What is Moore s Law? Survey: What is Moore s Law? Technology Unit Overview. Technology Generations

Lecture 1: Introduction to Digital System Design & Co-Design

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

Course Content. Course Content. Course Format. Low Power VLSI System Design Lecture 1: Introduction. Course focus

Contents CONTRIBUTING FACTORS. Preface. List of trademarks 1. WHY ARE CUSTOM CIRCUITS SO MUCH FASTER?

On the Rules of Low-Power Design

EECS 427 Lecture 22: Low and Multiple-Vdd Design

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

Lecture 9: Clocking for High Performance Processors

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

BASICS: TECHNOLOGIES. EEC 116, B. Baas

In 1951 William Shockley developed the world first junction transistor. One year later Geoffrey W. A. Dummer published the concept of the integrated

Practical Information

Overview of Design Methodology. A Few Points Before We Start 11/4/2012. All About Handling The Complexity. Lecture 1. Put things into perspective

Chapter 1 Introduction

Power-aware computing systems. Christian W. Probst*

Computer Aided Design of Electronics

Enhancing Power, Performance, and Energy Efficiency in Chip Multiprocessors Exploiting Inverse Thermal Dependence

Power Spring /7/05 L11 Power 1

Jeffrey Davis Georgia Institute of Technology School of ECE Atlanta, GA Tel No

Datorstödd Elektronikkonstruktion

Architecting Systems of the Future, page 1

ELEC-H-473 Microprocessor architectures. Lecture 01 Dragomir Milojevic

Processors Processing Processors. The meta-lecture

Research Statement. Sorin Cotofana

Digital Design and System Implementation. Overview of Physical Implementations

Interconnect-Power Dissipation in a Microprocessor

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

FPGA-2012 Pre-Conference Workshop: FPGAs in 2032: Challenges and Opportunities

Copyright 2003 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Slides prepared by Walid A. Najjar & Brian J.

Progress due to: Feature size reduction - 0.7X/3 years (Moore s Law). Increasing chip size - 16% per year. Creativity in implementing functions.

Olivier Sentieys. IRISA/INRIA Cairn team. Power Consumption in Silicon Chips. Chips, logic gates and transistors.

High Speed Low Power Noise Tolerant Multiple Bit Adder Circuit Design Using Domino Logic

Microarchitectural Simulation and Control of di/dt-induced. Power Supply Voltage Variation

Probabilistic and Variation- Tolerant Design: Key to Continued Moore's Law. Tanay Karnik, Shekhar Borkar, Vivek De Circuit Research, Intel Labs

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Reduce Power Consumption for Digital Cmos Circuits Using Dvts Algoritham

Power Management in Multicore Processors through Clustered DVFS

450mm and Moore s Law Advanced Packaging Challenges and the Impact of 3D

EE4800 CMOS Digital IC Design & Analysis. Lecture 1 Introduction Zhuo Feng

Final Report: DBmbench

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

MICROPROCESSOR TECHNOLOGY

Lecture #2 Solving the Interconnect Problems in VLSI

EECS150 - Digital Design Lecture 15 - CMOS Implementation Technologies. Overview of Physical Implementations

EECS150 - Digital Design Lecture 9 - CMOS Implementation Technologies

CS61c: Introduction to Synchronous Digital Systems

A Framework for Assessing the Feasibility of Learning Algorithms in Power-Constrained ASICs

Trends and Challenges in VLSI Technology Scaling Towards 100nm

1 Digital EE141 Integrated Circuits 2nd Introduction

Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

Digital Systems Design

Formal Hardware Verification: Theory Meets Practice

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Agenda. Digital Today Smart Power Management Tomorrow. Brief History of transistor development. Analog PWM controllers. CMOS Historical perspective

EE 330 Fall Integrated Electronics. Thu Duong, Siva Sudani and Ben Curtin

Deep Submicron Technology: Opportunity or Dead End for Dynamic Circuit Techniques

Low-Power CMOS VLSI Design

Kenneth R. Laker, University of Pennsylvania, updated 20Jan15

An Overview of Static Power Dissipation

Lecture 19: Design for Skew

METHODS FOR TRUE ENERGY- PERFORMANCE OPTIMIZATION. Naga Harika Chinta

Low Power Design of Successive Approximation Registers

Fault Tolerance and Reliability Techniques for High-Density Random-Access Memories (Hardcover) by Kanad Chakraborty, Pinaki Mazumder

CSE502: Computer Architecture Welcome to CSE 502

VLSI System Testing. Outline

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. July 30, Digital EE141 Integrated Circuits 2nd Introduction

ISSCC 2003 / SESSION 1 / PLENARY / 1.1

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Disseny físic. Disseny en Standard Cells. Enric Pastor Rosa M. Badia Ramon Canal DM Tardor DM, Tardor

UNIT-II LOW POWER VLSI DESIGN APPROACHES

CS/EE 181a 2010/11 Lecture 1

LIMITS OF PARALLELISM AND BOOSTING IN DIM SILICON

The future of lithography and its impact on design

ECE 5745 Complex Digital ASIC Design Topic 2: CMOS Devices

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Transcription:

Overview 1 Trends in Microprocessor Architecture R05 Robert Mullins Computer architecture Scaling performance and CMOS Where have performance gains come from? Modern superscalar processors The limits of superscalar processors This course 2 Computer architecture Computer architecture Computer architecture is the interface between what technology can provide and what the marketplace demands Computer architecture is the science and art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals Mark Hill Computer architecture is a science of trade-offs Computer architecture forms the bridge between application need and the capabilities of the underlying technology Tilak Agerwala and Siddhartha Chatterjee Yale Patt 3 4

Computer architecture We cannot architect a new computer without defining performance, power and cost goals. The design process is all about understanding and making tradeoffs What is our target market and what applications will we be running? The best architecture is a moving target The needs of the marketplace change Shifting fabrication technology characteristics New technologies memory, packaging, compiler, languages,... 5 Computer architecture Computer architect's often err by preparing for yesterday's computations Bill Dally (Easy to make the same error during a PhD!) Tomorrow's applications and technologies are not easy to predict! 6 Burger's end of the road paper suggested performance would be limited to 12.5%/annum Predicted: 1997-2014 7.4x Actual: ~36x If at historical rate: 1720x Reproduced from Computer architecture: A quantitative approach, Hennessy/Patterson 7 8

Microprocessor trends Microprocessor performance increased at a rate of ~52%/year between 1986-2002 ~800X improvement over 16 years How was such an improvement in performance achieved? Is this a reasonable rate of performance growth given the advances in fabrication technology? Exe. time = Instr. count x CPI x Clock Period https://github.com/karlrupp/microprocessor-trend-data 9 10 Technology scaling Gates per clock 7 process generations Scaling provides ~1.4x transistor performance improvment per generation 10.5X (careful, this doesn't automatically translate directly into performance gains) Less logic between pipeline registers Reduction from ~100 to 10 gate delays 10X How? Pipelining 5 to 20 stages (~4X) Circuit-level advances e.g. new logic families ~2.5X Reproduced with kind permission of Mark Horowitz 11 Reproduced with kind permission of Mark Horowitz 12

IPC & instr. count ~105X ~5-8X improvement in SPECint/MHz This is despite clock frequency improvements Includes advances in compiler technology and impact of increased bus widths Improvement in SPECint95/Mhz over time Reproduced with kind permission of Mark Horowitz Reproduced from CMOS VLSI Design Weste/Harris (2005) 13 14 How was it possible to maintain and even decrease CPI (improve IPC) Moore's law! How were the additional transistors exploited? Intel 386 to Pentium 4 386: 275K transistors (die size = 43mm2) P4: 42M transistors (die size = 217mm2) 5X from increased die size 27X from technology scaling Today's (2017) largest chips contain > 10 billion transistors Reproduced from CMOS VLSI Design, Weste and Harris (2005) 15 16

The future of Moore's Law: 2D to 3D Moore s Law Beyond 2021 it won't be economically desirable to shrink transistor dimensions Recently introduced vertical transistors (e.g. dual-gate and tri-gate) Monolithic 3D predicted by 2024 Roadmap to consider applications in future (more of an end-to-end view vs. bottom-up) The latest ITRS Roadmap (2015) predicts that physical gate length will not shrink beyond 2021. Earlier predictions (2013) were more optimistic. Modern superscalar processors 18 Modern superscalar processors Revision (See Hennessy/Patterson) Significant hardware support for Instruction Level Parallelism (ILP) in most commercial microprocessors Multiple-issue architectures Deep pipelines, branch prediction, speculative execution Large on-chip caches (L1/L2/L3) Out-of-order execution, register renaming Dynamic memory address disambiguation SIMD instructions... 19 20

Cost and complexity of extracting ILP Pipeline depth limits Diminishing returns Increased complexity limits ability to optimise design The underlying fabrication technology characteristics are becoming more challenging too Increases verification complexity and time Increases time-to-market Interruptions to the pipeline (branches) Performance of the memory system Clocking overheads (registers/clock skew) Need to balance stages and maintain the atomicity of some operations Limited ILP Power cost (See also Optimal Pipeline Depth link on Seminar 1 wiki page) 21 22 Interconnect versus transistor scaling Smaller transistors = faster/lower power Wires don't scale in the same way Centralised structures don't scale well Pressure to decentralise Consider bypass network between FUs Clustered implementations "Coming challenges in microarchitecture and architecture", Ronen et al, 2001 23 24

Voltage scaling and power limits Accept we can make little progress with single-thread performance Look towards thread-level parallelism Voltage scaling has slowed 5V to 1V - gave us 25X power savings 1V to 0.7V (limit at end of CMOS around 2020) Only 2X power savings left from voltage scaling! Sensible power limits already reached Pressure to reduce power consumption Process variation complications Achieve our performance gains in a new way: Rapidly increase the number of cores 2X-3X per generation Don't scale the clock frequency Create simpler more power efficient cores instead Fault tolerance requirements in the longer term 25 26 Pawlowski (Intel) 2007 It is now 2018... Numbers of cores has scaled less agressively than this. In 2017 @ 14nm, is simple? Replicate existing processor designs to ease design process Many applications already exist where thread-level parallelism is plentiful We've had 30+ years of experience writing parallel programs High-end server part: 28 Core, Xeon (Skylake) 56 threads Clock frequency 2.5GHz (max turbo freq. 3.8GHz) TDP (power) = 205 W 27 28

Many new challenges: Power is a first order design constraint On-chip and off-chip communication Simpler cores and Amdahl's law Power constrained design Support for the shared-memory paradigm? Synchronization and thread-scheduling support? Everyone must now write scalable and correct parallel programs! 29 Power consumption is already at a sensible limit (for many applications we would like to reduce it) We are going to increase the number of cores by 2-3X per generation Power savings? Core shrink (<1.4X) Simpler cores (1.4-2X?) Some VDD savings Need to add uncore logic too! Techniques for adaptive EPI? Future of multicore? Beyond homogenous multicore Power consumption is a limiting factor in the design of multicore processors For many designs this has prompted the integration of many specialized accelerators An ASIC implementation of an algorithm may be 101000X more energy efficient that a software implementation e.g. Apple A8 SoC: NAVIGO, [Hempstead, Wei and Brooks, 2011] Examined throughput orientated workload Suggest gains limited to 35% per year due to power constraints ~50% custom accelerators ~25% CPUs (2) ~25% GPU 30 31

Future gains This course Need for applications to be approximation/fault tolerant Introduction to the challenges of building and programming chip multiprocessors Lots to learn from traditional parallel computers, but many problems and trade-offs are new New applications The trade-offs on-chip are very different to those when designing physically larger parallel machines Power and energy constraints Parallel programming for the masses IRDS Roadmap (2016) Node: 2nm/1.5nm Vertical Gate-All-Around-Device (GAA) Monolithic-3D (stacking of devices) VDD = 0.4V This course 1. Trends in microprocessor architecture 2. Introduction to parallel computing 3. Parallel algorithms 4. Chip Multiprocessors (I) 5. Chip Multiprocessors (II) 6. Transactional memory 7. On-chip interconnection networks 8. Manycore research issues 2017: Rune Holm ARM (Machine Learning Group) 2016: Gavin Stark, Netronome (CTO) 2014: David Moloney, Movidius (CTO) 2012: Matt Horsnell, ARM 2011: Eben Upton, Broadcom... 35 34