Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

Similar documents
Exascale Research: Preparing for the Post- Moore Era

CS4617 Computer Architecture

Enabling Breakthroughs In Technology

MICROPROCESSOR TECHNOLOGY

Progress due to: Feature size reduction - 0.7X/3 years (Moore s Law). Increasing chip size - 16% per year. Creativity in implementing functions.

On-chip Networks in Multi-core era

ISSCC 2003 / SESSION 1 / PLENARY / 1.1

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Recent Trends in Semiconductor IC Device Manufacturing

EMT 251 Introduction to IC Design

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

Research in Support of the Die / Package Interface

Practical Information

Lecture 1 Introduction to Solid State Electronics

Thermal Management in the 3D-SiP World of the Future

Practical Information

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

Enabling Scientific Breakthroughs at the Petascale

New Process Technologies Will silicon CMOS carry us to the end of the Roadmap?

NanoFabrics: : Spatial Computing Using Molecular Electronics

Lecture 1: Introduction to Digital System Design & Co-Design

White Paper Stratix III Programmable Power

UNIT-III POWER ESTIMATION AND ANALYSIS

The end of Moore s law and the race for performance

Low Transistor Variability The Key to Energy Efficient ICs

PROCESS-VOLTAGE-TEMPERATURE (PVT) VARIATIONS AND STATIC TIMING ANALYSIS

Experience Report on Developing a Software Communications Architecture (SCA) Core Framework. OMG SBC Workshop Arlington, Va.

Computer Architecture

Introduction. Reading: Chapter 1. Courtesy of Dr. Dansereau, Dr. Brown, Dr. Vranesic, Dr. Harris, and Dr. Choi.

Nanoelectronics the Original Positronic Brain?

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

Lecture 04 CSE 40547/60547 Computing at the Nanoscale Interconnect

Chapter 1 Introduction

Static Power and the Importance of Realistic Junction Temperature Analysis

How material engineering contributes to delivering innovation in the hyper connected world

International Technology Roadmap for Semiconductors. Dave Armstrong Advantest Ira Feldman Feldman Engineering Marc Loranger - FormFactor

Introduction. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. July 30, 2002

6.012 Microelectronic Devices and Circuits

Advanced Digital Design

DATASHEET CADENCE QRC EXTRACTION

The Advantages of Integrated MEMS to Enable the Internet of Moving Things

Shared Context Is A Force Multiplier

The Transistor. Survey: What is Moore s Law? Survey: What is Moore s Law? Technology Unit Overview. Technology Generations

International Technology Roadmap for Semiconductors. Dave Armstrong Advantest Ira Feldman Feldman Engineering Marc Loranger FormFactor

LOW-POWER SOFTWARE-DEFINED RADIO DESIGN USING FPGAS

Creating the Right Environment for Machine Learning Codesign. Cliff Young, Google AI

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

DIGITAL INTEGRATED CIRCUITS A DESIGN PERSPECTIVE 2 N D E D I T I O N

Technical challenges for high-frequency wireless communication

Eurolab-4-HPC Roadmap. Paul Carpenter Barcelona Supercomputing Center Theo Ungerer University of Augsburg

Pramoda N V Department of Electronics and Communication Engineering, MCE Hassan Karnataka India

Introduction to VLSI ASIC Design and Technology

CS Computer Architecture Spring Lecture 04: Understanding Performance

Innovative ultra-broadband ubiquitous Wireless communications through terahertz transceivers ibrow

Formal Hardware Verification: Theory Meets Practice

Design of low threshold Full Adder cell using CNTFET

LOW LEAKAGE CNTFET FULL ADDERS

Amber Path FX SPICE Accurate Statistical Timing for 40nm and Below Traditional Sign-Off Wastes 20% of the Timing Margin at 40nm

Exploring the Software Stack for Underdesigned Computing Machines Rajesh Gupta UC San Diego.

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

The Bump in the Road to Exaflops and Rethinking LINPACK

LSI and Circuit Technologies for the SX-8 Supercomputer

Lecture 1. Tinoosh Mohsenin

Impact from Industrial use of HPC HPC User Forum #59 Munich, Germany October 2015

Towards a Reconfigurable Nanocomputer Platform

A Case for Opportunistic Embedded Sensing In Presence of Hardware Power Variability

International Journal of Advanced Research in Computer Science and Software Engineering

FROM KNIGHTS CORNER TO LANDING: A CASE STUDY BASED ON A HODGKIN- HUXLEY NEURON SIMULATOR

Lecture 1, Introduction and Background

Simulation and Analysis of CNTFETs based Logic Gates in HSPICE

Research Needs for Device Sciences Modeling and Simulation (May 6, 2005)

+1 (479)

ECE 546 Introduction

Intel s High-k/Metal Gate Announcement. November 4th, 2003

Lecture #2 Solving the Interconnect Problems in VLSI

EMERGING SUBSTRATE TECHNOLOGIES FOR PACKAGING

BICMOS Technology and Fabrication

FPGA Based System Design

Leading by design: Q&A with Dr. Raghuram Tupuri, AMD Chris Hall, DigiTimes.com, Taipei [Monday 12 December 2005]

Silicon photonics and memories

INF3430 Clock and Synchronization

Design & Performance Analysis of DG-MOSFET for Reduction of Short Channel Effect over Bulk MOSFET at 20nm

Optical Bus for Intra and Inter-chip Optical Interconnects

Cherry Picking: Exploiting Process Variations in the Dark Silicon Era

Lecture Introduction

JDT EFFECTIVE METHOD FOR IMPLEMENTATION OF WALLACE TREE MULTIPLIER USING FAST ADDERS

Building a Cell Ecosystem. David A. Bader

Digital Design and System Implementation. Overview of Physical Implementations

Low Power System-On-Chip-Design Chapter 12: Physical Libraries

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Processors Processing Processors. The meta-lecture

6.012 Microelectronic Devices and Circuits

Low Power Design in VLSI

HIGH-LEVEL SUPPORT FOR SIMULATIONS IN ASTRO- AND ELEMENTARY PARTICLE PHYSICS

It s Time for 300mm Prime

High speed silicon-based optoelectronic devices Delphine Marris-Morini Institut d Electronique Fondamentale, Université Paris Sud

An Introduction to High-Frequency Circuits and Systems

Research Statement. Sorin Cotofana

Pushing Ultra-Low-Power Digital Circuits

Lec 24: Parallel Processors. Announcements

Transcription:

Parallel Computing 2020: Preparing for the Post-Moore Era Marc Snir

THE (CMOS) WORLD IS ENDING NEXT DECADE So says the International Technology Roadmap for Semiconductors (ITRS) 2

End of CMOS? IN THE LONG TERM (~2017 THROUGH 2024) While power consumption is an urgent challenge, its leakage or static component will become a major industry crisis in the long term, threatening the survival of CMOS technology itself, just as bipolar technology was threatened and eventually disposed of decades ago. [ITRS 2009/2010] Unlike the situation at the end of the bipolar era, no technology is waiting in the wings. 3

Technology Barriers New materials.. such as III-V or germanium thin channels on silicon, or even semiconductor nanowires, carbon nanotubes, graphene or others may be needed. New structures three-dimensional architecture, such as vertically stackable cell arrays in monolithic integration, with acceptable yield and performance. These are huge industry challenges to simply imagine and define Note: Predicted feature size in 2024 (7.5 nm) = ~32 silicon atoms (Si-Si lattice distance is 0.235 nm) 4

Economic Barriers ROI challenges achieving constant/improved ratio of cost to throughput might be an insoluble dilemma. Rock s Law: Cost of semiconductor chip fabrication plant doubles every four years Current cost is $7-$9B Intel s yearly revenue is $35B Semiconductor industry grows < 20% annually Opportunities for consolidation are limited Will take longer to amortize future technology investments Progress stops when manufacturing a twice as dense chip is twice as expensive 5

Scaling is Plateauing Simple scaling (proportional decrease in all parameters) has ended years ago Single thread performance is not improving Rate of increase in chip density is slowing down in the next few years, for technological and economic reasons Silicon will plateau at x10-x100 current performance No alternative technology is ready for prime time 6

IT as Scaling Slows End of Moore s Law is not the end of the Computer Industry It needs not be the end of IT growth Mass market (mobile, home): Increasing emphasis on function (or fashion) Big iron: Increasing emphasis on compute efficiency: Get more results from a given energy and transistor budget. 7

Compute Efficiency Progressively more efficient use of a fixed set of resources (similar to fuel efficiency) More computations per joule More computations per transistor A clear understanding of where performance is wasted and continuous progress to reduce waste A clear distinction between overheads computational friction -- and the essential work (We are still very far from any fundamental limit) 8

HPC The Canary in the Mine HPC is already heavily constrained by low compute efficiency High power consumption, high levels of parallelism Exascale research is not only research for the next turn of the crank in supercomputing, but research on how to sustain performance growth in face of semiconductor technology slow-down Essential for continued progress in science, national competitiveness and national security 9

PETASCALE IN A YEAR Blue Waters 10

Blue Waters System Attribute Blue Waters Vendor IBM Processor IBM Power7 Peak Performance (PF) ~10 Sustained Performance (PF) ~1 Number of Cores/Chip 8 Number of Cores >300,000 Amount of Memory (PB) ~1 Amount of Disk Storage (PB) ~18 Amount of Archival Storage (PB) >500 External Bandwidth (Gbps) 100-400 Water Cooled >10 MW 11

Exascale in 2015 with 20MW [Kogge s Report] Aggressive scaling of Blue Gene Technology (32nm) 67 MW 223K nodes, 160M cores 3.6 PB memory (1 Byte/1000 flops capacity, 1 Word/50 flops bandwidth) 40 mins MTTI A more detailed and realistic study by Kogge indicates power consumption is ~500 MW 12

Kogge -- Spectrum [A] practical exaflops-scale supercomputer might not be possible anytime in the foreseeable future Building exascale computers... would require engineers to rethink entirely how they construct number crunchers Don t expect to see an [exascale] supercomputer any time soon. But don t give up hope, either. 13

Exascale Research: Some Fundamental Questions Power Complexity Communication-optimal computations Low entropy computations Jitter-resilient computation Steady-state computations Friction-less architecture Self-organizing computations Resiliency 14

Power Complexity There is a huge gap between theories on the (quantum) physical constraints of computation and the practice of current computing devices Can we develop power complexity models of computations that are relevant to computer engineers? 15

Communication-Efficient Algorithms: Theory Communication in time (registers, memory) and space (buses, links) is, by far, the major source of energy consumption Need to stop counting operations and start counting communications Need a theory of communication-efficient algorithms (beyond FFT and dense linear algebra) Communication-efficient PDE solvers (understand relation between properties of PDE and communication needs) Need to measure correctly inherent communication costs at the algorithm level Temporal/spatial/processor locality: second order statistics on data & control dependencies 16

Communication-Efficient Computations: Practice Need better benchmarks to sample multivariate distributions (apply Optimal Sampling Theory?) Need communication-focused programming models & environments User can analyze and control cost of communications incurred during program execution (volume, locality) Need productivity environments for performanceoriented programmers 17

Low-Entropy Communication Communication can be much cheaper if known in advance Memory access overheads, latency hiding, reduced arbitration cost, bulk transfers (e.g., optical switches) Bulk mail vs. express mail Current HW/SW architectures take little advantage of such knowledge Need architecture/software/algorithm research CS is lacking a good algorithmic theory of entropy Need theory, benchmarks, metrics 18

Jitter-Resilient Computation Expect increased variance in the compute speed of different components in a large machine Power management Error correction Asynchronous system activities Variance in application Need variance-tolerant applications Bad: frequent barriers, frequent reductions Good: 2-phase collectives, double-buffering Need theory and metrics Need new variance-tolerant algorithms Need automatic transformations for increased variance tolerance 19

Steady-State Computation Each subsystem of a large system (CPU, memory, interconnect, disk) has low average utilization during a long computation Each subsystem is the performance bottleneck during part of the computation Utilization is not steady-state hence need to over-provision each subsystem. Proposed solution A: power management, to reduce subsystem consumption when not on critical path Hard (in theory and in practice) Proposed solution B: Techniques for steady-state computation E.g., communication/computation overlap Need research in Software (programming models, compilers, run-time), and architecture 20

Friction-less Software Layering Current HW/SW architectures have developed multiple, rigid levels of abstraction (ISA, VM, APIs, languages ) Facilitates SW development but energy is lost at layer matching Flexible specialization enables to regain lost performance Inlining, on-line compilation, code morphing Similar techniques are needed for OS layers 21

Self-Organizing Computations Hardware continuously changes (failures, power management) Algorithms have more dynamic behavior (multigrid, multiscale adapt to evolution of simulated system) Mapping of computation to HW needs to be continuously adjusted Too hard to do in a centralized manner -> Need distributed, hill climbing algorithms 22

Resiliency HW for fault correction (and possibly fault detection) may be too expensive (consumes too much power) and is source of jitter Current global checkpoint/restart algorithms cannot cope with MTBF of few hours or less Need SW (language, compiler, runtime) support for error compartmentalization Catch error before it propagates May need fault-tolerant algorithms Need new complexity theory 23

Summary The end of Moore s era will change in fundamental ways the IT industry and CS research A much stronger emphasis on compute efficiency A more systematic and rigorous study of sources of inefficiencies The quest for exascale at reasonable power budget is the first move into this new domain 24

25