Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Similar documents
Department Computer Science and Engineering IIT Kanpur

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

ECE473 Computer Architecture and Organization. Pipeline: Introduction

CMP 301B Computer Architecture. Appendix C

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

EECE 321: Computer Organiza5on

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

CSE502: Computer Architecture CSE 502: Computer Architecture

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CS 110 Computer Architecture Lecture 11: Pipelining

CSE502: Computer Architecture CSE 502: Computer Architecture

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Pipelined Processor Design

Project 5: Optimizer Jason Ansel

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

ELCN100 Electronic Lab. Instruments and Measurements Spring Lecture 01: Introduction

Lecture 4: Introduction to Pipelining

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

CS429: Computer Organization and Architecture

Lec 24: Parallel Processors. Announcements

Computer Hardware. Pipeline

CS61c: Introduction to Synchronous Digital Systems

A Static Power Model for Architects

CMOS Process Variations: A Critical Operation Point Hypothesis

CS4617 Computer Architecture

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Design Challenges in Multi-GHz Microprocessors

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Out-of-Order Execution. Register Renaming. Nima Honarmand

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

CSE502: Computer Architecture Welcome to CSE 502

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Course Content. Course Content. Course Format. Low Power VLSI System Design Lecture 1: Introduction. Course focus

Practical Information

Instruction Level Parallelism. Data Dependence Static Scheduling

Practical Information

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Lecture #1. Course Overview

Overview of Design Methodology. A Few Points Before We Start 11/4/2012. All About Handling The Complexity. Lecture 1. Put things into perspective

Instruction Level Parallelism Part II - Scoreboard

Lecture 1: Introduction to Digital System Design & Co-Design

Chapter 1 Introduction

Parallelism Across the Curriculum

Introduction (concepts and definitions)

Dynamic Scheduling I

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

ICS312 Machine-level and Systems Programming

CMSC 611: Advanced Computer Architecture

Precise State Recovery. Out-of-Order Pipelines

Lecture 9: Clocking for High Performance Processors

Lecture 04 CSE 40547/60547 Computing at the Nanoscale Interconnect

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

RISC Central Processing Unit

Pipelined Architecture (2A) Young Won Lim 4/7/18

Pipelined Architecture (2A) Young Won Lim 4/10/18

OOO Execution & Precise State MIPS R10000 (R10K)

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

MICROPROCESSOR TECHNOLOGY

Interconnect-Power Dissipation in a Microprocessor

Measuring and Evaluating Computer System Performance

EECS 579 Fall What is Testing?

Instruction Level Parallelism III: Dynamic Scheduling

CS Computer Architecture Spring Lecture 04: Understanding Performance

Systems with Digital Integrated Circuits

Statistical Simulation of Multithreaded Architectures

Manufacturing Case Studies: Copy Exactly (CE!) and the two-year cycle at Intel

CSE502: Computer Architecture CSE 502: Computer Architecture

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Computer Aided Design of Electronics

EE4800 CMOS Digital IC Design & Analysis. Lecture 1 Introduction Zhuo Feng

Power Spring /7/05 L11 Power 1

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

The future of lithography and its impact on design

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

EMT 251 Introduction to IC Design

PC accounts for 353 Cory will be created early next week (when the class list is completed) Discussions & Labs start in Week 3

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

LECTURE 8. Pipelining: Datapath and Control

Performance Metrics, Amdahl s Law

Computer Architecture

EECS150 Spring 2007 Lab Lecture #5. Shah Bawany. 2/16/2007 EECS150 Lab Lecture #5 1

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

Electrical Engineering 40 Introduction to Microelectronic Circuits

Lecture 8-1 Vector Processors 2 A. Sohn

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

A Review on Different Multiplier Techniques

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

Datorstödd Elektronikkonstruktion

Trends and Challenges in VLSI Technology Scaling Towards 100nm

On the Rules of Low-Power Design

Transcription:

NPTEL Online - IIT Kanpur Instructor: Dr. Mainak Chaudhuri Instructor: Dr. S. K. Aggarwal Course Name: Department: Program Optimization for Multi-core Architecture Computer Science and Engineering IIT Kanpur Instructor: Dr. Rajat Moona file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture1/main.html[6/14/2012 11:17:07 AM]

The Lecture Contains: Mind-boggling Trends in Chip Industry Agenda Unpipelined Microprocessors Pipelining Pipelining Hazards Control Dependence Data Dependence Structural Hazard Out-of-order Execution Multiple Issue Out-of-Order Multiple Issue Moore's Law file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture1/1_1.htm[6/14/2012 11:17:08 AM]

Mind-boggling Trends in Chip Industry Long history since 1971 Introduction of Intel 4004 http://www.intel4004.com/ Today we talk about more than one billion transistors on a chip Intel Montecito (in market since July'06) has 1.7B transistors Die size has increased steadily (what is a die?) Intel Prescott: 112mm 2, Intel Pentium 4EE: 237 mm 2, Intel Montecito: 596 mm 2 Minimum feature size has shrunk from 10 micron in 1971 to 0.045 micron today Agenda Unpipelined microprocessors Pipelining: simplest form of ILP Out-of-order execution: more ILP Multiple issue: drink more ILP Scaling issues and Moore's Law Why multi-core TLP and de-centralized design Tiled CMP and shared cache Implications on software Research directions file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture1/1_2.htm[6/14/2012 11:17:08 AM]

Unpipelined Microprocessors Typically an instruction enjoys five phases in its life Instruction fetch from memory Instruction decode and operand register read Execute Data memory access Register write Unpipelined execution would take a long single cycle or multiple short cycles Only one instruction inside processor at any point in time Pipelining One simple observation Exactly one piece of hardware is active at any point in time Why not fetch a new instruction every cycle? Five instructions in five different phases Throughput increases five times (ideally) Bottom-line is If consecutive instructions are independent, they can be processed in parallel The first form of instruction-level parallelism (ILP) file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture1/1_3.htm[6/14/2012 11:17:08 AM]

Pipelining Hazards Instruction dependence limits achievable parallelism Control and data dependence (aka hazards) Finite amount of hardware limits achievable parallelism Structural hazards Control dependence On average, every fifth instruction is a branch (coming from if-else, for, do-while, ) Branches execute in the third phase Introduces bubbles unless you are smart Control Dependence What do you fetch in X and Y slots? Options: Nothing, fall-through, learn past history and predict (today best predictors achieve on average 97% accuracy for SPEC2000) Data Dependence file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture1/1_4.htm[6/14/2012 11:17:08 AM]

Take three bubbles? Back-to-back dependence is too frequent Solution: Hardware bypass paths Allow the ALU to bypass the produced value in time: not always possible Data Dependence Need a live bypass! (requires some negative time travel: not yet feasible in real world) No option but to take one bubble Bigger Problems: load latency is often high; you may not find the data in cache Structural Hazard Usual solution is to put more resources file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture1/1_5.htm[6/14/2012 11:17:09 AM]

Out-of-order Execution Results must become visible in-order Multiple Issue Results must become visible in-order file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture1/1_6.htm[6/14/2012 11:17:09 AM]

Out-of-order Multiple Issue Some hardware nightmares Complex issue logic to discover independent instructions Increased pressure on cache Impact of a cache miss is much bigger now in terms of lost opportunity Various speculative techniques are in place to ignore the slow and stupid memory Increased impact of control dependence Must feed the processor with multiple correct instructions every cycle One cycle of bubble means lost opportunity of multiple instructions Complex logic to verify Moore's Law Number of transistors on-chip doubles every 18 months So much of innovation was possible only because we had transistors Phenomenal 58% performance growth every year Moore's Law is facing a danger today Power consumption is too high when clocked at multi-ghz frequency and it is proportional to the number of switching transistors Wire delay doesn't decrease with transistor size file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture1/1_7.htm[6/14/2012 11:17:09 AM]