EECE 321: Computer Organiza5on

Similar documents
7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Computer Architecture

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture 4: Introduction to Pipelining

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

ECE473 Computer Architecture and Organization. Pipeline: Introduction

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

CS 110 Computer Architecture Lecture 11: Pipelining

LECTURE 8. Pipelining: Datapath and Control

RISC Design: Pipelining

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Pipelining and ISA Design

CMSC 611: Advanced Computer Architecture

Pipelined Processor Design

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Computer Hardware. Pipeline

CS429: Computer Organization and Architecture

RISC Central Processing Unit

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Instruction Level Parallelism. Data Dependence Static Scheduling

CS420/520 Computer Architecture I

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

CSEN 601: Computer System Architecture Summer 2014

COSC4201. Scoreboard

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

CMP 301B Computer Architecture. Appendix C

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

CMSC 611: Advanced Computer Architecture

CS61C : Machine Structures

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Instruction Level Parallelism Part II - Scoreboard

Dynamic Scheduling I

Department Computer Science and Engineering IIT Kanpur

Issue. Execute. Finish

CS61C : Machine Structures

CZ3001 ADVANCED COMPUTER ARCHITECTURE

FMP For More Practice

On the Rules of Low-Power Design

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Project 5: Optimizer Jason Ansel

Out-of-Order Execution. Register Renaming. Nima Honarmand

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

CSE502: Computer Architecture CSE 502: Computer Architecture

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

CSE502: Computer Architecture CSE 502: Computer Architecture

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

CS61C : Machine Structures

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Parallel architectures Electronic Computers LM

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

Introduc)on to So,ware Engineering

EE 457 Homework 5 Redekopp Name: Score: / 100_

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

CS Computer Architecture Spring Lecture 04: Understanding Performance

EE382V-ICS: System-on-a-Chip (SoC) Design

Dynamic Scheduling II

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Measuring and Evaluating Computer System Performance

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

CS61c: Introduction to Synchronous Digital Systems

Instruction Level Parallelism III: Dynamic Scheduling

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Lecture 8-1 Vector Processors 2 A. Sohn

HIGH-PERFORMANCE HYBRID WAVE-PIPELINE SCHEME AS IT APPLIES TO ADDER MICRO-ARCHITECTURES

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

CS 61C: Great Ideas in Computer Architecture Lecture 10: Finite State Machines, Func/onal Units. Machine Interpreta4on

OOO Execution & Precise State MIPS R10000 (R10K)

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

CS521 CSE IITG 11/23/2012

Lesson 7. Digital Signal Processors

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

Precise State Recovery. Out-of-Order Pipelines

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

Performance Metrics, Amdahl s Law

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #20. Warehouse Scale Computer

Tomasolu s s Algorithm

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Architectural Power Management for High Leakage Technologies

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

Transcription:

EECE 321: Computer Organiza5on Mohammad M. Mansour Dept. of Electrical and Compute Engineering American University of Beirut Lecture 21: Pipelining

Processor Pipelining Same principles can be applied to processors where we pipeline inst. execu>on. For MIPS, the 5 stages are pipelined. Designate the 5 stages as follows: Instruc>on fetch (IF) Instruc>on decode and operand fetch: Reg ALU opera>on execu>on: ALU Access an operand in data memory: Data access Write the result into a register: Reg An instruc>on executes by doing the appropriate work in each stage. Ex: load Instruc>on Fetch Reg read Conven>on: All stages are balanced Register writes occur during first half of the stage, reads in the second half. Instruc>on Fetch ALU Data Access Reg R- Format instruc>on: doesn t use data access stage Instruc>on Fetch Reg read ALU Data Access write Reg write Reg ALU Reg EECE 321: Computer Organiza>on 2

Processor Pipelining Example: Assume we pipeline the 5 steps of execu>ng instruc>ons in MIPS (lw,sw,add, sub,and,or,slt,beq). Assume access of all func>onal units is 2ns, except register file which is 1ns. EECE 321: Computer Organiza>on 3

Pipelining Performance A single- cycle non- pipelined implementa>on requires 8ns to execute an instruc>on: The >me between the first and fourth instruc>ons is 3 x 8 = 24ns. In the pipelined processor, the clock cycle must be long enough to accommodate the slowest opera>on (i.e. 2ns) The >me between the first and fourth instruc>ons is 3 x 2 = 6ns. Pipelining speedup: If all pipeline stages are perfectly balanced, then the ideal >me between instruc>ons in the pipelined machine is equal to This speedup cannot always be a\ained due to pipelining limita>ons and overhead. In our example, the speedup is not reflected in the total execu>on >me: T nonpipelined = 24ns, T pipelined = 14ns, T pipelined / T nonpipelined = 1.71 What happens if we increase the number of instruc>ons from 3 to 1003: T nonpipelined = 8024ns, T pipelined = 1000x2 + 14 = 2014ns, T pipelined / T nonpipelined = 3.98 Pipelining improves performance by increasing throughput, as opposed to decreasing the execu>on >me of an individual instruc>on. Instruc>on throughput is the important metric because real programs execute billions of instruc>ons. EECE 321: Computer Organiza>on 4

Pipelining Hence pipelining improves performance by increasing instruc>on throughput. Ideal speedup is number of stages in the pipeline. Ideal speedup = k Do we achieve this? What makes pipelining easy? All instruc>ons have the same length (makes instruc>on fetching easy) Just a few instruc>on formats (this means instruc>on decode is simple, can start to fetch operands before knowing the instruc>on type) Memory operands appear only in loads and stores (this means we can use execute stage to compute address, then access memory in the following stage) The MIPS instruc>on set was inten>onally designed for pipelined execu>on. EECE 321: Computer Organiza>on 5

What makes pipelining hard? There are situa>ons where the next instruc>on can t be executed. This is due to what is called pipeline hazards. There are 3 types of pipeline hazards: 1) Structural hazards: Suppose we had only one memory 2) Control hazards: Do we always fetch instruc>ons in sequence? Need to worry about branch instruc>ons 3) Data hazards: An instruc>on depends on a previous instruc>on EECE 321: Computer Organiza>on 6

1. Structural Hazards This means that the hardware cannot support the combina>on of instruc>ons that we want to execute in the same clock cycle. Example: Consider execu>on of the following instruc>ons on a pipelined processor with a single memory unit. Time lw $1, 100($0) Program Execution Order lw $2, 104($0) lw $3, 108($0) lw $4, 112($0) Conflict over! memory access (assuming single memory)! EECE 321: Computer Organiza>on 7

2. Control Hazards (Branch Hazards) This hazard arises from the need to make a decision on the results of one instruc>on while others are execu>ng. This hazard is typical with branch instruc>ons. There are three solu>ons to control hazards. Solu>on 1 - Stall: Let the pipeline pause before con>nuing execu>on of other instruc>ons un>l the branch decision is resolved. Assume we add extra hardware so that branch decision is known in second stage. So next instruc>on cannot start immediately ager the branch, but ager 1 clock cycle This is indicated by a bubble in the pipeline. A NOP is inserted in MIPS code. bubble EECE 321: Computer Organiza>on 8

2. Control Hazards Stalling the pipeline slows down execu>on especially if we can t resolve the branch decision in the 2 nd stage (in our MIPS datapath it is resolved in 3 rd stage). Solu>on 2 Predict the decision of the branch (Branch Predic>on): One simple solu>on is to predict the branch fails. This solu>on doesn t slow down the pipeline when branches fail. However, when branches are taken, the pipeline stalls. Branch is NOT taken Branch decision is known Branch is taken Disable actions of lw" EECE 321: Computer Organiza>on 9

2. Control Hazards Solu>on 3 Delayed branches: This solu>on is actually used in MIPS. Place an instruc>on that is not affected by the branch (e.g. an instruc>on appearing before the branch) immediately ager it. Delay taking the branch 1clock cycle (i.e., delay loading of PC one more clock cycle) Example: add $4,$5,$6 doesn t affect the branch, so it can be moved into the delayed branch slot. EECE 321: Computer Organiza>on 10

Summary of Control Hazard Solu5ons Stall the pipeline Do branch predic>on Use delayed branches EECE 321: Computer Organiza>on 11

3. Data Hazards This occurs when the next instruc>on depends on the result generated by the current instruc>on: The add instruc>on doesn t write the result un>l the 5th stage, so we need to add two bubbles to the pipeline (2 NOPs in MIPS). Instruc>on Fetch Reg add $s0,$t0,$t1 #producer of $s0 sub $t2,$s0,$t3 #consumer of $s0 ALU Data Access Reg Instruc>on Fetch Reg ALU Data Access Reg Solu>on: Observe that we don t have to wait for the first add instruc>on to complete to resolve the data hazard. As soon as the first add finishes its third stage, the sum is ready and can be forwarded to sub. Gemng the missing item early from the internal resources is called register forwarding or bypassing. EECE 321: Computer Organiza>on 12

3. Data Hazards Example1 For the two instruc>ons below, show what pipeline stages can be connected by forwarding. Use the figure below to represent the datapath during the 5 stages. add $s0,$t0,$t1 sub $t2,$s0,$t3 Solu>on: Forwarding the value of $s0 directly from ALUout EECE 321: Computer Organiza>on 13

3. Data Hazards Example2 Repeat for the following pair of instruc>ons. Does forwarding remove all stalls? lw $s0,20($t1) sub $t2,$s0,$t3 Solu>on: Here since result of load is available only ager the 4th stage in MDR, and sub needs it in 3rd stage, s>ll need to insert a pipeline bubble. EECE 321: Computer Organiza>on 14

3. Data Hazards Reordering Code to Avoid Pipeline Stalls Can the assembler or compiler rearrange the code to eliminate stalls? lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Solu>on: lw lw sw sw $t0, 0($t1) $t2, 4($t1) $t0, 4($t1) $t2, 0($t1) EECE 321: Computer Organiza>on 15

Reordering Code to Support Structural & Data Hazards Time (1)! lw $1, 100($0) Program Execution Order 4 NOPs! lw $2, 104($0) lw $3, 108($0) lw $4, 112($0) Same #! of NOPs! Time (2)! lw $1, 100($0) Instructions from before! lw $2, 104($1) lw $3, 108($2) lw $4, 112($3) 4 NOPs!.!.!.! Prof. M. Mansour.!.! EECE 321: Computer Organiza>on.! 16

Reorganized Single- Cycle Datapath EECE 321: Computer Organiza>on 17

Pipelined Execu5on EECE 321: Computer Organiza>on 18

Pipelined Datapath: Adding Pipeline Registers Forward necessary informa>on used in later execu>on stages using pipeline registers Give them names: IF/ID, ID/EX, EX/MEM, MEM/WB EECE 321: Computer Organiza>on 19

Execu5on of lw on Pipelined Datapath: IF (1/5) EECE 321: Computer Organiza>on 20

Execu5on of lw on Pipelined Datapath: ID (2/5) EECE 321: Computer Organiza>on 21

Execu5on of lw on Pipelined Datapath: EX (3/5) EECE 321: Computer Organiza>on 22

Execu5on of lw on Pipelined Datapath: MEM (4/5) EECE 321: Computer Organiza>on 23

Execu5on of lw on Pipelined Datapath: WB (5/5) Where is the loaded value wri\en? Above datapath has a problem. EECE 321: Computer Organiza>on 24

Corrected Pipelined Datapath to Properly Handle lw EECE 321: Computer Organiza>on 25