CMSC 611: Advanced Computer Architecture

Similar documents
Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

ECE473 Computer Architecture and Organization. Pipeline: Introduction

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture 4: Introduction to Pipelining

CS420/520 Computer Architecture I

CS 110 Computer Architecture Lecture 11: Pipelining

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Pipelined Processor Design

CS429: Computer Organization and Architecture

Instruction Level Parallelism. Data Dependence Static Scheduling

EECE 321: Computer Organiza5on

LECTURE 8. Pipelining: Datapath and Control

RISC Design: Pipelining

Computer Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

CMSC 611: Advanced Computer Architecture

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

CS61C : Machine Structures

Computer Hardware. Pipeline

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

RISC Central Processing Unit

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

CMP 301B Computer Architecture. Appendix C

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

Pipelining and ISA Design

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Dynamic Scheduling I

COSC4201. Scoreboard

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

On the Rules of Low-Power Design

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

CSEN 601: Computer System Architecture Summer 2014

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

Instruction Level Parallelism Part II - Scoreboard

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

CS61C : Machine Structures

EE 457 Homework 5 Redekopp Name: Score: / 100_

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Department Computer Science and Engineering IIT Kanpur

Understanding Engineers #2

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #20. Warehouse Scale Computer

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

CZ3001 ADVANCED COMPUTER ARCHITECTURE

CS521 CSE IITG 11/23/2012

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

Measuring and Evaluating Computer System Performance

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CSE502: Computer Architecture CSE 502: Computer Architecture

Out-of-Order Execution. Register Renaming. Nima Honarmand

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Dynamic Scheduling II

Precise State Recovery. Out-of-Order Pipelines

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Final Report: DBmbench

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

HIGH-PERFORMANCE HYBRID WAVE-PIPELINE SCHEME AS IT APPLIES TO ADDER MICRO-ARCHITECTURES

CSE502: Computer Architecture CSE 502: Computer Architecture

CS Computer Architecture Spring Lecture 04: Understanding Performance

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Project 5: Optimizer Jason Ansel

Instruction Level Parallelism III: Dynamic Scheduling

FMP For More Practice

Computer Architecture Lab Session

Controller Implementation--Part I. Cascading Edge-triggered Flip-Flops

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

Introduction to CMOS VLSI Design (E158) Lecture 5: Logic

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Performance Metrics, Amdahl s Law

Compiler Optimisation

Tomasolu s s Algorithm

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

Serial Addition. Lecture 29 1

CSE502: Computer Architecture Welcome to CSE 502

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

CS 61C: Great Ideas in Computer Architecture Pipelining. Anything can be represented as a number, i.e., data or instrucvons

CS4617 Computer Architecture

Parallel architectures Electronic Computers LM

CS 6290 Evaluation & Metrics

EE382V-ICS: System-on-a-Chip (SoC) Design

DAT105: Computer Architecture

Transcription:

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science

Sequential Laundry 6 PM 7 8 9 10 11 Midnight T a s k O r d e r A B C D Time 30 40 20 30 40 20 30 40 20 30 40 20 Washer takes 30 min, Dryer takes 40 min, folding takes 20 min Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? Slide: Dave Patterson

Pipelined Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 40 40 40 20 Pipelining means start work as soon as possible Pipelined laundry takes 3.5 hours for 4 loads Slide: Dave Patterson

Pipelining Lessons T a s k O r d e r 6 PM 7 8 9 Time 30 40 40 40 40 20 A B C D Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduce speedup Stall for Dependencies Slide: Dave Patterson

31 31 31 MIPS Instruction Set RISC characterized by the following features that simplify implementation: All operations apply only on registers Memory is affected only by load and store Instructions follow very few formats and typically are of the same size op rs rt rd shamt funct 6 bits 5 bits 5 bits 16 bits 26 op 26 21 16 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 26 21 16 op rs rt immediate target address 6 bits 26 bits 11 6 0 0 0

MIPS Instruction Formats R-type (register) Most operations add $t1, $s3, $s4 # $t1 = $s3 + $s4 rd, rs, rt all registers op always 0, funct gives actual function 31 26 21 16 11 6 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 0

MIPS Instruction Formats I-type (immediate) with one immediate operand addi $t1, $s2, 32 # $t1 = $s2 + 32 Load, store within ±2 15 of register lw $t0, 32($s2) Load immediate values # $s1 = $s2[32] or *(32+s2) lui $t0, 255 # $t0 = (255<<16) li $t0, 255 31 26 21 op rs rt immediate 16 6 bits 5 bits 5 bits 16 bits 0

MIPS Instruction Formats I-type (immediate) PC-relative conditional branch ±2 15 from PC after instruction beq $s1, $s2, L1 bne $s1, $s2, L1 # goto L1 if ($s1 = $s2) # goto L1 if ($s1! $s2) 31 26 21 op rs rt immediate 16 6 bits 5 bits 5 bits 16 bits 0

MIPS Instruction Formats J-type (jump) unconditional jump j L1 # goto L1 Address is concatenated to top bits of PC Fixed addressing within 2 26 31 26 op target address 6 bits 26 bits 0

Single-cycle Execution! Figure: Dave Patterson

Multi-Cycle Implementation of MIPS! Instruction fetch cycle (IF) IR! Mem[PC]; NPC! PC + 4 " Instruction decode/register fetch cycle (ID) A! s[ir 6..10 ]; B! s[ir 11..15 ]; Imm! ((IR 16 ) 16 ##IR 16..31 ) # Execution/effective address cycle (EX) Memory ref: Output! A + Imm; - : Output! A func B; -Imm : Output! A op Imm; Branch: Output! NPC + Imm; Cond! (A op 0) $ Memory access/branch completion cycle (MEM) Memory ref: LMD! Mem[Output] or Mem(Output]! B; Branch: if (cond) PC!Output; % Write-back cycle (WB) - : s[ir 16..20 ]! Output; -Imm : Load: s[ir 11..15 ]! Output; s[ir 11..15 ]! LMD;

Multi-cycle Execution! " # $ % Figure: Dave Patterson

Stages of Instruction Execution Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch /Dec Exec Mem WB The load instruction is the longest All instructions follows at most the following five steps: Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory and update PC /Dec: isters Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file Slide: Dave Patterson

Instruction Pipelining Start handling next instruction while the current instruction is in progress Feasible when different devices at different stages Time IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB Program Flow IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB Time between instructions pipelined = Time between instructions nonpipelined Number of pipe stages Pipelining improves performance by increasing instruction throughput

Program execution order Time (in instructions) lw $1, 100($0) Example of Instruction Pipelining Instruction fetch 2 4 6 8 10 12 14 16 18 Data access lw $2, 200($0) lw $3, 300($0) Program execution Time order (in instructions) lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) 8 ns Time between first & fourth instructions is 3! 8 = 24 ns Instruction fetch 2 ns Instruction fetch 8 ns Data access 2 4 6 8 10 12 14 Instruction fetch 2 ns Instruction fetch Data access Data access Data access Instruction fetch 8 ns... Time between first & fourth instructions is 3! 2 = 6 ns 2 ns 2 ns 2 ns 2 ns 2 ns Ideal and upper bound for speedup is number of stages in the pipeline

Single Cycle Clk Cycle 1 Cycle 2 Load Store Waste Cycle time long enough for longest instruction Shorter instructions waste time No overlap Figure: Dave Patterson

Multiple Cycle Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Load Ifetch Exec Mem Wr Store Ifetch Exec Mem R-type Ifetch Cycle time long enough for longest stage Shorter stages waste time Shorter instructions can take fewer cycles No overlap Figure: Dave Patterson

Pipeline Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Load Ifetch Exec Mem Wr Store Ifetch Exec Mem Wr R-type Ifetch Exec Mem Wr Cycle time long enough for longest stage Shorter stages waste time No additional benefit from shorter instructions Overlap instruction execution Figure: Dave Patterson

Pipeline Performance Pipeline increases the instruction throughput not execution time of an individual instruction An individual instruction can be slower: Additional pipeline control Imbalance among pipeline stages Suppose we execute 100 instructions: Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multi-cycle Machine 10 ns/cycle x 4.2 CPI (due to inst mix) x 100 inst = 4200 ns Ideal 5 stages pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns Lose performance due to fill and drain

Pipeline Datapath Every stage must be completed in one clock cycle to avoid stalls Values must be latched to ensure correct execution of instructions The PC multiplexer has moved to the IF stage to prevent two instructions from updating the PC simultaneously (in case of branch instruction) Data Stationary

Pipeline Stage Interface Stage IF ID Any Instruction IF/ID.IR!MEM[PC] ; IF/ID.NPC,PC! ( if ( (EX/MEM.opcode == branch) & EX/MEM.cond) {EX/MEM.Output } else { PC + 4 } ) ; ID/EX.A = s[if/id. IR 6..10 ]; ID/EX.B!s[IF/ID. IR 11..15 ]; ID/EX.NPC!IF/ID.NPC ; ID/EX.IR!IF/ID.IR; ID/EX.Imm! (IF/ID. IR 16 ) 16 ## IF/ID. IR 16..31 ; Load or Store Branch EX MEM EX/MEM.IR = ID/EX.IR; EX/MEM. Output! ID/EX.A func ID/EX.B; Or EX/MEM.Output! ID/EX.A op ID/EX.Imm; EX/MEM.cond! 0; MEM/WB.IR!EX/MEM.IR; MEM/WB.Output! EX/MEM.Output; EX/MEM.IR! ID/EX.IR; EX/MEM.Output! ID/EX.A + ID/EX.Imm; EX/MEM.cond! 0; EX/MEM.B!ID/EX.B; MEM/WB.IR! EX/MEM.IR; MEM/WB.LMD! Mem[EX/MEM.Output] ; Or Mem[EX/MEM.Output]! EX/MEM.B ; EX/MEM.Output! ID/EX.NPC + ID/EX.Imm; EX/MEM.cond! (ID/EX.A op 0); WB s[mem/wb. IR 16..20 ]! EM/WB.Output; Or s[mem/wb. IR 11..15 ]! MEM/WB.Output ; For load only: s[mem/wb. IR 11..15 ]! MEM/WB.LMD;

Pipeline Hazards Cases that affect instruction execution semantics and thus need to be detected and corrected Hazards types Structural hazard: attempt to use a resource two different ways at same time Single memory for instruction and data Data hazard: attempt to use item before it is ready Instruction depends on result of prior instruction still in the pipeline Control hazard: attempt to make a decision before condition is evaluated branch instructions Hazards can always be resolved by waiting

Visualizing Pipelining Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. Ifetch Ifetch DMem DMem O r d e r Ifetch Ifetch DMem DMem Slide: David Culler

Example: One Memory Port/Structural Hazard Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. Load Ifetch Instr 1 Instr 2 Ifetch Ifetch DMem DMem DMem O r d e r Instr 3 Instr 4 Structural Hazard Ifetch DMem Slide: David Culler

Resolving Structural Hazards 1. Wait Must detect the hazard Easier with uniform ISA Must have mechanism to stall Easier with uniform pipeline organization 2. Throw more hardware at the problem Use instruction & data cache rather than direct access to memory

Detecting and Resolving Structural Hazard Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. Load Ifetch Instr 1 Instr 2 Ifetch Ifetch DMem DMem DMem O r d e r Stall Instr 3 Bubble Bubble Bubble Bubble Bubble Ifetch DMem Slide: David Culler

Stalls & Pipeline Performance Average instruction time unpipelined Pipelining Speedup = Average instruction time pipelined CPI unpipelined = CPI pipelined " Clock cycle unpipelined Clock cycle pipelined Ideal CPI pipelined = 1 CPI pipelined = Ideal CPI+ Pipeline stall cycles per instruction = 1+ Pipeline stall cycles per instruction CPI unpipelined Clock cycle unpipelined Speedup = " 1 + Pipeline stall cycles per instruction Clock cycle pipelined Assuming all pipeline stages are balanced Speedup = Pipeline depth 1 + Pipeline stall cycles per instruction