Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Similar documents
A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Lecture 4: Introduction to Pipelining

CS420/520 Computer Architecture I

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

ECE473 Computer Architecture and Organization. Pipeline: Introduction

CS 110 Computer Architecture Lecture 11: Pipelining

CMSC 611: Advanced Computer Architecture

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Pipelined Processor Design

LECTURE 8. Pipelining: Datapath and Control

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

CS429: Computer Organization and Architecture

EECE 321: Computer Organiza5on

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Computer Architecture

CS61C : Machine Structures

RISC Design: Pipelining

Computer Hardware. Pipeline

RISC Central Processing Unit

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Out-of-Order Execution. Register Renaming. Nima Honarmand

EE 457 Homework 5 Redekopp Name: Score: / 100_

Dynamic Scheduling I

CMP 301B Computer Architecture. Appendix C

CZ3001 ADVANCED COMPUTER ARCHITECTURE

CS521 CSE IITG 11/23/2012

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Instruction Level Parallelism. Data Dependence Static Scheduling

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

CSEN 601: Computer System Architecture Summer 2014

COSC4201. Scoreboard

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

CSE502: Computer Architecture CSE 502: Computer Architecture

Dynamic Scheduling II

CMSC 611: Advanced Computer Architecture

Pipelining and ISA Design

CS61C : Machine Structures

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Department Computer Science and Engineering IIT Kanpur

Instruction Level Parallelism Part II - Scoreboard

Instruction Level Parallelism III: Dynamic Scheduling

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Project 5: Optimizer Jason Ansel

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

CSE502: Computer Architecture CSE 502: Computer Architecture

Tomasolu s s Algorithm

EE382V-ICS: System-on-a-Chip (SoC) Design

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Finite State Machines CS 64: Computer Organization and Design Logic Lecture #16

Parallel architectures Electronic Computers LM

Precise State Recovery. Out-of-Order Pipelines

On the Rules of Low-Power Design

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Multiple Predictors: BTB + Branch Direction Predictors

FMP For More Practice

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

Performance Metrics, Amdahl s Law

OOO Execution & Precise State MIPS R10000 (R10K)

Understanding Engineers #2

Measuring and Evaluating Computer System Performance

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

CS Computer Architecture Spring Lecture 04: Understanding Performance

Lecture 8-1 Vector Processors 2 A. Sohn

The Metrics and Designs of an Arithmetic Logic Function over

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

Controller Implementation--Part I. Cascading Edge-triggered Flip-Flops

CS61c: Introduction to Synchronous Digital Systems

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

Exam #2 EE 209: Fall 2017

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #20. Warehouse Scale Computer

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

Datapath Components. Multipliers, Counters, Timers, Register Files

Tomasulo s Algorithm. Tomasulo s Algorithm

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Transcription:

Pipelining Readings: 4.5-4.8 Example: Doing the laundry Ann, Brian, Cathy, & Dave A B C D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes 99

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 20 30 40 20 30 40 20 30 40 20 Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? 100

Pipelined Laundry: Start work ASAP 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r 30 40 40 40 40 20 A B C D Pipelined laundry takes 3.5 hours for 4 loads 101

Pipelining Lessons T a s k O r d e r 6 PM 7 8 9 Time 30 40 40 40 40 20 A B C D Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences 102

Pipelined Execution Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Now we just have to make it work 103

Single Cycle vs. Pipeline Clk Cycle 1 Cycle 2 Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr 104

Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = ns 105

CPI for Pipelined Processors Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = ns CPI in pipelined processor is issue rate. Ignore fill/drain, ignore latency. Example: A processor wastes 2 cycles after every branch, and 1 after every load, during which it cannot issue a new instruction. If a program has 10% branches and 30% loads, what is the CPI on this program? 106

Pipelined Datapath Divide datapath into multiple pipeline stages IF Instruction Fetch RF Fetch EX Execute MEM Data Memory WB Writeback PC Instr. Memory File Data Memory File 107

Pipelined Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ALUOp, ALUSrc, ) are used 1 cycle later Control signals for Mem (MemWE, Mem2Reg, ) are used 2 cycles later Control signals for Wr (RegWE, ) are used 3 cycles later Reg/Dec Exec Mem Wr ALUSrc ALUSrc IF/ID Main Control ALUOp ID/Ex ALUOp Ex/Mem MemWE MemWE MemWE Mem2Reg Mem2Reg Mem2Reg RegWE RegWE RegWE Mem/Wr RegWE 108

Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make decision before condition evaluated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards 109

Pipelining the Load Instruction The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage File s Read ports (bus A and busb) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage File s Write port (bus W) for the Wr stage Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 1st LDUR 2nd LDUR 3rd LDUR 110

The Four Stages of Ifetch: Fetch the instruction from the Instruction Memory Reg/Dec: Fetch and Instruction Decode Exec: ALU operates on the two register operands Wr: Write the ALU output back to the register file Cycle 1 Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Exec Wr 111

Structural Hazard Interaction between and loads causes structural hazard on writeback Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr 112

Important Observation Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses File s Write Port during its 5th stage Load 1 2 3 4 5 uses File s Write Port during its 4th stage 1 2 3 4 Ifetch Reg/Dec Exec Wr Solution: Delay s register write by one cycle: Now instructions also use Reg File s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 113

Pipelining the Instruction Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Load 114

The Four Stages of Store Ifetch: Fetch the instruction from the Instruction Memory Reg/Dec: Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Wr: NOOP Compatible with Load & instructions Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Exec Mem Wr 115

The Stages of Conditional Branch Ifetch: Fetch the instruction from the Instruction Memory Reg/Dec: Fetch and Instruction Decode, compute branch target Exec: Test condition & update the PC Mem: NOOP Wr: NOOP Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr 116

Control Hazard Branch updates the PC at the end of the Exec stage. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock CBZ load 117

Accelerate Branches When can we compute branch target address? When can we compute the CBZ condition? IF Instruction Fetch RF Fetch EX Execute MEM Data Memory WB Writeback PC Instr. Memory File Data Memory File 118

Control Hazard 2 Branch updates the PC at the end of the Reg/Dec stage. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock CBZ load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr 119

Solution #1: Stall Delay loading next instruction, load no-op instead Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock CBZ Stall Bubble Bubble Bubble Bubble CPI if all other instructions take 1 cycle, and branches are 20% of instructions? 120

Solution #2: Branch Prediction Guess all branches not taken, squash if wrong Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock CBZ load CPI if 50% of branches actually not taken, and branch frequency 20%? 121

Solution #3: Branch Delay Slot Redefine branches: Instruction directly after branch always executed Instruction after branch is the delay slot Compiler/assembler fills the delay slot ADD X1, X0, X4 CBZ X2, FOO SUB X2, X0, X3 ADD X1, X0, X4 CBZ X1, FOO ADD X1, X0, X4 CBZ X1, FOO ADD X1, X3, X3 FOO: ADD X1, X2, X0 ADD X1, X0, X4 CBZ X1, FOO 122

Data Hazards Consider the following code: ADD X0, X1, X2 SUB X3, X0, X4 AND X5, X0, X6 ORR X7, X0, X8 EOR X9, X0, X10 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock ADD SUB Ifetch Reg/Dec Exec Mem Wr AND ORR EOR 123

Design File Carefully What if reads see value after write during the same cycle? ADD X0, X1, X2 SUB X3, X0, X4 AND X5, X0, X6 ORR X7, X0, X8 EOR X9, X0, X10 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock ADD SUB Ifetch Reg/Dec Exec Mem Wr AND ORR EOR 124

Forwarding Add logic to pass last two values from ALU output to ALU input(s) as needed Forward the ALU output to later instructions ADD X0, X1, X2 SUB X3, X0, X4 AND X5, X0, X6 ORR X7, X0, X8 EOR X9, X0, X10 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock ADD SUB Ifetch Reg/Dec Exec Mem Wr AND ORR EOR 125

Forwarding (cont.) Requires values from last two ALU operations. Remember destination register for operation. Compare sources of current instruction to destinations of previous 2. IF Instruction Fetch RF Fetch EX Execute MEM Data Memory WB Writeback PC Instr. Memory File Data Memory File 126

Data Hazards on Loads LDUR X0, [X31, 0] SUB X3, X0, X4 AND X5, X0, X6 ORR X7, X0, X8 EOR X9, X0, X10 Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 LDUR SUB Ifetch Reg/Dec Exec Mem Wr AND ORR EOR 127

Data Hazards on Loads (cont.) Solution: Use same forwarding hardware & register file for hazards 2+ cycles later Force compiler to not allow register reads within a cycle of load Fill delay slot, or insert no-op. 128

Pipelined CPI, cycle time CPI, assuming compiler can fill 50% of delay slots Instruction Type Type Cycles Type Frequency Cycles * Freq ALU 50% Load 20% Store 10% Branch 20% CPI: Pipelined: cycle time = 1ns. Delay for 1M instr: Single cycle: CPI = 1.0, cycle time = 4.5ns. Delay for 1M instr: 129

Pipelined CPU Summary 130