CS420/520 Computer Architecture I

Similar documents
Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Lecture 4: Introduction to Pipelining

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

CMSC 611: Advanced Computer Architecture

CS 110 Computer Architecture Lecture 11: Pipelining

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Pipelined Processor Design

CS429: Computer Organization and Architecture

LECTURE 8. Pipelining: Datapath and Control

RISC Design: Pipelining

Computer Architecture

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

EECE 321: Computer Organiza5on

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

RISC Central Processing Unit

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

CS61C : Machine Structures

Registers. CS152 Computer Architecture and Engineering Lecture 3

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

Computer Hardware. Pipeline

EE 457 Homework 5 Redekopp Name: Score: / 100_

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Instruction Level Parallelism. Data Dependence Static Scheduling

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

CMSC 611: Advanced Computer Architecture

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Finite State Machines CS 64: Computer Organization and Design Logic Lecture #16

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Controller Implementation--Part I. Cascading Edge-triggered Flip-Flops

CS61C : Machine Structures

Lecture 14: Datapath Functional Units Adders

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

EE382V-ICS: System-on-a-Chip (SoC) Design

Outline Single Cycle Processor Design Multi cycle Processor. Pipelined Processor Design. Overall clock period. Analyzing performance 3/18/2015

Pipelining and ISA Design

CS4617 Computer Architecture

Lecture 8-1 Vector Processors 2 A. Sohn

Dynamic Scheduling I

CS61c: Introduction to Synchronous Digital Systems

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CS Computer Architecture Spring Lecture 04: Understanding Performance

CMP 301B Computer Architecture. Appendix C

Performance Metrics, Amdahl s Law

On the Rules of Low-Power Design

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Measuring and Evaluating Computer System Performance

Understanding Engineers #2

Out-of-Order Execution. Register Renaming. Nima Honarmand

Lecture 9: Clocking for High Performance Processors

CZ3001 ADVANCED COMPUTER ARCHITECTURE

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

FMP For More Practice

Homework Problem Set: Combinational Devices & ASM Charts. Answer all questions on this sheet. You may attach additional pages if necessary.

Department Computer Science and Engineering IIT Kanpur

CSE502: Computer Architecture Welcome to CSE 502

What you can do with very little:

CSEN 601: Computer System Architecture Summer 2014

CS521 CSE IITG 11/23/2012

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

CSE502: Computer Architecture CSE 502: Computer Architecture

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

CSE 305: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Introduction to CMOS VLSI Design (E158) Lecture 5: Logic

Superpipelined Control and Data Path Synthesis

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Instruction Level Parallelism Part II - Scoreboard

Combinational Logic Gates in CMOS

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing

READ THIS FIRST: *One physical piece of 8.5x11 paper (you may use both sides). Notes must be handwritten.

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #20. Warehouse Scale Computer

CS 6290 Evaluation & Metrics

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Instruction Level Parallelism III: Dynamic Scheduling

COSC4201. Scoreboard

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Embedded Hardware (1) Kai Huang

HIGH-PERFORMANCE HYBRID WAVE-PIPELINE SCHEME AS IT APPLIES TO ADDER MICRO-ARCHITECTURES

Project 5: Optimizer Jason Ansel

Microprocessor & Interfacing Lecture Programmable Interval Timer

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

The Metrics and Designs of an Arithmetic Logic Function over

Transcription:

CS42/52 Computer rchitecture I Designing a Pipeline Processor (C4: ppendix ) Dr. Xiaobo Zhou Department of Computer Science CS42/52 pipeline. UC. Colorado Springs dapted from UCB97 & UCB3 Branch Jump Recap: Single Cycle Processor Instruction Fetch Unit <3:26> Instruction<3:> op <6:2> <2:25> <:5> <:5> LUop Main Control RegDst <5:> LU func Control RegWr 5 5 5 LUctr 3 Rw Rb busw -bit Registers imm6 Instr<5:> 6 Extender ExtOp LUSrc 3 LU In RegDst LUSrc : MemWr CS42/52 pipeline.2 UC. Colorado Springs dapted from UCB97 & UCB3 WrEn dr Memory

Recap: Drawbacks of this Single Cycle Processor Long cycle time: Cycle time must be long enough for the load instruction: - PC s Clock -to-q + - Instruction Memory ccess Time + - Register File ccess Time + - LU Delay (address calculation) + - Memory ccess Time + - Register File Setup & Writing Time + - Clock Skew Cycle time is much longer than needed for all other instructions. Examples: instructions do not require data memory access Jump does not require LU operation nor data memory access CS42/52 pipeline.3 UC. Colorado Springs dapted from UCB97 & UCB3 Recap: Overview of a Multiple Cycle Implementation The root of the single cycle processor s problems: The cycle time has to be long enough for the slowest instruction Solution: Break the instruction into smaller steps ute each step (instead of the entire instruction) in one cycle - Cycle time: time it takes to execute the longest step - Keep all the steps to have similar length This is the essence of the multiple cycle processor The advantages of the multiple cycle processor: Cycle time is much shorter Different instructions take different number of cycles to complete - Load takes five cycles - Jump only takes three cycles llows a functional unit to be used more than once per instruction CS42/52 pipeline.4 UC. Colorado Springs dapted from UCB97 & UCB3

Recap: Multiple Cycle Processor MCP: functional unit to be used more than once per instruction PCWr PC IorD PCWrCond PCSrc BrWr MemWr Rdr Ideal Memory Wrdr Din Dout IRWr Instruction Reg RegDst 5 5 RegWr LUSel Rb Reg File 4 Rw busw << 2 2 3 Target M ux LU LU Control Imm 6 ExtOp Extend LUSelB LUOp CS42/52 pipeline.5 UC. Colorado Springs dapted from UCB97 & UCB3 Outline of Today s Lecture--- Pipelining Introduction to the Concept of Pipelined Processor Pipelined path and Pipelined Control How to void ce Condition in a Pipeline Design? Pipeline Example: Instructions Interaction CS42/52 pipeline.6 UC. Colorado Springs dapted from UCB97 & UCB3

Preview: The Five Stages of Load Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : Calculate the memory address Mem: Read the data from the Memory Wr: Write the data back to the register file CS42/52 pipeline.7 UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining: Its Natural! Laundry Example nn, Brian, Cathy, Dave each have one load of clothes B C D to wash, dry, and fold Washer takes 3 minutes Dryer takes 4 minutes Folder takes 2 minutes CS42/52 pipeline.8 UC. Colorado Springs dapted from UCB97 & UCB3

Recap: Multiple Cycle path (base for pipelining) Beqz llows a functional unit to be used more than once per instruction is NOT good for pipelining - dder + LU; Instruction mem + mem CS42/52 pipeline.9 UC. Colorado Springs dapted from UCB97 & UCB3 Sequential Laundry 6 PM 7 8 9 Midnight Time T a s k O r d e r 3 4 2 3 4 2 3 4 2 3 4 2 B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? CS42/52 pipeline. UC. Colorado Springs dapted from UCB97 & UCB3

Pipelined Laundry: Start work SP 6 PM 7 8 9 Midnight Time T a s k O r d e r B C D 3 4 4 4 4 2 Pipelined laundry takes 3.5 hours for 4 loads CS42/52 pipeline. UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining Lessons T a s k O r d e r 6 PM 7 8 9 Time 3 4 4 4 4 2 B C D Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously (overlapped in execution, invisible to programmers) Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup CS42/52 pipeline.2 UC. Colorado Springs dapted from UCB97 & UCB3

Key Ideas Behind Pipelining Grading the mid term exams: 5 problems, five people grading the exam Each person ONLY grades one problem Pass the exam to the next person as soon as one finishes his part ssume each problem takes.5 hour to grade - Each individual exam still takes 2.5 hours to grade - But with 5 people, all exams can be graded much quicker The load instruction has 5 stages: Five independent functional units to work on each stage - Each functional unit is used only once The 2nd load can start as soon as the st finishes its Ieft stage Each load still takes five cycles to complete The throughput, however, is much higher CS42/52 pipeline.3 UC. Colorado Springs dapted from UCB97 & UCB3 The Five Stages of Load Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : Calculate the memory address Mem: Read the data from the Memory Wr: Write the data back to the register file CS42/52 pipeline.4 UC. Colorado Springs dapted from UCB97 & UCB3

Pipelining the Load Instruction Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 st lw 2nd lw 3rd lw The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File s Read ports (bus and ) for the Reg/Dec stage LU for the stage Memory for the Mem stage Register File s Write port (bus W) for the Wr stage One instruction enters the pipeline every cycle One instruction comes out of the pipeline (complete) every cycle The Effective Cycles per Instruction (CPI) is CS42/52 pipeline.5 UC. Colorado Springs dapted from UCB97 & UCB3 Single Cycle, Multiple Cycle, vs. Pipeline Cycle Cycle 2 Single Cycle Implementation: Load Store Waste Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Multiple Cycle Implementation: Load Ifetch Reg Mem Wr Store Ifetch Reg Mem Ifetch Pipeline Implementation: Load Ifetch Reg Mem Wr Store Ifetch Reg Mem Wr Ifetch Reg Mem Wr CS42/52 pipeline.6 UC. Colorado Springs dapted from UCB97 & UCB3

Why Pipeline? Suppose we execute instructions Single Cycle Machine 45 ns/cycle x CPI x inst = 45 ns Multicycle Machine ns/cycle x 4. CPI (due to inst mix) x inst = 4 ns Ideal pipelined machine ns/cycle x ( CPI x inst + 4 cycle drain) = 4 ns Compared to the Multi-cycle implementation, pipelining reduces the CPI! Compared to the Single-cycle implementation, pipelining reduces the clock cycle time! CS42/52 pipeline.7 UC. Colorado Springs dapted from UCB97 & UCB3 The Four Stages of Cycle Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : LU operates on the two register operands LU operates on the two register operands Update PC Wr: Write the LU output back to the register file CS42/52 pipeline.8 UC. Colorado Springs dapted from UCB97 & UCB3

Pipelining the and Load Instruction Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Wr Ops! We have a problem! Ifth Ifetch Reg/Dec Wr Load Ifetch Reg/Dec Wr Ifetch Reg/Dec Wr We have a problem: Two instructions try to write to the register file at the same time! Only one write port CS42/52 pipeline.9 UC. Colorado Springs dapted from UCB97 & UCB3 Important Observation Each functional unit can only be used once per instruction ( pipelining vs. multiple cycle) Each functional unit must be used at the same stage for all instructions: Load uses Register File s Write Port during its 5th stage Load 2 3 4 5 uses Register File s Write Port during its 4th stage 2 3 4 Ifetch Reg/Dec Wr CS42/52 pipeline.2 UC. Colorado Springs dapted from UCB97 & UCB3

Solution : Insert Bubble into the Pipeline Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Wr Load Ifetch Reg/Dec Wr Ifetch Reg/Dec Pipeline Wr Ifetch Bubble Reg/Dec Wr Ifetch Reg/Dec Insert a bubble into the pipeline to prevent 2 writes at the same cycle The control logic can be complex No instruction is completed during Cycle 5: The Effective CPI for load is 2 CS42/52 pipeline.2 UC. Colorado Springs dapted from UCB97 & UCB3 Solution 2: Delay s Write by One Cycle Delay s register write by one cycle: Now instructions also use Reg File s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done 2 3 4 5 Ifetch Reg/Dec Mem Wr Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Mem Wr Load CS42/52 pipeline.22 UC. Colorado Springs dapted from UCB97 & UCB3

The Four Stages of Store Cycle Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : Calculate the memory address Mem: Write the data into the Memory CS42/52 pipeline.23 UC. Colorado Springs dapted from UCB97 & UCB3 The Four Stages of Beq Cycle Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : LU compares the two register operands dder calculates the branch target address Mem: If the registers we compared in the stage are the same, - write the branch target address into the PC CS42/52 pipeline.24 UC. Colorado Springs dapted from UCB97 & UCB3

Pipelined path RegWr ExtOp LUOp Branch PC IUnit I IF/ID Register Rb RFile Rw Di ID/Ex Register Unit Ex/Me em Register Mem R Do W Di Mem/W Wr Register Why not move to ID/RF? Ok, but complicated control RegDst LUSrc MemWr CS42/52 pipeline.25 UC. Colorado Springs dapted from UCB97 & UCB3 The Instruction Fetch Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are here! Ifetch Reg/Dec Mem RegWr ExtOp LUOp Branch PC = 24 IUnit I IF/ID: lw $, ($2) Rb RFile Rw Di ID/Ex Register Unit Ex/Me em Register Me m R Do W Di Mem/W Wr Register RegDst LUSrc MemWr CS42/52 pipeline.26 UC. Colorado Springs dapted from UCB97 & UCB3

Detail View of the Instruction Unit Location 2: lw $, x($2) You are here! Ifetch Reg/Dec 4 PC = 24 2 ddress Instruction Memory Instruction dder IF/ID: lw $, ($2) CS42/52 pipeline.27 UC. Colorado Springs dapted from UCB97 & UCB3 The Decode / Register Fetch Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are here! Ifetch Reg/Dec Mem RegWr ExtOp LUOp Branch PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex: Reg g. 2 & x Unit Ex/Me em Register Me m R Do W Di Mem/W Wr Register RegDst LUSrc MemWr CS42/52 pipeline.28 UC. Colorado Springs dapted from UCB97 & UCB3

Load s ddress Calculation Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are here! Ifetch Reg/Dec Mem RegWr LUOp=dd ExtOp= Branch PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex Register Unit Ex/Mem: Load s ddress Mem R Do W Di Mem/W Wr Register RegDst= LUSrc= MemWr CS42/52 pipeline.29 UC. Colorado Springs dapted from UCB97 & UCB3 View of the ution Unit (like in Single Cycle) You are here! Mem ID/Ex Register SignExt imm6 6 Extender << 2 dder Move to stage 2? -bits imm in ID/Ex Target 3 LU LU Control LUout LUctr Ex/Mem: Load s Memo ory ddress ExtOp= LUSrc= 3 LUOp=dd CS42/52 pipeline.3 UC. Colorado Springs dapted from UCB97 & UCB3

Detail View of the ution Unit (Integrated) You are here! Mem If Beq? Beqz? ID/Ex Register imm6 6 Extender << 2 dder Target 3 LU LU Control LUout LUctr Ex/Mem: Load s Memo ory ddress Integrated ExtOp= LUSrc= 3 LUOp=dd CS42/52 pipeline.3 UC. Colorado Springs dapted from UCB97 & UCB3 Load s Memory ccess Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are here! Ifetch Reg/Dec Mem RegWr ExtOp LUOp Branch= PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex Register Unit Ex/Me em Register Mem R Do W Di Mem/Wr: Load s RegDst LUSrc MemWr= CS42/52 pipeline. UC. Colorado Springs dapted from UCB97 & UCB3

Load s Write Back Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are somewhere out there! Ifetch Reg/Dec Mem Wr RegWr= ExtOp LUOp Branch PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex Register Unit Ex/Me em Register Mem R Do W Di Mem/W Wr Register RegDst LUSrc MemWr = CS42/52 pipeline.33 UC. Colorado Springs dapted from UCB97 & UCB3 How bout Control Signals? Key Observation: Control Signals at Stage N = Func (Instr. at Stage N) N =, Mem, or Wr Example: Controls Signals at Stage = Func(Load s ) Ifetch Reg/Dec Mem Wr LUOp=dd RegWr ExtOp= Branch PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex Register Unit Ex/Mem: Load s ddress Mem R Do W Di Mem/W Wr Register RegDst= LUSrc= MemWr CS42/52 pipeline.34 UC. Colorado Springs dapted from UCB97 & UCB3

Pipeline Control The Main Control generates the control signals during Reg/Dec Control signals for (ExtOp, LUSrc,...) are used cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr ( MemWr) are used 3 cycles later Reg/Dec Mem Wr IF/ID Register Main Control ExtOp LUSrc LUOp RegDst MemWr Branch RegWr ID/Ex Register ExtOp LUSrc LUOp RegDst MemWr Branch RegWr Ex/Mem Register MemWr Branch RegWr Mem/W Wr Register RegWr CS42/52 pipeline.35 UC. Colorado Springs dapted from UCB97 & UCB3 Clock More Extensive Pipelining Example Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 : Load 4: 8: Store 2: Beq (target is ) End of Cycle 4 End of Cycle 5 End of Cycle 6 End of Cycle 7 End of Cycle 4: Load s Mem, s, Store s Reg, Beq s Ifetch End of Cycle 5: Load s Wr, s Mem, Store s, Beq s Reg End of Cycle 6: s Wr, Store s Mem, Beq s End of Cycle 7: Store s Wr, Beq s Mem CS42/52 pipeline.36 UC. Colorado Springs dapted from UCB97 & UCB3

Pipelining Example: End of Cycle 4 : Load s Mem 4: s 8: Store s Reg 2: Beq s Ifetch 8: Store s Reg 4: s : Load s Mem 2: Beq s Ifet RegWr= LUOp= ExtOp=x Branch= PC = 6 IU Unit I IF/ID: Beq In nstruction Rb RFile Rw Di ID/Ex: Store s & B Unit Ex/Mem: R-t type s Result Mem R Do W Di Mem/Wr: Load s Dout RegDst= LUSrc= MemWr= =x CS42/52 pipeline.37 UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining Example: End of Cycle 5 : Lw s Wr 4: R s Mem 8: Store s 2: Beq s Reg 6: R s Ifetch 2: Beq s Reg 8: Store s 4: s Mem 6: R s Ifet : Load s Wr RegWr= LUOp=dd ExtOp= Branch= PC = 2 IU Unit I IF/ID: Instru uction @ 6 Rb RFile Rw Di ID/Ex: Beq s & B Unit Ex/Mem: Sto re s ddress Mem R Do W Di Mem/Wr: R-t type s Result RegDst=x LUSrc= MemWr= = CS42/52 pipeline.38 UC. Colorado Springs dapted from UCB97 & UCB3

Pipelining Example: End of Cycle 6 4: R s Wr 8: Store s Mem 2: Beq s 6: R s Reg 2: R s Ifet 6: s Reg 2: Beq s 8: Store s Mem 2: s Ifet 4: s Wr LUOp=Sub RegWr= ExtOp= Branch= PC = 24 IUn nit I IF/ID: Instru ction @ 2 Rb RFile Rw Di ID/Ex: s & B Unit Ex/Mem: Beq q s Results Mem R Do W Di Mem/Wr: Not thing for St RegDst=x LUSrc= MemWr= = CS42/52 pipeline.39 UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining Example: End of Cycle 7 8: Store s Wr 2: Beq s Mem 6: R s 2: R s Reg 24: R s Ifet 2: s Reg 6: s 2: Beq s Mem 24: s Ifet 8: Store s Wr LUOp= RegWr= ExtOp=xx Branch= PC = IUn nit I IF/ID: Instru ction @ 24 Rb RFile Rw Di ID/Ex: s & B Unit Ex/Mem: yp pe s Results Mem R Do W Di Mem/Wr:Noth hing for Beq RegDst= LUSrc= MemWr= =x CS42/52 pipeline.4 UC. Colorado Springs dapted from UCB97 & UCB3

Basic Performance Issues in Pipelining Pipelining increases the CPU instruction throughput, but does not reduce the execution time of an individual instruction In fact, it slightly increases the execution time of an instruction Pipelining performance limitations Pipelining latency due to hazards Imbalance limits - Clock cannot run faster than the time needed for the slowest pipeline stage; hardware also limits the stage partitioning Pipeline overhead - Pipeline registers setup and latency (separating instructions at different stages so as to avoid interfering with each others) - Clock skews, maximum delay between the clock arrives at any two registers (delay in signal arrival times) When pipelining is useless? once the clock cycle is as small as the sum of the clock skew and pipeline register (latch) latency, since no time left for useful work! CS42/52 pipeline.4 UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining Performance Example un-pipelined (multi-cycle) processor has a ns clock cycle, and it uses 4 cycles for LU operations and Branches, 5 for Memory operations. The relative frequencies of three operations is 4%, 2%, and 4%. Due to clock skew and setup, pipelining the processor adds.2ns into clock cycle. Suppose there is no pipelining i hazard so that t pipelining i CPI is, how much speedup will we gain from a pipeline? nswer: For un-pipelined processor: ve. instruction exec. Time = clock cycle time * average CPI (IET) = ns (4% * 4 + 2% *4 + 4% * 5) = 44 4.4 ns For pipelined processor: ve. instruction exec. Time = ( +.2) ns * =.2 ns Speedup = IET_w/o pipeling / IET_w/pipeline = 4.4 ns /.2 ns = 3.7 CS42/52 pipeline.42 UC. Colorado Springs dapted from UCB97 & UCB3

Summary Disadvantages of the Single Cycle Processor Long cycle time Cycle time is too long for all instructions except the Load Multiple Clock Cycle Processor: Divide the instructions into smaller steps ute each step (instead of the entire instruction) in one cycle Pipeline Processor: Natural enhancement of the multiple clock cycle processor Each functional unit can only be used once per instruction If a instruction is going to use a functional unit: - it must use it at the same stage as all other instructions Pipeline Control: - Each stage s control signal depends ONLY on the instruction that is currently in that stage CS42/52 pipeline.43 UC. Colorado Springs dapted from UCB97 & UCB3 Where to get more information? ppendix of C4 (or C3) text book: Chapter. and.3: CO2: Chapter 6. 6.3 CO3: Chapter 6. 6.3 David Patterson and John Hennessy, Computer Organization & Design: The Hardware / Software Interface, Morgan Kaufman Publishers; CO2 (2nd edition) and CO3 (3rd edition) CS42/52 pipeline.44 UC. Colorado Springs dapted from UCB97 & UCB3