CS42/52 Computer rchitecture I Designing a Pipeline Processor (C4: ppendix ) Dr. Xiaobo Zhou Department of Computer Science CS42/52 pipeline. UC. Colorado Springs dapted from UCB97 & UCB3 Branch Jump Recap: Single Cycle Processor Instruction Fetch Unit <3:26> Instruction<3:> op <6:2> <2:25> <:5> <:5> LUop Main Control RegDst <5:> LU func Control RegWr 5 5 5 LUctr 3 Rw Rb busw -bit Registers imm6 Instr<5:> 6 Extender ExtOp LUSrc 3 LU In RegDst LUSrc : MemWr CS42/52 pipeline.2 UC. Colorado Springs dapted from UCB97 & UCB3 WrEn dr Memory
Recap: Drawbacks of this Single Cycle Processor Long cycle time: Cycle time must be long enough for the load instruction: - PC s Clock -to-q + - Instruction Memory ccess Time + - Register File ccess Time + - LU Delay (address calculation) + - Memory ccess Time + - Register File Setup & Writing Time + - Clock Skew Cycle time is much longer than needed for all other instructions. Examples: instructions do not require data memory access Jump does not require LU operation nor data memory access CS42/52 pipeline.3 UC. Colorado Springs dapted from UCB97 & UCB3 Recap: Overview of a Multiple Cycle Implementation The root of the single cycle processor s problems: The cycle time has to be long enough for the slowest instruction Solution: Break the instruction into smaller steps ute each step (instead of the entire instruction) in one cycle - Cycle time: time it takes to execute the longest step - Keep all the steps to have similar length This is the essence of the multiple cycle processor The advantages of the multiple cycle processor: Cycle time is much shorter Different instructions take different number of cycles to complete - Load takes five cycles - Jump only takes three cycles llows a functional unit to be used more than once per instruction CS42/52 pipeline.4 UC. Colorado Springs dapted from UCB97 & UCB3
Recap: Multiple Cycle Processor MCP: functional unit to be used more than once per instruction PCWr PC IorD PCWrCond PCSrc BrWr MemWr Rdr Ideal Memory Wrdr Din Dout IRWr Instruction Reg RegDst 5 5 RegWr LUSel Rb Reg File 4 Rw busw << 2 2 3 Target M ux LU LU Control Imm 6 ExtOp Extend LUSelB LUOp CS42/52 pipeline.5 UC. Colorado Springs dapted from UCB97 & UCB3 Outline of Today s Lecture--- Pipelining Introduction to the Concept of Pipelined Processor Pipelined path and Pipelined Control How to void ce Condition in a Pipeline Design? Pipeline Example: Instructions Interaction CS42/52 pipeline.6 UC. Colorado Springs dapted from UCB97 & UCB3
Preview: The Five Stages of Load Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : Calculate the memory address Mem: Read the data from the Memory Wr: Write the data back to the register file CS42/52 pipeline.7 UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining: Its Natural! Laundry Example nn, Brian, Cathy, Dave each have one load of clothes B C D to wash, dry, and fold Washer takes 3 minutes Dryer takes 4 minutes Folder takes 2 minutes CS42/52 pipeline.8 UC. Colorado Springs dapted from UCB97 & UCB3
Recap: Multiple Cycle path (base for pipelining) Beqz llows a functional unit to be used more than once per instruction is NOT good for pipelining - dder + LU; Instruction mem + mem CS42/52 pipeline.9 UC. Colorado Springs dapted from UCB97 & UCB3 Sequential Laundry 6 PM 7 8 9 Midnight Time T a s k O r d e r 3 4 2 3 4 2 3 4 2 3 4 2 B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? CS42/52 pipeline. UC. Colorado Springs dapted from UCB97 & UCB3
Pipelined Laundry: Start work SP 6 PM 7 8 9 Midnight Time T a s k O r d e r B C D 3 4 4 4 4 2 Pipelined laundry takes 3.5 hours for 4 loads CS42/52 pipeline. UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining Lessons T a s k O r d e r 6 PM 7 8 9 Time 3 4 4 4 4 2 B C D Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously (overlapped in execution, invisible to programmers) Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup CS42/52 pipeline.2 UC. Colorado Springs dapted from UCB97 & UCB3
Key Ideas Behind Pipelining Grading the mid term exams: 5 problems, five people grading the exam Each person ONLY grades one problem Pass the exam to the next person as soon as one finishes his part ssume each problem takes.5 hour to grade - Each individual exam still takes 2.5 hours to grade - But with 5 people, all exams can be graded much quicker The load instruction has 5 stages: Five independent functional units to work on each stage - Each functional unit is used only once The 2nd load can start as soon as the st finishes its Ieft stage Each load still takes five cycles to complete The throughput, however, is much higher CS42/52 pipeline.3 UC. Colorado Springs dapted from UCB97 & UCB3 The Five Stages of Load Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : Calculate the memory address Mem: Read the data from the Memory Wr: Write the data back to the register file CS42/52 pipeline.4 UC. Colorado Springs dapted from UCB97 & UCB3
Pipelining the Load Instruction Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 st lw 2nd lw 3rd lw The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File s Read ports (bus and ) for the Reg/Dec stage LU for the stage Memory for the Mem stage Register File s Write port (bus W) for the Wr stage One instruction enters the pipeline every cycle One instruction comes out of the pipeline (complete) every cycle The Effective Cycles per Instruction (CPI) is CS42/52 pipeline.5 UC. Colorado Springs dapted from UCB97 & UCB3 Single Cycle, Multiple Cycle, vs. Pipeline Cycle Cycle 2 Single Cycle Implementation: Load Store Waste Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Multiple Cycle Implementation: Load Ifetch Reg Mem Wr Store Ifetch Reg Mem Ifetch Pipeline Implementation: Load Ifetch Reg Mem Wr Store Ifetch Reg Mem Wr Ifetch Reg Mem Wr CS42/52 pipeline.6 UC. Colorado Springs dapted from UCB97 & UCB3
Why Pipeline? Suppose we execute instructions Single Cycle Machine 45 ns/cycle x CPI x inst = 45 ns Multicycle Machine ns/cycle x 4. CPI (due to inst mix) x inst = 4 ns Ideal pipelined machine ns/cycle x ( CPI x inst + 4 cycle drain) = 4 ns Compared to the Multi-cycle implementation, pipelining reduces the CPI! Compared to the Single-cycle implementation, pipelining reduces the clock cycle time! CS42/52 pipeline.7 UC. Colorado Springs dapted from UCB97 & UCB3 The Four Stages of Cycle Cycle 2 Cycle 3 Cycle 4 Ifetch Reg/Dec Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : LU operates on the two register operands LU operates on the two register operands Update PC Wr: Write the LU output back to the register file CS42/52 pipeline.8 UC. Colorado Springs dapted from UCB97 & UCB3
Pipelining the and Load Instruction Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Wr Ops! We have a problem! Ifth Ifetch Reg/Dec Wr Load Ifetch Reg/Dec Wr Ifetch Reg/Dec Wr We have a problem: Two instructions try to write to the register file at the same time! Only one write port CS42/52 pipeline.9 UC. Colorado Springs dapted from UCB97 & UCB3 Important Observation Each functional unit can only be used once per instruction ( pipelining vs. multiple cycle) Each functional unit must be used at the same stage for all instructions: Load uses Register File s Write Port during its 5th stage Load 2 3 4 5 uses Register File s Write Port during its 4th stage 2 3 4 Ifetch Reg/Dec Wr CS42/52 pipeline.2 UC. Colorado Springs dapted from UCB97 & UCB3
Solution : Insert Bubble into the Pipeline Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Wr Load Ifetch Reg/Dec Wr Ifetch Reg/Dec Pipeline Wr Ifetch Bubble Reg/Dec Wr Ifetch Reg/Dec Insert a bubble into the pipeline to prevent 2 writes at the same cycle The control logic can be complex No instruction is completed during Cycle 5: The Effective CPI for load is 2 CS42/52 pipeline.2 UC. Colorado Springs dapted from UCB97 & UCB3 Solution 2: Delay s Write by One Cycle Delay s register write by one cycle: Now instructions also use Reg File s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done 2 3 4 5 Ifetch Reg/Dec Mem Wr Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Mem Wr Load CS42/52 pipeline.22 UC. Colorado Springs dapted from UCB97 & UCB3
The Four Stages of Store Cycle Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : Calculate the memory address Mem: Write the data into the Memory CS42/52 pipeline.23 UC. Colorado Springs dapted from UCB97 & UCB3 The Four Stages of Beq Cycle Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode : LU compares the two register operands dder calculates the branch target address Mem: If the registers we compared in the stage are the same, - write the branch target address into the PC CS42/52 pipeline.24 UC. Colorado Springs dapted from UCB97 & UCB3
Pipelined path RegWr ExtOp LUOp Branch PC IUnit I IF/ID Register Rb RFile Rw Di ID/Ex Register Unit Ex/Me em Register Mem R Do W Di Mem/W Wr Register Why not move to ID/RF? Ok, but complicated control RegDst LUSrc MemWr CS42/52 pipeline.25 UC. Colorado Springs dapted from UCB97 & UCB3 The Instruction Fetch Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are here! Ifetch Reg/Dec Mem RegWr ExtOp LUOp Branch PC = 24 IUnit I IF/ID: lw $, ($2) Rb RFile Rw Di ID/Ex Register Unit Ex/Me em Register Me m R Do W Di Mem/W Wr Register RegDst LUSrc MemWr CS42/52 pipeline.26 UC. Colorado Springs dapted from UCB97 & UCB3
Detail View of the Instruction Unit Location 2: lw $, x($2) You are here! Ifetch Reg/Dec 4 PC = 24 2 ddress Instruction Memory Instruction dder IF/ID: lw $, ($2) CS42/52 pipeline.27 UC. Colorado Springs dapted from UCB97 & UCB3 The Decode / Register Fetch Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are here! Ifetch Reg/Dec Mem RegWr ExtOp LUOp Branch PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex: Reg g. 2 & x Unit Ex/Me em Register Me m R Do W Di Mem/W Wr Register RegDst LUSrc MemWr CS42/52 pipeline.28 UC. Colorado Springs dapted from UCB97 & UCB3
Load s ddress Calculation Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are here! Ifetch Reg/Dec Mem RegWr LUOp=dd ExtOp= Branch PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex Register Unit Ex/Mem: Load s ddress Mem R Do W Di Mem/W Wr Register RegDst= LUSrc= MemWr CS42/52 pipeline.29 UC. Colorado Springs dapted from UCB97 & UCB3 View of the ution Unit (like in Single Cycle) You are here! Mem ID/Ex Register SignExt imm6 6 Extender << 2 dder Move to stage 2? -bits imm in ID/Ex Target 3 LU LU Control LUout LUctr Ex/Mem: Load s Memo ory ddress ExtOp= LUSrc= 3 LUOp=dd CS42/52 pipeline.3 UC. Colorado Springs dapted from UCB97 & UCB3
Detail View of the ution Unit (Integrated) You are here! Mem If Beq? Beqz? ID/Ex Register imm6 6 Extender << 2 dder Target 3 LU LU Control LUout LUctr Ex/Mem: Load s Memo ory ddress Integrated ExtOp= LUSrc= 3 LUOp=dd CS42/52 pipeline.3 UC. Colorado Springs dapted from UCB97 & UCB3 Load s Memory ccess Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are here! Ifetch Reg/Dec Mem RegWr ExtOp LUOp Branch= PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex Register Unit Ex/Me em Register Mem R Do W Di Mem/Wr: Load s RegDst LUSrc MemWr= CS42/52 pipeline. UC. Colorado Springs dapted from UCB97 & UCB3
Load s Write Back Stage Location 2: lw $, x($2) $ <- Mem[($2) + x] You are somewhere out there! Ifetch Reg/Dec Mem Wr RegWr= ExtOp LUOp Branch PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex Register Unit Ex/Me em Register Mem R Do W Di Mem/W Wr Register RegDst LUSrc MemWr = CS42/52 pipeline.33 UC. Colorado Springs dapted from UCB97 & UCB3 How bout Control Signals? Key Observation: Control Signals at Stage N = Func (Instr. at Stage N) N =, Mem, or Wr Example: Controls Signals at Stage = Func(Load s ) Ifetch Reg/Dec Mem Wr LUOp=dd RegWr ExtOp= Branch PC IUnit I IF/ID: Rb RFile Rw Di ID/Ex Register Unit Ex/Mem: Load s ddress Mem R Do W Di Mem/W Wr Register RegDst= LUSrc= MemWr CS42/52 pipeline.34 UC. Colorado Springs dapted from UCB97 & UCB3
Pipeline Control The Main Control generates the control signals during Reg/Dec Control signals for (ExtOp, LUSrc,...) are used cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr ( MemWr) are used 3 cycles later Reg/Dec Mem Wr IF/ID Register Main Control ExtOp LUSrc LUOp RegDst MemWr Branch RegWr ID/Ex Register ExtOp LUSrc LUOp RegDst MemWr Branch RegWr Ex/Mem Register MemWr Branch RegWr Mem/W Wr Register RegWr CS42/52 pipeline.35 UC. Colorado Springs dapted from UCB97 & UCB3 Clock More Extensive Pipelining Example Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 : Load 4: 8: Store 2: Beq (target is ) End of Cycle 4 End of Cycle 5 End of Cycle 6 End of Cycle 7 End of Cycle 4: Load s Mem, s, Store s Reg, Beq s Ifetch End of Cycle 5: Load s Wr, s Mem, Store s, Beq s Reg End of Cycle 6: s Wr, Store s Mem, Beq s End of Cycle 7: Store s Wr, Beq s Mem CS42/52 pipeline.36 UC. Colorado Springs dapted from UCB97 & UCB3
Pipelining Example: End of Cycle 4 : Load s Mem 4: s 8: Store s Reg 2: Beq s Ifetch 8: Store s Reg 4: s : Load s Mem 2: Beq s Ifet RegWr= LUOp= ExtOp=x Branch= PC = 6 IU Unit I IF/ID: Beq In nstruction Rb RFile Rw Di ID/Ex: Store s & B Unit Ex/Mem: R-t type s Result Mem R Do W Di Mem/Wr: Load s Dout RegDst= LUSrc= MemWr= =x CS42/52 pipeline.37 UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining Example: End of Cycle 5 : Lw s Wr 4: R s Mem 8: Store s 2: Beq s Reg 6: R s Ifetch 2: Beq s Reg 8: Store s 4: s Mem 6: R s Ifet : Load s Wr RegWr= LUOp=dd ExtOp= Branch= PC = 2 IU Unit I IF/ID: Instru uction @ 6 Rb RFile Rw Di ID/Ex: Beq s & B Unit Ex/Mem: Sto re s ddress Mem R Do W Di Mem/Wr: R-t type s Result RegDst=x LUSrc= MemWr= = CS42/52 pipeline.38 UC. Colorado Springs dapted from UCB97 & UCB3
Pipelining Example: End of Cycle 6 4: R s Wr 8: Store s Mem 2: Beq s 6: R s Reg 2: R s Ifet 6: s Reg 2: Beq s 8: Store s Mem 2: s Ifet 4: s Wr LUOp=Sub RegWr= ExtOp= Branch= PC = 24 IUn nit I IF/ID: Instru ction @ 2 Rb RFile Rw Di ID/Ex: s & B Unit Ex/Mem: Beq q s Results Mem R Do W Di Mem/Wr: Not thing for St RegDst=x LUSrc= MemWr= = CS42/52 pipeline.39 UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining Example: End of Cycle 7 8: Store s Wr 2: Beq s Mem 6: R s 2: R s Reg 24: R s Ifet 2: s Reg 6: s 2: Beq s Mem 24: s Ifet 8: Store s Wr LUOp= RegWr= ExtOp=xx Branch= PC = IUn nit I IF/ID: Instru ction @ 24 Rb RFile Rw Di ID/Ex: s & B Unit Ex/Mem: yp pe s Results Mem R Do W Di Mem/Wr:Noth hing for Beq RegDst= LUSrc= MemWr= =x CS42/52 pipeline.4 UC. Colorado Springs dapted from UCB97 & UCB3
Basic Performance Issues in Pipelining Pipelining increases the CPU instruction throughput, but does not reduce the execution time of an individual instruction In fact, it slightly increases the execution time of an instruction Pipelining performance limitations Pipelining latency due to hazards Imbalance limits - Clock cannot run faster than the time needed for the slowest pipeline stage; hardware also limits the stage partitioning Pipeline overhead - Pipeline registers setup and latency (separating instructions at different stages so as to avoid interfering with each others) - Clock skews, maximum delay between the clock arrives at any two registers (delay in signal arrival times) When pipelining is useless? once the clock cycle is as small as the sum of the clock skew and pipeline register (latch) latency, since no time left for useful work! CS42/52 pipeline.4 UC. Colorado Springs dapted from UCB97 & UCB3 Pipelining Performance Example un-pipelined (multi-cycle) processor has a ns clock cycle, and it uses 4 cycles for LU operations and Branches, 5 for Memory operations. The relative frequencies of three operations is 4%, 2%, and 4%. Due to clock skew and setup, pipelining the processor adds.2ns into clock cycle. Suppose there is no pipelining i hazard so that t pipelining i CPI is, how much speedup will we gain from a pipeline? nswer: For un-pipelined processor: ve. instruction exec. Time = clock cycle time * average CPI (IET) = ns (4% * 4 + 2% *4 + 4% * 5) = 44 4.4 ns For pipelined processor: ve. instruction exec. Time = ( +.2) ns * =.2 ns Speedup = IET_w/o pipeling / IET_w/pipeline = 4.4 ns /.2 ns = 3.7 CS42/52 pipeline.42 UC. Colorado Springs dapted from UCB97 & UCB3
Summary Disadvantages of the Single Cycle Processor Long cycle time Cycle time is too long for all instructions except the Load Multiple Clock Cycle Processor: Divide the instructions into smaller steps ute each step (instead of the entire instruction) in one cycle Pipeline Processor: Natural enhancement of the multiple clock cycle processor Each functional unit can only be used once per instruction If a instruction is going to use a functional unit: - it must use it at the same stage as all other instructions Pipeline Control: - Each stage s control signal depends ONLY on the instruction that is currently in that stage CS42/52 pipeline.43 UC. Colorado Springs dapted from UCB97 & UCB3 Where to get more information? ppendix of C4 (or C3) text book: Chapter. and.3: CO2: Chapter 6. 6.3 CO3: Chapter 6. 6.3 David Patterson and John Hennessy, Computer Organization & Design: The Hardware / Software Interface, Morgan Kaufman Publishers; CO2 (2nd edition) and CO3 (3rd edition) CS42/52 pipeline.44 UC. Colorado Springs dapted from UCB97 & UCB3