EECE 321: Computer Organiza5on Mohammad M. Mansour Dept. of Electrical and Compute Engineering American University of Beirut Lecture 21: Pipelining
Processor Pipelining Same principles can be applied to processors where we pipeline inst. execu>on. For MIPS, the 5 stages are pipelined. Designate the 5 stages as follows: Instruc>on fetch (IF) Instruc>on decode and operand fetch: Reg ALU opera>on execu>on: ALU Access an operand in data memory: Data access Write the result into a register: Reg An instruc>on executes by doing the appropriate work in each stage. Ex: load Instruc>on Fetch Reg read Conven>on: All stages are balanced Register writes occur during first half of the stage, reads in the second half. Instruc>on Fetch ALU Data Access Reg R- Format instruc>on: doesn t use data access stage Instruc>on Fetch Reg read ALU Data Access write Reg write Reg ALU Reg EECE 321: Computer Organiza>on 2
Processor Pipelining Example: Assume we pipeline the 5 steps of execu>ng instruc>ons in MIPS (lw,sw,add, sub,and,or,slt,beq). Assume access of all func>onal units is 2ns, except register file which is 1ns. EECE 321: Computer Organiza>on 3
Pipelining Performance A single- cycle non- pipelined implementa>on requires 8ns to execute an instruc>on: The >me between the first and fourth instruc>ons is 3 x 8 = 24ns. In the pipelined processor, the clock cycle must be long enough to accommodate the slowest opera>on (i.e. 2ns) The >me between the first and fourth instruc>ons is 3 x 2 = 6ns. Pipelining speedup: If all pipeline stages are perfectly balanced, then the ideal >me between instruc>ons in the pipelined machine is equal to This speedup cannot always be a\ained due to pipelining limita>ons and overhead. In our example, the speedup is not reflected in the total execu>on >me: T nonpipelined = 24ns, T pipelined = 14ns, T pipelined / T nonpipelined = 1.71 What happens if we increase the number of instruc>ons from 3 to 1003: T nonpipelined = 8024ns, T pipelined = 1000x2 + 14 = 2014ns, T pipelined / T nonpipelined = 3.98 Pipelining improves performance by increasing throughput, as opposed to decreasing the execu>on >me of an individual instruc>on. Instruc>on throughput is the important metric because real programs execute billions of instruc>ons. EECE 321: Computer Organiza>on 4
Pipelining Hence pipelining improves performance by increasing instruc>on throughput. Ideal speedup is number of stages in the pipeline. Ideal speedup = k Do we achieve this? What makes pipelining easy? All instruc>ons have the same length (makes instruc>on fetching easy) Just a few instruc>on formats (this means instruc>on decode is simple, can start to fetch operands before knowing the instruc>on type) Memory operands appear only in loads and stores (this means we can use execute stage to compute address, then access memory in the following stage) The MIPS instruc>on set was inten>onally designed for pipelined execu>on. EECE 321: Computer Organiza>on 5
What makes pipelining hard? There are situa>ons where the next instruc>on can t be executed. This is due to what is called pipeline hazards. There are 3 types of pipeline hazards: 1) Structural hazards: Suppose we had only one memory 2) Control hazards: Do we always fetch instruc>ons in sequence? Need to worry about branch instruc>ons 3) Data hazards: An instruc>on depends on a previous instruc>on EECE 321: Computer Organiza>on 6
1. Structural Hazards This means that the hardware cannot support the combina>on of instruc>ons that we want to execute in the same clock cycle. Example: Consider execu>on of the following instruc>ons on a pipelined processor with a single memory unit. Time lw $1, 100($0) Program Execution Order lw $2, 104($0) lw $3, 108($0) lw $4, 112($0) Conflict over! memory access (assuming single memory)! EECE 321: Computer Organiza>on 7
2. Control Hazards (Branch Hazards) This hazard arises from the need to make a decision on the results of one instruc>on while others are execu>ng. This hazard is typical with branch instruc>ons. There are three solu>ons to control hazards. Solu>on 1 - Stall: Let the pipeline pause before con>nuing execu>on of other instruc>ons un>l the branch decision is resolved. Assume we add extra hardware so that branch decision is known in second stage. So next instruc>on cannot start immediately ager the branch, but ager 1 clock cycle This is indicated by a bubble in the pipeline. A NOP is inserted in MIPS code. bubble EECE 321: Computer Organiza>on 8
2. Control Hazards Stalling the pipeline slows down execu>on especially if we can t resolve the branch decision in the 2 nd stage (in our MIPS datapath it is resolved in 3 rd stage). Solu>on 2 Predict the decision of the branch (Branch Predic>on): One simple solu>on is to predict the branch fails. This solu>on doesn t slow down the pipeline when branches fail. However, when branches are taken, the pipeline stalls. Branch is NOT taken Branch decision is known Branch is taken Disable actions of lw" EECE 321: Computer Organiza>on 9
2. Control Hazards Solu>on 3 Delayed branches: This solu>on is actually used in MIPS. Place an instruc>on that is not affected by the branch (e.g. an instruc>on appearing before the branch) immediately ager it. Delay taking the branch 1clock cycle (i.e., delay loading of PC one more clock cycle) Example: add $4,$5,$6 doesn t affect the branch, so it can be moved into the delayed branch slot. EECE 321: Computer Organiza>on 10
Summary of Control Hazard Solu5ons Stall the pipeline Do branch predic>on Use delayed branches EECE 321: Computer Organiza>on 11
3. Data Hazards This occurs when the next instruc>on depends on the result generated by the current instruc>on: The add instruc>on doesn t write the result un>l the 5th stage, so we need to add two bubbles to the pipeline (2 NOPs in MIPS). Instruc>on Fetch Reg add $s0,$t0,$t1 #producer of $s0 sub $t2,$s0,$t3 #consumer of $s0 ALU Data Access Reg Instruc>on Fetch Reg ALU Data Access Reg Solu>on: Observe that we don t have to wait for the first add instruc>on to complete to resolve the data hazard. As soon as the first add finishes its third stage, the sum is ready and can be forwarded to sub. Gemng the missing item early from the internal resources is called register forwarding or bypassing. EECE 321: Computer Organiza>on 12
3. Data Hazards Example1 For the two instruc>ons below, show what pipeline stages can be connected by forwarding. Use the figure below to represent the datapath during the 5 stages. add $s0,$t0,$t1 sub $t2,$s0,$t3 Solu>on: Forwarding the value of $s0 directly from ALUout EECE 321: Computer Organiza>on 13
3. Data Hazards Example2 Repeat for the following pair of instruc>ons. Does forwarding remove all stalls? lw $s0,20($t1) sub $t2,$s0,$t3 Solu>on: Here since result of load is available only ager the 4th stage in MDR, and sub needs it in 3rd stage, s>ll need to insert a pipeline bubble. EECE 321: Computer Organiza>on 14
3. Data Hazards Reordering Code to Avoid Pipeline Stalls Can the assembler or compiler rearrange the code to eliminate stalls? lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Solu>on: lw lw sw sw $t0, 0($t1) $t2, 4($t1) $t0, 4($t1) $t2, 0($t1) EECE 321: Computer Organiza>on 15
Reordering Code to Support Structural & Data Hazards Time (1)! lw $1, 100($0) Program Execution Order 4 NOPs! lw $2, 104($0) lw $3, 108($0) lw $4, 112($0) Same #! of NOPs! Time (2)! lw $1, 100($0) Instructions from before! lw $2, 104($1) lw $3, 108($2) lw $4, 112($3) 4 NOPs!.!.!.! Prof. M. Mansour.!.! EECE 321: Computer Organiza>on.! 16
Reorganized Single- Cycle Datapath EECE 321: Computer Organiza>on 17
Pipelined Execu5on EECE 321: Computer Organiza>on 18
Pipelined Datapath: Adding Pipeline Registers Forward necessary informa>on used in later execu>on stages using pipeline registers Give them names: IF/ID, ID/EX, EX/MEM, MEM/WB EECE 321: Computer Organiza>on 19
Execu5on of lw on Pipelined Datapath: IF (1/5) EECE 321: Computer Organiza>on 20
Execu5on of lw on Pipelined Datapath: ID (2/5) EECE 321: Computer Organiza>on 21
Execu5on of lw on Pipelined Datapath: EX (3/5) EECE 321: Computer Organiza>on 22
Execu5on of lw on Pipelined Datapath: MEM (4/5) EECE 321: Computer Organiza>on 23
Execu5on of lw on Pipelined Datapath: WB (5/5) Where is the loaded value wri\en? Above datapath has a problem. EECE 321: Computer Organiza>on 24
Corrected Pipelined Datapath to Properly Handle lw EECE 321: Computer Organiza>on 25