Instruction Level Parallelism. Data Dependence Static Scheduling

Instruction Level Parallelism Data Dependence Static Scheduling

Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop

Dependence for (i=0; i<=999; i=i+1) x[i] = x[i] + a; Data Dependence Name Dependence Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop Name dependence antidependence, output dependence Register renaming Hazard ADD.D ADD.D F4, F0, F2 F4, F6, F8 Overlap during execution could change the order of access to the operand involved in the dependence.

Hazards Program Order ILP preserves program order only where it affects the outcome of the program Structural Hazards Resource conflicts Data Hazards RAW, WAW, WAR Control Hazard Whether or not an instruction should be executed depends on a control decision made by an earlier instruction

Structural Hazard 1 2 3 4 5 6 7 8 9 i1 i2 i3 i4 i5... E E E WB E E E WB E E E WB E E E WB E E E WB HAZARD!!!

Cost of a Load Structural Hazard Data references constitute 40% of the instruction mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much? Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime ideal =CPI Clock cycle time ideal

Cost of a Load Structural Hazard Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime =(1+0.4 1) Clock cycle time ideal 1.1 Avg. InstructionTime =1.27 Clock cycle time ideal

Resolving Structural Hazards Scheduling hazardous instructions away from each other Stalling one of the instructions Duplicating hardware units

AL Data Hazards E E WB 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend 16 32 Imm R1 R2 + R3 R4 R1 + R5

AL Data Hazards R1 is updated in the WB stage. IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD R1 R2 + R3 Sign Extend 16 32 Imm R4 R1 + R5 How to overcome this hazard?

AL Stalling (Interlocking) Stall Condition NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD R1 R2 + R3 R4 R1 + R5 Sign Extend 16 32 Imm

Stalled Stages and Pipeline Bubbles Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB E A WB Stalled Stages E A WB E A WB I1 I2 I3 I3 I3 I3 I4 I5 I1 I2 I2 I2 I2 I3 I4 I5 E I1 nop nop nop I2 I3 I4 I5 A I1 nop nop nop I2 I3 I4 I5 WB I1 nop nop nop I2 I3 I4 I5

AL rs1 rs2 C stall rd Stall Control Logic NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend 16 32 Imm Compare the source registers of the instructions in the decode stage with the destination register in the uncommitted instructions.

AL rs C stall rd Stall Control Logic rt NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend 16 32 Imm

Stall Condition C stall stall = ( (rs D = rd E ) + (rs D = rd ) + (rs D = rd W ) ) + ( (rt D = rd E ) + (rt D = rd ) + (rt D = rd W ) ) The pipeline should stall for all instructions? Are rs, rt and rd valid for all instructions?

IPS I Sources & Destinations R-type op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits I-type op rs rt 6 bits 5 bits 5 bits 16 bits J-type op Offset added to immediate PC 6 bits 26 bits Source(s) Destination AL rd (rs) func (rt) rs, rt rd ALI rt (rs) func Immediate rs rt LW rt em[ (rs) + Immediate ] rs rt SW em[ (rs) + Immediate ] rt rs, rt BZ Cond (rs)? PC = PC + Offset : PC = PC + 4 rs J PC = PC + Offset JAL R31 PC; PC = PC + Offset; R31 JR PC (rs) rs JALR R31 PC; PC (rs) rs R31

Stall Control Logic rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs rt ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D C dest R

Deriving the Stall Signal C dest ws = case opcode AL rd ALi, LW rt JAL, JALR R31 we = case opcode AL, ALi, LW (ws 0) JAL, JALR on. off C re re1 = case opcode AL, ALi, LW, SW, BZ, JR, JALR on J, JAL off re2 = case opcode AL, SW on... off C stall stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs D = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt D = ws W ).we W ). re2 D

The Stall Control Signal stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs D = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt D = ws W ).we W ). re2 D Is that all? Results of all instructions ready by E stage? [(R1) + 7] R2 R4 [(R3) + 13] Is there a possible data hazard here? What if the addresses (R1 + 7) == (R3 + 13)? Careful design of the memory system required.

Resolving Data Hazards Scheduling hazardous instructions away from each other Stalling one of the instructions Data Forwarding (Bypassing)

Forwarding DADD DSB AND OR OR R1,R2,R3 R4,R1,R5 R6,R1,R7 R8,R1,R9 R10,R1,R11 Time (clock cycles) DADD I REG AL D REG DSB I REG AL D REG AND I REG AL D REG

Forwarding Before Bypassing Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB CPI > 1 E A WB Stalled Stages E A WB E A WB After Bypassing Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB E A WB E A WB CPI = 1 E A WB

The Pipeline without Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

The Pipeline with Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

Cost of Forwarding In longer pipelines? In multiple issue pipelines?

No Stalls in the Pipeline? What about this instruction sequence? LD ADD R1, 4(R2) R3, R1, R4 When, at the latest, is the value of R1 needed by ADD? When, at the earliest can does R1 enter the pipeline?

Stall Logic stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs W = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt W = ws W ).we W ). re2 D stall = ( (rs D = ws E ) (opcode E = LW E ) (ws 0) ) re1 D + ( (rt D = ws E ) (opcode E = LW E ) (ws 0) ) re2 D

Pipeline Scheduling Reorder the instructions of the program so that dependent instructions are far enough apart Done by the compiler, before the program runs: Static Instruction Scheduling Done by the hardware, when the program is running: Dynamic Instruction Scheduling

Pipeline Scheduling Original Program LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Scheduled Code LW R3, 0(R1) LW R13, 0(R11) ADDI R5, R3, 1 ADD R2, R2, R3 ADD R12, R13, R3 Total Execution Cycles: 7 Total Execution Cycles: 5

Loop-level Parallelism Original Loop: Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F2, F6 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12, F2, F10 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F2, F14 S.D F16, -24(R1) DADDI R1, R1, #-32 BNE R1, R2, Loop N R O L L E D L O O P

Loop nrolling Instr producing result FP AL op Instr using result Latency to avoid a stall Another FP AL op FP AL op Store Double 2 Load Double FP AL op 1 Load Double Store double 0 Total Cycles: 27 cycles 3 Loop: L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) F6, -8(R1) F8, F2, F6 F8, -8(R1) F10, -16(R1) F12, F2, F10 F12, -16(R1) F14, -24(R1) F16, F2, F14 F16, -24(R1) DADDI R1, R1, #-32 BNE R1, R2, Loop

Loop nrolling Instr producing result FP AL op Instr using result Latency to avoid a stall Another FP AL op FP AL op Store Double 2 Load Double FP AL op 1 Load Double Store double 0 3 Loop: L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F2, F6 F12, F2, F10 F16, F2, F14 Total Cycles: 14 cycles S.D F4, 0(R1) S.D F8, -8(R1) DADDI R1, R1, #-32 Code Size Register pressure S.D S.D BNE F12, 16(R1) F16, 8(R1) R1, R2, Loop