Instruction Level Parallelism. Data Dependence Static Scheduling

Size: px

Start display at page:

Download "Instruction Level Parallelism. Data Dependence Static Scheduling"

Baldric Holmes
5 years ago
Views:

1 Instruction Level Parallelism Data Dependence Static Scheduling

2 Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop

3 Dependence for (i=0; i<=999; i=i+1) x[i] = x[i] + a; Data Dependence Name Dependence Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop Name dependence antidependence, output dependence Register renaming Hazard ADD.D ADD.D F4, F0, F2 F4, F6, F8 Overlap during execution could change the order of access to the operand involved in the dependence.

4 Hazards Program Order ILP preserves program order only where it affects the outcome of the program Structural Hazards Resource conflicts Data Hazards RAW, WAW, WAR Control Hazard Whether or not an instruction should be executed depends on a control decision made by an earlier instruction

5 Structural Hazard i1 i2 i3 i4 i5... E E E WB E E E WB E E E WB E E E WB E E E WB HAZARD!!!

6 Cost of a Load Structural Hazard Data references constitute 40% of the instruction mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much? Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime ideal =CPI Clock cycle time ideal

7 Cost of a Load Structural Hazard Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime =( ) Clock cycle time ideal 1.1 Avg. InstructionTime =1.27 Clock cycle time ideal

8 Resolving Structural Hazards Scheduling hazardous instructions away from each other Stalling one of the instructions Duplicating hardware units

9 AL Data Hazards E E WB 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend Imm R1 R2 + R3 R4 R1 + R5

10 AL Data Hazards R1 is updated in the WB stage. IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD R1 R2 + R3 Sign Extend Imm R4 R1 + R5 How to overcome this hazard?

11 AL Stalling (Interlocking) Stall Condition NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD R1 R2 + R3 R4 R1 + R5 Sign Extend Imm

12 Stalled Stages and Pipeline Bubbles Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB E A WB Stalled Stages E A WB E A WB I1 I2 I3 I3 I3 I3 I4 I5 I1 I2 I2 I2 I2 I3 I4 I5 E I1 nop nop nop I2 I3 I4 I5 A I1 nop nop nop I2 I3 I4 I5 WB I1 nop nop nop I2 I3 I4 I5

13 AL rs1 rs2 C stall rd Stall Control Logic NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend Imm Compare the source registers of the instructions in the decode stage with the destination register in the uncommitted instructions.

14 AL rs C stall rd Stall Control Logic rt NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend Imm

15 Stall Condition C stall stall = ( (rs D = rd E ) + (rs D = rd ) + (rs D = rd W ) ) + ( (rt D = rd E ) + (rt D = rd ) + (rt D = rd W ) ) The pipeline should stall for all instructions? Are rs, rt and rd valid for all instructions?

16 IPS I Sources & Destinations R-type op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits I-type op rs rt 6 bits 5 bits 5 bits 16 bits J-type op Offset added to immediate PC 6 bits 26 bits Source(s) Destination AL rd (rs) func (rt) rs, rt rd ALI rt (rs) func Immediate rs rt LW rt em[ (rs) + Immediate ] rs rt SW em[ (rs) + Immediate ] rt rs, rt BZ Cond (rs)? PC = PC + Offset : PC = PC + 4 rs J PC = PC + Offset JAL R31 PC; PC = PC + Offset; R31 JR PC (rs) rs JALR R31 PC; PC (rs) rs R31

17 Stall Control Logic rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs rt ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D C dest R

18 Deriving the Stall Signal C dest ws = case opcode AL rd ALi, LW rt JAL, JALR R31 we = case opcode AL, ALi, LW (ws 0) JAL, JALR on. off C re re1 = case opcode AL, ALi, LW, SW, BZ, JR, JALR on J, JAL off re2 = case opcode AL, SW on... off C stall stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs D = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt D = ws W ).we W ). re2 D

19 The Stall Control Signal stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs D = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt D = ws W ).we W ). re2 D Is that all? Results of all instructions ready by E stage? [(R1) + 7] R2 R4 [(R3) + 13] Is there a possible data hazard here? What if the addresses (R1 + 7) == (R3 + 13)? Careful design of the memory system required.

20 Resolving Data Hazards Scheduling hazardous instructions away from each other Stalling one of the instructions Data Forwarding (Bypassing)

21 Forwarding DADD DSB AND OR OR R1,R2,R3 R4,R1,R5 R6,R1,R7 R8,R1,R9 R10,R1,R11 Time (clock cycles) DADD I REG AL D REG DSB I REG AL D REG AND I REG AL D REG

22 Forwarding Before Bypassing Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB CPI > 1 E A WB Stalled Stages E A WB E A WB After Bypassing Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB E A WB E A WB CPI = 1 E A WB

23 The Pipeline without Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

24 The Pipeline with Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

25 The Pipeline with Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

26 The Pipeline with Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

27 Cost of Forwarding In longer pipelines? In multiple issue pipelines?

28 No Stalls in the Pipeline? What about this instruction sequence? LD ADD R1, 4(R2) R3, R1, R4 When, at the latest, is the value of R1 needed by ADD? When, at the earliest can does R1 enter the pipeline?

29 Stall Logic stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs W = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt W = ws W ).we W ). re2 D stall = ( (rs D = ws E ) (opcode E = LW E ) (ws 0) ) re1 D + ( (rt D = ws E ) (opcode E = LW E ) (ws 0) ) re2 D

30 Pipeline Scheduling Reorder the instructions of the program so that dependent instructions are far enough apart Done by the compiler, before the program runs: Static Instruction Scheduling Done by the hardware, when the program is running: Dynamic Instruction Scheduling

31 Pipeline Scheduling Original Program LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Scheduled Code LW R3, 0(R1) LW R13, 0(R11) ADDI R5, R3, 1 ADD R2, R2, R3 ADD R12, R13, R3 Total Execution Cycles: 7 Total Execution Cycles: 5

32 Loop-level Parallelism Original Loop: Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F2, F6 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12, F2, F10 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F2, F14 S.D F16, -24(R1) DADDI R1, R1, #-32 BNE R1, R2, Loop N R O L L E D L O O P

33 Loop nrolling Instr producing result FP AL op Instr using result Latency to avoid a stall Another FP AL op FP AL op Store Double 2 Load Double FP AL op 1 Load Double Store double 0 Total Cycles: 27 cycles 3 Loop: L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) F6, -8(R1) F8, F2, F6 F8, -8(R1) F10, -16(R1) F12, F2, F10 F12, -16(R1) F14, -24(R1) F16, F2, F14 F16, -24(R1) DADDI R1, R1, #-32 BNE R1, R2, Loop

34 Loop nrolling Instr producing result FP AL op Instr using result Latency to avoid a stall Another FP AL op FP AL op Store Double 2 Load Double FP AL op 1 Load Double Store double 0 3 Loop: L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F2, F6 F12, F2, F10 F16, F2, F14 Total Cycles: 14 cycles S.D F4, 0(R1) S.D F8, -8(R1) DADDI R1, R1, #-32 Code Size Register pressure S.D S.D BNE F12, 16(R1) F16, 8(R1) R1, R2, Loop

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T. Pipeline Hazards Krste Asanovic Laboratory for Computer Science M.I.T. Pipelined DLX Datapath without interlocks and jumps 31 0x4 RegDst RegWrite inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel