Parallel architectures Electronic Computers LM

Size: px

Start display at page:

Download "Parallel architectures Electronic Computers LM"

Dale Rice
6 years ago
Views:

1 Parallel architectures Electronic Computers LM 1

2 Architecture Architecture: functional behaviour of a computer. For instance a processor which executes DLX code Implementation: a logical network implementing the architecture. It is called also microarchitecture. There are many implementations of the same architecture. Example: family x86 Synthesis: a physical implementation. There are many possible synthesises of the same implementation (for instance different technologies) The architecture is defined by the machine language that is the instruction set (assembly language). Instruction Set Architecture -> ISA The ISA varies slowly while the implementation change rapidly (see for instance IA8, IA16, IA32 ). More an ISA remains more are the programs implemented on it and therefore compatibility becomes the main issue. 2

3 Sequential Single instruction executed at a time Instruction level parallelism Pipelined Multiple instructions executed simultaneously Superpipelined Multiple stages for each operation (EX, MEM etc.) in order to increase the clock frequency (i.e. Pentium IV) Scalar A single pipeline Superscalar Multiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision) Very Long Instruction Word Multiple pipelines; many instructions started at the same time. Instruction order decided at compile time Superscalar superpipelined (i.e. Pentium IV, I5, I7 etc.).. 3

4 architectures Multicore (core level parallelism) Many processors in the same chip (i.e.. Core duo Nehalem Sandy Bridge..) Multithread (thread level parallelism) Pipelines of the same processor used by different processes at the same time (time sharing) (as if it were a multicore ex. Pentium IV, Nehalem, Sandy Bridge etc.) Memory level parallelism A memory able to provide multiple data at different addresses at the same time (outstanding requests - DDR2, DDR3 etc.) 4

5 Deep Pipeline (Superpipeline) Fetch Branch penalty Fetch Decode Execute Memory Writeback Decode Execute Memory Branch penalty Writeback Each stage subdivided in three substages.. Higher clock frequency but higher branch penalty Higher power consumption!!!!!!!!!!!! 5

6 Parallel pipelines Sequential Time parallelism: pipeline Space parallelism: VLIW Space-time parallelism: (ie. I5, I7 ) 6

7 Diversified pipelines - 1 Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines. IF ID RD EX ALU MEM1 FP1 BR F => Floating MEM2 FP2 FP3 WB Different execution times problem Instruction interdependency problem Multi instruction buffer to avoid pipelines block. 7

8 Diversified pipelines - 2 IF ID In order execution RD Dispatch Buffer EX ALU MEM1 FP1 BR MEM2 FP2 FP3 «Out Of Order» execution Reorder Buffer WB In order retirement 8

Ex Integer M1 M2 M3 M4 M5 M6 M7 multicycle stages Pipelined IF ID

9 Floating Point DLX F instructions Integer FP Multipl. IF ID FP MEM WB adder FP Multiply FP/Int. Divid. Ex Integer M1 M2 M3 M4 M5 M6 M7 multicycle stages Pipelined IF ID FP Add MEM WB A1 A2 A3 A4 FP/INT. Divide (i.e. 24 clock cycles one instruction at a time executed) 9

10 DLX revisited Very important structure change (more intermediate registers, more complex ID stage to send each instruction to the appropriate execution stage) Hazards problems: the instructions do not end in the same order of their issue. Example FMUL Data required for computing the address F1,F2, F2 (no interdependency between instructions in this sequence) FADD F3, F4, F5 In violet the stages where the operands are needed FLD F6, 10(R8) and in green the stages where results are produced FST 40(R10), F9 Data written FMUL IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB FADD IF ID A1 A2 A3 A4 MEM WB FLD IF ID EX MEM WB FST IF ID EX MEM (WB) nop Red squares: execution Since the division is normally a single functional unity, up to 40 clocks stalls may occur in this case Multiple instructions at the same time in the same stages (in particular in WB) Write After Write hazards (WAW) i.e. if a FADD F6, F4, F5 (four EX cycles ) would directly precede a FLD F6, 10(R8) (one EX cycle) (although in this case the FADD would have been dropped by the compiler since useless) Instructions are not completed in order Same destination register Write sequence error Because of the different instructions execution times Read After Write (RAW - DLX) hazards are more frequent 10

11 DLX revisited To cope with multiple write operations at the same time of different registers the number of the input ports of the RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as to choose the instructions to be stalled). More complex pipelines RAW hazards are solved through the forwarding For WAW hazards consider the following example Multilple RF write operations FMUL F0, F4, F6.... FADD F2, F4, F6.... FLD F2, 0(R2) IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID A1 A2 A3 A4 MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB If FADD was started one clock later a Write After Write hazard would have taken place!! Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which use integer register for address computing Normally the hazards are detected in the ID stage considering the preceding and following instructions so as to introduce the required stalls (in this case FLD would have been stalled one clock) 11

12 DLX revisited In the previous case FLD F2, 0(R2) must be stalled until FADD F2, F4, F6 has reached the MEM stage. It must be however assumed that between the two instructions there must at least one using through the forwarding the result of FADD F2, F4, F6 otherwise the compiler would have dropped the instruction! The situation would have been even worse if FLD had been completed before the FADD. In any case it is always possible that different instructions are completed in an order different from that of their issue How can we grant that the final result is that of the program? 12

13 Compiler Let s consider this high level language statements X = Y + Z A = B * C to be executed in a processor with the following pipeline Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W In order emission The issue of the addition (multiply) is possible only AFTER the previous instruction execution calculating R2 (R5) that is after the last EX stage possibly with forwarding Addition result not yet ready RAW Busy decoder- RAW Multiply: waits for results Busy decoder Decoder busy The issue is here possible since data to R 1 e R 2 have been already produced Stalls Data not available D freed by the addition 13

14 Compiler Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W But we can modify the emission without modyfying the result before Emission possible since R1 and R2 already ready after Waiting for R5 Busy decoder 16 cicles instead of 22!!!! Waiting for R6 14

15 Multicycle hazards I1 F1 = F2 + F3 I2 F2 = F4 x F5 I3 F3 = F3 + F4 I4 F6 = F6 x F6 I5 F1 = F3 + F5 I6 F2 = F3 + F4 WAR (F2) I2 I1 WAR (F3) RAW (F3) I3 WAW(F1) RAW (F3) I5 NB: in this graph the hazards are potential since the registers only are considered no matter how many cycles are required by the executions WAW(F2) I6 I1 I2 Let s suppose to have a FP adder (1 cicle in red) and a multiplier (3 cicles in green). I3 I4 I5 I6 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 15

16 Dynamic instructions scheduling Temporal dependencies (hazards) not known at compile time It allows the execution of the code on different pipelines and on superscalar processors with no implications for the compiler. It allows the execution of instructions ahead of their position (in the following case FSUB F12,F8,F14) if the conditions allow it FDIV F0,F2,F4 FADD F10,F0,F8 (RAW - must wait for F0) FSUB F12,F8,F14 (can be executed anyway) Systems with out of order executions but commitment always in order 16

17 Scoreboard Write After Read (WAR) Consider the following sequence Read after Write (RAW) FDIV F0, F2, F4 FADD F10, F0, F8 FSUB F8, F8, F14 They must read the same value There is an antidependency (WAR hazard) between FADD and FSUB: should FSUB end before FADD has read F8 an error would occur (F8 already updated) A possible Write After Write (WAW) hazard would occur if in FSUB F10 instead of F8 had been used as destination (in case FSUB would end before FADD) Scoreboard technique: an instruction per clock should be terminated executing an instruction as soon as possible. 17

18 Scoreboard Registers FP MUL FP MUL FP DIV FP ADD Functional units INTEG Scoreboard The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines when an instruction can read its operands and start its execution. The scoreboard considers all system state changes and decides when the first instruction in the FIFO queue (as produced by the compiler) can be started. 18

19 Scoreboard The four stages equivalent to ID, EX and WB in DLX are: 1. Emission: if a functional unit for the instruction is available (free) and the required operands are available in RF with updated values, the instruction is issued unless another functional unit has already an instruction which must write into the same destination register. No WAW hazards therefore. In this latter case the instruction is stalled which blocks the emission of all the following instructions in the prefetch queue even when all other conditions for them are met! 2. Operand read: the instruction has been emitted. If the operand is available and no already executing instructionmust write it, the operand is read otherwise stall in the functional unit 3. Execution: when the result has been computed and stored the scoreboard is informed so as to unblock a possibly waiting instruction 4. In case of possible WAR the instruction is stalled and does not write the result if there is a previous instruction which has not yet read the operands and one of them is the destination register of the considered instruction. Once the operand has been read the result can be written It must be noticed that with this organisation the forwarding is avoided since the results are written as soon as produced (but for the wait WAR point 4) Obviously some stalls can be induced because the number of busses available for transfers is small The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing the RAW risks). 19

20 An example Integer LD F6, 34(R2) RAW LD F2, 45(R3) FMUL F0, F2,F4 RAW (MULD) FSUB F8, F6, F2 (SUBD ) FDIV F10, F0, F6 (DIVD) FADD F6, < F8, F2 WAR (ADDD) Do you find more hypothetical hazards? For instance what about F0? Hypothetical timing for different instructions (which includes the operands read and execution) FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles 20

21 Scoreboard entities Instruction stages: emission, operands read, execution and writeback Statuses of the functional units (FU): 9 parameters Busy Op Fi Fj, Fk Qj, Qk Rj, Rk Unit busy Operation Code presently executed Instruction destination (result) register Operands source registers Functional units producing the required operands (if not yet ready) for the registers Fj and Fk Flags (yes) indicating whether Fj, Fk have been already updated Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register N.B. It must be remembered that in case of possible WAW the instructions emission is stalled (point 1 of the rules) N.B. In the following example we suppose that two multiplication/division units are available 21

22 Example (here we assume that F0 is a normal register and not always 0 ) NB Instruction status Read ExecutionWrite LD = FLD Instruction j k Issue Op complete Result FLD MULTD = FMUL LD F6 34 R2 SUBD = FSUB FMUL LD F2 45 R3 FDIV MULTD F0 F2 F4 Instructions states DIVD = FDIV SUBD F8 F6 F2 Progression clock ADDD = FADD DIVD F10 F0 F6 ADDD F6 F8 F2 FU=Functional Unit Functional unit status dest Source1Source 2FU for j FU for k Fj? Fk? Clock 1 integer unit 2 multipl. units 1 add/sub unit 1 division unit Time Name Integer Mult1 Mult2 Add Divide Register result status n. of cycles of execution yet to elapse 0 1 cycle FADD, FSUB 2 cycles 10 cycles 40 cycles Register Q i Ready? Busy Op Fi Fj Fk Qj Qk Rj Rk R j and R k indicates whether (possibly in the next cycle if just produced) the data can be read from the operands source registers of the executed instruction. Q j and Q k are the Functional Units which produce them (if not yet ready). F j and F k are the registers where data produced by Q j and Q k are stored (or will be stored in the next cycle data available if the corresponding Ri is yes) to be used in the executed instruction Floating point result registers F0 F2 F4 F6 F8 F10 F12... F31 Functional Unit producing the result for the floating point register Fx (Qj, Qk) 22

23 Cycle 1 Instruction status Read Execution Write Instruction j k Issue Op/Excomplete Result F6,34(R2) is Issue LD F6 34 R2 1 LD F2 45 R3 MULTD F0 F2 F4 Brown colour for state change SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? At clock 1 the instruction stage of LD R2 is supposed to be already available and therefore in the next clock can be used. LD uses the integer unit Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Yes No Load F6 R2 Yes Mult2 No Add No Divide No Register result status Clock 1 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer R2 Functional unit used for producing the result in F6 23

24 Cycle 2 Instruction status Read Execution Write Instruction j k Issue Op/Ex complete Result LD F6 34 R2 1 2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Data ready in R2: instruction can proceed NB: The second LD cannot be emitted because the only integer unit is busy and the same applies for MULTD because instructions must be emitted in order Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock 2 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer 24

25 Cycle 3 Instruction status Read Execution Write Instruction j k IssueOp/Ex complete Result LD F6 34 R FLD 1 cycle LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Mult1 No Mult2 No Add No Divide No Register result status Clock 3 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer 25

26 Cycle 4 Instruction status Read Execution Write indicates their value Instruction j k Issue Op/Excomplete Result LD F6 34 R LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock 4 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer Register at the end of the period has been written Functional unit freed at the end of the period The change of status of the FUs at the clock positive edge concluding ending the current cycle (future status). For instance the integer functional unit is freed at the end of cycle 4 together with the result writeback. LD F6 34,R2 disappears totally from scoreboard at the clock positive edge concluding the current cycle 4. 26

27 Cycle 5 Instruction status Read Execution Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 RUj RUk Rj? Rk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Yes No Load F2 R3 Yes Mult2 No Add No Divide No Register result status Clock 5 F0 F2 F4 F6 F8 F10 F12... F31 RU Integer The Integer Functional Unit must produce a new value for F2 At the beginning of cycle 5 the integer unit is already free and then LD F2 45, R3 can start R3 supposed already ready as in the previous case 27

28 Cycle 6 Instruction status Read Execution Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R3 5 6 MULTD F0 F2 F4 6 MULTD waits for F2 SUBD F8 F6 F2 from the integer unit!!!! DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Mult2 Yes No Mult F0 F2 F4 Integer No Yes Add No Divide No Register result status Clock 6 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult Integer MULTD F0 F2, F4 can start because its FU is free and the destination register is F0 F4 supposed already present 28

29 Cycle 7 Instruction status Read Execution Write Instruction j k Issue Op/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer No Yes MULTD stalled in the execution unit because F2 not yet ready. The same for SUBD Mult2 No Add Divide Yes No Subd F8 F6 F2 Integer Yes No Register result status Clock 7 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult Integer Add (NB : FP adder executes FP subtractions too) SUBD F8 F6, F2 can start because the arithmetic FP sum/subtraction is free. 29

30 Cycle 8 Instruction status Read EX Write Instruction j k Issue DIVD F10 F0, F6 can start Op/Ex complete. Result because the divide FP FU is free LD F6 34 R LD F2 45 R MULTDF0 F2 F4 6 SUBD F8 F6 F2 7 F2 available!! DIVD F10 F0 F6 8 ADDD F6 F8 F2 F0 not yet available Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Updated at the end of the cycle Clock 8 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide F2 is written and therefore the integer unit is free F2 written allows MULTD and SUBD to read the operands during the next cycle 30

31 Cycle 9-10 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk N.B.: MULTD and SUBD can read the operands because F2 available (see cycle 8). DIVD is still stalled because of F0. ADDD cannot start because SUBD uses the adder FU Integer No 10 clock Mult1 Yes Mult F0 F2 F4 Mult2 No 2 clock Add Yes Sub F8 F6 F2 40 clock Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 9-10 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 31

32 Cycle 11 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Nota: FU Add requires 2 cycles for the SUBD and therefore nothing happens in cycle 10 while MULTD still processes its data NB: ADDD will use the result of the SUBD but is not yet started because of SUBD Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No 0 Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 11 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 32

33 Cycle 12 Instruction status Read EX Write Instruction j k IssueOp/Ex completeresult LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 FLD 1 cycle FADD and FSUB 2c ycles FMUL 10 cycles FDIV 40 cycles SUBD ends freeing the FU. In the next period ADDD can start Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No 7 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 12 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Divide F8 is written and the ADD/SUB FU is freed 33

34 Cycle 13 Instruction status Fead EX Write Instruction j k IssueOp/Excomplete Fesult LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F Now ADDD can start because SUBD has finished its execution and has freed the FU Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Divide Yes Add Yes Div F6 F10 F8 F0 F2 F6 Mult1 Yes No Yes Yes Register result status Clock 13 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 6 Clocks more FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles 34

35 Cycle 14 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 14 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 5 clocks more 2 Clocks more 35

36 Cycle 15 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles ADDD requires two cycles and therefore no system status change Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 15 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 4 Clocks more 1 Clock more 36

37 Cycle 16 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles ADDD ended its EX stage while MULTD and DIVD keep executing Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 16 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 37

38 Cycle 17 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles NB!!! ADDD stalled (cannot write) because of a WAR with DIVD on F6. DIVD does not read F6 because it waits for F0 produced by MULTD (operands are read in parallel). MULT and DIVD keep executing Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2 Clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Stalled because WAR F6 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 17 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 38

39 Cycle 18 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULT still executing DIVD still stalled Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1 clock more Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 18 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 39

40 Cycle 19 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULT ends its execution, will write in cycle 20 (after 10 cycles) which will unblock DIVD and then ADDD Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 19 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 40

41 Cycle 20 Instruction status Read EX Write FLD 1 cycle Instruction j k IssueOp/Excomplete Result FADD FSUB 2 cycles LD F6 34 R FMUL 10 cycles LD F2 45 R FDIV 40 cycles MULTD F0 F2 F MULTD writes F0 SUBD F8 F6 F unblocking DIVD DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock 20 F0 F2 F4 F6 F8 F10 F12... F31 FU Add Divide 41

42 Cycle 21 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Register result status Clock 21 F0 F2 F4 F6 F8 F10 F12... F31 FU Add Divide DIVD reads both F0 and F6 (which could not be written by ADDD because of WAR) unblocking ADDD which can write F6 in the next cycle 42

43 Cycle 22 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock 22 F0 F2 F4 F6 F8 F10 F12... F31 FU Divide Now ADDD can write F6 after the WAR hazards with DIVD disappeared. For 6 cycles ADDD couldn t write F6 although its result was available 43

44 Cycle 61 Instruction status Read EX Write Instruction j k Issue Op/ExcompleteResult LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock 61 F0 F2 F4 F6 F8 F10 F12... F31 FU Divide DIVD execution ends after 40 cycles 44

45 Cycle 62 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F All executions ended Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 62 FU 45

46 Scoreboard limits Register values must be read in any case in parallel only from the register file (which means that they must have been already stored in the registers no RAW problem) An instruction can be emitted only if all previous instructions have been emitted WAR RAW FDIV FADD FSTOR FSUB FMUL WAW F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6, F10, F8 N.B Hazards of the sequence are only potential: their occurrence depends on the instructions execution time 46

47 Renaming Tomasulo Algorithm «Renaming» indicates a location different from the RF where a requested datum is produced/stored and can be obtained. The name «renaming» is used because it is as if the source registers of an instruction were renamed Tomasulo algorithm: renaming is based on the concept of reservation stations which are functional units buffers where instructions can be «parked» waiting for the availability of the requested Fu and the needed data. A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU is free and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operands EITHER the source register data OR the reservation stations producing them are indicated (whence renaming). The renaming occurs at run-time A reservation station captures a required operand exactly when and where it is (not waiting until it is written avoiding the register file access). Similar to the case of forwarding When multiple writes to the same register occur (WAW possible only if multiple busses between FUs and RF are available) only the most recently produced data are written (for each register a TAG is used indicating the FU which has the right to write) The following benefits occur Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the information stored in the reservation stations of each functional unit determines whether an instruction can execute in the FU since the source (where the datum is being produced ) and NOT the RF is in any case indicated. RAW hazards are no more possible since the requested data are provided as soon as produced. The same for WAR Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RF through the common data busses (multiple reservation stations in addition to RF register can be accessed at the same time when multiple bussesare available) 47

48 Tomasulo Algorithm Tomasulo eliminates not only WAWs but also WARs Possible WAW FLD FLD FMUL FSUB FDIV FADD F6, 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, F6 F10, F0, F6 F6, F8, F2 Possible WAR. FLD FLD FMUL FSUB FDIV FADD [T/F6], 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, [T] F10, F0, [T] F6, F8, F2 Renaming (functional unit producing the datum) When an instruction is inserted in a RS it is checked whether one or more of its operands are being produced elsewhereby other RS: if yes then renaming For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read its operands (in case of F8 of FSUB and of F2 of FLD they were both immediately available for FADD) but since FDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holds for FSUB. As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the most recent instruction in the RS using a destination register can write the register. 48

49 Tomasulo Algorithm Very high performance without special compilers Differences with scoreboard Buffer and controls directly distributed in the FUs (there is no centralized control): buffers are called reservation stations Source registers names substituted by pointers to buffers of the reservation stations (if the requested data are being there produced) Renaming : a direct pointer to the sources and not to the register One ore more Common Data Bus for sending results to all FUs requiring them Load and Stores considered as FUs (a STORE can also be a source for a RS executing a LOAD) 49

50 Tomasulo Algorithm In this example is it assumed that the MUL unit executes the DIVs too and that the ADD executes the SUBs too. LOAD and STORES are handled as other instructions In this example: 3 RS for add/sub 2 RS for mult/div 5 RS for store 5 RS for load For the data produced by the FUs In this example only one Data Bus. Please notice that the same Common Data Bus is used also by the RS waiting for data Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements: either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming) 50

51 Tomasulo Algorithm Load buffers are used to store the load addresses Store buffers contain the computed addresses and the data to be written in memory Load and store must be executed in sequence if they are related to the same addresses. In the other cases it is possibile to anticipate the LOADs (never the STOREs) In figure there are 3 phases (each one of which can last several clocks): Emission: the instructions are extracted in order from the general instruction queue when there is a free RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operands are extracted from RF or the producing FU as indicated. In case of WAW it must be determined which instruction must provide the data Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must be transferred over a bus anyway) in order to catch them (and their sources) as soon as available: RAW are therefore avoided (we are sure not to read stale data in the RF). Writeback: as soon as a datum is produced, it is tranferred over one CDB (when more than one are available) to the RF and to the RS waiting for it. 51

52 Tomasulo Algorithm Let s see the scoreboard example in a Tomasulo Architecture. Let s suppose that the execution times are the same of the scoreboard (FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback LD LD FMUL FSUB FDIV FADD F6, 34(R2) F2, 45(R3) F0, F2,F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 52

53 Reservation Station Op: opcode of the instruction to be executed Vj, Vk: places of the operands (either RF or the FUs producing them) Qj, Qk: Functional units producing the results. A blank indicates that the source operands are already in Vj or Vk or that they are not required Busy: Busy FU Register File Status: Indicates which FU will write the register (if needed). A blank means that there are no instructions which must write the register and therefore its value can be directly used N.B. From the general instruction queue one instruction per clock is emitted when a FUs RS for that instruction is available otherwise stall. In our example we assume only one CDB. 53

54 Instruction status Instruction j k Issue LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Cycle 0 Execution Write Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Operands register. If blank the datum is produced in the corresponding Q FU Load/store not indicated in the status table NB. For LD (ST here not used) there is a limited number of RS. Their BUSY status is here displayed differently from the FU (see next slide) Producing FU if blank it means that the datum is in RF For sake of simplicity R j e R k (ready/notready) are not displayed since their values are implicit in the status of Q j and Q k FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 0 FU The FU producing the new value 54

55 Cycle 1 5 RS for the LOAD Instruction status Instruction j k Issue Execution Write Busy Address LD F6 34 R2 1 Load1 Yes 34+R2 LD F2 45 R3 NB: Here it is assumed that R2 and R3 are already available MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k 3 RS for adder/sub 2 RS for mul/div Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 1 FU Load1 55

56 Cycle 2 5 RS for LOAD Instruction status Instruction j k Issue Execution Write Busy Address LD F6 34 R Load1 Yes 34+R2 LD F2 45 R3 2 The second LD is emitted. Load2 Yes 45+R3 One instruction per clock is MULTD F0 F2 F4 emitted (when possible) SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 Mult2 No No No No No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 2 FU Load2 Load1 NB: Load -> 2 cycles: the first one for computing the address and the second for reading the datum N.B. A second LOAD has been emitted (not possible with the scoreboard) and parked in the RS. R3 value already available in the RF FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 56

57 Cycle 3 Instruction status Instruction j k Issue Execution Write Busy LD F6 34 R Load1 Yes LD F2 45 R Load2 Yes MULTD F0 F2 F4 3 MULTD emitted (free RS ) SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 No No Datum supposed already in the RF Address 34+R2 45+R3 LD two cycles MULTD can be emitted although F2 NOT yet available. F2-> renaming FLD 1+1 cycles, Add3 No FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Yet10 cycles Mult1 Yes Mult F4 Load2 FDIV 40+1 cycles) Mult2 No NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 3 FU Mult1 Load2 Load1 57

58 Cycle 4 Instruction status Instruction j k Issue Execution Write Busy LD F6 34 R LD F2 45 R Load2 Yes MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 SUBD is emitted (RS free) F6 available in RF at the DIVD F10 F0 F6 end of the cycle ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Address 45+R3 The datum read from memory LD F6 34(R2) is written both in the RF and in the RS of SUBD which is waiting for it FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Time Name Busy Op Vj Vk Qj Qk Yet 3 cycles Add1 Yes Sub F6 (captured on the fly) Load2 Add2 No Add3 No Yet 10 cycles Mult1 Yes Mult F4 Load2 The FUs execute both sums and subtractions Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 4 FU Mult1 Load2 Add1 FU freed 58

59 Cycle 5 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 3 Add1 Yes Sub F6 (capt.) F2 (capt) 0 Add2 No Add3 No DIVD is emitted (RS free) 10 Mult1 Yes Mult F2 (capt) F4 The datum read from memory with LD F2 45(R3) is written both in register F2 and in the RS of SUBD and MULTD which are waiting for it FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 0 Mult2 Register result status Yes Div F6 Mult1 Wait for F0 Clock F0 F2 F4 F6 F8 F10 F12... F31 5 FU Mult1 Add1 Mult2 FU freed 59

60 Cycle 6 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No ADDD is emitted (RS free) Wait for F8 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 9 Mult1 Yes Mult F2 F4 Now MULTD can execute (F2 and F4 available) Yet 40 cycles Mult2 Yes Div F6 Mult1 Wait for F0 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 6 FU Mult1 Add2 Add1 Mult2 60

61 Cycle 7 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Yet 40 cycles Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No 8 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Datum in F6 will be overwritten by ADDD but it was already read and is present in the RS of DIVD SUBD (as ADDD) two cycles FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback ADDD stalled waiting for SUBD (F8) Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 7 FU Mult1 Add2 Add1 Mult2 61

62 Cycle 8 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 2 Add2 Yes Add F8 F2 Add3 No 7 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 NB: SUBD ends before MULTD and allows ADDD (which captures the result of F8) to start executing FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 8 FU Mult1 Add2 Mult2 FU freed 62

63 Cycle 9 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes Add F8 F2 Add3 No ADDD executing 6 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 9 FU Mult1 Add2 Mult2 63

64 Cycle 10 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Yet 40 Add1 No 1 Add2 Yes Add F8 F2 Add3 No 5 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Two execution cycles FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 10 FU Mult1 Add2 Mult2 64

65 Cycle 11 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 0 Add1 Add2 Add3 No No No 4 Mult1 Yes Mult F2 F4 40 Mult2 Yes Div F6 Mult1 ADDD too ends before MULTD and DIVD FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 11 FU Mult1 Mult2 FU freed 65

66 Instruction status Cycle 12 Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 No No No 3 Mult1 Yes Mult F2 F4 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Waiting for the datum producedby MULTD Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 12 FU Mult1 Mult2 66

67 Cycle 15 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 No No No 1 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Waiting for the datum producedby MULTD Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 15 FU Mult1 Mult2 67

68 Cycle 16 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Now DIVD can execute Time Name Busy Op Vj Vk Qj Qk 0 0 Add1 Add2 Add3 Mult1 No No No No 40 Mult2 Yes Div F0 F6 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 16 FU Mult2 FU freed 68

69 Cycle 56 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 No No No No 0 Mult2 Yes Div F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 56 FU Mult2 69

70 Cycle 57 Instruction status Execution Write Instruction j k Issue complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 Mult2 No No No No No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 57 FU 70

71 A demo can be found at 71

72 Limits of Tomasulo Algorithm Very complex Each CDB must be connetcted to each RS Complex cabling Reduce n. of CDB means reduced efficiency If a single CDB is present only one instruction per cycle can end Ouf of order instructions completion!!!!!! NOT precise interrupts 73

73 Exceptions Exception/interrupt: non-programmed control transfer Return address and all other information necessary to restore the interrupted situation must be saved «Response» subroutine (handler) must be executed Two exceptions types: interrupt and trap Interrupts: external causes The user program are interrupted and the then restored Asyncronous to the current process Acknowledged at the end of the current instruction (if interrupts enabled) The handler is responsibility of the user program Traps: internal causes Exceptional conditions (overflow, zero division etc.) Errors (i.e. parity) Page fault (or see later segment fault): data not available in memory Syncronous to the current process Operating systems handler Instruction can be interrupted during its execution (i.e. page fault) and therefore must be «restartable»,. The executing program is normally temporarily aborted. 74

74 Examples Instruction Restart 75

75 Precise exceptions/interrupts Exceptions must be precise that is their behaviour must be same that would occur in a non-pipelined architecture Precise: machine status is saved as if the code would have been executed until the exception : All preceding instruction must be terminated All instructions following the instruction which provoked the exception must be handled as if they never started The same code must executed identically on different architectures Complex problem with pipeline, OOO execution (see later) etc. Scoreboard and Tomasulo have: In order emission, execution (and therefore terminated) out of order fuori ordine Precise exceptions(interrupts) : instruction commitment in order 76

76 Reorder Buffer (ROB) FIFO queue Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we say that the instruction is virtually inserted in the ROB When instructions are terminated the results are stored in the ROB (instead of the RF) which provides also the operands to other instructions which requires them (renaming!) Commitment Commitment: the result of the instruction in the top slot of the FIFO are transferred to the architectural registers Easy undo of speculated instructions (see later) or of branches erroneously predicted or exceptions Automatic WAW avoidance FP Op Queue ROB FP Regs Res Stations FP Adder Res Stations FP Adder 78

77 Tomasulo again 79

78 Tomasulo in 4 steps Emission Emission of an instruction from the instruction queue when a RSand a ROB slot available. In the RS are indicated the operands source and the ROB slot where an instruction will be parked after its esecution (this phase is called «dispatch ). The results are NOT written in the RF until the commitment phase. NB the lack of one of the two conditions blocks the emission of the following instructions Execution Operands transformation. If not yet ready they can be in the ROB (in this case the operand value computed by the nearest previous instructions is used) orstill computed in the FU. This phase is indicated as issue. Result writeback Execution ends. Result trasmitted on the CDB for the RS waiting of them and for the ROB. Commitment Register (or memory) update with the results stored in the ROB when the instruction is on the top of the ROB FIFO. In case of erroneously predicted branch the ROB results are just dropped ( graduation ). EMISSION IN ORDER COMMITMENT IN ORDER N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same (unlikely, otherwise the compiler would have dropped the first one) the result of the most recent instruction is used. 80

79 HW with ROB Destination Register Result Exception? Valid (terminated ) Program Counter FP Op Queue Compar network Reorder Buffer FP Regs ROB Res Stations FP Adder Res Stations FP Adder ROB is a circular queue 81

80 Example LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 82

81 FP Op queue LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 Tomasulo with ROB cycle 1 ROB Dest. Source Istruction F0 LD F0,10(R2) N Completed? ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ROB end ROB top Dest. => destination position in ROB Dest. FP registers Dest To memory From memory FP adders Reservation Stations FP multipliers Dest 1 10+R2 M1 83

82 FP Op queue LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 Tomasulo with ROB cycle 2 ROB Renaming!! Dest. F10 F0 Source ROB1 FADD LD Istruction F10, F4, F0 F0,10(R2) Completed? N Ex ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 End Top Dest. 2 FADD F10,F4, ROB1 FP registers Dest To memory From memory Three slots for memory operations FP adders Reservation Stations FP multipliers Dest 1 10+R2 M1 (Memory 2 clocks) There can be also two ROB sources 84

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Instruction-Level Parallelism Scoreboarding (A.8) Instruction Level Parallelism