Parallel architectures Electronic Computers LM

Size: px
Start display at page:

Download "Parallel architectures Electronic Computers LM"

Transcription

1 Parallel architectures Electronic Computers LM 1

2 Architecture Architecture: functional behaviour of a computer. For instance a processor which executes DLX code Implementation: a logical network implementing the architecture. It is called also microarchitecture. There are many implementations of the same architecture. Example: family x86 Synthesis: a physical implementation. There are many possible synthesises of the same implementation (for instance different technologies) The architecture is defined by the machine language that is the instruction set (assembly language). Instruction Set Architecture -> ISA The ISA varies slowly while the implementation change rapidly (see for instance IA8, IA16, IA32 ). More an ISA remains more are the programs implemented on it and therefore compatibility becomes the main issue. 2

3 Sequential Single instruction executed at a time Instruction level parallelism Pipelined Multiple instructions executed simultaneously Superpipelined Multiple stages for each operation (EX, MEM etc.) in order to increase the clock frequency (i.e. Pentium IV) Scalar A single pipeline Superscalar Multiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision) Very Long Instruction Word Multiple pipelines; many instructions started at the same time. Instruction order decided at compile time Superscalar superpipelined (i.e. Pentium IV, I5, I7 etc.).. 3

4 architectures Multicore (core level parallelism) Many processors in the same chip (i.e.. Core duo Nehalem Sandy Bridge..) Multithread (thread level parallelism) Pipelines of the same processor used by different processes at the same time (time sharing) (as if it were a multicore ex. Pentium IV, Nehalem, Sandy Bridge etc.) Memory level parallelism A memory able to provide multiple data at different addresses at the same time (outstanding requests - DDR2, DDR3 etc.) 4

5 Deep Pipeline (Superpipeline) Fetch Branch penalty Fetch Decode Execute Memory Writeback Decode Execute Memory Branch penalty Writeback Each stage subdivided in three substages.. Higher clock frequency but higher branch penalty Higher power consumption!!!!!!!!!!!! 5

6 Parallel pipelines Sequential Time parallelism: pipeline Space parallelism: VLIW Space-time parallelism: (ie. I5, I7 ) 6

7 Diversified pipelines - 1 Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines. IF ID RD EX ALU MEM1 FP1 BR F => Floating MEM2 FP2 FP3 WB Different execution times problem Instruction interdependency problem Multi instruction buffer to avoid pipelines block. 7

8 Diversified pipelines - 2 IF ID In order execution RD Dispatch Buffer EX ALU MEM1 FP1 BR MEM2 FP2 FP3 «Out Of Order» execution Reorder Buffer WB In order retirement 8

9 Floating Point DLX F instructions Integer FP Multipl. IF ID FP MEM WB adder FP Multiply FP/Int. Divid. Ex Integer M1 M2 M3 M4 M5 M6 M7 multicycle stages Pipelined IF ID FP Add MEM WB A1 A2 A3 A4 FP/INT. Divide (i.e. 24 clock cycles one instruction at a time executed) 9

10 DLX revisited Very important structure change (more intermediate registers, more complex ID stage to send each instruction to the appropriate execution stage) Hazards problems: the instructions do not end in the same order of their issue. Example FMUL Data required for computing the address F1,F2, F2 (no interdependency between instructions in this sequence) FADD F3, F4, F5 In violet the stages where the operands are needed FLD F6, 10(R8) and in green the stages where results are produced FST 40(R10), F9 Data written FMUL IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB FADD IF ID A1 A2 A3 A4 MEM WB FLD IF ID EX MEM WB FST IF ID EX MEM (WB) nop Red squares: execution Since the division is normally a single functional unity, up to 40 clocks stalls may occur in this case Multiple instructions at the same time in the same stages (in particular in WB) Write After Write hazards (WAW) i.e. if a FADD F6, F4, F5 (four EX cycles ) would directly precede a FLD F6, 10(R8) (one EX cycle) (although in this case the FADD would have been dropped by the compiler since useless) Instructions are not completed in order Same destination register Write sequence error Because of the different instructions execution times Read After Write (RAW - DLX) hazards are more frequent 10

11 DLX revisited To cope with multiple write operations at the same time of different registers the number of the input ports of the RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as to choose the instructions to be stalled). More complex pipelines RAW hazards are solved through the forwarding For WAW hazards consider the following example Multilple RF write operations FMUL F0, F4, F6.... FADD F2, F4, F6.... FLD F2, 0(R2) IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID A1 A2 A3 A4 MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB If FADD was started one clock later a Write After Write hazard would have taken place!! Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which use integer register for address computing Normally the hazards are detected in the ID stage considering the preceding and following instructions so as to introduce the required stalls (in this case FLD would have been stalled one clock) 11

12 DLX revisited In the previous case FLD F2, 0(R2) must be stalled until FADD F2, F4, F6 has reached the MEM stage. It must be however assumed that between the two instructions there must at least one using through the forwarding the result of FADD F2, F4, F6 otherwise the compiler would have dropped the instruction! The situation would have been even worse if FLD had been completed before the FADD. In any case it is always possible that different instructions are completed in an order different from that of their issue How can we grant that the final result is that of the program? 12

13 Compiler Let s consider this high level language statements X = Y + Z A = B * C to be executed in a processor with the following pipeline Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W In order emission The issue of the addition (multiply) is possible only AFTER the previous instruction execution calculating R2 (R5) that is after the last EX stage possibly with forwarding Addition result not yet ready RAW Busy decoder- RAW Multiply: waits for results Busy decoder Decoder busy The issue is here possible since data to R 1 e R 2 have been already produced Stalls Data not available D freed by the addition 13

14 Compiler Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W But we can modify the emission without modyfying the result before Emission possible since R1 and R2 already ready after Waiting for R5 Busy decoder 16 cicles instead of 22!!!! Waiting for R6 14

15 Multicycle hazards I1 F1 = F2 + F3 I2 F2 = F4 x F5 I3 F3 = F3 + F4 I4 F6 = F6 x F6 I5 F1 = F3 + F5 I6 F2 = F3 + F4 WAR (F2) I2 I1 WAR (F3) RAW (F3) I3 WAW(F1) RAW (F3) I5 NB: in this graph the hazards are potential since the registers only are considered no matter how many cycles are required by the executions WAW(F2) I6 I1 I2 Let s suppose to have a FP adder (1 cicle in red) and a multiplier (3 cicles in green). I3 I4 I5 I6 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 15

16 Dynamic instructions scheduling Temporal dependencies (hazards) not known at compile time It allows the execution of the code on different pipelines and on superscalar processors with no implications for the compiler. It allows the execution of instructions ahead of their position (in the following case FSUB F12,F8,F14) if the conditions allow it FDIV F0,F2,F4 FADD F10,F0,F8 (RAW - must wait for F0) FSUB F12,F8,F14 (can be executed anyway) Systems with out of order executions but commitment always in order 16

17 Scoreboard Write After Read (WAR) Consider the following sequence Read after Write (RAW) FDIV F0, F2, F4 FADD F10, F0, F8 FSUB F8, F8, F14 They must read the same value There is an antidependency (WAR hazard) between FADD and FSUB: should FSUB end before FADD has read F8 an error would occur (F8 already updated) A possible Write After Write (WAW) hazard would occur if in FSUB F10 instead of F8 had been used as destination (in case FSUB would end before FADD) Scoreboard technique: an instruction per clock should be terminated executing an instruction as soon as possible. 17

18 Scoreboard Registers FP MUL FP MUL FP DIV FP ADD Functional units INTEG Scoreboard The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines when an instruction can read its operands and start its execution. The scoreboard considers all system state changes and decides when the first instruction in the FIFO queue (as produced by the compiler) can be started. 18

19 Scoreboard The four stages equivalent to ID, EX and WB in DLX are: 1. Emission: if a functional unit for the instruction is available (free) and the required operands are available in RF with updated values, the instruction is issued unless another functional unit has already an instruction which must write into the same destination register. No WAW hazards therefore. In this latter case the instruction is stalled which blocks the emission of all the following instructions in the prefetch queue even when all other conditions for them are met! 2. Operand read: the instruction has been emitted. If the operand is available and no already executing instructionmust write it, the operand is read otherwise stall in the functional unit 3. Execution: when the result has been computed and stored the scoreboard is informed so as to unblock a possibly waiting instruction 4. In case of possible WAR the instruction is stalled and does not write the result if there is a previous instruction which has not yet read the operands and one of them is the destination register of the considered instruction. Once the operand has been read the result can be written It must be noticed that with this organisation the forwarding is avoided since the results are written as soon as produced (but for the wait WAR point 4) Obviously some stalls can be induced because the number of busses available for transfers is small The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing the RAW risks). 19

20 An example Integer LD F6, 34(R2) RAW LD F2, 45(R3) FMUL F0, F2,F4 RAW (MULD) FSUB F8, F6, F2 (SUBD ) FDIV F10, F0, F6 (DIVD) FADD F6, < F8, F2 WAR (ADDD) Do you find more hypothetical hazards? For instance what about F0? Hypothetical timing for different instructions (which includes the operands read and execution) FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles 20

21 Scoreboard entities Instruction stages: emission, operands read, execution and writeback Statuses of the functional units (FU): 9 parameters Busy Op Fi Fj, Fk Qj, Qk Rj, Rk Unit busy Operation Code presently executed Instruction destination (result) register Operands source registers Functional units producing the required operands (if not yet ready) for the registers Fj and Fk Flags (yes) indicating whether Fj, Fk have been already updated Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register N.B. It must be remembered that in case of possible WAW the instructions emission is stalled (point 1 of the rules) N.B. In the following example we suppose that two multiplication/division units are available 21

22 Example (here we assume that F0 is a normal register and not always 0 ) NB Instruction status Read ExecutionWrite LD = FLD Instruction j k Issue Op complete Result FLD MULTD = FMUL LD F6 34 R2 SUBD = FSUB FMUL LD F2 45 R3 FDIV MULTD F0 F2 F4 Instructions states DIVD = FDIV SUBD F8 F6 F2 Progression clock ADDD = FADD DIVD F10 F0 F6 ADDD F6 F8 F2 FU=Functional Unit Functional unit status dest Source1Source 2FU for j FU for k Fj? Fk? Clock 1 integer unit 2 multipl. units 1 add/sub unit 1 division unit Time Name Integer Mult1 Mult2 Add Divide Register result status n. of cycles of execution yet to elapse 0 1 cycle FADD, FSUB 2 cycles 10 cycles 40 cycles Register Q i Ready? Busy Op Fi Fj Fk Qj Qk Rj Rk R j and R k indicates whether (possibly in the next cycle if just produced) the data can be read from the operands source registers of the executed instruction. Q j and Q k are the Functional Units which produce them (if not yet ready). F j and F k are the registers where data produced by Q j and Q k are stored (or will be stored in the next cycle data available if the corresponding Ri is yes) to be used in the executed instruction Floating point result registers F0 F2 F4 F6 F8 F10 F12... F31 Functional Unit producing the result for the floating point register Fx (Qj, Qk) 22

23 Cycle 1 Instruction status Read Execution Write Instruction j k Issue Op/Excomplete Result F6,34(R2) is Issue LD F6 34 R2 1 LD F2 45 R3 MULTD F0 F2 F4 Brown colour for state change SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? At clock 1 the instruction stage of LD R2 is supposed to be already available and therefore in the next clock can be used. LD uses the integer unit Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Yes No Load F6 R2 Yes Mult2 No Add No Divide No Register result status Clock 1 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer R2 Functional unit used for producing the result in F6 23

24 Cycle 2 Instruction status Read Execution Write Instruction j k Issue Op/Ex complete Result LD F6 34 R2 1 2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Data ready in R2: instruction can proceed NB: The second LD cannot be emitted because the only integer unit is busy and the same applies for MULTD because instructions must be emitted in order Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock 2 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer 24

25 Cycle 3 Instruction status Read Execution Write Instruction j k IssueOp/Ex complete Result LD F6 34 R FLD 1 cycle LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Mult1 No Mult2 No Add No Divide No Register result status Clock 3 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer 25

26 Cycle 4 Instruction status Read Execution Write indicates their value Instruction j k Issue Op/Excomplete Result LD F6 34 R LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock 4 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer Register at the end of the period has been written Functional unit freed at the end of the period The change of status of the FUs at the clock positive edge concluding ending the current cycle (future status). For instance the integer functional unit is freed at the end of cycle 4 together with the result writeback. LD F6 34,R2 disappears totally from scoreboard at the clock positive edge concluding the current cycle 4. 26

27 Cycle 5 Instruction status Read Execution Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 RUj RUk Rj? Rk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Yes No Load F2 R3 Yes Mult2 No Add No Divide No Register result status Clock 5 F0 F2 F4 F6 F8 F10 F12... F31 RU Integer The Integer Functional Unit must produce a new value for F2 At the beginning of cycle 5 the integer unit is already free and then LD F2 45, R3 can start R3 supposed already ready as in the previous case 27

28 Cycle 6 Instruction status Read Execution Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R3 5 6 MULTD F0 F2 F4 6 MULTD waits for F2 SUBD F8 F6 F2 from the integer unit!!!! DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Mult2 Yes No Mult F0 F2 F4 Integer No Yes Add No Divide No Register result status Clock 6 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult Integer MULTD F0 F2, F4 can start because its FU is free and the destination register is F0 F4 supposed already present 28

29 Cycle 7 Instruction status Read Execution Write Instruction j k Issue Op/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer No Yes MULTD stalled in the execution unit because F2 not yet ready. The same for SUBD Mult2 No Add Divide Yes No Subd F8 F6 F2 Integer Yes No Register result status Clock 7 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult Integer Add (NB : FP adder executes FP subtractions too) SUBD F8 F6, F2 can start because the arithmetic FP sum/subtraction is free. 29

30 Cycle 8 Instruction status Read EX Write Instruction j k Issue DIVD F10 F0, F6 can start Op/Ex complete. Result because the divide FP FU is free LD F6 34 R LD F2 45 R MULTDF0 F2 F4 6 SUBD F8 F6 F2 7 F2 available!! DIVD F10 F0 F6 8 ADDD F6 F8 F2 F0 not yet available Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Updated at the end of the cycle Clock 8 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide F2 is written and therefore the integer unit is free F2 written allows MULTD and SUBD to read the operands during the next cycle 30

31 Cycle 9-10 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk N.B.: MULTD and SUBD can read the operands because F2 available (see cycle 8). DIVD is still stalled because of F0. ADDD cannot start because SUBD uses the adder FU Integer No 10 clock Mult1 Yes Mult F0 F2 F4 Mult2 No 2 clock Add Yes Sub F8 F6 F2 40 clock Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 9-10 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 31

32 Cycle 11 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Nota: FU Add requires 2 cycles for the SUBD and therefore nothing happens in cycle 10 while MULTD still processes its data NB: ADDD will use the result of the SUBD but is not yet started because of SUBD Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No 0 Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 11 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 32

33 Cycle 12 Instruction status Read EX Write Instruction j k IssueOp/Ex completeresult LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 FLD 1 cycle FADD and FSUB 2c ycles FMUL 10 cycles FDIV 40 cycles SUBD ends freeing the FU. In the next period ADDD can start Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No 7 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 12 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Divide F8 is written and the ADD/SUB FU is freed 33

34 Cycle 13 Instruction status Fead EX Write Instruction j k IssueOp/Excomplete Fesult LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F Now ADDD can start because SUBD has finished its execution and has freed the FU Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Divide Yes Add Yes Div F6 F10 F8 F0 F2 F6 Mult1 Yes No Yes Yes Register result status Clock 13 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 6 Clocks more FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles 34

35 Cycle 14 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 14 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 5 clocks more 2 Clocks more 35

36 Cycle 15 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles ADDD requires two cycles and therefore no system status change Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 15 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 4 Clocks more 1 Clock more 36

37 Cycle 16 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles ADDD ended its EX stage while MULTD and DIVD keep executing Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 16 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 37

38 Cycle 17 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles NB!!! ADDD stalled (cannot write) because of a WAR with DIVD on F6. DIVD does not read F6 because it waits for F0 produced by MULTD (operands are read in parallel). MULT and DIVD keep executing Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2 Clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Stalled because WAR F6 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 17 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 38

39 Cycle 18 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULT still executing DIVD still stalled Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1 clock more Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 18 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 39

40 Cycle 19 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULT ends its execution, will write in cycle 20 (after 10 cycles) which will unblock DIVD and then ADDD Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 19 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 40

41 Cycle 20 Instruction status Read EX Write FLD 1 cycle Instruction j k IssueOp/Excomplete Result FADD FSUB 2 cycles LD F6 34 R FMUL 10 cycles LD F2 45 R FDIV 40 cycles MULTD F0 F2 F MULTD writes F0 SUBD F8 F6 F unblocking DIVD DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock 20 F0 F2 F4 F6 F8 F10 F12... F31 FU Add Divide 41

42 Cycle 21 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Register result status Clock 21 F0 F2 F4 F6 F8 F10 F12... F31 FU Add Divide DIVD reads both F0 and F6 (which could not be written by ADDD because of WAR) unblocking ADDD which can write F6 in the next cycle 42

43 Cycle 22 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock 22 F0 F2 F4 F6 F8 F10 F12... F31 FU Divide Now ADDD can write F6 after the WAR hazards with DIVD disappeared. For 6 cycles ADDD couldn t write F6 although its result was available 43

44 Cycle 61 Instruction status Read EX Write Instruction j k Issue Op/ExcompleteResult LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock 61 F0 F2 F4 F6 F8 F10 F12... F31 FU Divide DIVD execution ends after 40 cycles 44

45 Cycle 62 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F All executions ended Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 62 FU 45

46 Scoreboard limits Register values must be read in any case in parallel only from the register file (which means that they must have been already stored in the registers no RAW problem) An instruction can be emitted only if all previous instructions have been emitted WAR RAW FDIV FADD FSTOR FSUB FMUL WAW F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6, F10, F8 N.B Hazards of the sequence are only potential: their occurrence depends on the instructions execution time 46

47 Renaming Tomasulo Algorithm «Renaming» indicates a location different from the RF where a requested datum is produced/stored and can be obtained. The name «renaming» is used because it is as if the source registers of an instruction were renamed Tomasulo algorithm: renaming is based on the concept of reservation stations which are functional units buffers where instructions can be «parked» waiting for the availability of the requested Fu and the needed data. A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU is free and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operands EITHER the source register data OR the reservation stations producing them are indicated (whence renaming). The renaming occurs at run-time A reservation station captures a required operand exactly when and where it is (not waiting until it is written avoiding the register file access). Similar to the case of forwarding When multiple writes to the same register occur (WAW possible only if multiple busses between FUs and RF are available) only the most recently produced data are written (for each register a TAG is used indicating the FU which has the right to write) The following benefits occur Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the information stored in the reservation stations of each functional unit determines whether an instruction can execute in the FU since the source (where the datum is being produced ) and NOT the RF is in any case indicated. RAW hazards are no more possible since the requested data are provided as soon as produced. The same for WAR Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RF through the common data busses (multiple reservation stations in addition to RF register can be accessed at the same time when multiple bussesare available) 47

48 Tomasulo Algorithm Tomasulo eliminates not only WAWs but also WARs Possible WAW FLD FLD FMUL FSUB FDIV FADD F6, 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, F6 F10, F0, F6 F6, F8, F2 Possible WAR. FLD FLD FMUL FSUB FDIV FADD [T/F6], 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, [T] F10, F0, [T] F6, F8, F2 Renaming (functional unit producing the datum) When an instruction is inserted in a RS it is checked whether one or more of its operands are being produced elsewhereby other RS: if yes then renaming For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read its operands (in case of F8 of FSUB and of F2 of FLD they were both immediately available for FADD) but since FDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holds for FSUB. As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the most recent instruction in the RS using a destination register can write the register. 48

49 Tomasulo Algorithm Very high performance without special compilers Differences with scoreboard Buffer and controls directly distributed in the FUs (there is no centralized control): buffers are called reservation stations Source registers names substituted by pointers to buffers of the reservation stations (if the requested data are being there produced) Renaming : a direct pointer to the sources and not to the register One ore more Common Data Bus for sending results to all FUs requiring them Load and Stores considered as FUs (a STORE can also be a source for a RS executing a LOAD) 49

50 Tomasulo Algorithm In this example is it assumed that the MUL unit executes the DIVs too and that the ADD executes the SUBs too. LOAD and STORES are handled as other instructions In this example: 3 RS for add/sub 2 RS for mult/div 5 RS for store 5 RS for load For the data produced by the FUs In this example only one Data Bus. Please notice that the same Common Data Bus is used also by the RS waiting for data Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements: either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming) 50

51 Tomasulo Algorithm Load buffers are used to store the load addresses Store buffers contain the computed addresses and the data to be written in memory Load and store must be executed in sequence if they are related to the same addresses. In the other cases it is possibile to anticipate the LOADs (never the STOREs) In figure there are 3 phases (each one of which can last several clocks): Emission: the instructions are extracted in order from the general instruction queue when there is a free RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operands are extracted from RF or the producing FU as indicated. In case of WAW it must be determined which instruction must provide the data Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must be transferred over a bus anyway) in order to catch them (and their sources) as soon as available: RAW are therefore avoided (we are sure not to read stale data in the RF). Writeback: as soon as a datum is produced, it is tranferred over one CDB (when more than one are available) to the RF and to the RS waiting for it. 51

52 Tomasulo Algorithm Let s see the scoreboard example in a Tomasulo Architecture. Let s suppose that the execution times are the same of the scoreboard (FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback LD LD FMUL FSUB FDIV FADD F6, 34(R2) F2, 45(R3) F0, F2,F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 52

53 Reservation Station Op: opcode of the instruction to be executed Vj, Vk: places of the operands (either RF or the FUs producing them) Qj, Qk: Functional units producing the results. A blank indicates that the source operands are already in Vj or Vk or that they are not required Busy: Busy FU Register File Status: Indicates which FU will write the register (if needed). A blank means that there are no instructions which must write the register and therefore its value can be directly used N.B. From the general instruction queue one instruction per clock is emitted when a FUs RS for that instruction is available otherwise stall. In our example we assume only one CDB. 53

54 Instruction status Instruction j k Issue LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Cycle 0 Execution Write Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Operands register. If blank the datum is produced in the corresponding Q FU Load/store not indicated in the status table NB. For LD (ST here not used) there is a limited number of RS. Their BUSY status is here displayed differently from the FU (see next slide) Producing FU if blank it means that the datum is in RF For sake of simplicity R j e R k (ready/notready) are not displayed since their values are implicit in the status of Q j and Q k FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 0 FU The FU producing the new value 54

55 Cycle 1 5 RS for the LOAD Instruction status Instruction j k Issue Execution Write Busy Address LD F6 34 R2 1 Load1 Yes 34+R2 LD F2 45 R3 NB: Here it is assumed that R2 and R3 are already available MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k 3 RS for adder/sub 2 RS for mul/div Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 1 FU Load1 55

56 Cycle 2 5 RS for LOAD Instruction status Instruction j k Issue Execution Write Busy Address LD F6 34 R Load1 Yes 34+R2 LD F2 45 R3 2 The second LD is emitted. Load2 Yes 45+R3 One instruction per clock is MULTD F0 F2 F4 emitted (when possible) SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 Mult2 No No No No No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 2 FU Load2 Load1 NB: Load -> 2 cycles: the first one for computing the address and the second for reading the datum N.B. A second LOAD has been emitted (not possible with the scoreboard) and parked in the RS. R3 value already available in the RF FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 56

57 Cycle 3 Instruction status Instruction j k Issue Execution Write Busy LD F6 34 R Load1 Yes LD F2 45 R Load2 Yes MULTD F0 F2 F4 3 MULTD emitted (free RS ) SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 No No Datum supposed already in the RF Address 34+R2 45+R3 LD two cycles MULTD can be emitted although F2 NOT yet available. F2-> renaming FLD 1+1 cycles, Add3 No FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Yet10 cycles Mult1 Yes Mult F4 Load2 FDIV 40+1 cycles) Mult2 No NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 3 FU Mult1 Load2 Load1 57

58 Cycle 4 Instruction status Instruction j k Issue Execution Write Busy LD F6 34 R LD F2 45 R Load2 Yes MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 SUBD is emitted (RS free) F6 available in RF at the DIVD F10 F0 F6 end of the cycle ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Address 45+R3 The datum read from memory LD F6 34(R2) is written both in the RF and in the RS of SUBD which is waiting for it FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Time Name Busy Op Vj Vk Qj Qk Yet 3 cycles Add1 Yes Sub F6 (captured on the fly) Load2 Add2 No Add3 No Yet 10 cycles Mult1 Yes Mult F4 Load2 The FUs execute both sums and subtractions Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 4 FU Mult1 Load2 Add1 FU freed 58

59 Cycle 5 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 3 Add1 Yes Sub F6 (capt.) F2 (capt) 0 Add2 No Add3 No DIVD is emitted (RS free) 10 Mult1 Yes Mult F2 (capt) F4 The datum read from memory with LD F2 45(R3) is written both in register F2 and in the RS of SUBD and MULTD which are waiting for it FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 0 Mult2 Register result status Yes Div F6 Mult1 Wait for F0 Clock F0 F2 F4 F6 F8 F10 F12... F31 5 FU Mult1 Add1 Mult2 FU freed 59

60 Cycle 6 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No ADDD is emitted (RS free) Wait for F8 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 9 Mult1 Yes Mult F2 F4 Now MULTD can execute (F2 and F4 available) Yet 40 cycles Mult2 Yes Div F6 Mult1 Wait for F0 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 6 FU Mult1 Add2 Add1 Mult2 60

61 Cycle 7 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Yet 40 cycles Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No 8 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Datum in F6 will be overwritten by ADDD but it was already read and is present in the RS of DIVD SUBD (as ADDD) two cycles FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback ADDD stalled waiting for SUBD (F8) Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 7 FU Mult1 Add2 Add1 Mult2 61

62 Cycle 8 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 2 Add2 Yes Add F8 F2 Add3 No 7 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 NB: SUBD ends before MULTD and allows ADDD (which captures the result of F8) to start executing FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 8 FU Mult1 Add2 Mult2 FU freed 62

63 Cycle 9 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes Add F8 F2 Add3 No ADDD executing 6 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 9 FU Mult1 Add2 Mult2 63

64 Cycle 10 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Yet 40 Add1 No 1 Add2 Yes Add F8 F2 Add3 No 5 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Two execution cycles FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 10 FU Mult1 Add2 Mult2 64

65 Cycle 11 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 0 Add1 Add2 Add3 No No No 4 Mult1 Yes Mult F2 F4 40 Mult2 Yes Div F6 Mult1 ADDD too ends before MULTD and DIVD FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 11 FU Mult1 Mult2 FU freed 65

66 Instruction status Cycle 12 Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 No No No 3 Mult1 Yes Mult F2 F4 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Waiting for the datum producedby MULTD Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 12 FU Mult1 Mult2 66

67 Cycle 15 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 No No No 1 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Waiting for the datum producedby MULTD Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 15 FU Mult1 Mult2 67

68 Cycle 16 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Now DIVD can execute Time Name Busy Op Vj Vk Qj Qk 0 0 Add1 Add2 Add3 Mult1 No No No No 40 Mult2 Yes Div F0 F6 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 16 FU Mult2 FU freed 68

69 Cycle 56 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 No No No No 0 Mult2 Yes Div F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 56 FU Mult2 69

70 Cycle 57 Instruction status Execution Write Instruction j k Issue complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 Mult2 No No No No No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 57 FU 70

71 A demo can be found at 71

72 Limits of Tomasulo Algorithm Very complex Each CDB must be connetcted to each RS Complex cabling Reduce n. of CDB means reduced efficiency If a single CDB is present only one instruction per cycle can end Ouf of order instructions completion!!!!!! NOT precise interrupts 73

73 Exceptions Exception/interrupt: non-programmed control transfer Return address and all other information necessary to restore the interrupted situation must be saved «Response» subroutine (handler) must be executed Two exceptions types: interrupt and trap Interrupts: external causes The user program are interrupted and the then restored Asyncronous to the current process Acknowledged at the end of the current instruction (if interrupts enabled) The handler is responsibility of the user program Traps: internal causes Exceptional conditions (overflow, zero division etc.) Errors (i.e. parity) Page fault (or see later segment fault): data not available in memory Syncronous to the current process Operating systems handler Instruction can be interrupted during its execution (i.e. page fault) and therefore must be «restartable»,. The executing program is normally temporarily aborted. 74

74 Examples Instruction Restart 75

75 Precise exceptions/interrupts Exceptions must be precise that is their behaviour must be same that would occur in a non-pipelined architecture Precise: machine status is saved as if the code would have been executed until the exception : All preceding instruction must be terminated All instructions following the instruction which provoked the exception must be handled as if they never started The same code must executed identically on different architectures Complex problem with pipeline, OOO execution (see later) etc. Scoreboard and Tomasulo have: In order emission, execution (and therefore terminated) out of order fuori ordine Precise exceptions(interrupts) : instruction commitment in order 76

76 Reorder Buffer (ROB) FIFO queue Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we say that the instruction is virtually inserted in the ROB When instructions are terminated the results are stored in the ROB (instead of the RF) which provides also the operands to other instructions which requires them (renaming!) Commitment Commitment: the result of the instruction in the top slot of the FIFO are transferred to the architectural registers Easy undo of speculated instructions (see later) or of branches erroneously predicted or exceptions Automatic WAW avoidance FP Op Queue ROB FP Regs Res Stations FP Adder Res Stations FP Adder 78

77 Tomasulo again 79

78 Tomasulo in 4 steps Emission Emission of an instruction from the instruction queue when a RSand a ROB slot available. In the RS are indicated the operands source and the ROB slot where an instruction will be parked after its esecution (this phase is called «dispatch ). The results are NOT written in the RF until the commitment phase. NB the lack of one of the two conditions blocks the emission of the following instructions Execution Operands transformation. If not yet ready they can be in the ROB (in this case the operand value computed by the nearest previous instructions is used) orstill computed in the FU. This phase is indicated as issue. Result writeback Execution ends. Result trasmitted on the CDB for the RS waiting of them and for the ROB. Commitment Register (or memory) update with the results stored in the ROB when the instruction is on the top of the ROB FIFO. In case of erroneously predicted branch the ROB results are just dropped ( graduation ). EMISSION IN ORDER COMMITMENT IN ORDER N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same (unlikely, otherwise the compiler would have dropped the first one) the result of the most recent instruction is used. 80

79 HW with ROB Destination Register Result Exception? Valid (terminated ) Program Counter FP Op Queue Compar network Reorder Buffer FP Regs ROB Res Stations FP Adder Res Stations FP Adder ROB is a circular queue 81

80 Example LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 82

81 FP Op queue LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 Tomasulo with ROB cycle 1 ROB Dest. Source Istruction F0 LD F0,10(R2) N Completed? ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ROB end ROB top Dest. => destination position in ROB Dest. FP registers Dest To memory From memory FP adders Reservation Stations FP multipliers Dest 1 10+R2 M1 83

82 FP Op queue LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 Tomasulo with ROB cycle 2 ROB Renaming!! Dest. F10 F0 Source ROB1 FADD LD Istruction F10, F4, F0 F0,10(R2) Completed? N Ex ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 End Top Dest. 2 FADD F10,F4, ROB1 FP registers Dest To memory From memory Three slots for memory operations FP adders Reservation Stations FP multipliers Dest 1 10+R2 M1 (Memory 2 clocks) There can be also two ROB sources 84

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Instruction-Level Parallelism Scoreboarding (A.8) Instruction Level Parallelism

More information

CMP 301B Computer Architecture. Appendix C

CMP 301B Computer Architecture. Appendix C CMP 301B Computer Architecture Appendix C Dealing with Exceptions What should be done when an exception arises and many instructions are in the pipeline??!! Force a trap instruction in the next IF stage

More information

Tomasulo s Algorithm. Tomasulo s Algorithm

Tomasulo s Algorithm. Tomasulo s Algorithm Tomasulo s Algorithm Load and store buffers Contain data and addresses, act like reservation stations Branch Prediction Top-level design: 56 Tomasulo s Algorithm Three Steps: Issue Get next instruction

More information

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3 EN164: Design of Computing Systems Lecture 22: Processor / ILP 3 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards Dynamic Scheduling Pipelining: Issue instructions in every cycle (CPI 1) Problem: hazards delay instruction completion & increase the CPI Compiler scheduling (static scheduling) reduces impact of hazards

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science !!! Basic MIPS integer pipeline Branches with one

More information

Instruction Level Parallelism Part II - Scoreboard

Instruction Level Parallelism Part II - Scoreboard Course on: Advanced Computer Architectures Instruction Level Parallelism Part II - Scoreboard Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Basic Assumptions We consider

More information

CS521 CSE IITG 11/23/2012

CS521 CSE IITG 11/23/2012 Parallel Decoding and issue Parallel execution Preserving the sequential consistency of execution and exception processing 1 slide 2 Decode/issue data Issue bound fetch Dispatch bound fetch RS RS RS RS

More information

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard CISC 662 Graduate Computer Architecture Lecture 9 - Scoreboard Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture tes from John Hennessy and David Patterson s: Computer

More information

COSC4201. Scoreboard

COSC4201. Scoreboard COSC4201 Scoreboard Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RIT) 1 Overcoming Data Hazards with Dynamic Scheduling In the pipeline, if there is

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have

More information

Out-of-Order Execution. Register Renaming. Nima Honarmand

Out-of-Order Execution. Register Renaming. Nima Honarmand Out-of-Order Execution & Register Renaming Nima Honarmand Out-of-Order (OOO) Execution (1) Essence of OOO execution is Dynamic Scheduling Dynamic scheduling: processor hardware determines instruction execution

More information

Precise State Recovery. Out-of-Order Pipelines

Precise State Recovery. Out-of-Order Pipelines Precise State Recovery in Out-of-Order Pipelines Nima Honarmand Recall Our Generic OOO Pipeline Instruction flow (pipeline front-end) is in-order Register and memory execution are OOO And, we need a final

More information

Dynamic Scheduling I

Dynamic Scheduling I basic pipeline started with single, in-order issue, single-cycle operations have extended this basic pipeline with multi-cycle operations multiple issue (superscalar) now: dynamic scheduling (out-of-order

More information

Dynamic Scheduling II

Dynamic Scheduling II so far: dynamic scheduling (out-of-order execution) Scoreboard omasulo s algorithm register renaming: removing artificial dependences (WAR/WAW) now: out-of-order execution + precise state advanced topic:

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Speculation and raps in Out-of-Order Cores What is wrong with omasulo s? Branch instructions Need branch prediction to guess what to fetch next Need speculative execution

More information

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018 omasulo s Algorithm Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University,

More information

Project 5: Optimizer Jason Ansel

Project 5: Optimizer Jason Ansel Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale

More information

Tomasolu s s Algorithm

Tomasolu s s Algorithm omasolu s s Algorithm Fall 2007 Prof. homas Wenisch http://www.eecs.umich.edu/courses/eecs4 70 Floating Point Buffers (FLB) ag ag ag Storage Bus Floating Point 4 3 Buffers FLB 6 5 5 4 Control 2 1 1 Result

More information

OOO Execution & Precise State MIPS R10000 (R10K)

OOO Execution & Precise State MIPS R10000 (R10K) OOO Execution & Precise State in MIPS R10000 (R10K) Nima Honarmand CDB. CDB.V Spring 2018 :: CSE 502 he Problem with P6 Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch

More information

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview

More information

DAT105: Computer Architecture

DAT105: Computer Architecture Department of Computer Science & Engineering Chalmers University of Techlogy DAT05: Computer Architecture Exercise 6 (Old exam questions) By Minh Quang Do 2007-2-2 Question 4a [2006/2/22] () Loop: LD F0,0(R)

More information

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont MIPS R10000 Case Study Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Multiprocessor SGI Origin Using MIPS R10K Many thanks to Prof. Martin and Roth of University of Pennsylvania for

More information

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont   Core 2 Microarchitecture P6 µarchitecture Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Core 2 Microarchitecture Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions

More information

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides.

More information

Instruction Level Parallelism III: Dynamic Scheduling

Instruction Level Parallelism III: Dynamic Scheduling Instruction Level Parallelism III: Dynamic Scheduling Reading: Appendix A (A-67) H&P Chapter 2 Instruction Level Parallelism III: Dynamic Scheduling 1 his Unit: Dynamic Scheduling Application OS Compiler

More information

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I U. Wisconsin CS/ECE 752 Advanced Computer Architecture I Prof. Karu Sankaralingam Unit 5: Dynamic Scheduling I Slides developed by Amir Roth of University of Pennsylvania with sources that included University

More information

Instruction Level Parallelism. Data Dependence Static Scheduling

Instruction Level Parallelism. Data Dependence Static Scheduling Instruction Level Parallelism Data Dependence Static Scheduling Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When

More information

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold Pipelining Readings: 4.5-4.8 Example: Doing the laundry Ann, Brian, Cathy, & Dave A B C D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes

More information

Issue. Execute. Finish

Issue. Execute. Finish Specula1on & Precise Interrupts Fall 2017 Prof. Ron Dreslinski h6p://www.eecs.umich.edu/courses/eecs470 In Order Out of Order In Order Issue Execute Finish Fetch Decode Dispatch Complete Retire Instruction/Decode

More information

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review) CSE 2021: Computer Organization IF for Load (Review) Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan CSE-2021 July-19-2012 2 ID for Load (Review) EX for Load (Review) CSE-2021 July-19-2012

More information

CSE 2021: Computer Organization

CSE 2021: Computer Organization CSE 2021: Computer Organization Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan IF for Load (Review) CSE-2021 July-14-2011 2 ID for Load (Review) CSE-2021 July-14-2011 3 EX for Load

More information

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy CSE 2021: Computer Organization Single Cycle (Review) Lecture-10 CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan CSE-2021 July-12-2012 2 Single Cycle with Jump Multi-Cycle Implementation

More information

CS 110 Computer Architecture Lecture 11: Pipelining

CS 110 Computer Architecture Lecture 11: Pipelining CS 110 Computer Architecture Lecture 11: Pipelining Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on

More information

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time Pipelining Readings: 4.5-4.8 Example: Doing the laundry A B C D Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes

More information

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Chapter 4 The Processor Part II Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup p = 2n/(0.5n + 1.5) 4 =

More information

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2) Lecture Topics Today: Pipelined Processors (P&H 4.5-4.10) Next: continued 1 Announcements Milestone #4 (due 2/23) Milestone #5 (due 3/2) 2 1 ISA Implementations Three different strategies: single-cycle

More information

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T. Pipeline Hazards Krste Asanovic Laboratory for Computer Science M.I.T. Pipelined DLX Datapath without interlocks and jumps 31 0x4 RegDst RegWrite inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel

More information

LECTURE 8. Pipelining: Datapath and Control

LECTURE 8. Pipelining: Datapath and Control LECTURE 8 Pipelining: Datapath and Control PIPELINED DATAPATH As with the single-cycle and multi-cycle implementations, we will start by looking at the datapath for pipelining. We already know that pipelining

More information

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p:// Wenisch 26 -- Portions ustin, Brehob, Falsafi, Hill, Hoe, ipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar EECS 4 ecture 4 Pipelining & Hazards II Winter 29 GS STTION Prof. Ronald Dreslinski h8p://www.eecs.umich.edu/courses/eecs4

More information

Lecture 8-1 Vector Processors 2 A. Sohn

Lecture 8-1 Vector Processors 2 A. Sohn Lecture 8-1 Vector Processors Vector Processors How many iterations does the following loop go through? For i=1 to n do A[i] = B[i] + C[i] Sequential Processor: n times. Vector processor: 1 instruction!

More information

Pipelined Processor Design

Pipelined Processor Design Pipelined Processor Design COE 38 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Pipelining versus Serial

More information

EECE 321: Computer Organiza5on

EECE 321: Computer Organiza5on EECE 321: Computer Organiza5on Mohammad M. Mansour Dept. of Electrical and Compute Engineering American University of Beirut Lecture 21: Pipelining Processor Pipelining Same principles can be applied to

More information

RISC Central Processing Unit

RISC Central Processing Unit RISC Central Processing Unit Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Spring, 2014 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/

More information

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Suggested Readings! Lecture 12 Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings! 1! CSE 30321 Lecture 12 Introduction to Pipelining! CSE 30321 Lecture 12 Introduction to Pipelining! 2! Suggested Readings!! Readings!! H&P: Chapter 4.5-4.7!! (Over the next 3-4 lectures)! Lecture 12"

More information

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution School of Electrical and Computer Engineering Cornell University revision: 2016-11-28-17-33 1 In-Order Dual-Issue

More information

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1 Pipelined Beta Where are the registers? Handouts: Lecture Slides L16 Pipelined Beta 1 Increasing CPU Performance MIPS = Freq CPI MIPS = Millions of Instructions/Second Freq = Clock Frequency, MHz CPI =

More information

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors 6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors Options for dealing with data and control hazards: stall, bypass, speculate 6.S084 Worksheet - 1 of 10 - L19 Control Hazards in Pipelined

More information

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation Mark Wolff Linda Wills School of Electrical and Computer Engineering Georgia Institute of Technology {wolff,linda.wills}@ece.gatech.edu

More information

CS61c: Introduction to Synchronous Digital Systems

CS61c: Introduction to Synchronous Digital Systems CS61c: Introduction to Synchronous Digital Systems J. Wawrzynek March 4, 2006 Optional Reading: P&H, Appendix B 1 Instruction Set Architecture Among the topics we studied thus far this semester, was the

More information

Lecture 4: Introduction to Pipelining

Lecture 4: Introduction to Pipelining Lecture 4: Introduction to Pipelining Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder

More information

EE382V-ICS: System-on-a-Chip (SoC) Design

EE382V-ICS: System-on-a-Chip (SoC) Design EE38V-CS: System-on-a-Chip (SoC) Design Hardware Synthesis and Architectures Source: D. Gajski, S. Abdi, A. Gerstlauer, G. Schirner, Embedded System Design: Modeling, Synthesis, Verification, Chapter 6:

More information

Compiler Optimisation

Compiler Optimisation Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This

More information

Digital Integrated CircuitDesign

Digital Integrated CircuitDesign Digital Integrated CircuitDesign Lecture 13 Building Blocks (Multipliers) Register Adder Shift Register Adib Abrishamifar EE Department IUST Acknowledgement This lecture note has been summarized and categorized

More information

FMP For More Practice

FMP For More Practice FP 6.-6 For ore Practice Labeling Pipeline Diagrams with 6.5 [2] < 6.3> To understand how pipeline works, let s consider these five instructions going through the pipeline: lw $, 2($) sub $, $2, $3 and

More information

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps CSE 30321 Computer Architecture I Fall 2010 Homework 06 Pipelined Processors 85 points Assigned: November 2, 2010 Due: November 9, 2010 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (25 points)

More information

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps CSE 30321 Computer Architecture I Fall 2011 Homework 06 Pipelined Processors 75 points Assigned: November 1, 2011 Due: November 8, 2011 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (15 points)

More information

ECE473 Computer Architecture and Organization. Pipeline: Introduction

ECE473 Computer Architecture and Organization. Pipeline: Introduction Computer Architecture and Organization Pipeline: Introduction Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 11.1 The Laundry Analogy Student A,

More information

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

Computer Elements and Datapath. Microarchitecture Implementation of an ISA 6.823, L5--1 Computer Elements and atapath Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 status lines Microarchitecture Implementation of an ISA ler control points 6.823, L5--2

More information

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay Evolution of DSP Processors Kartik Kariya EE, IIT Bombay Agenda Expected features of DSPs Brief overview of early DSPs Multi-issue DSPs Case Study: VLIW based Processor (SPXK5) for Mobile Applications

More information

Computer Architecture

Computer Architecture Computer Architecture An Introduction Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

More information

On the Rules of Low-Power Design

On the Rules of Low-Power Design On the Rules of Low-Power Design (and Why You Should Break Them) Prof. Todd Austin University of Michigan austin@umich.edu A long time ago, in a not so far away place The Rules of Low-Power Design P =

More information

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor ECE 2300 Digital ogic & Computer Organization Spring 2018 ore Pipelined icroprocessor ecture 18: 1 nnouncements No instructor office hour today Rescheduled to onday pril 16, 4:00-5:30pm Prelim 2 review

More information

CZ3001 ADVANCED COMPUTER ARCHITECTURE

CZ3001 ADVANCED COMPUTER ARCHITECTURE CZ3001 ADVANCED COMPUTER ARCHITECTURE Lab 3 Report Abstract Pipelining is a process in which successive steps of an instruction sequence are executed in turn by a sequence of modules able to operate concurrently,

More information

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona NPTEL Online - IIT Kanpur Instructor: Dr. Mainak Chaudhuri Instructor: Dr. S. K. Aggarwal Course Name: Department: Program Optimization for Multi-core Architecture Computer Science and Engineering IIT

More information

Class Project: Low power Design of Electronic Circuits (ELEC 6970) 1

Class Project: Low power Design of Electronic Circuits (ELEC 6970) 1 Power Minimization using Voltage reduction and Parallel Processing Sudheer Vemula Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL. Goal of the project:- To reduce the power consumed

More information

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubu Mukherjee SPEARS Group, Intel Massachusetts EECS, University

More information

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT NG KAR SIN (B.Tech. (Hons.), NUS) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING NATIONAL

More information

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture Overview 1 Trends in Microprocessor Architecture R05 Robert Mullins Computer architecture Scaling performance and CMOS Where have performance gains come from? Modern superscalar processors The limits of

More information

CS4617 Computer Architecture

CS4617 Computer Architecture 1/26 CS4617 Computer Architecture Lecture 2 Dr J Vaughan September 10, 2014 2/26 Amdahl s Law Speedup = Execution time for entire task without using enhancement Execution time for entire task using enhancement

More information

CS429: Computer Organization and Architecture

CS429: Computer Organization and Architecture CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: November 8, 2017 at 09:27 CS429 Slideset 14: 1 Overview What s wrong

More information

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Computer Architecture (263-2210-00L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Instructor: Prof. Onur Mutlu TAs: Hasan Hassan, Arash Tavakkol, Mohammad Sadr, Lois Orosa, Juan Gomez Luna Assigned:

More information

DIGITAL DESIGN WITH SM CHARTS

DIGITAL DESIGN WITH SM CHARTS DIGITAL DESIGN WITH SM CHARTS By: Dr K S Gurumurthy, UVCE, Bangalore e-notes for the lectures VTU EDUSAT Programme Dr. K S Gurumurthy, UVCE, Blore Page 1 19/04/2005 DIGITAL DESIGN WITH SM CHARTS The utility

More information

Low-Power Design for Embedded Processors

Low-Power Design for Embedded Processors Low-Power Design for Embedded Processors BILL MOYER, MEMBER, IEEE Invited Paper Minimization of power consumption in portable and batterypowered embedded systems has become an important aspect of processor

More information

Designing with STM32F3x

Designing with STM32F3x Designing with STM32F3x Course Description Designing with STM32F3x is a 3 days ST official course. The course provides all necessary theoretical and practical know-how for start developing platforms based

More information

Department Computer Science and Engineering IIT Kanpur

Department Computer Science and Engineering IIT Kanpur NPTEL Online - IIT Bombay Course Name Parallel Computer Architecture Department Computer Science and Engineering IIT Kanpur Instructor Dr. Mainak Chaudhuri file:///e /parallel_com_arch/lecture1/main.html[6/13/2012

More information

SCALCORE: DESIGNING A CORE

SCALCORE: DESIGNING A CORE SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia,

More information

ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION

ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION 98 Chapter-5 ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION 99 CHAPTER-5 Chapter 5: ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION S.No Name of the Sub-Title Page

More information

DELD MODEL ANSWER DEC 2018

DELD MODEL ANSWER DEC 2018 2018 DELD MODEL ANSWER DEC 2018 Q 1. a ) How will you implement Full adder using half-adder? Explain the circuit diagram. [6] An adder is a digital logic circuit in electronics that implements addition

More information

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice ECOM 4311 Digital System Design using VHDL Chapter 9 Sequential Circuit Design: Practice Outline 1. Poor design practice and remedy 2. More counters 3. Register as fast temporary storage 4. Pipelined circuit

More information

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India Vol. 2 Issue 2, December -23, pp: (75-8), Available online at: www.erpublications.com Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India Abstract: Real time operation

More information

Pre-Silicon Validation of Hyper-Threading Technology

Pre-Silicon Validation of Hyper-Threading Technology Pre-Silicon Validation of Hyper-Threading Technology David Burns, Desktop Platforms Group, Intel Corp. Index words: microprocessor, validation, bugs, verification ABSTRACT Hyper-Threading Technology delivers

More information

Computer Architecture and Organization:

Computer Architecture and Organization: Computer Architecture and Organization: L03: Register transfer and System Bus By: A. H. Abdul Hafez Abdul.hafez@hku.edu.tr, ah.abdulhafez@gmail.com 1 CAO, by Dr. A.H. Abdul Hafez, CE Dept. HKU Outlines

More information

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS Low Power Design Part I Introduction and VHDL design Ricardo Santos ricardo@facom.ufms.br LSCAD/FACOM/UFMS Motivation for Low Power Design Low power design is important from three different reasons Device

More information

Reading Material + Announcements

Reading Material + Announcements Reading Material + Announcements Reminder HW 1» Before asking questions: 1) Read all threads on piazza, 2) Think a bit Ÿ Then, post question Ÿ talk to Animesh if you are stuck Today s class» Wrap up Control

More information

Pipelined Architecture (2A) Young Won Lim 4/7/18

Pipelined Architecture (2A) Young Won Lim 4/7/18 Pipelined Architecture (2A) Copyright (c) 2014-2018 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2

More information

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/10/18 Pipelined Architecture (2A) Copyright (c) 2014-2018 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2

More information

Fall 2015 COMP Operating Systems. Lab #7

Fall 2015 COMP Operating Systems. Lab #7 Fall 2015 COMP 3511 Operating Systems Lab #7 Outline Review and examples on virtual memory Motivation of Virtual Memory Demand Paging Page Replacement Q. 1 What is required to support dynamic memory allocation

More information

Computer Arithmetic (2)

Computer Arithmetic (2) Computer Arithmetic () Arithmetic Units How do we carry out,,, in FPGA? How do we perform sin, cos, e, etc? ELEC816/ELEC61 Spring 1 Hayden Kwok-Hay So H. So, Sp1 Lecture 7 - ELEC816/61 Addition Two ve

More information

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J. Topics Low Power Techniques Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J. Rabaey Review: Energy & Power Equations E = C L V 2 DD P 0 1 +

More information

1 Q' 3. You are given a sequential circuit that has the following circuit to compute the next state:

1 Q' 3. You are given a sequential circuit that has the following circuit to compute the next state: UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences C50 Fall 2001 Prof. Subramanian Homework #3 Due: Friday, September 28, 2001 1. Show how to implement a T flip-flop starting

More information

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm V.Sandeep Kumar Assistant Professor, Indur Institute Of Engineering & Technology,Siddipet

More information

Basic electronics Prof. T.S. Natarajan Department of Physics Indian Institute of Technology, Madras Lecture- 24

Basic electronics Prof. T.S. Natarajan Department of Physics Indian Institute of Technology, Madras Lecture- 24 Basic electronics Prof. T.S. Natarajan Department of Physics Indian Institute of Technology, Madras Lecture- 24 Mathematical operations (Summing Amplifier, The Averager, D/A Converter..) Hello everybody!

More information

Digital Hearing Aids Specific μdsp Chip Design by Verilog HDL

Digital Hearing Aids Specific μdsp Chip Design by Verilog HDL Digital Hearing Aids Specific μdsp Chip Design by Verilog HDL Soon-Suck Jarng*, Lingfen Chen *, You-Jung Kwon * * Department of Information Control & Instrumentation, Chosun University, Gwang-Ju, Korea

More information

Understanding Engineers #2

Understanding Engineers #2 Understanding Engineers #! The graduate with a Science degree asks, "Why does it work?"! The graduate with an Engineering degree asks, "How does it work?"! The graduate with an Accounting degree asks,

More information

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers Dharmapuri Ranga Rajini 1 M.Ramana Reddy 2 rangarajini.d@gmail.com 1 ramanareddy055@gmail.com 2 1 PG Scholar, Dept

More information

Design of ALU and Cache Memory for an 8 bit ALU

Design of ALU and Cache Memory for an 8 bit ALU Clemson University TigerPrints All Theses Theses 12-2007 Design of ALU and Cache Memory for an 8 bit ALU Pravin chander Chandran Clemson University, pravinc@clemson.edu Follow this and additional works

More information

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Ramon Canal NCD Master MIRI. NCD Master MIRI 1 Wattch, Hotspot, Hotleakage, McPAT http://www.eecs.harvard.edu/~dbrooks/wattch-form.html http://lava.cs.virginia.edu/hotspot http://lava.cs.virginia.edu/hotleakage http://www.hpl.hp.com/research/mcpat/

More information