Parallel architectures Electronic Computers LM
|
|
- Dale Rice
- 6 years ago
- Views:
Transcription
1 Parallel architectures Electronic Computers LM 1
2 Architecture Architecture: functional behaviour of a computer. For instance a processor which executes DLX code Implementation: a logical network implementing the architecture. It is called also microarchitecture. There are many implementations of the same architecture. Example: family x86 Synthesis: a physical implementation. There are many possible synthesises of the same implementation (for instance different technologies) The architecture is defined by the machine language that is the instruction set (assembly language). Instruction Set Architecture -> ISA The ISA varies slowly while the implementation change rapidly (see for instance IA8, IA16, IA32 ). More an ISA remains more are the programs implemented on it and therefore compatibility becomes the main issue. 2
3 Sequential Single instruction executed at a time Instruction level parallelism Pipelined Multiple instructions executed simultaneously Superpipelined Multiple stages for each operation (EX, MEM etc.) in order to increase the clock frequency (i.e. Pentium IV) Scalar A single pipeline Superscalar Multiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision) Very Long Instruction Word Multiple pipelines; many instructions started at the same time. Instruction order decided at compile time Superscalar superpipelined (i.e. Pentium IV, I5, I7 etc.).. 3
4 architectures Multicore (core level parallelism) Many processors in the same chip (i.e.. Core duo Nehalem Sandy Bridge..) Multithread (thread level parallelism) Pipelines of the same processor used by different processes at the same time (time sharing) (as if it were a multicore ex. Pentium IV, Nehalem, Sandy Bridge etc.) Memory level parallelism A memory able to provide multiple data at different addresses at the same time (outstanding requests - DDR2, DDR3 etc.) 4
5 Deep Pipeline (Superpipeline) Fetch Branch penalty Fetch Decode Execute Memory Writeback Decode Execute Memory Branch penalty Writeback Each stage subdivided in three substages.. Higher clock frequency but higher branch penalty Higher power consumption!!!!!!!!!!!! 5
6 Parallel pipelines Sequential Time parallelism: pipeline Space parallelism: VLIW Space-time parallelism: (ie. I5, I7 ) 6
7 Diversified pipelines - 1 Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines. IF ID RD EX ALU MEM1 FP1 BR F => Floating MEM2 FP2 FP3 WB Different execution times problem Instruction interdependency problem Multi instruction buffer to avoid pipelines block. 7
8 Diversified pipelines - 2 IF ID In order execution RD Dispatch Buffer EX ALU MEM1 FP1 BR MEM2 FP2 FP3 «Out Of Order» execution Reorder Buffer WB In order retirement 8
9 Floating Point DLX F instructions Integer FP Multipl. IF ID FP MEM WB adder FP Multiply FP/Int. Divid. Ex Integer M1 M2 M3 M4 M5 M6 M7 multicycle stages Pipelined IF ID FP Add MEM WB A1 A2 A3 A4 FP/INT. Divide (i.e. 24 clock cycles one instruction at a time executed) 9
10 DLX revisited Very important structure change (more intermediate registers, more complex ID stage to send each instruction to the appropriate execution stage) Hazards problems: the instructions do not end in the same order of their issue. Example FMUL Data required for computing the address F1,F2, F2 (no interdependency between instructions in this sequence) FADD F3, F4, F5 In violet the stages where the operands are needed FLD F6, 10(R8) and in green the stages where results are produced FST 40(R10), F9 Data written FMUL IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB FADD IF ID A1 A2 A3 A4 MEM WB FLD IF ID EX MEM WB FST IF ID EX MEM (WB) nop Red squares: execution Since the division is normally a single functional unity, up to 40 clocks stalls may occur in this case Multiple instructions at the same time in the same stages (in particular in WB) Write After Write hazards (WAW) i.e. if a FADD F6, F4, F5 (four EX cycles ) would directly precede a FLD F6, 10(R8) (one EX cycle) (although in this case the FADD would have been dropped by the compiler since useless) Instructions are not completed in order Same destination register Write sequence error Because of the different instructions execution times Read After Write (RAW - DLX) hazards are more frequent 10
11 DLX revisited To cope with multiple write operations at the same time of different registers the number of the input ports of the RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as to choose the instructions to be stalled). More complex pipelines RAW hazards are solved through the forwarding For WAW hazards consider the following example Multilple RF write operations FMUL F0, F4, F6.... FADD F2, F4, F6.... FLD F2, 0(R2) IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID A1 A2 A3 A4 MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB If FADD was started one clock later a Write After Write hazard would have taken place!! Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which use integer register for address computing Normally the hazards are detected in the ID stage considering the preceding and following instructions so as to introduce the required stalls (in this case FLD would have been stalled one clock) 11
12 DLX revisited In the previous case FLD F2, 0(R2) must be stalled until FADD F2, F4, F6 has reached the MEM stage. It must be however assumed that between the two instructions there must at least one using through the forwarding the result of FADD F2, F4, F6 otherwise the compiler would have dropped the instruction! The situation would have been even worse if FLD had been completed before the FADD. In any case it is always possible that different instructions are completed in an order different from that of their issue How can we grant that the final result is that of the program? 12
13 Compiler Let s consider this high level language statements X = Y + Z A = B * C to be executed in a processor with the following pipeline Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W In order emission The issue of the addition (multiply) is possible only AFTER the previous instruction execution calculating R2 (R5) that is after the last EX stage possibly with forwarding Addition result not yet ready RAW Busy decoder- RAW Multiply: waits for results Busy decoder Decoder busy The issue is here possible since data to R 1 e R 2 have been already produced Stalls Data not available D freed by the addition 13
14 Compiler Fetch F Dec. D Issue I Ex. E Ex. E Ex. E WB W But we can modify the emission without modyfying the result before Emission possible since R1 and R2 already ready after Waiting for R5 Busy decoder 16 cicles instead of 22!!!! Waiting for R6 14
15 Multicycle hazards I1 F1 = F2 + F3 I2 F2 = F4 x F5 I3 F3 = F3 + F4 I4 F6 = F6 x F6 I5 F1 = F3 + F5 I6 F2 = F3 + F4 WAR (F2) I2 I1 WAR (F3) RAW (F3) I3 WAW(F1) RAW (F3) I5 NB: in this graph the hazards are potential since the registers only are considered no matter how many cycles are required by the executions WAW(F2) I6 I1 I2 Let s suppose to have a FP adder (1 cicle in red) and a multiplier (3 cicles in green). I3 I4 I5 I6 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 15
16 Dynamic instructions scheduling Temporal dependencies (hazards) not known at compile time It allows the execution of the code on different pipelines and on superscalar processors with no implications for the compiler. It allows the execution of instructions ahead of their position (in the following case FSUB F12,F8,F14) if the conditions allow it FDIV F0,F2,F4 FADD F10,F0,F8 (RAW - must wait for F0) FSUB F12,F8,F14 (can be executed anyway) Systems with out of order executions but commitment always in order 16
17 Scoreboard Write After Read (WAR) Consider the following sequence Read after Write (RAW) FDIV F0, F2, F4 FADD F10, F0, F8 FSUB F8, F8, F14 They must read the same value There is an antidependency (WAR hazard) between FADD and FSUB: should FSUB end before FADD has read F8 an error would occur (F8 already updated) A possible Write After Write (WAW) hazard would occur if in FSUB F10 instead of F8 had been used as destination (in case FSUB would end before FADD) Scoreboard technique: an instruction per clock should be terminated executing an instruction as soon as possible. 17
18 Scoreboard Registers FP MUL FP MUL FP DIV FP ADD Functional units INTEG Scoreboard The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines when an instruction can read its operands and start its execution. The scoreboard considers all system state changes and decides when the first instruction in the FIFO queue (as produced by the compiler) can be started. 18
19 Scoreboard The four stages equivalent to ID, EX and WB in DLX are: 1. Emission: if a functional unit for the instruction is available (free) and the required operands are available in RF with updated values, the instruction is issued unless another functional unit has already an instruction which must write into the same destination register. No WAW hazards therefore. In this latter case the instruction is stalled which blocks the emission of all the following instructions in the prefetch queue even when all other conditions for them are met! 2. Operand read: the instruction has been emitted. If the operand is available and no already executing instructionmust write it, the operand is read otherwise stall in the functional unit 3. Execution: when the result has been computed and stored the scoreboard is informed so as to unblock a possibly waiting instruction 4. In case of possible WAR the instruction is stalled and does not write the result if there is a previous instruction which has not yet read the operands and one of them is the destination register of the considered instruction. Once the operand has been read the result can be written It must be noticed that with this organisation the forwarding is avoided since the results are written as soon as produced (but for the wait WAR point 4) Obviously some stalls can be induced because the number of busses available for transfers is small The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing the RAW risks). 19
20 An example Integer LD F6, 34(R2) RAW LD F2, 45(R3) FMUL F0, F2,F4 RAW (MULD) FSUB F8, F6, F2 (SUBD ) FDIV F10, F0, F6 (DIVD) FADD F6, < F8, F2 WAR (ADDD) Do you find more hypothetical hazards? For instance what about F0? Hypothetical timing for different instructions (which includes the operands read and execution) FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles 20
21 Scoreboard entities Instruction stages: emission, operands read, execution and writeback Statuses of the functional units (FU): 9 parameters Busy Op Fi Fj, Fk Qj, Qk Rj, Rk Unit busy Operation Code presently executed Instruction destination (result) register Operands source registers Functional units producing the required operands (if not yet ready) for the registers Fj and Fk Flags (yes) indicating whether Fj, Fk have been already updated Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register N.B. It must be remembered that in case of possible WAW the instructions emission is stalled (point 1 of the rules) N.B. In the following example we suppose that two multiplication/division units are available 21
22 Example (here we assume that F0 is a normal register and not always 0 ) NB Instruction status Read ExecutionWrite LD = FLD Instruction j k Issue Op complete Result FLD MULTD = FMUL LD F6 34 R2 SUBD = FSUB FMUL LD F2 45 R3 FDIV MULTD F0 F2 F4 Instructions states DIVD = FDIV SUBD F8 F6 F2 Progression clock ADDD = FADD DIVD F10 F0 F6 ADDD F6 F8 F2 FU=Functional Unit Functional unit status dest Source1Source 2FU for j FU for k Fj? Fk? Clock 1 integer unit 2 multipl. units 1 add/sub unit 1 division unit Time Name Integer Mult1 Mult2 Add Divide Register result status n. of cycles of execution yet to elapse 0 1 cycle FADD, FSUB 2 cycles 10 cycles 40 cycles Register Q i Ready? Busy Op Fi Fj Fk Qj Qk Rj Rk R j and R k indicates whether (possibly in the next cycle if just produced) the data can be read from the operands source registers of the executed instruction. Q j and Q k are the Functional Units which produce them (if not yet ready). F j and F k are the registers where data produced by Q j and Q k are stored (or will be stored in the next cycle data available if the corresponding Ri is yes) to be used in the executed instruction Floating point result registers F0 F2 F4 F6 F8 F10 F12... F31 Functional Unit producing the result for the floating point register Fx (Qj, Qk) 22
23 Cycle 1 Instruction status Read Execution Write Instruction j k Issue Op/Excomplete Result F6,34(R2) is Issue LD F6 34 R2 1 LD F2 45 R3 MULTD F0 F2 F4 Brown colour for state change SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? At clock 1 the instruction stage of LD R2 is supposed to be already available and therefore in the next clock can be used. LD uses the integer unit Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Yes No Load F6 R2 Yes Mult2 No Add No Divide No Register result status Clock 1 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer R2 Functional unit used for producing the result in F6 23
24 Cycle 2 Instruction status Read Execution Write Instruction j k Issue Op/Ex complete Result LD F6 34 R2 1 2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Data ready in R2: instruction can proceed NB: The second LD cannot be emitted because the only integer unit is busy and the same applies for MULTD because instructions must be emitted in order Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock 2 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer 24
25 Cycle 3 Instruction status Read Execution Write Instruction j k IssueOp/Ex complete Result LD F6 34 R FLD 1 cycle LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Mult1 No Mult2 No Add No Divide No Register result status Clock 3 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer 25
26 Cycle 4 Instruction status Read Execution Write indicates their value Instruction j k Issue Op/Excomplete Result LD F6 34 R LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Mult1 No Mult2 No Add No Divide No Register result status Clock 4 F0 F2 F4 F6 F8 F10 F12... F31 FU Integer Register at the end of the period has been written Functional unit freed at the end of the period The change of status of the FUs at the clock positive edge concluding ending the current cycle (future status). For instance the integer functional unit is freed at the end of cycle 4 together with the result writeback. LD F6 34,R2 disappears totally from scoreboard at the clock positive edge concluding the current cycle 4. 26
27 Cycle 5 Instruction status Read Execution Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 RUj RUk Rj? Rk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Yes No Load F2 R3 Yes Mult2 No Add No Divide No Register result status Clock 5 F0 F2 F4 F6 F8 F10 F12... F31 RU Integer The Integer Functional Unit must produce a new value for F2 At the beginning of cycle 5 the integer unit is already free and then LD F2 45, R3 can start R3 supposed already ready as in the previous case 27
28 Cycle 6 Instruction status Read Execution Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R3 5 6 MULTD F0 F2 F4 6 MULTD waits for F2 SUBD F8 F6 F2 from the integer unit!!!! DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Mult2 Yes No Mult F0 F2 F4 Integer No Yes Add No Divide No Register result status Clock 6 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult Integer MULTD F0 F2, F4 can start because its FU is free and the destination register is F0 F4 supposed already present 28
29 Cycle 7 Instruction status Read Execution Write Instruction j k Issue Op/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer No Yes MULTD stalled in the execution unit because F2 not yet ready. The same for SUBD Mult2 No Add Divide Yes No Subd F8 F6 F2 Integer Yes No Register result status Clock 7 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult Integer Add (NB : FP adder executes FP subtractions too) SUBD F8 F6, F2 can start because the arithmetic FP sum/subtraction is free. 29
30 Cycle 8 Instruction status Read EX Write Instruction j k Issue DIVD F10 F0, F6 can start Op/Ex complete. Result because the divide FP FU is free LD F6 34 R LD F2 45 R MULTDF0 F2 F4 6 SUBD F8 F6 F2 7 F2 available!! DIVD F10 F0 F6 8 ADDD F6 F8 F2 F0 not yet available Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Updated at the end of the cycle Clock 8 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide F2 is written and therefore the integer unit is free F2 written allows MULTD and SUBD to read the operands during the next cycle 30
31 Cycle 9-10 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk N.B.: MULTD and SUBD can read the operands because F2 available (see cycle 8). DIVD is still stalled because of F0. ADDD cannot start because SUBD uses the adder FU Integer No 10 clock Mult1 Yes Mult F0 F2 F4 Mult2 No 2 clock Add Yes Sub F8 F6 F2 40 clock Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 9-10 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 31
32 Cycle 11 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Nota: FU Add requires 2 cycles for the SUBD and therefore nothing happens in cycle 10 while MULTD still processes its data NB: ADDD will use the result of the SUBD but is not yet started because of SUBD Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No 0 Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 11 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 32
33 Cycle 12 Instruction status Read EX Write Instruction j k IssueOp/Ex completeresult LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 FLD 1 cycle FADD and FSUB 2c ycles FMUL 10 cycles FDIV 40 cycles SUBD ends freeing the FU. In the next period ADDD can start Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No 7 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 12 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Divide F8 is written and the ADD/SUB FU is freed 33
34 Cycle 13 Instruction status Fead EX Write Instruction j k IssueOp/Excomplete Fesult LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F Now ADDD can start because SUBD has finished its execution and has freed the FU Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Divide Yes Add Yes Div F6 F10 F8 F0 F2 F6 Mult1 Yes No Yes Yes Register result status Clock 13 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 6 Clocks more FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles 34
35 Cycle 14 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 14 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 5 clocks more 2 Clocks more 35
36 Cycle 15 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles ADDD requires two cycles and therefore no system status change Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 15 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 4 Clocks more 1 Clock more 36
37 Cycle 16 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles ADDD ended its EX stage while MULTD and DIVD keep executing Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3 clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 16 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 37
38 Cycle 17 Instruction status Read EX Write Instruction j k IssueOp/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles NB!!! ADDD stalled (cannot write) because of a WAR with DIVD on F6. DIVD does not read F6 because it waits for F0 produced by MULTD (operands are read in parallel). MULT and DIVD keep executing Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2 Clocks more Mult1 Yes Mult F0 F2 F4 Mult2 No Stalled because WAR F6 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 17 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 38
39 Cycle 18 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULT still executing DIVD still stalled Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1 clock more Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 18 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 39
40 Cycle 19 Instruction status Read EX Write Instruction j k Issue Op/Ex complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F FLD 1 cycle FADD FSUB 2 cycles FMUL 10 cycles FDIV 40 cycles MULT ends its execution, will write in cycle 20 (after 10 cycles) which will unblock DIVD and then ADDD Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status Clock 19 F0 F2 F4 F6 F8 F10 F12... F31 FU Mult1 Add Divide 40
41 Cycle 20 Instruction status Read EX Write FLD 1 cycle Instruction j k IssueOp/Excomplete Result FADD FSUB 2 cycles LD F6 34 R FMUL 10 cycles LD F2 45 R FDIV 40 cycles MULTD F0 F2 F MULTD writes F0 SUBD F8 F6 F unblocking DIVD DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock 20 F0 F2 F4 F6 F8 F10 F12... F31 FU Add Divide 41
42 Cycle 21 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Register result status Clock 21 F0 F2 F4 F6 F8 F10 F12... F31 FU Add Divide DIVD reads both F0 and F6 (which could not be written by ADDD because of WAR) unblocking ADDD which can write F6 in the next cycle 42
43 Cycle 22 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock 22 F0 F2 F4 F6 F8 F10 F12... F31 FU Divide Now ADDD can write F6 after the WAR hazards with DIVD disappeared. For 6 cycles ADDD couldn t write F6 although its result was available 43
44 Cycle 61 Instruction status Read EX Write Instruction j k Issue Op/ExcompleteResult LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No Divide Yes Div F10 F0 F6 Register result status Clock 61 F0 F2 F4 F6 F8 F10 F12... F31 FU Divide DIVD execution ends after 40 cycles 44
45 Cycle 62 Instruction status Read EX Write Instruction j k IssueOp/Excomplete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F All executions ended Functional unit status dest S1 S2 FUj FUk Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 62 FU 45
46 Scoreboard limits Register values must be read in any case in parallel only from the register file (which means that they must have been already stored in the registers no RAW problem) An instruction can be emitted only if all previous instructions have been emitted WAR RAW FDIV FADD FSTOR FSUB FMUL WAW F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6, F10, F8 N.B Hazards of the sequence are only potential: their occurrence depends on the instructions execution time 46
47 Renaming Tomasulo Algorithm «Renaming» indicates a location different from the RF where a requested datum is produced/stored and can be obtained. The name «renaming» is used because it is as if the source registers of an instruction were renamed Tomasulo algorithm: renaming is based on the concept of reservation stations which are functional units buffers where instructions can be «parked» waiting for the availability of the requested Fu and the needed data. A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU is free and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operands EITHER the source register data OR the reservation stations producing them are indicated (whence renaming). The renaming occurs at run-time A reservation station captures a required operand exactly when and where it is (not waiting until it is written avoiding the register file access). Similar to the case of forwarding When multiple writes to the same register occur (WAW possible only if multiple busses between FUs and RF are available) only the most recently produced data are written (for each register a TAG is used indicating the FU which has the right to write) The following benefits occur Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the information stored in the reservation stations of each functional unit determines whether an instruction can execute in the FU since the source (where the datum is being produced ) and NOT the RF is in any case indicated. RAW hazards are no more possible since the requested data are provided as soon as produced. The same for WAR Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RF through the common data busses (multiple reservation stations in addition to RF register can be accessed at the same time when multiple bussesare available) 47
48 Tomasulo Algorithm Tomasulo eliminates not only WAWs but also WARs Possible WAW FLD FLD FMUL FSUB FDIV FADD F6, 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, F6 F10, F0, F6 F6, F8, F2 Possible WAR. FLD FLD FMUL FSUB FDIV FADD [T/F6], 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, [T] F10, F0, [T] F6, F8, F2 Renaming (functional unit producing the datum) When an instruction is inserted in a RS it is checked whether one or more of its operands are being produced elsewhereby other RS: if yes then renaming For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read its operands (in case of F8 of FSUB and of F2 of FLD they were both immediately available for FADD) but since FDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holds for FSUB. As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the most recent instruction in the RS using a destination register can write the register. 48
49 Tomasulo Algorithm Very high performance without special compilers Differences with scoreboard Buffer and controls directly distributed in the FUs (there is no centralized control): buffers are called reservation stations Source registers names substituted by pointers to buffers of the reservation stations (if the requested data are being there produced) Renaming : a direct pointer to the sources and not to the register One ore more Common Data Bus for sending results to all FUs requiring them Load and Stores considered as FUs (a STORE can also be a source for a RS executing a LOAD) 49
50 Tomasulo Algorithm In this example is it assumed that the MUL unit executes the DIVs too and that the ADD executes the SUBs too. LOAD and STORES are handled as other instructions In this example: 3 RS for add/sub 2 RS for mult/div 5 RS for store 5 RS for load For the data produced by the FUs In this example only one Data Bus. Please notice that the same Common Data Bus is used also by the RS waiting for data Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements: either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming) 50
51 Tomasulo Algorithm Load buffers are used to store the load addresses Store buffers contain the computed addresses and the data to be written in memory Load and store must be executed in sequence if they are related to the same addresses. In the other cases it is possibile to anticipate the LOADs (never the STOREs) In figure there are 3 phases (each one of which can last several clocks): Emission: the instructions are extracted in order from the general instruction queue when there is a free RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operands are extracted from RF or the producing FU as indicated. In case of WAW it must be determined which instruction must provide the data Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must be transferred over a bus anyway) in order to catch them (and their sources) as soon as available: RAW are therefore avoided (we are sure not to read stale data in the RF). Writeback: as soon as a datum is produced, it is tranferred over one CDB (when more than one are available) to the RF and to the RS waiting for it. 51
52 Tomasulo Algorithm Let s see the scoreboard example in a Tomasulo Architecture. Let s suppose that the execution times are the same of the scoreboard (FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback LD LD FMUL FSUB FDIV FADD F6, 34(R2) F2, 45(R3) F0, F2,F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 52
53 Reservation Station Op: opcode of the instruction to be executed Vj, Vk: places of the operands (either RF or the FUs producing them) Qj, Qk: Functional units producing the results. A blank indicates that the source operands are already in Vj or Vk or that they are not required Busy: Busy FU Register File Status: Indicates which FU will write the register (if needed). A blank means that there are no instructions which must write the register and therefore its value can be directly used N.B. From the general instruction queue one instruction per clock is emitted when a FUs RS for that instruction is available otherwise stall. In our example we assume only one CDB. 53
54 Instruction status Instruction j k Issue LD F6 34 R2 LD F2 45 R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Cycle 0 Execution Write Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Operands register. If blank the datum is produced in the corresponding Q FU Load/store not indicated in the status table NB. For LD (ST here not used) there is a limited number of RS. Their BUSY status is here displayed differently from the FU (see next slide) Producing FU if blank it means that the datum is in RF For sake of simplicity R j e R k (ready/notready) are not displayed since their values are implicit in the status of Q j and Q k FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 0 FU The FU producing the new value 54
55 Cycle 1 5 RS for the LOAD Instruction status Instruction j k Issue Execution Write Busy Address LD F6 34 R2 1 Load1 Yes 34+R2 LD F2 45 R3 NB: Here it is assumed that R2 and R3 are already available MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k 3 RS for adder/sub 2 RS for mul/div Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 1 FU Load1 55
56 Cycle 2 5 RS for LOAD Instruction status Instruction j k Issue Execution Write Busy Address LD F6 34 R Load1 Yes 34+R2 LD F2 45 R3 2 The second LD is emitted. Load2 Yes 45+R3 One instruction per clock is MULTD F0 F2 F4 emitted (when possible) SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 Mult2 No No No No No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 2 FU Load2 Load1 NB: Load -> 2 cycles: the first one for computing the address and the second for reading the datum N.B. A second LOAD has been emitted (not possible with the scoreboard) and parked in the RS. R3 value already available in the RF FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 56
57 Cycle 3 Instruction status Instruction j k Issue Execution Write Busy LD F6 34 R Load1 Yes LD F2 45 R Load2 Yes MULTD F0 F2 F4 3 MULTD emitted (free RS ) SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 No No Datum supposed already in the RF Address 34+R2 45+R3 LD two cycles MULTD can be emitted although F2 NOT yet available. F2-> renaming FLD 1+1 cycles, Add3 No FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, Yet10 cycles Mult1 Yes Mult F4 Load2 FDIV 40+1 cycles) Mult2 No NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 3 FU Mult1 Load2 Load1 57
58 Cycle 4 Instruction status Instruction j k Issue Execution Write Busy LD F6 34 R LD F2 45 R Load2 Yes MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 SUBD is emitted (RS free) F6 available in RF at the DIVD F10 F0 F6 end of the cycle ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Address 45+R3 The datum read from memory LD F6 34(R2) is written both in the RF and in the RS of SUBD which is waiting for it FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Time Name Busy Op Vj Vk Qj Qk Yet 3 cycles Add1 Yes Sub F6 (captured on the fly) Load2 Add2 No Add3 No Yet 10 cycles Mult1 Yes Mult F4 Load2 The FUs execute both sums and subtractions Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 4 FU Mult1 Load2 Add1 FU freed 58
59 Cycle 5 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F4 3 SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 3 Add1 Yes Sub F6 (capt.) F2 (capt) 0 Add2 No Add3 No DIVD is emitted (RS free) 10 Mult1 Yes Mult F2 (capt) F4 The datum read from memory with LD F2 45(R3) is written both in register F2 and in the RS of SUBD and MULTD which are waiting for it FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 0 Mult2 Register result status Yes Div F6 Mult1 Wait for F0 Clock F0 F2 F4 F6 F8 F10 F12... F31 5 FU Mult1 Add1 Mult2 FU freed 59
60 Cycle 6 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No ADDD is emitted (RS free) Wait for F8 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback 9 Mult1 Yes Mult F2 F4 Now MULTD can execute (F2 and F4 available) Yet 40 cycles Mult2 Yes Div F6 Mult1 Wait for F0 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 6 FU Mult1 Add2 Add1 Mult2 60
61 Cycle 7 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Yet 40 cycles Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes Sub F6 F2 Add2 Yes Add F2 Add1 Add3 No 8 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Datum in F6 will be overwritten by ADDD but it was already read and is present in the RS of DIVD SUBD (as ADDD) two cycles FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback ADDD stalled waiting for SUBD (F8) Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 7 FU Mult1 Add2 Add1 Mult2 61
62 Cycle 8 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 2 Add2 Yes Add F8 F2 Add3 No 7 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 NB: SUBD ends before MULTD and allows ADDD (which captures the result of F8) to start executing FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 8 FU Mult1 Add2 Mult2 FU freed 62
63 Cycle 9 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes Add F8 F2 Add3 No ADDD executing 6 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 9 FU Mult1 Add2 Mult2 63
64 Cycle 10 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Yet 40 Add1 No 1 Add2 Yes Add F8 F2 Add3 No 5 Mult1 Yes Mult F2 F4 Mult2 Yes Div F6 Mult1 Two execution cycles FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 10 FU Mult1 Add2 Mult2 64
65 Cycle 11 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk 0 Add1 Add2 Add3 No No No 4 Mult1 Yes Mult F2 F4 40 Mult2 Yes Div F6 Mult1 ADDD too ends before MULTD and DIVD FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 11 FU Mult1 Mult2 FU freed 65
66 Instruction status Cycle 12 Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Cycles yet to be executed for completing the execution Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 No No No 3 Mult1 Yes Mult F2 F4 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Waiting for the datum producedby MULTD Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 12 FU Mult1 Mult2 66
67 Cycle 15 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 No No No 1 Mult1 Yes Mult F2 F4 Yet 40 Mult2 Yes Div F6 Mult1 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Waiting for the datum producedby MULTD Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 15 FU Mult1 Mult2 67
68 Cycle 16 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Now DIVD can execute Time Name Busy Op Vj Vk Qj Qk 0 0 Add1 Add2 Add3 Mult1 No No No No 40 Mult2 Yes Div F0 F6 FLD 1+1 cycles, FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) NB. +1 for the writeback Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 16 FU Mult2 FU freed 68
69 Cycle 56 Instruction status Instruction j k Issue Execution Write LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 No No No No 0 Mult2 Yes Div F0 F6 Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 56 FU Mult2 69
70 Cycle 57 Instruction status Execution Write Instruction j k Issue complete Result LD F6 34 R LD F2 45 R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk Add1 Add2 Add3 Mult1 Mult2 No No No No No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F31 57 FU 70
71 A demo can be found at 71
72 Limits of Tomasulo Algorithm Very complex Each CDB must be connetcted to each RS Complex cabling Reduce n. of CDB means reduced efficiency If a single CDB is present only one instruction per cycle can end Ouf of order instructions completion!!!!!! NOT precise interrupts 73
73 Exceptions Exception/interrupt: non-programmed control transfer Return address and all other information necessary to restore the interrupted situation must be saved «Response» subroutine (handler) must be executed Two exceptions types: interrupt and trap Interrupts: external causes The user program are interrupted and the then restored Asyncronous to the current process Acknowledged at the end of the current instruction (if interrupts enabled) The handler is responsibility of the user program Traps: internal causes Exceptional conditions (overflow, zero division etc.) Errors (i.e. parity) Page fault (or see later segment fault): data not available in memory Syncronous to the current process Operating systems handler Instruction can be interrupted during its execution (i.e. page fault) and therefore must be «restartable»,. The executing program is normally temporarily aborted. 74
74 Examples Instruction Restart 75
75 Precise exceptions/interrupts Exceptions must be precise that is their behaviour must be same that would occur in a non-pipelined architecture Precise: machine status is saved as if the code would have been executed until the exception : All preceding instruction must be terminated All instructions following the instruction which provoked the exception must be handled as if they never started The same code must executed identically on different architectures Complex problem with pipeline, OOO execution (see later) etc. Scoreboard and Tomasulo have: In order emission, execution (and therefore terminated) out of order fuori ordine Precise exceptions(interrupts) : instruction commitment in order 76
76 Reorder Buffer (ROB) FIFO queue Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we say that the instruction is virtually inserted in the ROB When instructions are terminated the results are stored in the ROB (instead of the RF) which provides also the operands to other instructions which requires them (renaming!) Commitment Commitment: the result of the instruction in the top slot of the FIFO are transferred to the architectural registers Easy undo of speculated instructions (see later) or of branches erroneously predicted or exceptions Automatic WAW avoidance FP Op Queue ROB FP Regs Res Stations FP Adder Res Stations FP Adder 78
77 Tomasulo again 79
78 Tomasulo in 4 steps Emission Emission of an instruction from the instruction queue when a RSand a ROB slot available. In the RS are indicated the operands source and the ROB slot where an instruction will be parked after its esecution (this phase is called «dispatch ). The results are NOT written in the RF until the commitment phase. NB the lack of one of the two conditions blocks the emission of the following instructions Execution Operands transformation. If not yet ready they can be in the ROB (in this case the operand value computed by the nearest previous instructions is used) orstill computed in the FU. This phase is indicated as issue. Result writeback Execution ends. Result trasmitted on the CDB for the RS waiting of them and for the ROB. Commitment Register (or memory) update with the results stored in the ROB when the instruction is on the top of the ROB FIFO. In case of erroneously predicted branch the ROB results are just dropped ( graduation ). EMISSION IN ORDER COMMITMENT IN ORDER N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same (unlikely, otherwise the compiler would have dropped the first one) the result of the most recent instruction is used. 80
79 HW with ROB Destination Register Result Exception? Valid (terminated ) Program Counter FP Op Queue Compar network Reorder Buffer FP Regs ROB Res Stations FP Adder Res Stations FP Adder ROB is a circular queue 81
80 Example LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 82
81 FP Op queue LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 Tomasulo with ROB cycle 1 ROB Dest. Source Istruction F0 LD F0,10(R2) N Completed? ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 ROB end ROB top Dest. => destination position in ROB Dest. FP registers Dest To memory From memory FP adders Reservation Stations FP multipliers Dest 1 10+R2 M1 83
82 FP Op queue LD F0, 10(R2) FADD F10, F4, F0 FDIV F2, F10, F6 BRNE F2, +100 LD F4, 0(R3) FADD F0, F4, F6 ST 0(R3), F4 Tomasulo with ROB cycle 2 ROB Renaming!! Dest. F10 F0 Source ROB1 FADD LD Istruction F10, F4, F0 F0,10(R2) Completed? N Ex ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 End Top Dest. 2 FADD F10,F4, ROB1 FP registers Dest To memory From memory Three slots for memory operations FP adders Reservation Stations FP multipliers Dest 1 10+R2 M1 (Memory 2 clocks) There can be also two ROB sources 84
Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks
Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture Outline Instruction-Level Parallelism Scoreboarding (A.8) Instruction Level Parallelism
More informationCMP 301B Computer Architecture. Appendix C
CMP 301B Computer Architecture Appendix C Dealing with Exceptions What should be done when an exception arises and many instructions are in the pipeline??!! Force a trap instruction in the next IF stage
More informationTomasulo s Algorithm. Tomasulo s Algorithm
Tomasulo s Algorithm Load and store buffers Contain data and addresses, act like reservation stations Branch Prediction Top-level design: 56 Tomasulo s Algorithm Three Steps: Issue Get next instruction
More informationEN164: Design of Computing Systems Lecture 22: Processor / ILP 3
EN164: Design of Computing Systems Lecture 22: Processor / ILP 3 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationProblem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards
Dynamic Scheduling Pipelining: Issue instructions in every cycle (CPI 1) Problem: hazards delay instruction completion & increase the CPI Compiler scheduling (static scheduling) reduces impact of hazards
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science !!! Basic MIPS integer pipeline Branches with one
More informationInstruction Level Parallelism Part II - Scoreboard
Course on: Advanced Computer Architectures Instruction Level Parallelism Part II - Scoreboard Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Basic Assumptions We consider
More informationCS521 CSE IITG 11/23/2012
Parallel Decoding and issue Parallel execution Preserving the sequential consistency of execution and exception processing 1 slide 2 Decode/issue data Issue bound fetch Dispatch bound fetch RS RS RS RS
More informationCISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard
CISC 662 Graduate Computer Architecture Lecture 9 - Scoreboard Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture tes from John Hennessy and David Patterson s: Computer
More informationCOSC4201. Scoreboard
COSC4201 Scoreboard Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RIT) 1 Overcoming Data Hazards with Dynamic Scheduling In the pipeline, if there is
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Execution and Register Rename In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have
More informationOut-of-Order Execution. Register Renaming. Nima Honarmand
Out-of-Order Execution & Register Renaming Nima Honarmand Out-of-Order (OOO) Execution (1) Essence of OOO execution is Dynamic Scheduling Dynamic scheduling: processor hardware determines instruction execution
More informationPrecise State Recovery. Out-of-Order Pipelines
Precise State Recovery in Out-of-Order Pipelines Nima Honarmand Recall Our Generic OOO Pipeline Instruction flow (pipeline front-end) is in-order Register and memory execution are OOO And, we need a final
More informationDynamic Scheduling I
basic pipeline started with single, in-order issue, single-cycle operations have extended this basic pipeline with multi-cycle operations multiple issue (superscalar) now: dynamic scheduling (out-of-order
More informationDynamic Scheduling II
so far: dynamic scheduling (out-of-order execution) Scoreboard omasulo s algorithm register renaming: removing artificial dependences (WAR/WAW) now: out-of-order execution + precise state advanced topic:
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Speculation and raps in Out-of-Order Cores What is wrong with omasulo s? Branch instructions Need branch prediction to guess what to fetch next Need speculative execution
More informationEECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018
omasulo s Algorithm Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University,
More informationProject 5: Optimizer Jason Ansel
Project 5: Optimizer Jason Ansel Overview Project guidelines Benchmarking Library OoO CPUs Project Guidelines Use optimizations from lectures as your arsenal If you decide to implement one, look at Whale
More informationTomasolu s s Algorithm
omasolu s s Algorithm Fall 2007 Prof. homas Wenisch http://www.eecs.umich.edu/courses/eecs4 70 Floating Point Buffers (FLB) ag ag ag Storage Bus Floating Point 4 3 Buffers FLB 6 5 5 4 Control 2 1 1 Result
More informationOOO Execution & Precise State MIPS R10000 (R10K)
OOO Execution & Precise State in MIPS R10000 (R10K) Nima Honarmand CDB. CDB.V Spring 2018 :: CSE 502 he Problem with P6 Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch
More informationChapter 16 - Instruction-Level Parallelism and Superscalar Processors
Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 16 - Superscalar Processors 1 / 78 Table of Contents I 1 Overview
More informationDAT105: Computer Architecture
Department of Computer Science & Engineering Chalmers University of Techlogy DAT05: Computer Architecture Exercise 6 (Old exam questions) By Minh Quang Do 2007-2-2 Question 4a [2006/2/22] () Loop: LD F0,0(R)
More informationEECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont
MIPS R10000 Case Study Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Multiprocessor SGI Origin Using MIPS R10K Many thanks to Prof. Martin and Roth of University of Pennsylvania for
More informationEECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture
P6 µarchitecture Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Core 2 Microarchitecture Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions
More informationEECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont
Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides.
More informationInstruction Level Parallelism III: Dynamic Scheduling
Instruction Level Parallelism III: Dynamic Scheduling Reading: Appendix A (A-67) H&P Chapter 2 Instruction Level Parallelism III: Dynamic Scheduling 1 his Unit: Dynamic Scheduling Application OS Compiler
More informationU. Wisconsin CS/ECE 752 Advanced Computer Architecture I
U. Wisconsin CS/ECE 752 Advanced Computer Architecture I Prof. Karu Sankaralingam Unit 5: Dynamic Scheduling I Slides developed by Amir Roth of University of Pennsylvania with sources that included University
More informationInstruction Level Parallelism. Data Dependence Static Scheduling
Instruction Level Parallelism Data Dependence Static Scheduling Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Schedulers Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When
More informationPipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold
Pipelining Readings: 4.5-4.8 Example: Doing the laundry Ann, Brian, Cathy, & Dave A B C D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes
More informationIssue. Execute. Finish
Specula1on & Precise Interrupts Fall 2017 Prof. Ron Dreslinski h6p://www.eecs.umich.edu/courses/eecs470 In Order Out of Order In Order Issue Execute Finish Fetch Decode Dispatch Complete Retire Instruction/Decode
More information7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)
CSE 2021: Computer Organization IF for Load (Review) Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan CSE-2021 July-19-2012 2 ID for Load (Review) EX for Load (Review) CSE-2021 July-19-2012
More informationCSE 2021: Computer Organization
CSE 2021: Computer Organization Lecture-11 CPU Design : Pipelining-2 Review, Hazards Shakil M. Khan IF for Load (Review) CSE-2021 July-14-2011 2 ID for Load (Review) CSE-2021 July-14-2011 3 EX for Load
More information7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10 CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan CSE-2021 July-12-2012 2 Single Cycle with Jump Multi-Cycle Implementation
More informationCS 110 Computer Architecture Lecture 11: Pipelining
CS 110 Computer Architecture Lecture 11: Pipelining Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on
More informationA B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time
Pipelining Readings: 4.5-4.8 Example: Doing the laundry A B C D Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes
More informationChapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:
Chapter 4 The Processor Part II Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup p = 2n/(0.5n + 1.5) 4 =
More informationLecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)
Lecture Topics Today: Pipelined Processors (P&H 4.5-4.10) Next: continued 1 Announcements Milestone #4 (due 2/23) Milestone #5 (due 3/2) 2 1 ISA Implementations Three different strategies: single-cycle
More informationAsanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.
Pipeline Hazards Krste Asanovic Laboratory for Computer Science M.I.T. Pipelined DLX Datapath without interlocks and jumps 31 0x4 RegDst RegWrite inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel
More informationLECTURE 8. Pipelining: Datapath and Control
LECTURE 8 Pipelining: Datapath and Control PIPELINED DATAPATH As with the single-cycle and multi-cycle implementations, we will start by looking at the datapath for pipelining. We already know that pipelining
More informationEECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://
Wenisch 26 -- Portions ustin, Brehob, Falsafi, Hill, Hoe, ipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar EECS 4 ecture 4 Pipelining & Hazards II Winter 29 GS STTION Prof. Ronald Dreslinski h8p://www.eecs.umich.edu/courses/eecs4
More informationLecture 8-1 Vector Processors 2 A. Sohn
Lecture 8-1 Vector Processors Vector Processors How many iterations does the following loop go through? For i=1 to n do A[i] = B[i] + C[i] Sequential Processor: n times. Vector processor: 1 instruction!
More informationPipelined Processor Design
Pipelined Processor Design COE 38 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Pipelining versus Serial
More informationEECE 321: Computer Organiza5on
EECE 321: Computer Organiza5on Mohammad M. Mansour Dept. of Electrical and Compute Engineering American University of Beirut Lecture 21: Pipelining Processor Pipelining Same principles can be applied to
More informationRISC Central Processing Unit
RISC Central Processing Unit Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Spring, 2014 ldvan@cs.nctu.edu.tw http://www.cs.nctu.edu.tw/~ldvan/
More informationSuggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!
1! CSE 30321 Lecture 12 Introduction to Pipelining! CSE 30321 Lecture 12 Introduction to Pipelining! 2! Suggested Readings!! Readings!! H&P: Chapter 4.5-4.7!! (Over the next 3-4 lectures)! Lecture 12"
More informationECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution
ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution School of Electrical and Computer Engineering Cornell University revision: 2016-11-28-17-33 1 In-Order Dual-Issue
More informationPipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1
Pipelined Beta Where are the registers? Handouts: Lecture Slides L16 Pipelined Beta 1 Increasing CPU Performance MIPS = Freq CPI MIPS = Millions of Instructions/Second Freq = Clock Frequency, MHz CPI =
More information6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors
6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors Options for dealing with data and control hazards: stall, bypass, speculate 6.S084 Worksheet - 1 of 10 - L19 Control Hazards in Pipelined
More informationSATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation
SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation Mark Wolff Linda Wills School of Electrical and Computer Engineering Georgia Institute of Technology {wolff,linda.wills}@ece.gatech.edu
More informationCS61c: Introduction to Synchronous Digital Systems
CS61c: Introduction to Synchronous Digital Systems J. Wawrzynek March 4, 2006 Optional Reading: P&H, Appendix B 1 Instruction Set Architecture Among the topics we studied thus far this semester, was the
More informationLecture 4: Introduction to Pipelining
Lecture 4: Introduction to Pipelining Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder
More informationEE382V-ICS: System-on-a-Chip (SoC) Design
EE38V-CS: System-on-a-Chip (SoC) Design Hardware Synthesis and Architectures Source: D. Gajski, S. Abdi, A. Gerstlauer, G. Schirner, Embedded System Design: Modeling, Synthesis, Verification, Chapter 6:
More informationCompiler Optimisation
Compiler Optimisation 6 Instruction Scheduling Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This
More informationDigital Integrated CircuitDesign
Digital Integrated CircuitDesign Lecture 13 Building Blocks (Multipliers) Register Adder Shift Register Adib Abrishamifar EE Department IUST Acknowledgement This lecture note has been summarized and categorized
More informationFMP For More Practice
FP 6.-6 For ore Practice Labeling Pipeline Diagrams with 6.5 [2] < 6.3> To understand how pipeline works, let s consider these five instructions going through the pipeline: lw $, 2($) sub $, $2, $3 and
More informationIF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps
CSE 30321 Computer Architecture I Fall 2010 Homework 06 Pipelined Processors 85 points Assigned: November 2, 2010 Due: November 9, 2010 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (25 points)
More informationIF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps
CSE 30321 Computer Architecture I Fall 2011 Homework 06 Pipelined Processors 75 points Assigned: November 1, 2011 Due: November 8, 2011 PLEASE DO THE ASSIGNMENT ON THIS HANDOUT!!! Problem 1: (15 points)
More informationECE473 Computer Architecture and Organization. Pipeline: Introduction
Computer Architecture and Organization Pipeline: Introduction Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 11.1 The Laundry Analogy Student A,
More informationComputer Elements and Datapath. Microarchitecture Implementation of an ISA
6.823, L5--1 Computer Elements and atapath Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 status lines Microarchitecture Implementation of an ISA ler control points 6.823, L5--2
More informationEvolution of DSP Processors. Kartik Kariya EE, IIT Bombay
Evolution of DSP Processors Kartik Kariya EE, IIT Bombay Agenda Expected features of DSPs Brief overview of early DSPs Multi-issue DSPs Case Study: VLIW based Processor (SPXK5) for Mobile Applications
More informationComputer Architecture
Computer Architecture An Introduction Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/
More informationOn the Rules of Low-Power Design
On the Rules of Low-Power Design (and Why You Should Break Them) Prof. Todd Austin University of Michigan austin@umich.edu A long time ago, in a not so far away place The Rules of Low-Power Design P =
More informationECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor
ECE 2300 Digital ogic & Computer Organization Spring 2018 ore Pipelined icroprocessor ecture 18: 1 nnouncements No instructor office hour today Rescheduled to onday pril 16, 4:00-5:30pm Prelim 2 review
More informationCZ3001 ADVANCED COMPUTER ARCHITECTURE
CZ3001 ADVANCED COMPUTER ARCHITECTURE Lab 3 Report Abstract Pipelining is a process in which successive steps of an instruction sequence are executed in turn by a sequence of modules able to operate concurrently,
More informationInstructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona
NPTEL Online - IIT Kanpur Instructor: Dr. Mainak Chaudhuri Instructor: Dr. S. K. Aggarwal Course Name: Department: Program Optimization for Multi-core Architecture Computer Science and Engineering IIT
More informationClass Project: Low power Design of Electronic Circuits (ELEC 6970) 1
Power Minimization using Voltage reduction and Parallel Processing Sudheer Vemula Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL. Goal of the project:- To reduce the power consumed
More informationArchitectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance
Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubu Mukherjee SPEARS Group, Intel Massachusetts EECS, University
More informationA LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT
A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT NG KAR SIN (B.Tech. (Hons.), NUS) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING NATIONAL
More informationOverview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture
Overview 1 Trends in Microprocessor Architecture R05 Robert Mullins Computer architecture Scaling performance and CMOS Where have performance gains come from? Modern superscalar processors The limits of
More informationCS4617 Computer Architecture
1/26 CS4617 Computer Architecture Lecture 2 Dr J Vaughan September 10, 2014 2/26 Amdahl s Law Speedup = Execution time for entire task without using enhancement Execution time for entire task using enhancement
More informationCS429: Computer Organization and Architecture
CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: November 8, 2017 at 09:27 CS429 Slideset 14: 1 Overview What s wrong
More informationComputer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS
Computer Architecture (263-2210-00L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS Instructor: Prof. Onur Mutlu TAs: Hasan Hassan, Arash Tavakkol, Mohammad Sadr, Lois Orosa, Juan Gomez Luna Assigned:
More informationDIGITAL DESIGN WITH SM CHARTS
DIGITAL DESIGN WITH SM CHARTS By: Dr K S Gurumurthy, UVCE, Bangalore e-notes for the lectures VTU EDUSAT Programme Dr. K S Gurumurthy, UVCE, Blore Page 1 19/04/2005 DIGITAL DESIGN WITH SM CHARTS The utility
More informationLow-Power Design for Embedded Processors
Low-Power Design for Embedded Processors BILL MOYER, MEMBER, IEEE Invited Paper Minimization of power consumption in portable and batterypowered embedded systems has become an important aspect of processor
More informationDesigning with STM32F3x
Designing with STM32F3x Course Description Designing with STM32F3x is a 3 days ST official course. The course provides all necessary theoretical and practical know-how for start developing platforms based
More informationDepartment Computer Science and Engineering IIT Kanpur
NPTEL Online - IIT Bombay Course Name Parallel Computer Architecture Department Computer Science and Engineering IIT Kanpur Instructor Dr. Mainak Chaudhuri file:///e /parallel_com_arch/lecture1/main.html[6/13/2012
More informationSCALCORE: DESIGNING A CORE
SCALCORE: DESIGNING A CORE FOR VOLTAGE SCALABILITY Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, Asit Mishra University of Illinois, University of Wisconsin, Nvidia,
More informationADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION
98 Chapter-5 ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION 99 CHAPTER-5 Chapter 5: ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION S.No Name of the Sub-Title Page
More informationDELD MODEL ANSWER DEC 2018
2018 DELD MODEL ANSWER DEC 2018 Q 1. a ) How will you implement Full adder using half-adder? Explain the circuit diagram. [6] An adder is a digital logic circuit in electronics that implements addition
More informationECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice
ECOM 4311 Digital System Design using VHDL Chapter 9 Sequential Circuit Design: Practice Outline 1. Poor design practice and remedy 2. More counters 3. Register as fast temporary storage 4. Pipelined circuit
More informationVector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India
Vol. 2 Issue 2, December -23, pp: (75-8), Available online at: www.erpublications.com Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India Abstract: Real time operation
More informationPre-Silicon Validation of Hyper-Threading Technology
Pre-Silicon Validation of Hyper-Threading Technology David Burns, Desktop Platforms Group, Intel Corp. Index words: microprocessor, validation, bugs, verification ABSTRACT Hyper-Threading Technology delivers
More informationComputer Architecture and Organization:
Computer Architecture and Organization: L03: Register transfer and System Bus By: A. H. Abdul Hafez Abdul.hafez@hku.edu.tr, ah.abdulhafez@gmail.com 1 CAO, by Dr. A.H. Abdul Hafez, CE Dept. HKU Outlines
More informationLow Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS
Low Power Design Part I Introduction and VHDL design Ricardo Santos ricardo@facom.ufms.br LSCAD/FACOM/UFMS Motivation for Low Power Design Low power design is important from three different reasons Device
More informationReading Material + Announcements
Reading Material + Announcements Reminder HW 1» Before asking questions: 1) Read all threads on piazza, 2) Think a bit Ÿ Then, post question Ÿ talk to Animesh if you are stuck Today s class» Wrap up Control
More informationPipelined Architecture (2A) Young Won Lim 4/7/18
Pipelined Architecture (2A) Copyright (c) 2014-2018 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2
More informationPipelined Architecture (2A) Young Won Lim 4/10/18
Pipelined Architecture (2A) Copyright (c) 2014-2018 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2
More informationFall 2015 COMP Operating Systems. Lab #7
Fall 2015 COMP 3511 Operating Systems Lab #7 Outline Review and examples on virtual memory Motivation of Virtual Memory Demand Paging Page Replacement Q. 1 What is required to support dynamic memory allocation
More informationComputer Arithmetic (2)
Computer Arithmetic () Arithmetic Units How do we carry out,,, in FPGA? How do we perform sin, cos, e, etc? ELEC816/ELEC61 Spring 1 Hayden Kwok-Hay So H. So, Sp1 Lecture 7 - ELEC816/61 Addition Two ve
More informationTopics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.
Topics Low Power Techniques Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J. Rabaey Review: Energy & Power Equations E = C L V 2 DD P 0 1 +
More information1 Q' 3. You are given a sequential circuit that has the following circuit to compute the next state:
UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences C50 Fall 2001 Prof. Subramanian Homework #3 Due: Friday, September 28, 2001 1. Show how to implement a T flip-flop starting
More informationA New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm
A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm V.Sandeep Kumar Assistant Professor, Indur Institute Of Engineering & Technology,Siddipet
More informationBasic electronics Prof. T.S. Natarajan Department of Physics Indian Institute of Technology, Madras Lecture- 24
Basic electronics Prof. T.S. Natarajan Department of Physics Indian Institute of Technology, Madras Lecture- 24 Mathematical operations (Summing Amplifier, The Averager, D/A Converter..) Hello everybody!
More informationDigital Hearing Aids Specific μdsp Chip Design by Verilog HDL
Digital Hearing Aids Specific μdsp Chip Design by Verilog HDL Soon-Suck Jarng*, Lingfen Chen *, You-Jung Kwon * * Department of Information Control & Instrumentation, Chosun University, Gwang-Ju, Korea
More informationUnderstanding Engineers #2
Understanding Engineers #! The graduate with a Science degree asks, "Why does it work?"! The graduate with an Engineering degree asks, "How does it work?"! The graduate with an Accounting degree asks,
More informationHigh performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers
High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers Dharmapuri Ranga Rajini 1 M.Ramana Reddy 2 rangarajini.d@gmail.com 1 ramanareddy055@gmail.com 2 1 PG Scholar, Dept
More informationDesign of ALU and Cache Memory for an 8 bit ALU
Clemson University TigerPrints All Theses Theses 12-2007 Design of ALU and Cache Memory for an 8 bit ALU Pravin chander Chandran Clemson University, pravinc@clemson.edu Follow this and additional works
More informationRamon Canal NCD Master MIRI. NCD Master MIRI 1
Wattch, Hotspot, Hotleakage, McPAT http://www.eecs.harvard.edu/~dbrooks/wattch-form.html http://lava.cs.virginia.edu/hotspot http://lava.cs.virginia.edu/hotleakage http://www.hpl.hp.com/research/mcpat/
More information