Parallel Decoding and issue Parallel execution Preserving the sequential consistency of execution and exception processing 1 slide 2 Decode/issue data Issue bound fetch Dispatch bound fetch RS RS RS RS slide 3 slide 4 Decode/issue data RS RS RS RS Parallel Decoding and issue Parallel execution Renaming Remove WAR and WAW data dependency Out of Order Scheduling Score boarding and Tomasulo approach Re Ordering Maintaining order of execution Preserving the sequential consistency of execution and exception processing slide 5 6 1
RAW DIV F0, F2, F4 RAW ADD F6, F0,F8 WAR DIV ADD F0, F2, F4 S, F0,F8 ST F6, O(R1) SUB F8, F10,F24 MUL F6, F10,F8 WAW WAR ST S, O(R1) SUB T, F10,F24 MUL F6, F10, T 7 8 register address from mapping physical register file (larger than architectural register file) Compiler Done statically Limited by registers visible to compiler Hardware Done dynamically Limited by registers available to hardware slide 9 slide 10 Introduced with CDC6600 Data 0 1 2 Register File status 1 0 1 0 1 11 slide 12 2
decoded OC Reservation station Rs1 Rs2 Rd OC (opcode) check V bits of sources Rs1,Rs2,Rd reset V bit of Rd slide 13 Register File Os1 Os2 (operand value) result, Rd update Rd set V bit DIV R3, R1, R2 MUL R5, R3, R4 ADD R4, R6, R7 DIV MUL ADD bus Precedence handle Hazards But Stall for all RAW, WAR and WAW Scoreboard R1 R2 R3 R4 R5 R6 R7 Read Φ Φ MUL MUL Φ ADD ADD Write Φ Φ DIV ADD MUL Φ Φ Precedence Φ Φ W R W Φ Φ 1: issue I to DIV [R1],[R2] >DIV, ], Begin DIV 2. Issue I2 to MUL (Dependency) to RS TAG [DIV] (from R3) goes to MUL TAG [MUL] is placed in [R4] read score board TAG[MUL] is placed in [R5] write to score board 3: Decode I3 to ADD (Dependency) to RS TAG[ADD] for [R6] read score board TAG[ADD] for [R7] read score board TAG[ADD] for [R4] write score board 14 Op Operation to perform in the unit Qj, Qk From which FU is it will get Operand Rj, Rk Ready Status of Source Registers for the Operation If Both Ri and Rj is Yes, then Operation can be scheduled Busy Indicates reservation station or FU is busy Op Operation to perform in the unit Qj, Qk From which FU is it will get Operand Rj, Rk Ready Status of Source Registers for the Operation If Both Ri and Rj is Yes, then Operation can be scheduled Busy Indicates reservation station or FU is busy 15 16 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 1 INT 2 MUL1 3 MUL2 4 ADD 5 DIV 1 INT Y LF 2 MUL1 Y MUL 4 ADD Y SUB 5 DIV Y DIV FU No 17 FU No 18 3
LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 1 INT Y LF F2 R3 2 MUL1 Y MUL F0 F2 F4 4 ADD Y SUB F8 F6 F2 5 DIV Y DIV F10 F0 F6 1 INT Y LF F2 R3 Y Y 2 MUL1 Y MUL F0 F2 F4 1 N Y 4 ADD Y SUB F8 F6 F2 1 Y N FU No 19 FU No 2 1 4 5 20 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 1 INT Y LF F2 R3 N N 2 MUL1 Y MUL F0 F2 F4 1 N Y 4 ADD Y SUB F8 F6 F2 1 Y N 2 MUL1 Y MUL F0 F2 F4 Y Y 4 ADD Y SUB F8 F6 F2 Y Y FU No 2 1 4 5 21 FU No 2 4 5 22 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 2 MUL1 Y MUL F0 F2 F4 N N 4 ADD N 2 MUL1 Y MUL F0 F2 F4 N N 4 ADD Y ADD F6 F8 F2 Y Y FU No 2 5 23 FU No 2 4 5 24 4
LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 2 MUL1 Y MUL F0 F2 F4 N N 4 ADD Y ADD F6 F8 F2 N N 2 MUL1 Y MUL F0 F2 F4 N N 4 ADD Y ADD F6 F8 F2 N N FU No 2 4 5 25 FU No 2 4 5 26 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) 2 MUL1 N 4 ADD Y ADD F6 F8 F2 N N 5 DIV Y DIV F10 F0 F6 Y Y 2 MUL1 N 4 ADD Y ADD F6 F8 F2 N N 5 DIV Y DIV F10 F0 F6 N N FU No 4 5 27 FU No 4 5 28 LF F6, 34(R2) LF F2, 45(R3) 2 MUL1 N 4 ADD N 5 DIV Y DIV F10 F0 F6 N N FU No 5 29 Out of Order Scheduling No renaming required, can be done in Hardware by using special taging Developed in IBM but extensively used by Intel/AMD Tomasulo: Eckert Mauchly Award in 1997 All most all modern processor use this method Pentium Pro, Core 2 Duo, Core i3/5/7, AMD Optron/Phenom 30 5
It is so important to Read in ACA Course. Many Demo s available online Ref: http://www.ecs.umass.edu/ece/koren/architectur edu/ece/koren/architectur e/tomasulo/applettomasulo.html http://www.dcs.ed.ac.uk/home/hase/webhase/d emo/tomasulo.html http://www.ecs.umass.edu/ece/koren/architectur e/tomasulo1/tomasulo.htm 31 1.Issue get from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2.Execution operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3.Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Issue: build dependence for new inst Writeback: Wakeup dependent s 32 Renames its two source registers (source renaming) Assigns it to a free RS Updates Renaming table (dest renaming) Also decodes the inst and read register values in parallel Only ready s can join the competition There is a select logic to select s for FU execution Some policy may be used, e.g. age based Non ready s can be waken up during writeback of its parent inst 33 34 Normal data bus data + destination ( go to bus) Common data bus data + source ( come from bus) 64 bits of data + 4 bits of source index (tag) Does the broadcast to every in the fly How it do Child s do tag matching and update their ready bits and value fields (if the tag matches theirs) Adapted from UCB CS252 S98, Copyright 1998 USB 35 36 6
IBM 360/91 Tomasulo s scheme Issue bound fetch FUs : LOAD, STORE, 3 x ADD/SUB, 2 x MUL/DIV Group RS s with 1 slot per FU 1 In order issue, out of order execution slide 37 decoded Reservation station OC Os1/Is1 Vs1 Os2/Is2 Vs2 Rd associative update of Is1, Is2 with Rd, set Vs bits slide 38 Rs1,Rs2,Rd reset V bit of Rd Os1 check Vs1, Vs2 OC, Os1, Os2, Rd Register File Os2 (operand value) result, Rd update Rd, set V bit Common Data Bus Op Operation to perform in the unit Vj, Vk (Value of Source operands) Store buffers has V field, result to be stored Qj, Qk Qk Reservation stations producing source registers (value to be written) Store buffers only have Qi for RS producing result Busy Indicates reservation station or FU is busy 39 LF F6, 34(R2) LF F2, 45(R3) ADD1 ADD2 ADD3 MUL1 MUL2 Qi 40 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) ADD1 Y SUB ADD2 Y ADD MUL1 Y MUL MUL2 Y DIV ADD1 Y SUB (LD1) LD2 ADD2 Y ADD ADD1 LD2 MUL1 Y MUL (F4) LD2 Qi 41 Qi MUL1 LD2 ADD2 ADD1 MUL2 42 7
LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) ADD1 Y SUB (LD1) (LD2) ADD2 Y ADD (LD2) ADD1 MUL1 Y MUL (LD2) (F4) ADD1 N ADD2 Y ADD (ADD1) (LD2) MUL1 Y MUL (LD2) (F4) Qi MUL1 ADD2 ADD1 MUL2 43 Qi MUL1 ADD2 MUL2 44 LF F6, 34(R2) LF F2, 45(R3) LF F6, 34(R2) LF F2, 45(R3) ADD1 N ADD2 Y ADD (ADD1) (LD2) MUL1 Y MUL (LD2) (F4) ADD1 N ADD2 N MUL1 Y MUL (LD2) (F4) Qi MUL1 ADD2 MUL2 45 Qi MUL1 MUL2 46 LF F6, 34(R2) LF F2, 45(R3) ADD1 N ADD2 N MUL1 N MUL2 Y DIV (MUL1) (LD1) Qi MUL2 47 48 8