EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

MIPS R10000 Case Study Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Multiprocessor SGI Origin Using MIPS R10K Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. Slide 1

Slide 2

Announcements Final project proposal due tonight Email to staff (eecs470f18staff@umich.edu) by midnight One page: what do you want to do, when do you want to do it Has everyone signed up for proposal meetings?? HW # 3 due 10/19 (next Friday) No class 10/15 (Fall Break) Midterm on Monday 10/22 6pm (two weeks from today) Email me ASAP if you can t make it Covers everything in lecture and lab Slide 3

Readings For oday: H & P 3.11 Yeager MIPS R10K Slide 4

How to ensure precise state? Last ime Started P6 case study Slide 5

Finish up P6 case study MIPS R10K case study oday Alternate way of implementing precise-state in out-of-order machine Slide 6

Roadmap Speedup Programs Reduce Instruction Latency Parallelize Reduce number of instructions Instruction Level Parallelism hread, Process, etc. Level Parallelism Pipelining Dynamic Scheduling Superscalar Execution Scoreboarding Register Renaming Programmability omasulo s Algorithm Precise State P6 R10K Slide 7

CDB. CDB.V P6 Data Structures Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch Insn fields and status bits ags Values Slide 8

Precise State in P6 Point of is maintaining precise state How does that work? Easy as 1,2,3 1. Wait until last good insn retires, first bad insn at head 2. Clear contents of, RS, and Map able 3. Start over Works because zero (0) means the right thing 0 in /RS entry is empty ag == 0 in Map able register is in regfile and because regfile and D$ writes take place at R Example: page fault in first stf Slide 9

P6: Cycle 9 (with precise state) ht # Insn R V S X C 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 2 mulf f0,f1,f2 f2 [f2] c4 c5+ c8 h 3 stf f2,z(r1) c8 c9 4 addi r1,4,r1 r1 [r1] c5 c6 c7 5 ldf X(r1),f1 f1 [f1] c7 c8 c9 t 6 mulf f0,f1,f2 f2 c9 7 stf f2,z(r1) Map able Reg + f0 f1 f2 r1 #5+ #6 #4+ CDB PAGE FAUL V #5 [f1] Reservation Stations # FU busy op 1 2 V1 V2 1 ALU no 2 LD no 3 S yes stf #7 #6 #4.V 4 FP1 yes mulf #6 #5 [f0] CDB.V 5 FP2 no Slide 10

P6: Cycle 10 (with precise state) ht # Insn R V S X C 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 2 mulf f0,f1,f2 f2 [f2] c4 c5+ c8 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Map able Reg + f0 f1 f2 r1 CDB V faulting insn at head? CLEAR EVERYHING Reservation Stations # FU busy op 1 2 V1 V2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Slide 11

P6: Cycle 11 (with precise state) ht # Insn R V S X C 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 2 mulf f0,f1,f2 f2 [f2] c4 c5+ c8 ht 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Map able Reg + f0 f1 f2 r1 CDB V SAR OVER (after OS fixes page fault) Reservation Stations # FU busy op 1 2 V1 V2 1 ALU no 2 LD no 3 S yes stf #3 [f4] [r1] 4 FP1 no 5 FP2 no Slide 12

P6: Cycle 12 (with precise state) ht # Insn R V S X C 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 2 mulf f0,f1,f2 f2 [f2] c4 c5+ c8 h 3 stf f2,z(r1) c12 t 4 addi r1,4,r1 r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Map able Reg + f0 f1 f2 r1 #4 CDB V Reservation Stations # FU busy op 1 2 V1 V2 1 ALU yes addi #4 [r1] 2 LD no 3 S yes stf #3 [f4] [r1] 4 FP1 no 5 FP2 no Slide 13

P6 Performance In other words: what is the cost of precise state? + In general: same performance as plain omasulo is not a performance device Maybe a little better (RS freed earlier fewer struct hazards) Unless is too small In which case struct hazards become a problem Rules of thumb for size At least N (width) * number of pipe stages between D and R At least N * t hit-l2 Can add a factor of 2 to both if you want What is the rationale behind these? Slide 14

P6 (omasulo+) Redux Popular design for a while (Relatively) easy to implement correctly Anything goes wrong (mispredicted branch, fault, interrupt)? Just clear everything and start again Examples: Intel PentiumPro, IBM/Motorola PowerPC, AMD K6 Actually making a comeback Examples: Intel PentiumM But went away for a while, why? Slide 15

CDB. CDB.V he Problem with P6 Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch Problem for high performance implementations oo much value movement (regfile/ RS regfile) Multi-input muxes, long buses complicate routing and slow clock Slide 16

CDB. MIPS R10K: Alternative Implementation Map able + R old Head Retire value Dispatch op RS 1+ 2+ Arch. Map Free List ail Dispatch FU One big physical register file holds all data - no copies + Register file close to FUs small fast data path and RS on the side used only for control and tags Slide 17

Register Renaming in R10K Architectural register file? Gone Physical register file holds all values #physical registers = #architectural registers + # entries Map architectural registers to physical registers Removes WAW, WAR hazards (physical registers replace RS copies) Fundamental change to map table Mappings cannot be 0 (there is no architectural register file) Free list keeps track of unallocated physical registers is responsible for returning physical registers to free list Conceptually, this is true register renaming Have already seen an example Slide 18

Register Renaming Example Parameters Names: r1,r2,r3 Locations: p1,p2,p3,p4,p5,p6,p7 Original mapping: r1 p1, r2 p2, r3 p3, p4 p7 are free Mapable FreeList Raw insns Renamed insns r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r3,r1 mul p2,p5,p6 p6 p2 p5 p7 div r1,r3,r2 div p6,p5,p7 Question: how is the insn after div renamed? We are out of free locations (physical registers) Real question: how/when are physical registers freed? Slide 19

Freeing Registers in P6 and R10K P6 No need to free storage for speculative ( in-flight ) values explicitly emporary storage comes with entry R: copy speculative value from to register file, free entry R10K Can t free physical register when insn retires No architectural register to copy value to But Can free physical register previously mapped to same logical register Why? All insns that will ever read its value have retired Slide 20

Freeing Registers in R10K Mapable FreeList Raw insns Renamed insns r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r3,r1 mul p2,p5,p6 p6 p2 p5 p7 div r1,r3,r2 div p6,p5,p7 When add retires, free p1 When sub retires, free p3 When mul retires, free? When div retires, free? See the pattern? Slide 21

R10K Data Structures New tags (again) P6: # R10K: PR# : physical register corresponding to insn s logical output old: physical register previously mapped to insn s logical output RS, 1, 2: output, input physical registers Map able +: PR# (never empty) + ready bit Architectural Map able : PR# (never empty) Free List : PR# No values in, RS, or on CDB Slide 22

R10K Data Structures ht # Insn old S X C 1 ldf X(r1),f1 2 mulf f0,f1,f2 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#2+ PR#3+ PR#4+ Free List PR#5,PR#6, PR#7,PR#8 Arch. Map Reg + f0 f1 f2 r1 CDB Notice I: no values anywhere PR#1 PR#2 PR#3 PR#4 Notice II: Mapable is never empty Slide 23

R10K Pipeline R10K pipeline structure: F, D, S, X, C, R D (dispatch) Structural hazard (RS,, LSQ, physical registers)? stall Allocate RS,, LSQ entries and new physical register () Record previously mapped physical register (old) C (complete) Write destination physical register R (retire) head not complete? Stall Handle any exceptions Store write LSQ head to D$ Free, LSQ entries Free previous physical register (old) Record committed physical register () Slide 24

CDB. R10K Dispatch (D) Map able + R old Head Retire value Dispatch op RS 1+ 2+ Arch. Map Free List ail Dispatch FU Read preg (physical register) tags for input registers, store in RS Read preg tag for output register, store in (old) Allocate new preg (free list) for output register, store in RS,, Map able Slide 25

CDB. R10K Complete (C) Map able + R old Head Retire value Dispatch op RS 1+ 2+ Arch. Map Free List ail Dispatch FU Set insn s output register ready bit in map table Set ready bits for matching input tags in RS Slide 26

CDB. R10K Retire (R) Map able + R old Head Retire value Dispatch op RS 1+ 2+ Arch. Map Free List ail Dispatch FU Return old of head to free list Record of head in architectural map table Slide 27

R10K: Cycle 1 ht # Insn old S X C ht 1 ldf X(r1),f1 2 mulf f0,f1,f2 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) PR#5 PR#2 Reservation Stations # FU busy op 1 2 1 ALU no 2 LD yes ldf PR#5 PR#4+ 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#3+ PR#4+ Free List PR#5,PR#6, PR#7,PR#8 Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#2 PR#3 PR#4 CDB Allocate new preg (PR#5) to f1 Remember old preg mapped to f1 (PR#2) in Slide 28

R10K: Cycle 2 ht # Insn old S X C h 1 ldf X(r1),f1 PR#5 PR#2 c2 t 2 mulf f0,f1,f2 PR#6 PR#3 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD yes ldf PR#5 PR#4+ 3 S no 4 FP1 yes mulf PR#6 PR#1+ PR#5 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#6 PR#4+ Free List PR#6,PR#7, PR#8 Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#2 PR#3 PR#4 CDB Allocate new preg (PR#6) to f2 Remember old preg mapped to f3 (PR#3) in Slide 29

R10K: Cycle 3 ht # Insn old S X C h 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 t 2 mulf f0,f1,f2 PR#6 PR#3 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 yes mulf PR#6 PR#1+ PR#5 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#6 PR#4+ Free List PR#7,PR#8, PR#9 Stores are not allocated pregs Free Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#2 PR#3 PR#4 CDB Slide 30

R10K: Cycle 4 ht # Insn old S X C h 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 2 mulf f0,f1,f2 PR#6 PR#3 c4 3 stf f2,z(r1) t 4 addi r1,4,r1 PR#7 PR#4 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 yes mulf PR#6 PR#1+ PR#5+ 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#7 Free List PR#7,PR#8, PR#9 Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#2 PR#3 PR#4 ldf completes set Mapable ready bit CDB PR#5 Match PR#5 tag from CDB & issue Slide 31

R10K: Cycle 5 ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 h 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 3 stf f2,z(r1) 4 addi r1,4,r1 PR#7 PR#4 c5 t 5 ldf X(r1),f1 PR#8 PR#5 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD yes ldf PR#8 PR#7 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Free Map able Reg + f0 f1 f2 r1 PR#1+ PR#8 PR#6 PR#7 Free List PR#8,PR#2, PR#9 Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#5 PR#3 PR#4 CDB ldf retires Return PR#2 to free list Record PR#5 in Arch map Slide 32

Precise State in R10K Problem with R10K design? Precise state has more overhead Keep second (non-speculative) map table (architectural map table) which is only updated on retirement On exception or mispredict, copy architectural map table into map table Also need architectural free list? Alternatively, serially rollback using, old fields ± Slow, but less hardware Slide 33

R10K: Cycle 5 (with precise state) ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 h 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 3 stf f2,z(r1) 4 addi r1,4,r1 PR#7 PR#4 c5 t 5 ldf X(r1),f1 PR#8 PR#5 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD yes ldf PR#8 PR#7 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#8 PR#6 PR#7 Free List PR#8,PR#2, PR#9 CDB undo insns 3-5 (doesn t matter why) use serial rollback Slide 34

R10K: Cycle 6 (with precise state) ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 h 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 3 stf f2,z(r1) t 4 addi r1,4,r1 PR#7 PR#4 c5 5 ldf X(r1),f1 PR#8 PR#5 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#7 Free List PR#2,PR#8 PR#9 CDB undo ldf (#5) 1. free RS 2. free (PR#8), return to FreeList 3. restore M[f1] to old (PR#5) 4. free #5 insns may execute during rollback (not shown) Slide 35

R10K: Cycle 7 (with precise state) ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 h 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 t 3 stf f2,z(r1) 4 addi r1,4,r1 PR#7 PR#4 c5 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#4+ Free List PR#2,PR#8, PR#7, PR#9 CDB undo addi (#4) 1. free RS 2. free (PR#7), return to FreeList 3. restore M[r1] to old (PR#4) 4. free #4 Slide 36

R10K: Cycle 8 (with precise state) ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 ht 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#4+ Free List PR#2,PR#8, PR#7, PR#9 CDB undo stf (#3) 1. free RS 2. free #3 3. no registers to restore/free 4. how is D$ write undone? Slide 37

Can we do better? Early Branch Resolution Recover from branch mispredicts before retirement Maintain a stack of map-table-checkpoints for each branch, or branch stack Keeps track of architectural state before branch executes New structural hazard if checkpoint space runs out Discuss more in a few weeks Branch Stack Recovery PC &LSQ tail BP repair + Free list Slide 38

P6 vs. R10K (Renaming) Feature P6 R10K Value storage ARF,,RS PRF Register read @D: ARF/ RS @S: PRF FU Register write @R: ARF @C: FU PRF Speculative value free @R: automatic () @R: overwriting insn Data paths ARF/ RS PRF FU RS FU FU PRF R10K-style became popular in late 90 s, early 00 s E.g., MIPS R10K (duh), DEC Alpha 21264, Intel Pentium4 P6-style is perhaps making a comeback Why? Frequency (power) is on the retreat, simplicity is important FU ARF Precise state Simple: clear everything Complex: serial/checkpoint Slide 39

Summary Modern dynamic scheduling must support precise state A software sanity issue, not a performance issue Strategy: Writeback Complete (OoO) + Retire (io) As an added benefit, we can do speculative execution with same mechanism wo basic designs P6: omasulo + re-order buffer, copy based register renaming ± Precise state is simple, but fast implementations are difficult R10K: implements true register renaming ± Easier fast implementations, but precise state is more complex Slide 40

Dynamic Scheduling Summary Out-of-order execution: a performance technique Easier/more effective in hardware than software (isn t everything?) Idea: make scheduling transparent to software Feature I: Dynamic scheduling (io OoO) Performance piece: re-arrange insns into high-performance order Decode (io) dispatch (io) + issue (OoO) wo algorithms: Scoreboard, omasulo Feature II: Precise state (OoO io) Correctness piece: put insns back into program order Writeback (OoO) complete (OoO) + retire (io) wo designs: P6, R10K Next: memory scheduling Slide 41