EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Similar documents
OOO Execution & Precise State MIPS R10000 (R10K)

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Tomasolu s s Algorithm

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Issue. Execute. Finish

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Precise State Recovery. Out-of-Order Pipelines

Instruction Level Parallelism III: Dynamic Scheduling

Out-of-Order Execution. Register Renaming. Nima Honarmand

Dynamic Scheduling II

Dynamic Scheduling I

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

CS521 CSE IITG 11/23/2012

CMP 301B Computer Architecture. Appendix C

CSE502: Computer Architecture CSE 502: Computer Architecture

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

COSC4201. Scoreboard

Instruction Level Parallelism Part II - Scoreboard

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Project 5: Optimizer Jason Ansel

Parallel architectures Electronic Computers LM

Tomasulo s Algorithm. Tomasulo s Algorithm

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture 4: Introduction to Pipelining

Instruction Level Parallelism. Data Dependence Static Scheduling

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Pipelined Processor Design

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

CS 110 Computer Architecture Lecture 11: Pipelining

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

CS429: Computer Organization and Architecture

DAT105: Computer Architecture

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Final Report: DBmbench

RISC Central Processing Unit

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

EE 457 Homework 5 Redekopp Name: Score: / 100_

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

On the Rules of Low-Power Design

Computer Architecture

Blackfin Online Learning & Development

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

SCALCORE: DESIGNING A CORE

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Multiple Predictors: BTB + Branch Direction Predictors

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

LECTURE 8. Pipelining: Datapath and Control

EECE 321: Computer Organiza5on

A Static Power Model for Architects

CMSC 611: Advanced Computer Architecture

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Understanding Engineers #2

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

Compiler Optimisation

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Lecture 8-1 Vector Processors 2 A. Sohn

CMSC 611: Advanced Computer Architecture

Introduction to Computer Engineering. CS/ECE 252, Spring 2013 Prof. Mark D. Hill Computer Sciences Department University of Wisconsin Madison

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

Quantifying the Complexity of Superscalar Processors

EECS 270 Schedule and Syllabus for Fall 2011 Designed by Prof. Pinaki Mazumder

EECS 427 Lecture 21: Design for Test (DFT) Reminders

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

Register Allocation and VDD-Gating Algorithms for Out-of-Order Architectures

Design Challenges in Multi-GHz Microprocessors

Combinatorial Logic Design Multiplexers and ALUs CS 64: Computer Organization and Design Logic Lecture #14

CS4617 Computer Architecture

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

EECS 473. Review etc.

Introduction to CMOS VLSI Design (E158) Lecture 5: Logic

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

Processors Processing Processors. The meta-lecture

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

NAND Structure Aware Controller Framework

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Fall 2015 COMP Operating Systems. Lab #7

Department Computer Science and Engineering IIT Kanpur

EE382V-ICS: System-on-a-Chip (SoC) Design

EECS 473. Review etc.

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

CPSC 217 Assignment 3 Due Date: Friday March 30, 2018 at 11:59pm

Meltdown & Spectre. Side-channels considered harmful. Qualcomm Mobile Security Summit May, San Diego, CA. Moritz Lipp

Transcription:

MIPS R10000 Case Study Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Multiprocessor SGI Origin Using MIPS R10K Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. Slide 1

Slide 2

Announcements Final project proposal due tonight Email to staff (eecs470f18staff@umich.edu) by midnight One page: what do you want to do, when do you want to do it Has everyone signed up for proposal meetings?? HW # 3 due 10/19 (next Friday) No class 10/15 (Fall Break) Midterm on Monday 10/22 6pm (two weeks from today) Email me ASAP if you can t make it Covers everything in lecture and lab Slide 3

Readings For oday: H & P 3.11 Yeager MIPS R10K Slide 4

How to ensure precise state? Last ime Started P6 case study Slide 5

Finish up P6 case study MIPS R10K case study oday Alternate way of implementing precise-state in out-of-order machine Slide 6

Roadmap Speedup Programs Reduce Instruction Latency Parallelize Reduce number of instructions Instruction Level Parallelism hread, Process, etc. Level Parallelism Pipelining Dynamic Scheduling Superscalar Execution Scoreboarding Register Renaming Programmability omasulo s Algorithm Precise State P6 R10K Slide 7

CDB. CDB.V P6 Data Structures Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch Insn fields and status bits ags Values Slide 8

Precise State in P6 Point of is maintaining precise state How does that work? Easy as 1,2,3 1. Wait until last good insn retires, first bad insn at head 2. Clear contents of, RS, and Map able 3. Start over Works because zero (0) means the right thing 0 in /RS entry is empty ag == 0 in Map able register is in regfile and because regfile and D$ writes take place at R Example: page fault in first stf Slide 9

P6: Cycle 9 (with precise state) ht # Insn R V S X C 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 2 mulf f0,f1,f2 f2 [f2] c4 c5+ c8 h 3 stf f2,z(r1) c8 c9 4 addi r1,4,r1 r1 [r1] c5 c6 c7 5 ldf X(r1),f1 f1 [f1] c7 c8 c9 t 6 mulf f0,f1,f2 f2 c9 7 stf f2,z(r1) Map able Reg + f0 f1 f2 r1 #5+ #6 #4+ CDB PAGE FAUL V #5 [f1] Reservation Stations # FU busy op 1 2 V1 V2 1 ALU no 2 LD no 3 S yes stf #7 #6 #4.V 4 FP1 yes mulf #6 #5 [f0] CDB.V 5 FP2 no Slide 10

P6: Cycle 10 (with precise state) ht # Insn R V S X C 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 2 mulf f0,f1,f2 f2 [f2] c4 c5+ c8 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Map able Reg + f0 f1 f2 r1 CDB V faulting insn at head? CLEAR EVERYHING Reservation Stations # FU busy op 1 2 V1 V2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Slide 11

P6: Cycle 11 (with precise state) ht # Insn R V S X C 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 2 mulf f0,f1,f2 f2 [f2] c4 c5+ c8 ht 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Map able Reg + f0 f1 f2 r1 CDB V SAR OVER (after OS fixes page fault) Reservation Stations # FU busy op 1 2 V1 V2 1 ALU no 2 LD no 3 S yes stf #3 [f4] [r1] 4 FP1 no 5 FP2 no Slide 12

P6: Cycle 12 (with precise state) ht # Insn R V S X C 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 2 mulf f0,f1,f2 f2 [f2] c4 c5+ c8 h 3 stf f2,z(r1) c12 t 4 addi r1,4,r1 r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Map able Reg + f0 f1 f2 r1 #4 CDB V Reservation Stations # FU busy op 1 2 V1 V2 1 ALU yes addi #4 [r1] 2 LD no 3 S yes stf #3 [f4] [r1] 4 FP1 no 5 FP2 no Slide 13

P6 Performance In other words: what is the cost of precise state? + In general: same performance as plain omasulo is not a performance device Maybe a little better (RS freed earlier fewer struct hazards) Unless is too small In which case struct hazards become a problem Rules of thumb for size At least N (width) * number of pipe stages between D and R At least N * t hit-l2 Can add a factor of 2 to both if you want What is the rationale behind these? Slide 14

P6 (omasulo+) Redux Popular design for a while (Relatively) easy to implement correctly Anything goes wrong (mispredicted branch, fault, interrupt)? Just clear everything and start again Examples: Intel PentiumPro, IBM/Motorola PowerPC, AMD K6 Actually making a comeback Examples: Intel PentiumM But went away for a while, why? Slide 15

CDB. CDB.V he Problem with P6 Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch Problem for high performance implementations oo much value movement (regfile/ RS regfile) Multi-input muxes, long buses complicate routing and slow clock Slide 16

CDB. MIPS R10K: Alternative Implementation Map able + R old Head Retire value Dispatch op RS 1+ 2+ Arch. Map Free List ail Dispatch FU One big physical register file holds all data - no copies + Register file close to FUs small fast data path and RS on the side used only for control and tags Slide 17

Register Renaming in R10K Architectural register file? Gone Physical register file holds all values #physical registers = #architectural registers + # entries Map architectural registers to physical registers Removes WAW, WAR hazards (physical registers replace RS copies) Fundamental change to map table Mappings cannot be 0 (there is no architectural register file) Free list keeps track of unallocated physical registers is responsible for returning physical registers to free list Conceptually, this is true register renaming Have already seen an example Slide 18

Register Renaming Example Parameters Names: r1,r2,r3 Locations: p1,p2,p3,p4,p5,p6,p7 Original mapping: r1 p1, r2 p2, r3 p3, p4 p7 are free Mapable FreeList Raw insns Renamed insns r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r3,r1 mul p2,p5,p6 p6 p2 p5 p7 div r1,r3,r2 div p6,p5,p7 Question: how is the insn after div renamed? We are out of free locations (physical registers) Real question: how/when are physical registers freed? Slide 19

Freeing Registers in P6 and R10K P6 No need to free storage for speculative ( in-flight ) values explicitly emporary storage comes with entry R: copy speculative value from to register file, free entry R10K Can t free physical register when insn retires No architectural register to copy value to But Can free physical register previously mapped to same logical register Why? All insns that will ever read its value have retired Slide 20

Freeing Registers in R10K Mapable FreeList Raw insns Renamed insns r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r3,r1 mul p2,p5,p6 p6 p2 p5 p7 div r1,r3,r2 div p6,p5,p7 When add retires, free p1 When sub retires, free p3 When mul retires, free? When div retires, free? See the pattern? Slide 21

R10K Data Structures New tags (again) P6: # R10K: PR# : physical register corresponding to insn s logical output old: physical register previously mapped to insn s logical output RS, 1, 2: output, input physical registers Map able +: PR# (never empty) + ready bit Architectural Map able : PR# (never empty) Free List : PR# No values in, RS, or on CDB Slide 22

R10K Data Structures ht # Insn old S X C 1 ldf X(r1),f1 2 mulf f0,f1,f2 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#2+ PR#3+ PR#4+ Free List PR#5,PR#6, PR#7,PR#8 Arch. Map Reg + f0 f1 f2 r1 CDB Notice I: no values anywhere PR#1 PR#2 PR#3 PR#4 Notice II: Mapable is never empty Slide 23

R10K Pipeline R10K pipeline structure: F, D, S, X, C, R D (dispatch) Structural hazard (RS,, LSQ, physical registers)? stall Allocate RS,, LSQ entries and new physical register () Record previously mapped physical register (old) C (complete) Write destination physical register R (retire) head not complete? Stall Handle any exceptions Store write LSQ head to D$ Free, LSQ entries Free previous physical register (old) Record committed physical register () Slide 24

CDB. R10K Dispatch (D) Map able + R old Head Retire value Dispatch op RS 1+ 2+ Arch. Map Free List ail Dispatch FU Read preg (physical register) tags for input registers, store in RS Read preg tag for output register, store in (old) Allocate new preg (free list) for output register, store in RS,, Map able Slide 25

CDB. R10K Complete (C) Map able + R old Head Retire value Dispatch op RS 1+ 2+ Arch. Map Free List ail Dispatch FU Set insn s output register ready bit in map table Set ready bits for matching input tags in RS Slide 26

CDB. R10K Retire (R) Map able + R old Head Retire value Dispatch op RS 1+ 2+ Arch. Map Free List ail Dispatch FU Return old of head to free list Record of head in architectural map table Slide 27

R10K: Cycle 1 ht # Insn old S X C ht 1 ldf X(r1),f1 2 mulf f0,f1,f2 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) PR#5 PR#2 Reservation Stations # FU busy op 1 2 1 ALU no 2 LD yes ldf PR#5 PR#4+ 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#3+ PR#4+ Free List PR#5,PR#6, PR#7,PR#8 Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#2 PR#3 PR#4 CDB Allocate new preg (PR#5) to f1 Remember old preg mapped to f1 (PR#2) in Slide 28

R10K: Cycle 2 ht # Insn old S X C h 1 ldf X(r1),f1 PR#5 PR#2 c2 t 2 mulf f0,f1,f2 PR#6 PR#3 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD yes ldf PR#5 PR#4+ 3 S no 4 FP1 yes mulf PR#6 PR#1+ PR#5 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#6 PR#4+ Free List PR#6,PR#7, PR#8 Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#2 PR#3 PR#4 CDB Allocate new preg (PR#6) to f2 Remember old preg mapped to f3 (PR#3) in Slide 29

R10K: Cycle 3 ht # Insn old S X C h 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 t 2 mulf f0,f1,f2 PR#6 PR#3 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 yes mulf PR#6 PR#1+ PR#5 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#6 PR#4+ Free List PR#7,PR#8, PR#9 Stores are not allocated pregs Free Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#2 PR#3 PR#4 CDB Slide 30

R10K: Cycle 4 ht # Insn old S X C h 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 2 mulf f0,f1,f2 PR#6 PR#3 c4 3 stf f2,z(r1) t 4 addi r1,4,r1 PR#7 PR#4 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 yes mulf PR#6 PR#1+ PR#5+ 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#7 Free List PR#7,PR#8, PR#9 Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#2 PR#3 PR#4 ldf completes set Mapable ready bit CDB PR#5 Match PR#5 tag from CDB & issue Slide 31

R10K: Cycle 5 ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 h 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 3 stf f2,z(r1) 4 addi r1,4,r1 PR#7 PR#4 c5 t 5 ldf X(r1),f1 PR#8 PR#5 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD yes ldf PR#8 PR#7 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Free Map able Reg + f0 f1 f2 r1 PR#1+ PR#8 PR#6 PR#7 Free List PR#8,PR#2, PR#9 Arch. Map Reg + f0 f1 f2 r1 PR#1 PR#5 PR#3 PR#4 CDB ldf retires Return PR#2 to free list Record PR#5 in Arch map Slide 32

Precise State in R10K Problem with R10K design? Precise state has more overhead Keep second (non-speculative) map table (architectural map table) which is only updated on retirement On exception or mispredict, copy architectural map table into map table Also need architectural free list? Alternatively, serially rollback using, old fields ± Slow, but less hardware Slide 33

R10K: Cycle 5 (with precise state) ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 h 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 3 stf f2,z(r1) 4 addi r1,4,r1 PR#7 PR#4 c5 t 5 ldf X(r1),f1 PR#8 PR#5 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD yes ldf PR#8 PR#7 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#8 PR#6 PR#7 Free List PR#8,PR#2, PR#9 CDB undo insns 3-5 (doesn t matter why) use serial rollback Slide 34

R10K: Cycle 6 (with precise state) ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 h 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 3 stf f2,z(r1) t 4 addi r1,4,r1 PR#7 PR#4 c5 5 ldf X(r1),f1 PR#8 PR#5 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#7 Free List PR#2,PR#8 PR#9 CDB undo ldf (#5) 1. free RS 2. free (PR#8), return to FreeList 3. restore M[f1] to old (PR#5) 4. free #5 insns may execute during rollback (not shown) Slide 35

R10K: Cycle 7 (with precise state) ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 h 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 t 3 stf f2,z(r1) 4 addi r1,4,r1 PR#7 PR#4 c5 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#4+ Free List PR#2,PR#8, PR#7, PR#9 CDB undo addi (#4) 1. free RS 2. free (PR#7), return to FreeList 3. restore M[r1] to old (PR#4) 4. free #4 Slide 36

R10K: Cycle 8 (with precise state) ht # Insn old S X C 1 ldf X(r1),f1 PR#5 PR#2 c2 c3 c4 ht 2 mulf f0,f1,f2 PR#6 PR#3 c4 c5 3 stf f2,z(r1) 4 addi r1,4,r1 5 ldf X(r1),f1 6 mulf f0,f1,f2 7 stf f2,z(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#4+ Free List PR#2,PR#8, PR#7, PR#9 CDB undo stf (#3) 1. free RS 2. free #3 3. no registers to restore/free 4. how is D$ write undone? Slide 37

Can we do better? Early Branch Resolution Recover from branch mispredicts before retirement Maintain a stack of map-table-checkpoints for each branch, or branch stack Keeps track of architectural state before branch executes New structural hazard if checkpoint space runs out Discuss more in a few weeks Branch Stack Recovery PC &LSQ tail BP repair + Free list Slide 38

P6 vs. R10K (Renaming) Feature P6 R10K Value storage ARF,,RS PRF Register read @D: ARF/ RS @S: PRF FU Register write @R: ARF @C: FU PRF Speculative value free @R: automatic () @R: overwriting insn Data paths ARF/ RS PRF FU RS FU FU PRF R10K-style became popular in late 90 s, early 00 s E.g., MIPS R10K (duh), DEC Alpha 21264, Intel Pentium4 P6-style is perhaps making a comeback Why? Frequency (power) is on the retreat, simplicity is important FU ARF Precise state Simple: clear everything Complex: serial/checkpoint Slide 39

Summary Modern dynamic scheduling must support precise state A software sanity issue, not a performance issue Strategy: Writeback Complete (OoO) + Retire (io) As an added benefit, we can do speculative execution with same mechanism wo basic designs P6: omasulo + re-order buffer, copy based register renaming ± Precise state is simple, but fast implementations are difficult R10K: implements true register renaming ± Easier fast implementations, but precise state is more complex Slide 40

Dynamic Scheduling Summary Out-of-order execution: a performance technique Easier/more effective in hardware than software (isn t everything?) Idea: make scheduling transparent to software Feature I: Dynamic scheduling (io OoO) Performance piece: re-arrange insns into high-performance order Decode (io) dispatch (io) + issue (OoO) wo algorithms: Scoreboard, omasulo Feature II: Precise state (OoO io) Correctness piece: put insns back into program order Writeback (OoO) complete (OoO) + retire (io) wo designs: P6, R10K Next: memory scheduling Slide 41