OOO Execution & Precise State MIPS R10000 (R10K)

Similar documents
EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

CSE502: Computer Architecture CSE 502: Computer Architecture

Out-of-Order Execution. Register Renaming. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Precise State Recovery. Out-of-Order Pipelines

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Tomasolu s s Algorithm

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Instruction Level Parallelism III: Dynamic Scheduling

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Dynamic Scheduling I

Issue. Execute. Finish

Dynamic Scheduling II

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Tomasulo s Algorithm. Tomasulo s Algorithm

CS521 CSE IITG 11/23/2012

CMP 301B Computer Architecture. Appendix C

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CSE502: Computer Architecture CSE 502: Computer Architecture

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

COSC4201. Scoreboard

Parallel architectures Electronic Computers LM

Project 5: Optimizer Jason Ansel

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Instruction Level Parallelism. Data Dependence Static Scheduling

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Instruction Level Parallelism Part II - Scoreboard

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

DAT105: Computer Architecture

CS 110 Computer Architecture Lecture 11: Pipelining

Pipelined Processor Design

Computer Architecture

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture 8-1 Vector Processors 2 A. Sohn

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Department Computer Science and Engineering IIT Kanpur

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

LECTURE 8. Pipelining: Datapath and Control

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

RISC Central Processing Unit

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

CS429: Computer Organization and Architecture

EECE 321: Computer Organiza5on

Final Report: DBmbench

ECE473 Computer Architecture and Organization. Pipeline: Introduction

On the Rules of Low-Power Design

EE 457 Homework 5 Redekopp Name: Score: / 100_

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Lecture 4: Introduction to Pipelining

Chapter 1 Introduction

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Design Challenges in Multi-GHz Microprocessors

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

EE382V-ICS: System-on-a-Chip (SoC) Design

CSE 2021: Computer Organization

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

EECS 498 Introduction to Distributed Systems

DeCoR: A Delayed Commit and Rollback Mechanism for Handling Inductive Noise in Processors

Quantifying the Complexity of Superscalar Processors

SCALCORE: DESIGNING A CORE

Understanding Engineers #2

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

Register Allocation and VDD-Gating Algorithms for Out-of-Order Architectures

Transaction Log Fundamentals for the DBA

Compiler Optimisation

FMP For More Practice

CS61c: Introduction to Synchronous Digital Systems

How a processor can permute n bits in O(1) cycles

Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing

A Static Power Model for Architects

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Chapter 3 Digital Logic Structures

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Pre-Silicon Validation of Hyper-Threading Technology

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Multiple Predictors: BTB + Branch Direction Predictors

High Speed ECC Implementation on FPGA over GF(2 m )

Blackfin Online Learning & Development

Low-Power CMOS VLSI Design

111OO11 Control the computer ' 1-4 of the lunar module +11

Game Architecture. Rabin is a good overview of everything to do with Games A lot of these slides come from the 1 st edition CS

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Transcription:

OOO Execution & Precise State in MIPS R10000 (R10K) Nima Honarmand

CDB. CDB.V Spring 2018 :: CSE 502 he Problem with P6 Map able + Regfile value R value Head Retire Dispatch op RS 1 2 V1 FU V2 ail Dispatch Problem for high performance implementations oo much value movement (Regfile/ RS Regfile) Multi-input muxes, long buses, slow clock

CDB. Spring 2018 :: CSE 502 MIPS R10K: Alternative Implementation Map able + R old Head Retire Regfile value Dispatch op RS 1+ 2+ Free List ail Dispatch FU One big physical register file holds all data - no copies + Register file close to FUs small and fast data path and RS on the side used only for control and tags

R10K-Style Register Renaming Architectural register file? Gone Physical register file holds all values #physical registers = #architectural registers + # entries Map (rename) architectural registers to physical registers No WAW or WAR hazards (physical regs. replace RS values) Fundamental change to map table Mappings cannot be 0 (no architectural register file) Explicit free list tracks unallocated physical regs. Retire stage returns physical regs. to free list

Physical Register Reclamation P6 No need to free speculative ( in-flight ) values explicitly emporary storage comes with entry R10K Can t free physical regs. when insn. retires Younger insns. likely depend on it But In Retire stage, can free physical reg. previously mapped to logical destination reg. Why?

Freeing Registers Mapable FreeList Original insns. Renamed insns. r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6 p4 p2 p6 p7 div r1,4,r1 div p4,4,p7 p7 p2 p6 p1 add r1,r3,r2 add p7,p6,p1 When add retires, free p1 When sub retires, free p3 When mul retires, free p5 When div retires, free p4 Always OK to free old mapping

Hardware Data Structures New tags (again) P6: # R10K: PR# (physical register #) : PR# corresponding to insn s logical output old: PR# previously mapped to insn s logical output RS, 1, 2: output, input physical registers Map able +: PR# (never empty) + ready bit Free List : PR# No values in, RS, or on CDB

Hardware Data Structures ht # Insn old S X C 1 f1 = ldf (r1) 2 f2 = mulf f0,f1 3 stf f2,(r1) 4 r1 = addi r1,4 5 f1 = ldf (r1) 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#2+ PR#3+ PR#4+ Free List PR#5,PR#6, PR#7,PR#8 CDB Note I: no values anywhere Note II: Mapable is never empty

R10K Pipeline R10K pipeline structure: F, D, S, X, C, R D (dispatch) Structural hazard (RS,, physical registers)? stall Allocate RS,, and new physical register () Record previously mapped physical register (old) C (complete) Write destination physical register R (retire) head not complete? stall Handle any exceptions Free entry Free previous physical register (old)

CDB. Spring 2018 :: CSE 502 R10K Dispatch (D) Map able + R old Head Retire Regfile value Dispatch op RS 1+ 2+ Free List ail Dispatch FU Read preg (physical register) tags for input registers, store in RS Read preg tag for output register, store in (old) Allocate new preg (free list) for output reg, store in RS,, Map able

CDB. Spring 2018 :: CSE 502 R10K Complete (C) Map able + R old Head Retire Regfile value Dispatch op RS 1+ 2+ Free List ail Dispatch FU Set insn s output register ready bit in map table Set ready bits for matching input tags in RS

CDB. Spring 2018 :: CSE 502 R10K Retire (R) Map able + R old Head Retire Regfile value Dispatch op RS 1+ 2+ Free List ail Dispatch FU Return old of head to free list

R10K: Cycle 1 ht # Insn old S X C ht 1 f1 = ldf (r1) PR#5 PR#2 2 f2 = mulf f0,f1 3 stf f2,(r1) 4 r1 = addi r1,4 5 f1 = ldf (r1) 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD yes ldf PR#5 PR#4+ 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#3+ PR#4+ Free List PR#5,PR#6, PR#7,PR#8 CDB Allocate new preg (PR#5) to f1 Remember old preg mapped to f1 (PR#2) in

R10K: Cycle 2 ht # Insn old S X C h 1 f1 = ldf (r1) PR#5 PR#2 c2 t 2 f2 = mulf f0,f1 PR#6 PR#3 3 stf f2,(r1) 4 r1 = addi r1,4 5 f1 = ldf (r1) 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD yes ldf PR#5 PR#4+ 3 S no 4 FP1 yes mulf PR#6 PR#1+ PR#5 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#6 PR#4+ Free List PR#6,PR#7, PR#8 CDB Allocate new preg (PR#6) to f2 Remember old preg mapped to f3 (PR#3) in

R10K: Cycle 3 ht # Insn old S X C h 1 f1 = ldf (r1) PR#5 PR#2 c2 c3 t 2 f2 = mulf f0,f1 PR#6 PR#3 3 stf f2,(r1) 4 r1 = addi r1,4 5 f1 = ldf (r1) 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 yes mulf PR#6 PR#1+ PR#5 5 FP2 no free Map able Reg + f0 f1 f2 r1 PR#1+ PR#5 PR#6 PR#4+ Free List PR#7,PR#8, PR#9 CDB Stores are not allocated pregs

R10K: Cycle 4 ht # Insn old S X C h 1 f1 = ldf (r1) PR#5 PR#2 c2 c3 c4 2 f2 = mulf f0,f1 PR#6 PR#3 c4 3 stf f2,(r1) t 4 r1 = addi r1,4 PR#7 PR#4 5 f1 = ldf (r1) 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 yes mulf PR#6 PR#1+ PR#5+ 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#7 Free List PR#7,PR#8, PR#9 CDB PR#5 ldf completes set Mapable ready bit match PR#5 tag from CDB & issue

R10K: Cycle 5 ht # Insn old S X C 1 f1 = ldf (r1) PR#5 PR#2 c2 c3 c4 h 2 f2 = mulf f0,f1 PR#6 PR#3 c4 c5 3 stf f2,(r1) 4 r1 = addi r1,4 PR#7 PR#4 c5 t 5 f1 = ldf (r1) PR#8 PR#5 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD yes ldf PR#8 PR#7 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no free Map able Reg + f0 f1 f2 r1 PR#1+ PR#8 PR#6 PR#7 Free List PR#8,PR#2, PR#9 CDB ldf retires Return PR#2 to free list

Precise State in R10K Precise state is more difficult in R10K Physical registers are written out-of-order (at C) o recover precise state, roll back the Map able and Free List free written registers and restore old ones wo ways of restoring Map able and Free List Option I: serial rollback using, old fields ± Slow, but simple Option II: single-cycle restoration from some checkpoint ± Fast, but checkpoints are expensive Modern processor compromise: make common case fast Checkpoint only for branch prediction (frequent rollbacks) Serial recovery for exceptions and interrupts (rare rollbacks)

R10K: Cycle 5 (with precise state) ht # Insn old S X C 1 f1 = ldf (r1) PR#5 PR#2 c2 c3 c4 h 2 f2 = mulf f0,f1 PR#6 PR#3 c4 c5 3 stf f2,(r1) 4 r1 = addi r1,4 PR#7 PR#4 c5 t 5 f1 = ldf (r1) PR#8 PR#5 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD yes ldf PR#8 PR#7 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#8 PR#6 PR#7 Free List PR#8,PR#2, PR#9 CDB undo insns 3-5 (doesn t matter why) use serial rollback

R10K: Cycle 6 (with precise state) ht # Insn old S X C 1 f1 = ldf (r1) PR#5 PR#2 c2 c3 c4 h 2 f2 = mulf f0,f1 PR#6 PR#3 c4 c5 3 stf f2,(r1) t 4 r1 = addi r1,4 PR#7 PR#4 c5 5 f1 = ldf (r1) PR#8 PR#5 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU yes addi PR#7 PR#4+ 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#7 Free List PR#2,PR#8 PR#9 CDB undo ldf (#5) 1. free RS 2. free (PR#8), return to Free List 3. restore M[f1] to old (PR#5) 4. free #5

R10K: Cycle 7 (with precise state) ht # Insn old S X C 1 f1 = ldf (r1) PR#5 PR#2 c2 c3 c4 h 2 f2 = mulf f0,f1 PR#6 PR#3 c4 c5 t 3 stf f2,(r1) 4 r1 = addi r1,4 PR#7 PR#4 c5 5 f1 = ldf (r1) 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S yes stf PR#6 PR#4+ 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#4+ Free List PR#2,PR#8, PR#7, PR#9 CDB undo addi (#4) 1. free RS 2. free (PR#7), return to Free List 3. restore M[r1] to old (PR#4) 4. free #4

R10K: Cycle 8 (with precise state) ht # Insn old S X C 1 f1 = ldf (r1) PR#5 PR#2 c2 c3 c4 ht 2 f2 = mulf f0,f1 PR#6 PR#3 c4 c5 3 stf f2,(r1) 4 r1 = addi r1,4 5 f1 = ldf (r1) 6 f2 = mulf f0,f1 7 stf f2,(r1) Reservation Stations # FU busy op 1 2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Map able Reg + f0 f1 f2 r1 PR#1+ PR#5+ PR#6 PR#4+ Free List PR#2,PR#8, PR#7, PR#9 CDB undo stf (#3) 1. free RS 2. free #3 3. no registers to restore/free 4. how is L1-D write undone?

Renaming & OoO in P6 vs. R10K Feature P6 R10K Value storage ARF,,RS PRF Register read @D: ARF/ RS @S: PRF FU Register write @R: ARF @C: FU PRF Speculative value free @R: automatic () @R: overwriting insn Data paths ARF/ RS RS FU FU, RS ARF PRF FU FU PRF Precise state Simple: clear everything Complex: serial/checkpoint R10K-style became popular in late 90 s, early 00 s E.g., MIPS R10K (duh), DEC Alpha 21264, Intel Pentium 4 P6-style is making a comeback Why? Frequency (power) is on the retreat, simplicity is important