CSE502: Computer Architecture CSE 502: Computer Architecture

Similar documents
CSE502: Computer Architecture CSE 502: Computer Architecture

Out-of-Order Execution. Register Renaming. Nima Honarmand

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Instruction Level Parallelism III: Dynamic Scheduling

Tomasolu s s Algorithm

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

CSE502: Computer Architecture CSE 502: Computer Architecture

Dynamic Scheduling I

OOO Execution & Precise State MIPS R10000 (R10K)

Dynamic Scheduling II

Precise State Recovery. Out-of-Order Pipelines

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Issue. Execute. Finish

CMP 301B Computer Architecture. Appendix C

CS521 CSE IITG 11/23/2012

CSE502: Computer Architecture CSE 502: Computer Architecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

COSC4201. Scoreboard

Instruction Level Parallelism Part II - Scoreboard

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Tomasulo s Algorithm. Tomasulo s Algorithm

Project 5: Optimizer Jason Ansel

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Parallel architectures Electronic Computers LM

Instruction Level Parallelism. Data Dependence Static Scheduling

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

CS 110 Computer Architecture Lecture 11: Pipelining

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Pipelined Processor Design

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Department Computer Science and Engineering IIT Kanpur

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Compiler Optimisation

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Lecture 8-1 Vector Processors 2 A. Sohn

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

DAT105: Computer Architecture

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

EECE 321: Computer Organiza5on

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Lecture 4: Introduction to Pipelining

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

LECTURE 8. Pipelining: Datapath and Control

Multiple Predictors: BTB + Branch Direction Predictors

CS429: Computer Organization and Architecture

Computer Architecture

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

ECE473 Computer Architecture and Organization. Pipeline: Introduction

SCALCORE: DESIGNING A CORE

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Final Report: DBmbench

CMSC 611: Advanced Computer Architecture

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Quantifying the Complexity of Superscalar Processors

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

EE382V-ICS: System-on-a-Chip (SoC) Design

Register Allocation and VDD-Gating Algorithms for Out-of-Order Architectures

Digital Integrated CircuitDesign

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

CSE502: Computer Architecture Welcome to CSE 502

Pipelining and ISA Design

CSE 466 Software for Embedded Systems. What is an embedded system?

RISC Central Processing Unit

EC4205 Microprocessor and Microcontroller

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors

Subra Ganesan DSP 1.

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

Pre-Silicon Validation of Hyper-Threading Technology

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era

SOFTWARE IMPLEMENTATION OF THE

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

Design Challenges in Multi-GHz Microprocessors

Reading Material + Announcements

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Performance Evaluation of Recently Proposed Cache Replacement Policies

Blackfin Online Learning & Development

Understanding Engineers #2

Meltdown & Spectre. Side-channels considered harmful. Qualcomm Mobile Security Summit May, San Diego, CA. Moritz Lipp

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors

An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors

DIGITAL DESIGN WITH SM CHARTS

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Transcription:

CSE 502: Computer Architecture Out-of-Order Execution and Register Rename

In Search of Parallelism rivial Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have dependencies In all previous cases, all insns. executed with or after earlier insns. Superscalar execution quickly hits a ceiling due to deps. So what is non-trivial parallelism?

Instruction-Level Parallelism (ILP) ILP is a measure of inter-dependencies between insns. Average ILP = num. instruction / num. cyc required code1: ILP = 1 code2: ILP = i.e. must execute serially i.e. can execute at the same time code1: r1 r2 + 1 r r1 / 17 r4 r0 - r code2: r1 r2 + 1 r r / 17 r4 r0 - r10

he Problem with In-Order Pipelines 1 2 4 5 6 7 8 10 11 12 1 14 15 16 addf f0,f1,f2 F D E+ E+ E+ W mulf f2,f,f2 F D d* d* E* E* E* E* E* W subf f0,f1,f4 F p* p* D E+ E+ E+ W What s happening in cycle 4? mulf stalls due to RAW hazard OK, this is a fundamental problem subf stalls due to pipeline hazard Why? subf can t proceed into D because mulf is there hat is the only reason, and it isn t a fundamental one Why can t subf go to D in cycle 4 and E+ in cycle 5?

ILP usually assumes ILP!= IPC Infinite resources Perfect fetch Unit-latency for all instructions ILP is a property of the program dataflow IPC is the real observed metric How many insns. are executed per cycle ILP is an upper-bound on the attainable IPC Specific to a particular program

Dynamic scheduling OoO Execution (1/) otally in the hardware Also called Out-of-Order execution (OoO) Fetch many instructions into instruction window Use branch prediction to speculate past branches Rename regs. to avoid false deps. (WAW and WAR) Execute insns. as soon as possible As soon as deps. (regs and memory) are known oday s machines: 100+ insns. scheduling window

Out-of-Order Execution (2/) Execute insns. in dataflow order Often similar but not the same as program order Use register renaming removes false deps. Scheduler identifies when to run insns. Wait for all deps. to be satisfied

Fetch Rename Schedule Out-of-Order Execution (/) Static Program Dynamic Instruction Stream Renamed Instruction Stream Dynamically Scheduled Instructions Out-of-order = out of the original sequential order

OoO Example (1/2) A: R1 = R2 + R B: R4 = R5 + R6 Cycle 1: A B C: R1 = R1 * R4 2: C D: R7 = LD 0[R1] A B : E: BEQZ R7, +2 F: R4 = R7 - E D C F G J 4: 5: D IPC = 10/8 = 1.25 G: R1 = R1 + 1 H K 6: E F G H: R4 S 0[R1] 7: H J J: R1 = R1 1 8: K K: R S 0[R1]

OoO Example (2/2) A: R1 = R2 + R B: R4 = R5 + R6 Cycle 1: A B C: R1 = R1 * R4 2: C D: R = LD 0[R1] A B E : E F E: BEQZ R7, +2 F: R4 = R7 - D C H F G J 4: 5: D G G: R1 = R1 + 1 K 6: H J H: R4 S 0[R] 7: K J: R1 = R 1 K: R S 0[R1] IPC = 10/7 = 1.4

Superscalar!= Out-of-Order A: R1 = Load 16[R2] B: R = R1 + R4 C: R6 = Load 8[R] D: R5 = R2 4 E: R7 = Load 20[R5] F: R4 = R4 1 G: BEQ R4, #0 A B C D E F G 1-wide In-Order A cache miss B C D E F G 2-wide In-Order A cache miss B D E C F G 8 cycles 1-wide Out-of-Order A cache miss B F G C D E 7 cycles 2-wide Out-of-Order A cache miss B C D F G 5 cycles E 10 cycles

Example Pipeline erminology In-order pipeline F: Fetch D: Decode X: Execute W: Writeback regfile I$ BP D$

Example Pipeline Diagram Alternative pipeline diagram Down: insns Across: pipeline stages In boxes: cycles Basically: stages cycles Convenient for out-of-order Insn D X W ldf X(r1),f1 c1 c2 c mulf f0,f1,f2 c c4+ c7 stf f2,z(r1) c7 c8 c addi r1,4,r1 c8 c c10 ldf X(r1),f1 c10 c11 c12 mulf f0,f1,f2 c12 c1+ c16 stf f2,z(r1) c16 c17 c18

Instruction Buffer insn buffer regfile I$ BP D$ rick: instruction buffer (a.k.a. instruction window) A bunch of registers for holding insns. Split D into two parts Accumulate decoded insns. in buffer in-order Buffer sends insns. down rest of pipeline out-of-order

Dispatch and Issue insn buffer regfile I$ BP D$ Dispatch (D): first part of decode Allocate slot in insn. buffer (if buffer is not full) In order: blocks younger insns. Issue (S): second part of decode Send insns. from insn. buffer to execution units Out-of-order: doesn t block younger insns.

Dispatch and Issue with Floating-Point insn buffer regfile I$ D$ BP E* E* E* E + E + E/ F-regfile Number of pipeline stages per FU can vary

Scoreboarding Our-of-Order opics First OoO, no register renaming omasulo s algorithm OoO with register renaming Handling precise state and speculation P6-style execution (Intel Pentium Pro) R10k-style execution (MIPS R10k) Handling memory dependencies

In-Order Issue, OoO Completion In-order Inst. Stream Execution Begins In-order IN Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Fmul Out-of-order Completion Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard. WAW Hazard 4. WAR Hazard Issue = send an instruction to execution

rack with Simple Scoreboarding Scoreboard: a bit-array, 1-bit for each GPR If the bit is not set: the register has valid data If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it Issue in Order: RD Fn (RS, R) If SB[RS] or SB[R] is set RAW, stall If SB[RD] is set WAW, stall Else, dispatch to FU (Fn) and set SB[RD] Complete out-of-order Update GPR[RD], clear SB[RD] Finite number of regs. will force WAR and WAW

Review of Register Dependencies R1 R2 R R4 Read-After-Write A: R1 = R2 + R B: R4 = R1 * R4 5 A 7 B 7 21 R1 R2 R R4 Write-After-Read A: R1 = R / R4 B: R = R2 * R4 5 A B -6 R1 R2 R R4 Write-After-Write A: R1 = R2 + R B: R1 = R * R4 5 A 7 B 27 R1 R2 R R4 5 B 5 15 A 7 15 R1 R2 R R4 5 B 5-6 A -6 R1 R2 R R4 5 B 27 A 7

Eliminating WAR Dependencies WAR dependencies are from reusing registers A: R1 = R / R4 B: R = R2 * R4 A: R1 X = R / R4 B: R5 = R2 * R4 R1 R2 R R4 5 A B -6 R1 R2 R R4 5 B 5-6 A R1 5 R2-6 R R4 4 B 5 A R5-6 -6 Can get correct result just by using different reg.

Eliminating WAW Dependencies WAW dependencies are also from reusing registers A: R1 = R2 + R B: R1 = R * R4 A: R5 X = R2 + R B: R1 = R * R4 R1 R2 R R4 5 A 7 B 27 R1 5 R2 R R4 B 27 A 7 R1 R2 R R4 5 4 B 27 A 27 R5 4 7 Can get correct result just by using different reg.

Register Renaming Register renaming (in hardware) Change register names to eliminate WAR/WAW hazards Arch. registers (r1,f0 ) are names, not storage locations Can have more locations than names Can have multiple active versions of same name How does it work? Map-table: maps names to most recent locations On a write: allocate new location, note in map-table On a read: find location of most recent write via map-table

Register Renaming Anti (WAR) and output (WAW) deps. are false Dep. is on name/location, not on data Given infinite registers, WAR/WAW don t arise Renaming removes WAR/WAW, but leaves RAW intact Example Names: r1,r2,r Locations: p1,p2,p,p4,p5,p6,p7 Original: r1 p1, r2 p2, r p, p4 p7 are free Mapable FreeList Original insns. Renamed insns. r1 r2 r p1 p2 p p4,p5,p6,p7 add r2,r,r1 add p2,p,p4 p4 p2 p p5,p6,p7 sub r2,r1,r sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r,r mul p2,p5,p6 p4 p2 p6 p7 div r1,4,r1 div p4,4,p7

Register Renaming Anti (WAR) and output (WAW) deps. are false Dep. is on name/location, not on data Given infinite registers, WAR/WAW don t arise Renaming removes WAR/WAW, but leaves RAW intact Example Names: r1,r2,r Locations: p1,p2,p,p4,p5,p6,p7 Original: r1 p1, r2 p2, r p, p4 p7 are free Mapable FreeList Original insns. Renamed insns. r1 r2 r p1 p2 p p4,p5,p6,p7 add r2,r,r1 add p2,p,p4 p4 p2 p p5,p6,p7 sub r2,r1,r sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r,r mul p2,p5,p6 p4 p2 p6 p7 div r1,4,r1 div p4,4,p7

omasulo s Algorithm Reservation Stations (RS): instruction buffer Common data bus (CDB): broadcasts results to RS Register renaming: removes WAR/WAW hazards Bypassing (not shown here to make example simpler)

omasulo Data Structures (1/2) Reservation Stations (RS) FU, busy, op, R: destination register name : destination register tag (RS# of this RS) 1,2: source register tag (RS# of RS that will output value) V1,V2: source register values Map able (a.k.a., RA) : tag (RS#) that will write this register Common Data Bus (CDB) Broadcasts <RS#, value> of completed insns. Valid tags indicate the RS# that will produce result

CDB. CDB.V omasulo Data Structures (2/2) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2

omasulo Pipeline New pipeline structure: F, D, S, X, W D (dispatch) Structural hazard? stall : allocate RS entry S (issue) RAW hazard? wait (monitor CDB) : go to execute W (writeback) Write register, free RS entry W and RAW-dependent S in same cycle W and structural-dependent D in same cycle

CDB. CDB.V omasulo Dispatch (D) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Allocate RS entry (structural stall if busy) Input register ready? read value into RS : read tag into RS Set register status (i.e., rename) for output register

CDB. CDB.V omasulo Issue (S) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Wait for RAW hazards Read register values from RS

CDB. CDB.V omasulo Execute (X) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2

CDB. CDB.V omasulo Writeback (W) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Wait for structural (CDB) hazards Output Reg tag still matches? clear, write result to register CDB broadcast to RS: tag match? clear tag, copy value

CDB. CDB.V Where is the register rename? Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Value copies in RS (V1, V2) Insn. stores correct input values in its own RS entry Free list is implicit (allocate/deallocate as part of RS)

omasulo Data Structures Insn Status Insn D S X W ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) Map able Reg f0 f1 f2 r1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no S no 4 FP1 no 5 FP2 no CDB V

Insn Status Insn D S X W ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) c1 omasulo: Cycle 1 Map able Reg f0 f1 RS#2 f2 r1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] S no 4 FP1 no 5 FP2 no CDB allocate V

Insn Status Insn D S X W ldf X(r1),f1 mulf f0,f1,f2 c2 stf f2,z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) omasulo: Cycle 2 c1 c2 Map able Reg f0 f1 RS#2 f2 RS#4 r1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] S no 4 FP1 yes mulf f2 - RS#2 [f0] - 5 FP2 no CDB allocate V

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c mulf f0,f1,f2 c2 stf f2,z(r1) c addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) omasulo: Cycle Map able Reg f0 f1 RS#2 f2 RS#4 r1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - RS#2 [f0] - 5 FP2 no CDB allocate V

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c c4 mulf f0,f1,f2 c2 c4 stf f2,z(r1) c addi r1,4,r1 c4 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) omasulo: Cycle 4 Map able Reg f0 f1 RS#2 f2 RS#4 r1 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD no S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - RS#2 [f0] CDB.V 5 FP2 no CDB V RS#2 [f1] allocate free ldf finished (W) clear f1 RegStatus CDB broadcast RS#2 ready grab CDB value

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c c4 mulf f0,f1,f2 c2 c4 c5 stf f2,z(r1) c addi r1,4,r1 c4 c5 ldf X(r1),f1 c5 mulf f0,f1,f2 stf f2,z(r1) omasulo: Cycle 5 Map able Reg f0 f1 RS#2 f2 RS#4 r1 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD yes ldf f1 - RS#1 - - S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 no CDB allocate V

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c c4 mulf f0,f1,f2 c2 c4 c5+ stf f2,z(r1) c addi r1,4,r1 c4 c5 c6 ldf X(r1),f1 c5 mulf f0,f1,f2 c6 stf f2,z(r1) omasulo: Cycle 6 Map able Reg f0 f1 RS#2 f2 RS#4RS#5 r1 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD yes ldf f1 - RS#1 - - S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 yes mulf f2 - RS#2 [f0] - CDB no stall on WAW: scoreboard overwrites f2 RegStatus anyone who needs old f2 tag has it allocate V

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c c4 mulf f0,f1,f2 c2 c4 c5+ stf f2,z(r1) c addi r1,4,r1 c4 c5 c6 c7 ldf X(r1),f1 c5 c7 mulf f0,f1,f2 c6 stf f2,z(r1) omasulo: Cycle 7 Map able Reg f0 f1 RS#2 f2 RS#5 r1 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - RS#1 - CDB.V S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 yes mulf f2 - RS#2 [f0] - CDB V RS#1 [r1] no W wait on WAR: scoreboard ensures anyone who needs old r1 has RS copy D stall on store RS: structural (no space) addi finished (W) clear r1 RegStatus CDB broadcast RS#1 ready grab CDB value

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c c4 mulf f0,f1,f2 c2 c4 c5+ c8 stf f2,z(r1) c c8 addi r1,4,r1 c4 c5 c6 c7 ldf X(r1),f1 c5 c7 c8 mulf f0,f1,f2 c6 stf f2,z(r1) omasulo: Cycle 8 Map able Reg f0 f1 RS#2 f2 RS#5 r1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] S yes stf - RS#4 - CDB.V [r1] 4 FP1 no 5 FP2 yes mulf f2 - RS#2 [f0] - CDB V RS#4 [f2] mulf finished (W), f2 already overwritten by 2nd mulf (RS#5) CDB broadcast RS#4 ready grab CDB value

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c c4 mulf f0,f1,f2 c2 c4 c5+ c8 stf f2,z(r1) c c8 c addi r1,4,r1 c4 c5 c6 c7 ldf X(r1),f1 c5 c7 c8 c mulf f0,f1,f2 c6 c stf f2,z(r1) omasulo: Cycle Map able Reg f0 f1 RS#2 f2 RS#5 r1 2nd ldf finished (W) clear f1 RegStatus CDB broadcast Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no S yes stf - - - [f2] [r1] 4 FP1 no 5 FP2 yes mulf f2 - RS#2 [f0] CDB.V CDB V RS#2 [f1] RS#2 ready grab CDB value

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c c4 mulf f0,f1,f2 c2 c4 c5+ c8 stf f2,z(r1) c c8 c c10 addi r1,4,r1 c4 c5 c6 c7 ldf X(r1),f1 c5 c7 c8 c mulf f0,f1,f2 c6 c c10 stf f2,z(r1) c10 omasulo: Cycle 10 Map able Reg f0 f1 f2 RS#5 r1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no S yes stf - RS#5 - - [r1] 4 FP1 no 5 FP2 yes mulf f2 - - [f0] [f1] CDB stf finished (W) no output register no CDB broadcast V free allocate

Scoreboard vs. omasulo Scoreboard omasulo Insn D S X W D S X W ldf X(r1),f1 c1 c2 c c4 c1 c2 c c4 mulf f0,f1,f2 c2 c4 c5+ c8 c2 c4 c5+ c8 stf f2,z(r1) c c8 c c10 c c8 c c10 addi r1,4,r1 c4 c5 c6 c c4 c5 c6 c7 ldf X(r1),f1 c5 c c10 c11 c5 c7 c8 c mulf f0,f1,f2 c8 c11 c12+ c15 c6 c c10+ c1 stf f2,z(r1) c10 c15 c16 c17 c10 c1 c14 c15 Hazard Scoreboard omasulo Insn buffer stall in D stall in D FU wait in S wait in S RAW wait in S wait in S WAR wait in W none WAW stall in D none

Can We Add Superscalar? Dynamic scheduling and multi-issue are orthogonal N: superscalar width (number of parallel operations) W: window size (number of reservation stations) What is needed for an N-by-W omasulo? RS: N tag/value write (D), N value read (S), 2N tag cmp (W) Select logic: W N priority encoder (S) M: 2N read (D), N write (D) RF: 2N read (D), N write (W) CDB: N (W)