EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Similar documents
Tomasolu s s Algorithm

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Instruction Level Parallelism III: Dynamic Scheduling

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Out-of-Order Execution. Register Renaming. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Dynamic Scheduling I

Issue. Execute. Finish

OOO Execution & Precise State MIPS R10000 (R10K)

Dynamic Scheduling II

Precise State Recovery. Out-of-Order Pipelines

CS521 CSE IITG 11/23/2012

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

CSE502: Computer Architecture CSE 502: Computer Architecture

COSC4201. Scoreboard

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

CMP 301B Computer Architecture. Appendix C

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Project 5: Optimizer Jason Ansel

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Instruction Level Parallelism Part II - Scoreboard

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Tomasulo s Algorithm. Tomasulo s Algorithm

Parallel architectures Electronic Computers LM

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Pipelined Processor Design

Instruction Level Parallelism. Data Dependence Static Scheduling

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

CSE 2021: Computer Organization

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

CS 110 Computer Architecture Lecture 11: Pipelining

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Compiler Optimisation

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Lecture 4: Introduction to Pipelining

DAT105: Computer Architecture

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

ECE473 Computer Architecture and Organization. Pipeline: Introduction

LECTURE 8. Pipelining: Datapath and Control

Quantifying the Complexity of Superscalar Processors

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

EECE 321: Computer Organiza5on

Performance Evaluation of Recently Proposed Cache Replacement Policies

Lecture 8-1 Vector Processors 2 A. Sohn

RISC Central Processing Unit

Department Computer Science and Engineering IIT Kanpur

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

EE382V-ICS: System-on-a-Chip (SoC) Design

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

CS429: Computer Organization and Architecture

SCALCORE: DESIGNING A CORE

Final Report: DBmbench

Multiple Predictors: BTB + Branch Direction Predictors

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Computer Architecture

CMSC 611: Advanced Computer Architecture

Processors Processing Processors. The meta-lecture

Chapter 3. H/w s/w interface. hardware software Vijaykumar ECE495K Lecture Notes: Chapter 3 1

A Static Power Model for Architects

CS61c: Introduction to Synchronous Digital Systems

Low Complexity Out-of-Order Issue Logic Using Static Circuits

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

SOFTWARE IMPLEMENTATION OF THE

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Digital Integrated CircuitDesign

Reading Material + Announcements

Introduction to CMOS VLSI Design (E158) Lecture 5: Logic

EE 457 Homework 5 Redekopp Name: Score: / 100_

I hope you have completed Part 2 of the Experiment and is ready for Part 3.

MITOCW R3. Document Distance, Insertion and Merge Sort

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

EENG 444 / ENAS 944 Digital Communication Systems

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

(VE2: Verilog HDL) Software Development & Education Center

Low-Power Design for Embedded Processors

Exploring Computation- Communication Tradeoffs in Camera Systems

Lecture 13 Register Allocation: Coalescing

Topic Notes: Digital Logic

Register Allocation and VDD-Gating Algorithms for Out-of-Order Architectures

How a processor can permute n bits in O(1) cycles

On the Rules of Low-Power Design

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

bus waveforms transport delta and simulation

Relocatable Fleet Code

EECS 427 Lecture 22: Low and Multiple-Vdd Design

DIGITAL DESIGN WITH SM CHARTS

Transcription:

omasulo s Algorithm Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, yson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin. Slide 1

Announcements Programming assignment #2 posted by end of the day Be aware synthesis could take a while. Homework #2 posted Due 2 weeks from today. Can do some of it now, and should be able to do all by end of this week. Slide 2

Readings H & P Chapter 3.4-3.5 Slide 3

Basic Anatomy of an OoO Scheduler Slide 4

New Pipeline erminology regfile I$ B P D$ In-order pipeline Often written as F,D,X,W (multi-cycle X includes M) Example pipeline: 1-cycle int (including mem), 3-cycle pipelined FP Slide 5

New Pipeline Diagram Insn D X W ldf X(r1),f1 c1 c2 c3 mulf f0,f1,f2 c3 c4+ c7 stf f2,z(r1) c7 c8 c9 addi r1,4,r1 c8 c9 c10 ldf X(r1),f1 c10 c11 c12 mulf f0,f1,f2 c12 c13+ c16 stf f2,z(r1) c16 c17 c18 Alternative pipeline diagram (we will see two approaches in class) Down: instructions executing over time Across: pipeline stages In boxes: the specific cycle of activity, for that instruction Basically: stages cycles Convenient for out-of-order Slide 6

Anatomy of OoO: Instruction Buffer insn buffer I$ B P D1 D2 regfile D$ Insn buffer (many names for this buffer) Basically: a bunch of latches for holding insns Candidate pool of instructions Split D into two pieces Accumulate decoded insns in buffer in-order Buffer sends insns down rest of pipeline out-of-order Slide 7

Anatomy of OoO: Dispatch and Issue insn buffer I$ B P D S regfile D$ Dispatch (D): first part of decode Allocate slot in insn buffer New kind of structural hazard (insn buffer is full) In order: stall back-propagates to younger insns Issue (S): second part of decode Send insns from insn buffer to execution units + Out-of-order: wait doesn t back-propagate to younger insns Slide 8

Dispatch and Issue with Floating-Point insn buffer I$ B P D S regfile D$ E* E* E* E + E + E/ F-regfile Slide 9

Dynamic Scheduling Algorithms Register scheduler: scheduler driven by register dependences Book covers two register scheduling algorithms Scoreboard: No register renaming limited scheduling flexibility omasulo: Register renaming more flexibility, better performance We focus on omasulo s algorithm in the lecture No test questions on scoreboarding Do note that it is used in certain GPUs. Big simplification in this lecture: memory scheduling Pretend register algorithm magically knows memory dependences A little more realism later in the term Slide 10

Issue Key OoO Design Feature: Issue Policy and Issue Logic If multiple instructions are ready, which one to choose? Issue policy Oldest first? Safe Longest latency first? May yield better performance Select logic: implements issue policy Most projects use random. Slide 11

Eliminating False Dependencies with Register Renaming Slide 12

Dynamic execution Hazards Renaming Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, rue Data dependencies rue data dependency RAW Read after Write R1=R2+R3 R4=R1+R5 rue dependencies prevent reordering (Mostly) unavoidable 13

Dynamic execution Hazards Renaming Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, False Data Dependencies False or Name dependencies WAW Write after Write R1=R2+R3 R1=R4+R5 WAR Write after Read R2=R1+R3 R1=R4+R5 False dependencies prevent reordering Can they be eliminated? (Yes, with renaming!) 14

Dynamic execution Hazards Renaming Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Data Dependency Graph: Simple example R1=MEM[R2+0] // A R2=R2+4 // B R3=R1+R4 // C MEM[R2+0]=R3 // D RAW WAW WAR 15

Dynamic execution Hazards Renaming Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Data Dependency Graph: More complex example R1=MEM[R3+4] // A R2=MEM[R3+8] // B R1=R1*R2 // C MEM[R3+4]=R1 // D MEM[R3+8]=R1 // E R1=MEM[R3+12] // F R2=MEM[R3+16] // G R1=R1*R2 // H MEM[R3+12]=R1 // I MEM[R3+16]=R1 // J RAW WAW WAR 16

Dynamic execution Hazards Renaming Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Eliminating False Dependencies R1=MEM[R3+4] // A R2=MEM[R3+8] // B R1=R1*R2 // C MEM[R3+4]=R1 // D MEM[R3+8]=R1 // E R1=MEM[R3+12] // F R2=MEM[R3+16] // G R1=R1*R2 // H MEM[R3+12]=R1 // I MEM[R3+16]=R1 // J Well, logically there is no reason for F-J to be dependent on A-E. So.. ABFG CH DEIJ Should be possible. But that would cause either C or H to have the wrong reg inputs How do we fix this? Remember, the dependency is really on the name of the register So change the register names! 17

Dynamic execution Hazards Renaming Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Register Renaming Concept he register names are arbitrary he register name only needs to be consistent between writes. R1=... = R1..= R1 R1=. he value in R1 is alive from when the value is written until the last read of that value. 18

Dynamic execution Hazards Renaming Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, So after renaming, what happens to the P1=MEM[R3+4] //A P2=MEM[R3+8] //B P3=P1*P2 //C MEM[R3+4]=P3 //D MEM[R3+8]=P3 //E P4=MEM[R3+12] //F P5=MEM[R3+16] //G P6=P4*P5 //H MEM[R3+12]=P6 //I MEM[R3+16]=P6 //J dependencies? RAW WAW WAR 19

Dynamic execution Hazards Renaming Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Register Renaming Approach Every time an architected register is written we assign it to a physical register Until the architected register is written again, we continue to translate it to the physical register number Leaves RAW dependencies intact It is really simple, let s look at an example: Names: r1,r2,r3 Locations: p1,p2,p3,p4,p5,p6,p7 Original mapping: r1 p1, r2 p2, r3 p3, p4 p7 are free Mapable FreeList Orig. insns Renamed insns r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6 p4 p2 p6 p7 div r1,4,r1 div p4,4,p7 20

Dynamic execution Hazards Renaming R1=MEM[P7+4] // A R2=MEM[R3+8] // B R1=R1*R2 // C MEM[R3+4]=R1 // D MEM[R3+8]=R1 // E R1=MEM[R3+12] // F R2=MEM[R3+16] // G R1=R1*R2 // H MEM[R3+12]=R1 // I MEM[R3+16]=R1 // J Arch V? Physical 1 1 2 1 3 1 P1=MEM[R3+4] P2=MEM[R3+8] P3=P1*P2 MEM[R3+4]=P3 MEM[R3+8]=P3 P4=MEM[R3+12] P5=MEM[R3+16] P6=P4*P5 MEM[R3+12]=P6 MEM[R3+16]=P6 21 21

omasulo s Scheduling Algorithm Slide 22

omasulo s Scheduling Algorithm omasulo s algorithm Reservation stations (RS): instruction buffer Common data bus (CDB): broadcasts results to RS Register renaming: removes WAR/WAW hazards First implementation: IBM 360/91 [1967] Dynamic scheduling for FP units only Bypassing Our example: Simple omasulo Dynamic scheduling for everything, including load/store No bypassing 5 RS: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, pipelined) Slide 23

omasulo Data Structures Reservation Stations (RS#) FU, busy, op, R: destination register name : destination register tag (RS# of this RS) 1,2: source register tags (RS# of RS that will produce value) V1,V2: source register values Rename able/map able/ra : tag (RS#) that will write this register Common Data Bus (CDB) Broadcasts <RS#, value> of completed insns ags interpreted as ready-bits++ ==0 Value is ready somewhere!=0 Value is not ready, wait until CDB broadcasts Slide 24

CDB. CDB.V Simple omasulo Data Structures Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Insn fields and status bits ags Values Slide 25

Simple omasulo Pipeline New pipeline structure: F, D, S, X, W D (dispatch) Structural hazard? stall : allocate RS entry S (issue) RAW hazard? wait (monitor CDB) : go to execute W (writeback) Write register (sometimes ), free RS entry W and RAW-dependent S in same cycle W and structural-dependent D in same cycle Slide 26

CDB. CDB.V omasulo Dispatch (D) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Stall for structural (RS) hazards Allocate RS entry Input register ready? read value into RS : read tag into RS Rename output register to RS # (represents a unique value name ) Slide 27

CDB. CDB.V omasulo Issue (S) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Wait for RAW hazards Read register values from RS Slide 28

CDB. CDB.V omasulo Execute (X) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Slide 29

CDB. CDB.V omasulo Writeback (W) Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 Wait for structural (CDB) hazards if Map able rename still matches? Clear mapping, write result to regfile CDB broadcast to RS: tag match? clear tag, copy value Free RS entry Slide 30

CDB. CDB.V Register Renaming for omasulo Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 FU V2 What in omasulo implements register renaming? Value copies in RS (V1, V2) Insn stores correct input values in its own RS entry + Future insns can overwrite master copy in regfile, doesn t matter Slide 31

Value/Copy-Based Register Renaming omasulo-style register renaming Called value-based or copy-based Names: architectural registers Storage locations: register file and reservation stations Values can and do exist in both Register file holds master (i.e., most recent) values + RS copies eliminate WAR hazards Storage locations referred to internally by RS# tags Register table translates names to tags ag == 0 value is in register file ag!= 0 value is not ready and is being computed by RS# CDB broadcasts values with tags attached So insns know what value they are looking at Slide 32

CDB. CDB.V Simple omasulo Data Structures Map able Regfile value RS: Status information R: Destination Register op: Operand (add, etc.) ags 1, 2: source operand tags Values V1, V2: source operand values Fetched insns R Reservation Stations op 1 == == == == 2 == == == == V1 FU V2 Map table (also RA: Register Alias able) Maps registers to tags Regfile (also ARF: Architected Register File) Holds value of register if no value in RS Slide 33

omasulo Data Structures (iming Free Example) CDB V Map able Reg r0 r1 r2 r3 r4 Instruction r0=r1*r2 r1=r2*r3 r2=r4+1 r1=r1+r1 Reservation Stations FU busy R op 1 2 V1 V2 1 2 3 4 5 ARF Reg V r0 r1 r2 r3 r4 Slide 34

Questions Where can we get values for a given instruction from? A) B) When do we update the ARF? (his is a bit tricky) How do we know there isn t anyone else who needs the value we overwrite? What do we do on a branch mispredict? Slide 35

Example with timing his set of slides is here for you to look over outside of class. I generally prefer to not worry about timing issues too much at this point. hey are implementation-specific and get more involved than I think is useful. hat said, a number of students get the general case better if they have a specific case to look at. Slide 36

Example:omasulo with timing Insn Status Insn D S X W ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) Map able Reg f0 f1 f2 r1 CDB V Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no Slide 37

omasulo: Cycle 1 Insn Status Insn D S X W ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) c1 Map able Reg f0 f1 f2 r1 RS#2 CDB V Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] 3 S no 4 FP1 no 5 FP2 no allocate Slide 38

omasulo: Cycle 2 Insn Status Insn D S X W ldf X(r1),f1 c1 c2 mulf f0,f1,f2 c2 stf f2,z(r1) addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#4 CDB V Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] 3 S no 4 FP1 yes mulf f2 - RS#2 [f0] - 5 FP2 no allocate Slide 39

omasulo: Cycle 3 Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c3 mulf f0,f1,f2 c2 stf f2,z(r1) c3 addi r1,4,r1 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#4 CDB V Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - RS#2 [f0] - 5 FP2 no allocate Slide 40

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c2 c4 stf f2,z(r1) c3 addi r1,4,r1 c4 ldf X(r1),f1 mulf f0,f1,f2 stf f2,z(r1) omasulo: Cycle 4 Map able Reg f0 f1 f2 r1 RS#2 RS#4 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD no 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - RS#2 [f0] CDB.V 5 FP2 no CDB V RS#2 [f1] allocate free ldf finished (W) clear f1 RegStatus CDB broadcast RS#2 ready grab CDB value Slide 41

omasulo: Cycle 5 Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c2 c4 c5 stf f2,z(r1) c3 addi r1,4,r1 c4 c5 ldf X(r1),f1 c5 mulf f0,f1,f2 stf f2,z(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#4 RS#1 CDB V Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD yes ldf f1 - RS#1 - - 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 no allocate Slide 42

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c2 c4 c5+ stf f2,z(r1) c3 addi r1,4,r1 c4 c5 c6 ldf X(r1),f1 c5 mulf f0,f1,f2 c6 stf f2,z(r1) omasulo: Cycle 6 Map able Reg f0 f1 f2 r1 RS#4RS#5 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD yes ldf f1 - RS#1 - - 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 yes mulf f2 - RS#2 [f0] - CDB no D stall on WAW: scoreboard would overwrite f2 RegStatus anyone who needs old f2 tag has it allocate V Slide 43

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c2 c4 c5+ stf f2,z(r1) c3 addi r1,4,r1 c4 c5 c6 c7 ldf X(r1),f1 c5 c7 mulf f0,f1,f2 c6 stf f2,z(r1) omasulo: Cycle 7 Map able Reg f0 f1 f2 r1 RS#2 RS#5 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - RS#1 - CDB.V 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 yes mulf f2 - RS#2 [f0] - CDB V RS#1 [r1] no W wait on WAR: scoreboard would anyone who needs old r1 has RS copy D stall on store RS: structural addi finished (W) clear r1 RegStatus CDB broadcast RS#1 ready grab CDB value Slide 44

Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c2 c4 c5+ c8 stf f2,z(r1) c3 c8 addi r1,4,r1 c4 c5 c6 c7 ldf X(r1),f1 c5 c7 c8 mulf f0,f1,f2 c6 stf f2,z(r1) omasulo: Cycle 8 Map able Reg f0 f1 f2 r1 RS#2 RS#5 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] 3 S yes stf - RS#4 - CDB.V [r1] 4 FP1 no 5 FP2 yes mulf f2 - RS#2 [f0] - CDB V RS#4 [f2] mulf finished (W) don t clear f2 RegStatus already overwritten by 2nd mulf (RS#5) CDB broadcast RS#4 ready grab CDB value Slide 45

omasulo: Cycle 9 Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c2 c4 c5+ c8 stf f2,z(r1) c3 c8 c9 addi r1,4,r1 c4 c5 c6 c7 ldf X(r1),f1 c5 c7 c8 c9 mulf f0,f1,f2 c6 c9 stf f2,z(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#5 2nd ldf finished (W) clear f1 RegStatus CDB broadcast CDB V RS#2 [f1] Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no 3 S yes stf - - - [f2] [r1] 4 FP1 no 5 FP2 yes mulf f2 - RS#2 [f0] CDB.V RS#2 ready grab CDB value Slide 46

omasulo: Cycle 10 Insn Status Insn D S X W ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c2 c4 c5+ c8 stf f2,z(r1) c3 c8 c9 c10 addi r1,4,r1 c4 c5 c6 c7 ldf X(r1),f1 c5 c7 c8 c9 mulf f0,f1,f2 c6 c9 c10 stf f2,z(r1) c10 Map able Reg f0 f1 f2 r1 RS#5 CDB stf finished (W) no output register no CDB broadcast V Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no 3 S yes stf - RS#5 - - [r1] 4 FP1 no 5 FP2 yes mulf f2 - - [f0] [f1] free allocate Slide 47

CDB. CDB.V Can We Add Bypassing? Map able Regfile value Fetched insns R op Reservation Stations 1 2 V1 V2 FU Yes, but it s more complicated than you might think Scheduler must work in advance of computation Requires knowledge of the latency of instructions, not always possible Accurate bypass is a key advancement in scheduling in last 10 years Slide 48

Can We Add Superscalar? Dynamic scheduling and multiple issue are orthogonal E.g., Pentium4: dynamically scheduled 5-way superscalar wo dimensions N: superscalar width (number of parallel operations) W: window size (number of reservation stations) What do we need for an N-by-W omasulo? RS: N tag/value w-ports (D), N value r-ports (S), 2N tag CAMs (W) Select logic: W N priority encoder (S) M: 2N r-ports (D), N w-ports (D) RF: 2N r-ports (D), N w-ports (W) CDB: N (W) Which are the expensive pieces? Slide 49

Superscalar Select Logic Superscalar select logic: W N priority encoder Somewhat complicated (N 2 logw) Can simplify using different RS designs Split design Divide RS into N banks: 1 per FU? Implement N separate W/N 1 encoders + Simpler: N * logw/n Less scheduling flexibility FIFO design [Palacharla+] Can issue only head of each RS bank + Simpler: no select logic at all Less scheduling flexibility (but surprisingly not that bad) Slide 50

Dynamic Scheduling Summary Dynamic scheduling: out-of-order execution Higher pipeline/fu utilization, improved performance Easier and more effective in hardware than software + More storage locations than architectural registers + Dynamic handling of cache misses Instruction buffer: multiple F/D latches Implements large scheduling scope + passing functionality Split decode into in-order dispatch and out-of-order issue Stall vs. wait Dynamic scheduling algorithms Scoreboard: no register renaming, limited out-of-order omasulo: copy-based register renaming, full out-of-order Slide 51

Are we done? When can omasulo go wrong? Lack of instructions to choose from!! Need a really really really good branch predictor Exceptions!! No way to figure out relative order of instructions in RS Slide 52

And a bit of terminology Issue can be thought of as a two-stage process: wakeup and select. When the RS figures out it has it s data and is ready to run it is said to have woken up and the process of doing so is called wakeup But there may be a structural hazard no EX unit available for a given RS When? hus, in addition to be woken up, and RS needs to be selected before it can go to the execute unit (EX stage). his process is called select Slide 53