Out-of-Order Execution. Register Renaming. Nima Honarmand

Similar documents
CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Tomasolu s s Algorithm

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Instruction Level Parallelism III: Dynamic Scheduling

OOO Execution & Precise State MIPS R10000 (R10K)

Precise State Recovery. Out-of-Order Pipelines

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Dynamic Scheduling I

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Dynamic Scheduling II

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Issue. Execute. Finish

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CS521 CSE IITG 11/23/2012

CMP 301B Computer Architecture. Appendix C

CSE502: Computer Architecture CSE 502: Computer Architecture

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

COSC4201. Scoreboard

Project 5: Optimizer Jason Ansel

Instruction Level Parallelism Part II - Scoreboard

Tomasulo s Algorithm. Tomasulo s Algorithm

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Parallel architectures Electronic Computers LM

Instruction Level Parallelism. Data Dependence Static Scheduling

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

CS 110 Computer Architecture Lecture 11: Pipelining

Pipelined Processor Design

DAT105: Computer Architecture

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Lecture 8-1 Vector Processors 2 A. Sohn

Lecture 4: Introduction to Pipelining

Compiler Optimisation

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

CS429: Computer Organization and Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Department Computer Science and Engineering IIT Kanpur

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

EECE 321: Computer Organiza5on

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Final Report: DBmbench

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Multiple Predictors: BTB + Branch Direction Predictors

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

EE382V-ICS: System-on-a-Chip (SoC) Design

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Quantifying the Complexity of Superscalar Processors

Performance Evaluation of Recently Proposed Cache Replacement Policies

Computer Architecture

LECTURE 8. Pipelining: Datapath and Control

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Exploring Computation- Communication Tradeoffs in Camera Systems

SOFTWARE IMPLEMENTATION OF THE

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

EE 457 Homework 5 Redekopp Name: Score: / 100_

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

CMSC 611: Advanced Computer Architecture

How a processor can permute n bits in O(1) cycles

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Relocatable Fleet Code

Design Challenges in Multi-GHz Microprocessors

Game Architecture. 4/8/16: Multiprocessor Game Loops

An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Low-Power Design for Embedded Processors

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

RISC Central Processing Unit

SCALCORE: DESIGNING A CORE

Introduction (concepts and definitions)

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

CS61c: Introduction to Synchronous Digital Systems

COSC 3213: Computer Networks I Instructor: Dr. Amir Asif Department of Computer Science York University Section B

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

Digital Integrated CircuitDesign

ICS312 Machine-level and Systems Programming

Controller Implementation--Part I. Cascading Edge-triggered Flip-Flops

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

CZ3001 ADVANCED COMPUTER ARCHITECTURE

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Transcription:

Out-of-Order Execution & Register Renaming Nima Honarmand

Out-of-Order (OOO) Execution (1) Essence of OOO execution is Dynamic Scheduling Dynamic scheduling: processor hardware determines instruction execution order As opposed to static scheduling where processor just follows program order specified by compiler Goal: execute each instruction as quickly as possible while maintaining true data dependencies and control dependencies in the program

Out-of-Order (OOO) Execution (2) Fetch many instructions into Instruction Window (IW) Use branch prediction to speculate past branches oday s high-end CPUs: 100+ instruction window Rename registers to avoid false register dependencies WAW and WAR Scheduler identifies when to run each instruction in IW Wait for all register dependencies to be resolved Make sure memory dependencies and exception behavior are maintained (later)

Out-of-Order Execution (3) Static Program Dynamic Instruction Stream Renamed Instruction Stream Dynamically Scheduled Instructions Fetch Rename Schedule Out-of-order = out of the original sequential order

Superscalar!= Out-of-Order hese are orthogonal concepts All combinations are possible (but not equally common) A: R1 = Load 16[R2] B: R3 = R1 + R4 C: R6 = Load 8[R9] D: R5 = R2 4 E: R7 = Load 20[R5] F: R4 = R4 1 G: BEQ R4, #0 A B C D E F G rue dependencies 1-wide In-Order A cache miss B C D E F G 10 cycles 2-wide In-Order A cache miss B D E C F G 8 cycles 1-wide Out-of-Order A cache miss B F G C D E 7 cycles 2-wide Out-of-Order A cache miss B C D E 5 cycles F G

Example Pipeline erminology In-order 4-stage pipeline F: Fetch D: Decode and read register file X: Execute and memory access W: Writeback to register file regfile I$ BP D$

Example Pipeline Diagram Alternative pipeline diagram Down: instructions Across: pipeline stages In boxes: cycles Convenient to follow out-of-order execution Insn D X W f1 = ldf (r1) c1 c2 c3 f2 = mulf f0,f1 c3 c4+ c7 stf f2,(r1) c7 c8 c9 r1 = addi r1,4 c8 c9 c10 f1 = ldf (r1) c10 c11 c12 f2 = mulf f0,f1 c12 c13+ c16 stf f2,(r1) c16 c17 c18

Instruction Buffer insn buffer regfile I$ BP D$ rick: instruction buffer (a.k.a. instruction window) A set of hardware components to hold in-flight instructions Split D into two parts Accumulate decoded instructions in buffer in-order Buffer sends instructions down rest of pipeline out-of-order

Dispatch and Issue insn buffer regfile I$ BP D$ Dispatch (D): first part of decode Allocate slot in instruction buffer (if buffer is not full) In order: blocks younger instructions (if buffer full) Issue (S): second part of decode Send instructions from instruction buffer to execution units Out-of-order: doesn t block younger instructions

Dispatch and Issue in Diversified Pipelines insn buffer regfile I$ BP D$ E* E* E* Floating-Point Pipeline (for example) E + E/ E + F-regfile Number of pipeline stages per FU can vary

Register Renaming (1) Anti (WAR) and output (WAW) dependencies are false Dep. is on name/location, not on data Given infinite registers, WAR/WAW don t arise Renaming removes WAR/WAW, but leaves RAW intact Register renaming (in hardware) Change register names to eliminate WAR/WAW hazards Architectural registers (r1, f0 ) are names, not storage locations Can have more locations than names Can have multiple active versions of same name How does it work? Map-table: maps names to most recent locations On a write: allocate new location (from a free list), note in map-table On a read: find location of most recent write via map-table

Register Renaming (2) Anti (WAR) and output (WAW) deps. are false Dep. is on name/location, not on data Given infinite registers, WAR/WAW don t arise Renaming removes WAR/WAW, but leaves RAW intact Example Names: r1,r2,r3 Physical Locations: p1 p7 Original: r1 p1, r2 p2, r3 p3, p4 p7 are free Mapable FreeList Original insns. Renamed insns. r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6 p4 p2 p6 p7 div r1,4,r1 div p4,4,p7

Register Renaming (3) Anti (WAR) and output (WAW) deps. are false Dep. is on name/location, not on data Given infinite registers, WAR/WAW don t arise Renaming removes WAR/WAW, but leaves RAW intact Example Names: r1,r2,r3 Physical Locations: p1 p7 Original: r1 p1, r2 p2, r3 p3, p4 p7 are free Mapable FreeList Original insns. Renamed insns. r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6 p4 p2 p6 p7 div r1,4,r1 div p4,4,p7

omasulo s Algorithm for OOO Reservation Stations (RS): buffers to hold instructions Common Data Bus (CDB): broadcasts instruction results to RS Does two things: Register renaming: removes WAR/WAW hazards Forwarding (not shown for now to make example simpler) Will discuss later

omasulo Data Structures (1) Map able Regfile value Fetched insns R op Reservation Stations 1 2 CDB. V1 FU V2 CDB.V

omasulo Data Structures (2) Reservation Stations (RS) FU, busy, op, R (architectural destination register s name) : destination register tag (RS# of this RS) 1, 2: source register tag (RS# of RS that will output value) V1, V2: source register values Map able a.k.a. Register Alias able (RA) Holds mappings from architectural registers to RS# : tag (RS#) that will write this register Valid tags indicate the RS# that will produce result Common Data Bus (CDB) Completed instructions broadcast their <RS#, value> on CDB RS and Register File monitor CDB to learn about completed instructions

omasulo Pipeline New pipeline structure: F, D, S, X, W D (dispatch) Structural hazard? stall : allocate RS entry In this case, structural hazard means there is no free RS entry S (issue) RAW hazard? wait (monitor CDB) : go to execute W (writeback) Write register + free RS entry W and RAW-dependent S in same cycle Instruction(s) waiting for this result to be produced can now issue W and structurally-stalled D in same cycle Instruction waiting for a free RS entry can now be dispatched

omasulo Dispatch (D) Map able Regfile value Fetched insns R op Reservation Stations 1 2 CDB. V1 FU V2 CDB.V Allocate RS entry (structural stall if no free entry) Input register ready? read RegFile value into RS : read tag into RS Set register status (i.e., rename) for output register in map table

omasulo Issue (S) Map able Regfile value Fetched insns R op Reservation Stations 1 2 CDB. V1 FU V2 CDB.V Wait for RAW hazards Read register values from RS

omasulo Execute (X) Map able Regfile value Fetched insns R op Reservation Stations 1 2 CDB. V1 FU V2 CDB.V

omasulo Writeback (W) Map able Regfile value Fetched insns R op Reservation Stations 1 2 CDB. V1 FU V2 CDB.V Broadcast <RS#, Value> on CDB R still matches Map able entry? Clear M entry, write result to RegFile Compare CDB. with all 1 and 2s in RS: tag match? clear tag, copy value

Where is the register rename? Map able Regfile value Fetched insns R op Reservation Stations 1 2 CDB. V1 FU V2 CDB.V Value copies in RS (V1, V2) Instruction stores correct input values in its own RS entry Free list is implicit (allocate/deallocate as part of RS)

omasulo Data Structures Insn Status Insn D S X W f1 = ldf (r1) f2 = mulf f0,f1 stf f2,(r1) r1 = addi r1,4 f1 = ldf (r1) f2 = mulf f0,f1 stf f2,(r1) Map able Reg f0 f1 f2 r1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no 3 S no 4 FP1 no 5 FP2 no CDB V

omasulo: Cycle 1 Insn Status Insn D S X W f1 = ldf (r1) c1 f2 = mulf f0,f1 stf f2,(r1) r1 = addi r1,4 f1 = ldf (r1) f2 = mulf f0,f1 stf f2,(r1) Map able Reg f0 f1 f2 r1 RS#2 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] 3 S no 4 FP1 no 5 FP2 no CDB allocate V

omasulo: Cycle 2 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 f2 = mulf f0,f1 c2 stf f2,(r1) r1 = addi r1,4 f1 = ldf (r1) f2 = mulf f0,f1 stf f2,(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#4 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] 3 S no 4 FP1 yes mulf f2 - RS#2 [f0] - 5 FP2 no CDB allocate V

omasulo: Cycle 3 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 c3 f2 = mulf f0,f1 c2 stf f2,(r1) c3 r1 = addi r1,4 f1 = ldf (r1) f2 = mulf f0,f1 stf f2,(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#4 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - RS#2 [f0] - 5 FP2 no CDB allocate V

omasulo: Cycle 4 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 c3 c4 f2 = mulf f0,f1 c2 c4 stf f2,(r1) c3 r1 = addi r1,4 c4 f1 = ldf (r1) f2 = mulf f0,f1 stf f2,(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#4 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD no 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - RS#2 [f0] CDB.V 5 FP2 no CDB V RS#2 [f1] allocate free ldf finished (W) clear f1 RegStatus CDB broadcast RS#2 ready grab CDB value

omasulo: Cycle 5 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 c3 c4 f2 = mulf f0,f1 c2 c4 c5 stf f2,(r1) c3 r1 = addi r1,4 c4 c5 f1 = ldf (r1) c5 f2 = mulf f0,f1 stf f2,(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#4 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD yes ldf f1 - RS#1 - - 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 no CDB allocate V

omasulo: Cycle 6 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 c3 c4 f2 = mulf f0,f1 c2 c4 c5+ stf f2,(r1) c3 r1 = addi r1,4 c4 c5 c6 f1 = ldf (r1) c5 f2 = mulf f0,f1 c6 stf f2,(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#4RS#5 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU yes addi r1 - - [r1] - 2 LD yes ldf f1 - RS#1 - - 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 yes mulf f2 - RS#2 [f0] - CDB no stall on WAW: overwrite f2 RegStatus anyone who needs old f2 tag has it allocate V

omasulo: Cycle 7 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 c3 c4 f2 = mulf f0,f1 c2 c4 c5+ stf f2,(r1) c3 r1 = addi r1,4 c4 c5 c6 c7 f1 = ldf (r1) c5 c7 f2 = mulf f0,f1 c6 stf f2,(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#5 RS#1 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - RS#1 - CDB.V 3 S yes stf - RS#4 - - [r1] 4 FP1 yes mulf f2 - - [f0] [f1] 5 FP2 yes mulf f2 - RS#2 [f0] - CDB V RS#1 [r1] no stall on WAR: anyone who needs old r1 has RS copy D stall on store RS: structural (no space) addi finished (W) clear r1 RegStatus CDB broadcast RS#1 ready grab CDB value

omasulo: Cycle 8 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 c3 c4 f2 = mulf f0,f1 c2 c4 c5+ c8 stf f2,(r1) c3 c8 r1 = addi r1,4 c4 c5 c6 c7 f1 = ldf (r1) c5 c7 c8 f2 = mulf f0,f1 c6 stf f2,(r1) Map able Reg f0 f1 f2 r1 RS#2 RS#5 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD yes ldf f1 - - - [r1] 3 S yes stf - RS#4 - CDB.V [r1] 4 FP1 no 5 FP2 yes mulf f2 - RS#2 [f0] - CDB V RS#4 [f2] mulf finished (W), f2 already overwritten by 2nd mulf (RS#5) CDB broadcast RS#4 ready grab CDB value

omasulo: Cycle 9 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 c3 c4 f2 = mulf f0,f1 c2 c4 c5+ c8 stf f2,(r1) c3 c8 c9 r1 = addi r1,4 c4 c5 c6 c7 f1 = ldf (r1) c5 c7 c8 c9 f2 = mulf f0,f1 c6 c9 stf f2,(r1) Map able Reg f0 f1 RS#2 f2 RS#5 r1 2nd ldf finished (W) clear f1 RegStatus CDB broadcast Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no 3 S yes stf - - - [f2] [r1] 4 FP1 no 5 FP2 yes mulf f2 - RS#2 [f0] CDB.V CDB V RS#2 [f1] RS#2 ready grab CDB value

omasulo: Cycle 10 Insn Status Insn D S X W f1 = ldf (r1) c1 c2 c3 c4 f2 = mulf f0,f1 c2 c4 c5+ c8 stf f2,(r1) c3 c8 c9 c10 r1 = addi r1,4 c4 c5 c6 c7 f1 = ldf (r1) c5 c7 c8 c9 f2 = mulf f0,f1 c6 c9 c10 stf f2,(r1) c10 Map able Reg f0 f1 f2 r1 RS#5 Reservation Stations FU busy op R 1 2 V1 V2 1 ALU no 2 LD no 3 S yes stf - RS#5 - - [r1] 4 FP1 no 5 FP2 yes mulf f2 - - [f0] [f1] CDB stf finished (W) no output register no CDB broadcast V free allocate

Superscalar omasulo Pipeline Recall: Dynamic scheduling and multi-issue are orthogonal N: superscalar width (number of parallel operations) o allow superscalar WS: window size (number of reservation stations) o allow out-of-order What is needed for an N-by-WS omasulo? RS: N tag/value write (D), N value read (S), 2WS tag cmp (W) Select logic: WS N priority encoder (S) Map able: 2N read ports (D), N write ports (D) Register File: 2N read ports (D), N write ports (W) CDB: N (W)

Superscalar Select Logic Among all the ready instructions in RS, which one(s) to issue next? Superscalar select logic: has to choose N instructions out of up to WS WS N priority encoder Somewhat complicated: O(N 2 log 2 WS) Can simplify using different RS designs Split design Divide RS into N banks: 1 per FU? Implement N separate WS/N 1 encoders + Simpler: N * log 2 WS/N Less scheduling flexibility FIFO design Split RS into N banks, and only issue instruction at the head of each RS bank + Simpler: no select logic at all Less scheduling flexibility (but surprisingly not that bad)

Can We Add Forwarding? Map able Regfile value Fetched insns R op Reservation Stations 1 2 CDB. V1 V2 CDB.V Yes, but it s more complicated than you might think In fact: requires a completely new pipeline FU

Out-of-Order Forwarding Is Hard (1) No Forwarding Forwarding Insn D S X W D S X W f1 = ldf (r1) c1 c2 c3 c4 c1 c2 c3 c4 f2 = mulf f0,f1 c2 c4 c5+ c8 c2 c3 c4+ c7 Forwarding: ldf X in c3 mulf X in c4 his means mulf should do S in c3 But how can mulf do S in c3 if ldf does W in c4? Must change pipeline

Out-of-Order Forwarding Is Hard (2) Modern OOO schedulers with forwarding Split CDB tag and value, move tag broadcast to S ldf tag broadcast now in S mulf S in cycle 3 How do multi-cycle operations work? Delay tag broadcast according to FU latency How about variable-latency operations (e.g., cache misses)? Speculatively broadcast tag assuming best-case delay (e.g., cache hit) If wrong, kill and replay the dependent instructions And their dependent instructions, and their dependents, etc. Very complex schedulers used in high-performance processors