CSE502: Computer Architecture CSE 502: Computer Architecture

Similar documents
CSE502: Computer Architecture CSE 502: Computer Architecture

Out-of-Order Execution. Register Renaming. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

CS521 CSE IITG 11/23/2012

CSE502: Computer Architecture CSE 502: Computer Architecture

Dynamic Scheduling I

CMP 301B Computer Architecture. Appendix C

Precise State Recovery. Out-of-Order Pipelines

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Tomasolu s s Algorithm

Instruction Level Parallelism III: Dynamic Scheduling

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Project 5: Optimizer Jason Ansel

OOO Execution & Precise State MIPS R10000 (R10K)

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Instruction Level Parallelism Part II - Scoreboard

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Dynamic Scheduling II

CS 110 Computer Architecture Lecture 11: Pipelining

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Compiler Optimisation

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

COSC4201. Scoreboard

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Parallel architectures Electronic Computers LM

Pipelined Processor Design

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

SCALCORE: DESIGNING A CORE

Issue. Execute. Finish

POWER GATING. Power-gating parameters

Tomasulo s Algorithm. Tomasulo s Algorithm

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

EE382V-ICS: System-on-a-Chip (SoC) Design

Performance Evaluation of Recently Proposed Cache Replacement Policies

Lecture 23: Media Access Control. CSE 123: Computer Networks Alex C. Snoeren

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

VLSI System Testing. Outline

Instruction Level Parallelism. Data Dependence Static Scheduling

CSE 2021: Computer Organization

Digital Integrated CircuitDesign

FIFO WITH OFFSETS HIGH SCHEDULABILITY WITH LOW OVERHEADS. RTAS 18 April 13, Björn Brandenburg

Final Report: DBmbench

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Design Challenges in Multi-GHz Microprocessors

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Lecture 3: Modulation & Clock Recovery. CSE 123: Computer Networks Stefan Savage

How a processor can permute n bits in O(1) cycles

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Scheduling. Radek Mařík. April 28, 2015 FEE CTU, K Radek Mařík Scheduling April 28, / 48

INF3430 Clock and Synchronization

ECE473 Computer Architecture and Organization. Pipeline: Introduction

High Speed ECC Implementation on FPGA over GF(2 m )

F3 08AD 1 8-Channel Analog Input

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

*Most details of this presentation obtain from Behrouz A. Forouzan. Data Communications and Networking, 5 th edition textbook

Department Computer Science and Engineering IIT Kanpur

Transportation Timetabling

Multiple Access (3) Required reading: Garcia 6.3, 6.4.1, CSE 3213, Fall 2010 Instructor: N. Vlajic

Logical Trunked. Radio (LTR) Theory of Operation

Penelope 1 : The NBTI-Aware Processor

The challenges of low power design Karen Yorav

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Lecture 9: Clocking for High Performance Processors

GPU-accelerated track reconstruction in the ALICE High Level Trigger

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Lecture 8-1 Vector Processors 2 A. Sohn

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

RISC Central Processing Unit

Lecture 3: Modulation & Clock Recovery. CSE 123: Computer Networks Alex C. Snoeren

CS429: Computer Organization and Architecture

Introduction to CMOS VLSI Design (E158) Lecture 5: Logic

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

F3 16AD 16-Channel Analog Input

Pipeline Damping: A Microarchitectural Technique to Reduce Inductive Noise in Supply Voltage

a8259 Features General Description Programmable Interrupt Controller

CS Computer Architecture Spring Lecture 04: Understanding Performance

Multiple Predictors: BTB + Branch Direction Predictors

TRIESTE: A Trusted Radio Infrastructure for Enforcing SpecTrum Etiquettes

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

This Errata Sheet contains corrections or changes made after the publication of this manual.

Quantifying the Complexity of Superscalar Processors

Fall 2015 COMP Operating Systems. Lab #7

Transcription:

CSE 502: Computer Architecture Out-of-Order Schedulers

Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When ready, operands sent directly from scheduler to functional units ARF Fetch & Dispatch Data-Capture Scheduler Functional Units PRF/ROB Bypass Physical register update

Components of a Scheduler Buffer for unexecuted instructions Method for tracking state of dependencies (resolved or not) A C D B E Arbiter B Method for notification of dependency resolution Method for choosing between multiple ready instructions competing for the same resource F G Scheduler Entries or Issue Queue (IQ) or Reservation Stations (RS)

Scheduling Loop or Wakeup-Select Loop Wake-Up Part: Executing insn notifies dependents Waiting insns. check if all deps are satisfied If yes, wake up instutrction Select Part: Choose which instructions get to execute More than one insn. can be ready Number of functional units and memory ports are limited

Scalar Scheduler (Issue Width = 1) T14 T16 = = T39 Tag Broadcast Bus T39 T6 T17 T39 = = = = T8 T42 Select Logic To Execute Logic T15 T39 = = T17

Superscalar Scheduler (detail of one entry) Tag Broadcast Buses Tags, Ready Logic Select Logic grants = = = = = = = = Src L Val L Rdy L Src R Val R Rdy R Dst Issued bid

Interaction with Execution Payload RAM D S L A opcode Val L Val R S R Select Logic Val L Val R Val L Val L Val R Val R

Again, But Superscalar D S L S R A Val L Val R opcode Val L Val R D S L S R B Select Logic Val L Val R Val L Val R opcode Val L Val R Val L Val L Val R Val R Scheduler captures values

Issue Width Max insns. selected each cycle is issue width Previous slides showed different issue widths four, one, and two Hardware requirements: Naively, issue width of N requires N tag broadcast buses Can specialize some of the issue slots E.g., a slot that only executes branches (no outputs)

Simple Scheduler Pipeline A A: Select Payload Execute tag broadcast result broadcast B C B: Wakeup enable capture on tag match Capture Select Payload Execute tag broadcast C: Wakeup enable capture Capture Cycle i Cycle i+1 Very long clock cycle

Deeper Scheduler Pipeline A A: Select Payload Execute tag broadcast result broadcast B C B: Wakeup enable capture Capture Select Payload Execute tag broadcast C: Wakeup enable capture Capture Select Payload Execute Cycle i Cycle i+1 Cycle i+2 Cycle i+3 Faster, but Capture & Payload on same cycle

Even Deeper Scheduler Pipeline A A: Select Payload Execute tag broadcast enable B: Wakeup capture Select Payload result broadcast and bypass Capture Execute B C C: Cycle i Wakeup tag match on first operand Wakeup tag match on second operand (now C is ready) Capture Capture Select Payload Execute No simultaneous read/write! Cycle i+1 Cycle i+2 Cycle i+3 Cycle i+4 Need second level of bypassing

Very Deep Scheduler Pipeline A B A: Select Payload Execute B: Select Select Payload Execute C: D: A&B both ready, only A selected, B bids again Wakeup Capture Select Payload Execute Wakeup Wakeup Capture Capture C A C and C D must be bypassed, B D OK without bypass D Select Payload Execute Cycle i i+1 i+2 i+3 i+4 i+5 i+6 Dependent instructions can t execute back-to-back

Pipelineing Critical Loops Wakeup-Select Loop hard to pipeline No back-to-back execute Worst-case IPC is ½ Usually not worst-case Last example had IPC 2 3 Regular Scheduling A B C No Backto-Back A B C Studies indicate 10-15% IPC penalty

IPC vs. Frequency 10-15% IPC not bad if frequency can double 1000ps 500ps 500ps 2.0 IPC, 1GHz 1.7 IPC, 2GHz 2 BIPS 3.4 BIPS Frequency doesn t double Latch/pipeline overhead Stage imbalance 900ps 450ps 450ps 900ps 350 550 1.5GHz

Non-Data-Capture Scheduler Fetch & Dispatch Fetch & Dispatch Scheduler Scheduler ARF Functional Units PRF Physical register update Unified PRF Functional Units Physical register update

Pipeline Timing Data-Capture S X E Select Payload Execute Skip Cycle Wakeup Select Payload Execute Non-Data-Capture S X X X E Select Payload Read Operands from PRF Execute Wakeup Select Payload Read Operands from PRF Exec Substantial increase in schedule-to-execute latency

Handling Multi-Cycle Instructions Sched PayLd Exec Add R1 = R2 + R3 WU Sched PayLd Exec Xor R4 = R1 ^ R5 Sched PayLd Exec Exec Exec Mul R1 = R2 R3 WU Sched PayLd Exec Add R4 = R1 + R5 Instructions can t execute too early

Delayed Tag Broadcast (1/3) Sched PayLd Exec Exec Exec Mul R1 = R2 R3 WU Sched PayLd Exec Add R4 = R1 + R5 Must make sure broadcast bus available in future Bypass and data-capture get more complex

Delayed Tag Broadcast (2/3) Sched PayLd Exec Exec Exec Mul R1 = R2 R3 WU Sched PayLd Exec Add R4 = R1 + R5 Assume issue width equals 2 Sched PayLd Exec Sub R7 = R8 #1 Sched PayLd Exec Xor R9 = R9 ^ R6 In this cycle, three instructions need to broadcast their tags!

Delayed Tag Broadcast (3/3) Possible solutions 1. One select for issuing, another select for tag broadcast Messes up timing of data-capture 2. Pre-reserve the bus Complicated select logic, track future cycles in addition to current 3. Hold the issue slot from initial launch until tag broadcast sch payl exec exec exec Issue width effectively reduced by one for three cycles

Delayed Wakeup Push the delay to the consumer Tag Broadcast for R1 = R2 R3 Tag arrives, but we wait three cycles before acknowledging it R1 = R4 = R5 = R1 + R4 ready! Must know ancestor s latency

Non-Deterministic Latencies Previous approaches assume all latencies are known Real situations have unknown latency Load instructions Latency {L1_lat, L2_lat, L3_lat, DRAM_lat} DRAM_lat is not a constant either, queuing delays Architecture specific cases PowerPC 603 has early out for multiplication Intel Core 2 s has early out divider also Makes delayed broadcast hard Kills delayed wakeup

The Wait-and-See Approach Complexity only in the case of variable-latency ops Most insns. have known latency Wait to learn if load hits or misses in the cache Sched PayLd Scheduler DL1 Tags DL1 Data Exec May be able to design cache s.t. hit/miss known before data R1 = 16[$sp] Exec Exec Cache hit known, can broadcast tag Exec Exec Exec Sched PayLd Exec Load-to-Use latency increases by 2 cycles (3 cycle load appears as 5) R2 = R1 + #4 Penalty reduced to 1 cycle Sched PayLd Exec

Load-Hit Speculation Caches work pretty well Hit rates are high (otherwise we wouldn t use caches) Assume all loads hit in the cache Sched PayLd Exec Exec Exec Cache hit, R1 = 16[$sp] data forwarded Broadcast delayed by DL1 latency Sched PayLd Exec R2 = R1 + #4 What to do on a cache miss?

Load-Hit Mis-speculation Cache Miss Detected! L2 hit Sched PayLd Exec Exec Exec Broadcast delayed by DL1 latency Sched PayLd Broadcast delayed by L2 latency Each mis-scheduling wastes an issue slot: the tag broadcast bus, payload RAM read port, writeback/bypass bus, etc. could have been used for another instruction Exec Exec Exec Value at cache output is bogus Invalidate the instruction (ALU output ignored) Sched PayLd Exec Rescheduled assuming a hit at the DL2 cache There could be a miss at the L2 and again at the L3 cache. A single load can waste multiple issuing opportunities. It s hard, but we want this for performance

But wait, there s more! Sched PayLd Exec Exec Exec Sched PayLd L1-D Miss Squash Exec Not only children get squashed, there may be grand-children to squash as well All waste issue slots All must be rescheduled All waste power None may leave scheduler until load hit known Sched PayLd Sched PayLd Sched PayLd Sched PayLd Exec Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Exec Exec

Squashing (1/3) Squash in-flight between schedule and execute Relatively simple (each RS remembers that it was issued) Insns. stay in scheduler Ensure they are not re-scheduled Not too bad Dependents issued in order Mis-speculation known before Exec Sched PayLd Exec Exec Exec? Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec May squash non-dependent instructions

Squashing (2/3) Selective squashing with load colors Each load assigned a unique color Every dependent inherits parents colors On load miss, the load broadcasts its color Anyone in the same color group gets squashed An instruction may end up with many colors Tracking colors requires huge number of comparisons

Squashing (3/3) Can list colors in unary (bit-vector) form Each insn. s vector is bitwise OR of parents vectors 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 X X 1 0 1 1 0 0 0 0 X Load R1 = 16[R2] Add R3 = R1 + R4 Load R5 = 12[R7] Load R8 = 0[R1] Load R7 = 8[R4] Add R6 = R8 + R7 Allows squashing just the dependents

Scheduler Allocation (1/3) Allocate in order, deallocate in order Very simple! Reduces effective scheduler size Insns. executed out-of-order RS entries cannot be reused Head Tail Tail Circular Buffer Can be terrible if load goes to memory

Scheduler Allocation (2/3) Arbitrary placement improves utilization Complex allocator Scan availability to find N free entries Complex write logic Route N insns. to arbitrary entries 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 RS Allocator Entry availability bit-vector

Scheduler Allocation (3/3) Segment the entries One entry per segment may be allocated per cycle Each allocator does 1-of-4 instead of 4-of-16 as before Write logic is simplified Still possible inefficiencies Full segments block allocation Reduces dispatch width A B C D X 0 0 1 0 1 0 1 0 0 0 0 10 0 1 1 0 Alloc Alloc Alloc Alloc Free RS entries exist, just not in the correct segment

Select Logic Goal: minimize DFG height (execution time) NP-Hard Precedence Constrained Scheduling Problem Even harder: entire DFG is not known at scheduling time Scheduling insns. may affect scheduling of not-yet-fetched insns. Today s designs implement heuristics For performance For ease of implementation

Simple Select Logic 1 Grant 0 = 1 Grant 1 =!Bid 0 Grant 2 =!Bid 0 &!Bid 1 Grant 3 =!Bid 0 &!Bid 1 &!Bid 2 Grant n-1 =!Bid 0 & &!Bid n-2 O(log S) gates grant i S entries 1 grant 0 x i = Bid i yields O(S) x 0 grant 1 gate delay x 1 grant 2 x 2 grant 3 x 3 grant 4 x 4 grant 5 x 5 grant 6 x 6 grant 7 Scheduler Entries x 7 grant 8 x 8 grant 9

Random Select Insns. occupy arbitrary scheduler entries First ready entry may be the oldest, youngest, or in middle Simple static policy results in random schedule Still correct (no dependencies are violated) Likely to be far from optimal

Oldest-First Select Newly dispatched insns. have few dependencies No one is waiting for them yet Insns. in scheduler are likely to have the most deps. Many new insns. dispatched since old insn s rename Selecting oldest likely satisfies more dependencies finishing it sooner is likely to make more insns. ready

Implementing Oldest First Select (1/3) Compress Up A B C D E F G H Write instructions into scheduler in program order B D E F G H I J Newly dispatched E F G H I J K L

Implementing Oldest First Select (2/3) Compressing buffers are very complex Gates, wiring, area, power Ex. 4-wide Need up to shift by 4 An entire instruction s worth of data: tags, opcodes, immediates, readiness, etc.

Implementing Oldest First Select (3/3) G A F D B H C E 6 0 5 3 1 7 2 4 0 3 2 0 2 Grant 0 Age-Aware Select Logic Must broadcast grant age to instructions

Problems in N-of-M Select (1/2) G A F D B H C E 6 0 5 3 1 7 2 4 Age-Aware 1-of-M Age-Aware 1-of-M Age-Aware 1-of-M O(log M) gate delay / select N layers O(N log M) delay

Problems in N-of-M Select (2/2) Select logic handles functional unit constraints Maybe two instructions ready this cycle but both need the divider DIV 1 LOAD 5 XOR 3 MUL 6 DIV 4 ADD 2 BR 7 ADD 8 Assume issue width = 4 Four oldest and ready instructions ADD is the 5 th oldest ready instruction, but it should be issued because only one of the ready divides can issue this cycle

Partitioned Select DL1 Load Store Add(2) Div(1) Load(5) (Idle) DIV LOAD XOR MUL DIV ADD BR ADD 1 5 3 6 4 2 7 8 1-of-M ALU Select 1-of-M Mul/Div Select 1-of-M Load Select 1-of-M Store Select 5 Ready Insts Max Issue = 4 Actual issue is only 3 insts N possible insns. issued per cycle

Multiple Units of the Same Type ALU 1 ALU 2 Load DL1 Store DIV LOAD XOR MUL DIV ADD BR ADD 1 5 3 6 4 2 7 8 1-of-M ALU Select 1-of-M ALU Select 1-of-M Mul/Div Select 1-of-M Load Select 1-of-M Store Select Possible to have multiple popular FUs

Bid to Both? ADD ADD 2 8 2 2 8 Select Logic for ALU 1 8 Select Logic for ALU 2 No! Same inputs Same outputs

Chain Select Logics ADD ADD 2 8 2 8 Select Logic for ALU 1 8 Select Logic for ALU 2 Works, but doubles the select latency

Select Binding (1/2) During dispatch/alloc, each instruction is bound to one and only one select logic ADD 5 1 XOR SUB 2 8 2 1 ADD 4 1 CMP 7 2 Select Logic for ALU 1 Select Logic for ALU 2

Select Binding (2/2) (Idle) ADD 5 1 XOR SUB 2 8 2 1 ADD 4 1 CMP 3 2 Select Logic for ALU 1 Select Logic for ALU 2 ADD 5 1 XOR SUB 2 8 2 1 ADD 4 1 CMP 3 2 Select Logic for ALU 1 Select Logic for ALU 2 Not-Quite-Oldest-First: Ready insns are aged 2, 3, 4 Issued insns are 2 and 4 Wasted Resources: 3 instructions are ready Only 1 gets to issue

Make N Match Functional Units? ALU 1 ALU 2 ALU 3 M/D Shift FAdd FM/D SIMD Load Store ADD 5 LOAD 3 MUL 6 Too big and too slow

Execution Ports (1/2) Divide functional units into P groups Called ports Area only O(P 2 M log M), where P << F Logic for tracking bids and grants less complex (deals with P sets) ADD 3 LOAD 5 ADD 2 MUL 8 Port 0 Port 1 Port 2 Port 3 Port 4 Shift Load Store FM/D SIMD ALU 1 ALU 2 ALU 3 M/D FAdd

Execution Ports (2/2) More wasted resources Shift Load Store Example SHL issued on Port 0 ALU 1 ALU 2 ALU 3 ADD cannot issue 3 ALUs are unused ADD 5 0 XOR SHL 2 1 1 0 ADD 4 2 CMP 3 1 Select Logic for Port 0 Select Logic for Port 1 Select Logic for Port 1

Port Binding Assignment of functional units to execution ports Depends on number/type of FUs and issue width Load Store Shift FM/D FM/D FAdd M/D Shift FAdd ALU 1 ALU 2 M/D FAdd ALU 1 ALU 2 Load Store Load M/D FM/D 8 Units, N=4 ALU 1 ALU 2 Shift Store Int/FP Separation Only Port 3 needs to access FP RF and support 64/80 bits Even distribution of Int/FP units, more likely to keep all N ports busy Each port need not have the same number of FUs; should be bound based on frequency of usage

Port Assignment Insns. get port assignment at dispatch For unique resources Assign to the only viable port Ex. Store must be assigned to Port 1 For non-unique resources Must make intelligent decision Ex. ADD can go to any of Ports 0, 1 or 2 Port 0 Port 1 Port 2 Port 3 Port 4 Shift Load Store FM/D SIMD ALU 1 ALU 2 ALU 3 M/D FAdd Optimal assignment requires knowing the future Possible heuristics random, round-robin, load-balance, dependency-based,

Decentralized RS (1/4) Area and latency depend on number of RS entries Decentralize the RS to reduce effects: RS 1 RS 2 RS 3 Select for Port 0 Select for Port 1 Select for Port 2 Select for Port 3 Port 0 Port 1 P2 M 1 entries M 2 entries M 3 entries Select logic blocks for RS i only have gate delay of O(log M i ) Port3

Int-only wakeup Decentralized RS (2/4) Natural split: INT vs. FP Int Cluster L1 Data Cache FP Cluster INT RF Store Load ALU 1 ALU 2 Port 1 Port 0 FP-Ld FP-St FAdd FM/D Port 3 Port 2 FP RF FP-only wakeup Often implies non-rob based physical register file: One unified integer PRF, and one unified FP PRF, each managed separately with their own free lists

Decentralized RS (3/4) Fully generalized decentralized RS MOV F/I Shift Store FP-Ld FP-St ALU 1 ALU 2 M/D Load FM/D FAdd Port 5 Port 4 Port 3 Port 2 Port 1 Port 0 Over-doing it can make RS and select smaller but tag broadcast may get out of control Can combine with INT/FP split idea

Decentralized RS (4/4) Each RS-cluster is smaller Easier to implement, less area, faster clock speed Poor utilization leads to IPC loss Partitioning must match program characteristics Previous example: Integer program with no FP instructions runs on 2/3 of issue width (ports 4 and 5 are unused)