Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Similar documents
EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Instruction Level Parallelism Part II - Scoreboard

CMP 301B Computer Architecture. Appendix C

COSC4201. Scoreboard

Tomasulo s Algorithm. Tomasulo s Algorithm

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

CS521 CSE IITG 11/23/2012

Parallel architectures Electronic Computers LM

Dynamic Scheduling I

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Out-of-Order Execution. Register Renaming. Nima Honarmand

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

DAT105: Computer Architecture

Instruction Level Parallelism III: Dynamic Scheduling

Instruction Level Parallelism. Data Dependence Static Scheduling

Dynamic Scheduling II

OOO Execution & Precise State MIPS R10000 (R10K)

Tomasolu s s Algorithm

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

CSE502: Computer Architecture CSE 502: Computer Architecture

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Project 5: Optimizer Jason Ansel

Precise State Recovery. Out-of-Order Pipelines

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Pipelined Processor Design

CSE502: Computer Architecture CSE 502: Computer Architecture

CS 110 Computer Architecture Lecture 11: Pipelining

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Lecture 8-1 Vector Processors 2 A. Sohn

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Lecture 4: Introduction to Pipelining

EECE 321: Computer Organiza5on

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

LECTURE 8. Pipelining: Datapath and Control

Issue. Execute. Finish

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Compiler Optimisation

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

RISC Central Processing Unit

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Computer Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Department Computer Science and Engineering IIT Kanpur

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

CMSC 611: Advanced Computer Architecture

FMP For More Practice

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Computer Hardware. Pipeline

On the Rules of Low-Power Design

RISC Design: Pipelining

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

CS Computer Architecture Spring Lecture 04: Understanding Performance

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

CS61c: Introduction to Synchronous Digital Systems

Pipelined Architecture (2A) Young Won Lim 4/7/18

Pipelined Architecture (2A) Young Won Lim 4/10/18

CS429: Computer Organization and Architecture

EE382V-ICS: System-on-a-Chip (SoC) Design

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Multiple Predictors: BTB + Branch Direction Predictors

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

EC4205 Microprocessor and Microcontroller

Quantifying the Complexity of Superscalar Processors

Digital Integrated CircuitDesign

10. BSY-1 Trainer Case Study

CMSC 611: Advanced Computer Architecture

Chapter 1 Introduction

Pipelining and ISA Design

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

DIGITAL DESIGN WITH SM CHARTS

CS420/520 Computer Architecture I

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

SCALCORE: DESIGNING A CORE

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

CSEN 601: Computer System Architecture Summer 2014

Power Issues with Embedded Systems. Rabi Mahapatra Computer Science

L15: VLSI Integration and Performance Transformations

L15: VLSI Integration and Performance Transformations

Low Complexity Out-of-Order Issue Logic Using Static Circuits

Understanding Engineers #2

Transcription:

Advanced Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu

Lecture Outline Instruction-Level Parallelism Scoreboarding (A.8)

Instruction Level Parallelism (ILP) Quest for ILP drives significant research Dynamic instruction scheduling Scoreboarding Tomasulo s Algorithm Register Renaming (removing artificial dependencies WAR/WAWs) Dynamic Branch Prediction Superscalar/Multiple instruction Issue Hardware Speculation All of these things fit well together

Dynamic Scheduling Scheduling tries to re-arrange instructions to improve performance Previously: We assume when ID detects a hazard that cannot be hidden by bypassing/forwarding pipeline stalls AND, we assume the compiler tries to reduce this Now: Re-arrange instructions at runtime to reduce stalls Why hardware and not compiler?

Goals of Scheduling Goal of Static Scheduling Compiler tries to avoids/reduce dependencies Goal of Dynamic Scheduling Hardware tries to avoid stalling when present Why hardware and not compiler? Code Portability More information available dynamically (run-time) ISA can limit registers ID space Speculation sometimes needs hardware to work well Not everyone uses gcc O3

Dynamic Scheduling: Basic Idea Example DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 Dependency Independent Hazard detection during decode stalls whole pipeline No reason to stall in these cases: Out-of-Order Execution Reduces stalls, improved FU utilization, more parallel execution To give the appearance of sequential execution: precise interrupts First, we will study without this, then see how to add them

How to implement? Previously: Instruction Decode and Operation Fetch are a single cycle Now: Split ID into two phases Issue decode instructions, check for structural hazards Read Operands Wait until no data hazards, then read ops Dividing hazard checks into a two-step process Out-of-order execution => WAR/WAW hazards DIVD F0, F2, F4 DIVD F0, F2, F4 ADDD F10, F0, F8 WAR ADDD F10,F0, F8 SUBD F8, F8, F14 SUBD F10, F8, F14 WAW

Scoreboarding Basics Previously: Instruction Decode and Operation Fetch are a single cycle Now: To support multiple instructions in ID stage, we need two things Buffered storage (Instruction Buffer/Window/Queue) Split ID into two phases IF ID EX M WB Standard Pipeline I-Buffer/Scoreboard IF IS Rd EX M WB New Pipeline

Scoreboarding Centralized scheme No bypassing WAR/WAW hazards are a problem Originally proposed in CDC6600 (S. Cray, 1964)

Scoreboarding Stages Issue (Or Dispatch) Fetch Same as before Issue (Check Structural Hazards) If FU is free an no other active instruction has same destination register (WAW), then issue instruction Do not issue until structural hazards cleared Stalled instruction stay in I-Buffer Size of buffer is also a structural Hazard May have to stall Fetch if buffer fills Note: Issue is In-Order, stalls stops younger instructions

Scoreboarding Stages Read Operands (Or Issue!) Read Operands (Check Data Hazards) Check scoreboard for whether source operands are available Available? No earlier issued active instructions will write register No currently active FU is going to write it Dynamically avoids RAW hazards

Execution Scoreboarding Stages Execution/Write Result Execute/Update scoreboard Write Result Scoreboard checks for WAR stalls and stalls completing instruction, if necessary Before, stalls only occur at the beginning of instructions, now it can be at the end as well Can happen if: Completing instruction destination register matches an older instruction that has not yet read its source operands

Scoreboarding Control Hardware Three main parts Instruction Status Bits Indicate which of the four stages instruction is in Functional Unit Status Bits Busy (In Use or not), Operation being Performed Fi -- Destination Register, Fj, Fk, -- Source Registers Qj, Qk FU producing source regs Fj, Fk Rj, Rk Flags indicating when Fj, Fk are ready but not yet read Register Result Status Which FU will write each register

Instruction status: Scoreboard Example Read Exec Write LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Mult1 Mult2 Add Divide No No No No No FU Example courtesy of Prof. Broderson, CS152, UCB, Copyright (C) 2001 UCB

Instruction status: Scoreboard Example: Cycle 1 Read Exec Write LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No 1 FU Integer

Instruction status: Scoreboard Example: Cycle 2 Read Exec Write LD F6 34+ R2 1 2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No 2 FU Integer Issue 2nd LD?

Instruction status: Scoreboard Example: Cycle 3 Read Exec Write LD F6 34+ R2 1 2 3 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 No Mult1 No Mult2 No Add No Divide No 3 FU Integer Issue MULT?

Instruction status: Scoreboard Example: Cycle 4 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Mult1 Mult2 Add Divide No No No No No 4 FU Integer

Instruction status: Scoreboard Example: Cycle 5 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 Yes Mult1 No Mult2 No Add No Divide No 5 FU Integer

Instruction status: Scoreboard Example: Cycle 6 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add No Divide No 6 FU Mult1 Integer

Instruction status: Scoreboard Example: Cycle 7 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide No 7 FU Mult1 Integer Add Read multiply operands?

Scoreboard Example: Cycle 8a (First half of clock cycle) Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide Yes Div F10 F0 F6 Mult1 No Yes 8 FU Mult1 Integer Add Divide

Scoreboard Example: Cycle 8b (Second half of clock cycle) Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 8 FU Mult1 Add Divide

Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Note Remaining Scoreboard Example: Cycle 9 Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 2Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 9 FU Mult1 Add Divide Read operands for MULT & SUB? Issue ADDD?

Instruction status: Scoreboard Example: Cycle 10 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer No 9Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes 10 FU Mult1 Add Divide

Scoreboard Example: Cycle 11 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer No 8Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes 11 FU Mult1 Add Divide

Instruction status: Scoreboard Example: Cycle 12 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer No 7Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes 12 FU Mult1 Divide Read operands for DIVD?

Instruction status: Scoreboard Example: Cycle 13 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Integer No 6Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 13 FU Mult1 Add Divide

Instruction status: Scoreboard Example: Cycle 14 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Integer No 5Mult1 Yes Mult F0 F2 F4 No No Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 14 FU Mult1 Add Divide

Scoreboard Example: Cycle 15 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Integer No 4Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes 15 FU Mult1 Add Divide

Scoreboard Example: Cycle 16 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No 3Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes 16 FU Mult1 Add Divide

Scoreboard Example: Cycle 17 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No 2Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes 17 FU Mult1 Add Divide Why not write result of ADD??? WAR Hazard!

Scoreboard Example: Cycle 18 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No 1Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes 18 FU Mult1 Add Divide

Scoreboard Example: Cycle 19 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No 0Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes 19 FU Mult1 Add Divide

Scoreboard Example: Cycle 20 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes 20 FU Add Divide

Scoreboard Example: Cycle 21 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes 21 FU Add Divide WAR Hazard is now gone...

Scoreboard Example: Cycle 22 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 22 Integer No Mult1 No Mult2 No Add No 39 Divide Yes Div F10 F0 F6 No No 22 FU Divide

Faster than light computation (skip a couple of cycles)

Scoreboard Example: Cycle 61 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 ADDD F6 F8 F2 13 14 16 22 Integer No Mult1 No Mult2 No Add No 0 Divide Yes Div F10 F0 F6 No No 61 FU Divide

Scoreboard Example: Cycle 62 Instruction status: Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 62 ADDD F6 F8 F2 13 14 16 22 Integer Mult1 Mult2 Add Divide No No No No No 62 FU

Instruction status: Review: Scoreboard Example: Cycle 62 Read Exec Write LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 62 ADDD F6 F8 F2 13 14 16 22 Integer Mult1 Mult2 Add Divide No No No No No 62 FU In-order issue; out-of-order execute & commit

Scoreboarding Review LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 1 2 3 4 5 6 7 8 9 10 11 12 13 LD F6, 34(R2) Iss Rd Ex Wb LD F2, 45(R3) Iss Rd Ex Wb MULTD F0, F2, F4 Iss Iss Iss Rd M1 M2 M3 M4 SUBD F8, F6, F2 Iss Iss Rd A1 A2 Wb DIVD F10, F0, F6 Iss Iss Iss Iss Iss Iss ADDD F6, F8, F2 Iss

Scoreboarding Review LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 11 12 13 14 15 16 17 18 19 20 21 22. 62 LD F6, 34(R2) LD F2, 45(R3) MULTD F0, F2, F4 M2 M3 M4 M5 M6 M7 M8 M9 M10 Wb SUBD F8, F6, F2 A2 Wb DIVD F10, F0, F6 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Rd D1 Wb ADDD F6, F8, F2 Iss Rd A1 A2 A2 A2 A2 A2 A2 Wb

Scoreboarding Limitations Number and type of functional units Number of instruction buffer entries (scoreboard size) Amount of application ILP (RAW hazards) Presence of antidependencies (WAR) and output dependencies (WAW) Inorder issue for WAW/Structural Hazards limits scheduler WAR stalls are critical for loops (hardware loop unrolling)

Tomasulo s Approach Used in IBM 360/91 Machines (Late 60s) Similar to scoreboarding, but added renaming Key concept: Reservation Stations Very Important Topic Scheduling ideas led to Alpha 21264, HP PA-8000, MIPS R10K, Pentium III, Pentium 4, PowerPC 604, etc

Reservation Stations (RS) Distributed (rather than centralized) control scheme Bypassing is allowed via Common Data Bus (CDB) to RS Register Renaming eliminates WAR/WAW hazards Scoreboard/Instruction Buffer => Reservation Stations Fetch and Buffer operands as soon as available Eliminates need to always get values from registers at execute Pending instructions designate reservation stations that will provide their inputs Successive writes to a register cause only the last one to update the register

Register Renaming Compiler can eliminate some WAW/WAR false hazards, but not all Not enough registers Hazards across branches (common!) can eliminate on taken, or fall through but not both Hazards with itself -- dynamic loops (example later) Example (spill code causing false hazards ) C = A + B D = A - B ADD SW SUB R3, R1, R2 R3, 0(R4) R3, R1, R2

Register Renaming Dynamically change register names to eliminate false dependencies (WAR/WAW hazards) Architectural registers: Names not Locations Many more locations ( reservation stations or physical registers ) than names ( logical or architectural registers ) Dynamically map names to locations

Register Renaming Example Assume temporary registers S and T DIV F0, F2, F4 ADD F6, F0, F8 SW F6, 0(R1) SUB F8, F10, F14 MUL F6, F10, F8 DIV F0, F2, F4 ADD S, F0, F8 SW S, 0(R1) SUB T, F10, F14 MUL F6, F10, T

Register Renaming with Tomasulo At instruction issue: Register specifiers for source operands are renamed to the names of the reservation stations Values can exist in reservation station or register file To eliminate WARs, register file values are copied to reservation stations at issue Other methods example use pointer-based renaming (map-table) Technique used in Pentium III, PowerPC604

Reservation Station Components Op: Operation to perform in the unit Qj, Qk: Reservation stations producing source registers (value to be written) Note: No ready flags needed as in Scoreboard Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Busy: Indicates reservation station or FU are occupied Register Result Status: Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Three Stages of Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available

For next time Tomasulo s Algorithm Section 3.2/3.3 of H&P Branch Prediction Section 3.4/3.5 of H&P