Instruction Level Parallelism Part II - Scoreboard

Similar documents
CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

CMP 301B Computer Architecture. Appendix C

COSC4201. Scoreboard

Tomasulo s Algorithm. Tomasulo s Algorithm

Parallel architectures Electronic Computers LM

Dynamic Scheduling I

CS521 CSE IITG 11/23/2012

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Instruction Level Parallelism. Data Dependence Static Scheduling

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

CSE502: Computer Architecture CSE 502: Computer Architecture

Out-of-Order Execution. Register Renaming. Nima Honarmand

Instruction Level Parallelism III: Dynamic Scheduling

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

CSE502: Computer Architecture CSE 502: Computer Architecture

Tomasolu s s Algorithm

DAT105: Computer Architecture

Project 5: Optimizer Jason Ansel

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Dynamic Scheduling II

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Pipelined Processor Design

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture 8-1 Vector Processors 2 A. Sohn

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

CSE502: Computer Architecture CSE 502: Computer Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Compiler Optimisation

Lecture 4: Introduction to Pipelining

EECE 321: Computer Organiza5on

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

CS 110 Computer Architecture Lecture 11: Pipelining

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Precise State Recovery. Out-of-Order Pipelines

OOO Execution & Precise State MIPS R10000 (R10K)

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

ECE473 Computer Architecture and Organization. Pipeline: Introduction

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

RISC Central Processing Unit

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

CMSC 611: Advanced Computer Architecture

Issue. Execute. Finish

LECTURE 8. Pipelining: Datapath and Control

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Computer Hardware. Pipeline

CS429: Computer Organization and Architecture

Department Computer Science and Engineering IIT Kanpur

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Computer Architecture

FMP For More Practice

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

Final Report: DBmbench

CMSC 611: Advanced Computer Architecture

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

a8259 Features General Description Programmable Interrupt Controller

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Pipelining and ISA Design

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Run-time Power Control Scheme Using Software Feedback Loop for Low-Power Real-time Applications

CS61C : Machine Structures

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Low Complexity Out-of-Order Issue Logic Using Static Circuits

Quantifying the Complexity of Superscalar Processors

CHAPTER 4 GALS ARCHITECTURE

An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Enhancing System Architecture by Modelling the Flash Translation Layer

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Peer-to-Peer Architecture

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

SCHEDULING Giovanni De Micheli Stanford University

RISC Design: Pipelining

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

Global Correction Services for GNSS

M.Sc. Thesis. Implementation and automatic generation of asynchronous scheduled dataflow graph. T.M. van Leeuwen B.Sc. Abstract

DIGITAL DESIGN WITH SM CHARTS

Design of FIR Filter Using Modified Montgomery Multiplier with Pipelining Technique

EC4205 Microprocessor and Microcontroller

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Instruction Selection via Tree-Pattern Matching Comp 412

Reading Material + Announcements

MEDIUM SPEED ANALOG-DIGITAL CONVERTERS

Performance Metrics, Amdahl s Law

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Transcription:

Course on: Advanced Computer Architectures Instruction Level Parallelism Part II - Scoreboard Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1

Basic Assumptions We consider single-issue processors The Instruction Fetch stage precedes the Issue Stage and may fetch either into an Instruction Register or into a queue of pending instructions Instructions are then issued from the IR or from the queue Execution stage may require multiple cycles, depending on the operation type. 2

Key Idea: Dynamic Scheduling Problem: Data dependences that cannot be hidden with bypassing or forwarding cause hardware stalls of the pipeline Solution: Allow instructions behind a stall to proceed Hw rearranges dynamically the instruction execution to reduce stalls Enables out-of-order execution and completion (commit) First implemented in CDC 6600 (1963). 3

Dynamic Scheduling Advantages: Enables handling cases of dependence unknown at compile time Simplifies compiler Allows code compiled for one pipeline to run efficiently on a different pipeline (code portability) Disadvantages: Significant increase in hardware complexity Could generate imprecise exceptions 4

Example 1 DIVD F0,F2,F4 ADDD F10,F0,F8 # RAW F0 SUBD F12,F8,F14 RAW Hazard: ADDD stalls for F0 (waiting that DIVD commits). SUBD would stall even if not data dependent on anything in the pipeline without dynamic scheduling. BASIC IDEA: to enable SUBD to proceed (out-oforder execution) 5

Example 2: Analysis of dependences and hazards LD F6, 34(R2) LD F2, 45(R3) MULTD F0, F2, F4 # RAW F2 SUBD F8, F6, F2 # RAW F2,RAW F6 DIVD F10, F0, F6 # RAW F0,RAW F6 ADDD F6, F8, F2 # WAR F6,RAW F8,RAW F2 6

Scoreboard Dynamic Scheduling Algorithm 7

Scoreboard basic scheme Out-of-order execution divides ID stage: 1.Issue Decode instructions, check for structural hazards 2.Read operands (RR) Wait until no data hazards, then read operands Instructions execute whenever not dependent on previous instructions and no hazards Scoreboard allows instructions to execute whenever 1 & 2 hold, not waiting for prior instructions 8

Scoreboard basic scheme We distinguish when an instruction begins execution and it completes execution: between the two times, the instruction is in execution. We assume the pipeline allows multiple instructions in execution at the same time that requires multiple functional units, pipelined functional units or both. CDC 6600: In order issue, out of order execution, out of order completion (commit) No forwarding! Imprecise interrupt/exception model for now! 9

Functional Units Scoreboard Architecture Memory 10

Scoreboard Pipeline Scoreboard replaces ID, EX, WB stages with 4 stages ID stage split in two parts: Issue (decode and check structural hazard) Read Operands (wait until no data hazards) Scoreboard allows instructions without dependencies to execute In-order issue BUT out-of-order read-operands out-of-order execution and completion All instructions pass through the issue stage in-order, but they can be stalled or bypass each other in the read operand stage and thus enter execution out-of-order and with different latencies, which implies out-of-order completion. 11

Scoreboard Implications Out-of-order completion WAR and WAW hazards can occur Solutions for WAR: Stall write back until registers have been read. Read registers only during Read Operands stage. 12

Scoreboard Implications Solution for WAW: Detect hazard and stall issue of new instruction until the other instruction completes No register renaming Need to have multiple instructions in execution phase Multiple execution units or pipelined execution units Scoreboard keeps track of dependencies and state of operations 13

Scoreboard Scheme Hazard detection and resolution is centralized in the scoreboard: Every instruction goes through the Scoreboard, where a record of data dependences is constructed The Scoreboard then determines when the instruction can read its operand and begin execution If the scoreboard decides the instruction cannot execute immediately, it monitors every change and decides when the instruction can execute. The scoreboard controls when the instruction can write its result into destination register 14

Exception handling Problem with out-of order completion Must preserve exception behavior as in-order execution Solution: ensure that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed 15

Imprecise exceptions An exception is imprecise if the processor state when an exception is raised does not look exactly as if the instructions were executed in-order. Imprecise exceptions can occur because: The pipeline may have already completed instructions that are later in program order than the instruction causing the exception The pipeline may have not yet completed some instructions that are earlier in program order than the instruction causing the exception Imprecise exception make it difficult to restart execution after handling 16

Four Stages of Scoreboard Control 1. Issue Decode instruction and check for structural hazards & WAW hazards Instructions issued in program order (for hazard checking) If a functional unit for the instruction is free and no other active instruction has the same destination register (no WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural hazard or a WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 17

Four Stages of Scoreboard Control Note that when the issue stage stalls, it causes the buffer between Instruction fetch and issue to fill: If the buffer has a single entry: IF stalls If the buffer is a queue of multiple instruction: IF stalls when the queue fills 18

Four Stages of Scoreboard Control 2. Read Operands Wait until no data hazards, then read operands A source operand is available if: - No earlier issued active instruction will write it or - A functional unit is writing its value in a register When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. RAW hazards are resolved dynamically in this step, and instructions may be sent into execution out of order. No forwarding of data in this model 19

Four Stages of Scoreboard Control 3.Execution The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. FUs are characterized by: -Variable latency (the effective time used to complete one operation). - Initiation interval (the number of cycles that must elapse between issuing two operations to the same functional unit). - Load/Store latency depends on data cache HIT/MISS 20

Four Stages of Scoreboard Control 4. Write result Check for WAR hazards and finish execution Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the completing instruction. 21

WAR/WAW Example DIVD F0,F2,F4 ADDD F6,F0,F8 SUBD F8,F8,F14 MULD F6,F10,F8 RAW F0 WAR F8 WAW F6 The scoreboard would: Stall SUBD in the WB stage, waiting for ADDD reads F0 and F8 and Stall MULD in the issue stage until ADDD writes F6. Can be solved through register renaming 22

Scoreboard Structure 1. Instruction status 2. Functional Unit status Indicates the state of the functional unit (FU): Busy Indicates whether the unit is busy or not Op - The operation to perform in the unit (+,-, etc.) Fi - Destination register Fj, Fk Source register numbers Qj, Qk Functional units producing source registers Fj, Fk Rj, Rk Flags indicating when Fj, Fk are ready. Flags are set to NO after operands are read. 3. Register result status. Indicates which functional unit will write each register. Blank if no pending instructions will write that register. 23

Detailed Scoreboard Pipeline Control Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj( f ) Fi(FU) or Rj( f )=No) & (Fk( f ) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No 24

Scoreboard Example Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU 25

Scoreboard Example: Cycle 1 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Integer 26

Scoreboard Example Cycle 2 Instruction status Read ExecutioWrite Instruction j k Issue operand completeresult LD F6 34+ R2 1 2 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Integer Issue 2nd load? Integer Pipeline Full Cannot exec 2 nd Load due to structural hazard on Integer Unit Issue stalls 27

Scoreboard Example Cycle 3 Instruction status Read ExecutioWrite Instruction j k Issue operand completeresult LD F6 34+ R2 1 2 3 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Integer Issue stalls Load execution complete in one clock cycle (data cache hit) 28

Scoreboard Example: Cycle 4 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Integer Issue stalls Write F6 29

Scoreboard Example: Cycle 5 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Integer The 2 nd load is issued 30

Scoreboard Example: Cycle 6 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 Integer MULT is issued but has to wait for F2 from LOAD (RAW Hazard on F2) 31

Scoreboard Example: Cycle 7 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 Integer Add Read multiply operands? Now SUBD can be issued to ADD Functional Unit (then SUBD has to wait for RAW F2 from load) 32

Scoreboard Example: Cycle 8a (First half of clock cycle) Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Integer Add Divide DIVD is issued but there is another RAW hazard (F0) from MULTD -> DIVD has to wait for F0 33

Scoreboard Example: Cycle 8b (Second half of clock cycle) Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Add Divide Load completes (Writes F2), and operands for MULT an SUBD are ready 34

Scoreboard Example: Cycle 9 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Note Remaining Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 2Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 Add Divide Read operands for MULTD & SUBD by multiple-port Register File Issue ADDD? No for structural hazard on ADD Functional Unit MULTD and SUBD are sent in execution in parallel: Latency of 10 cycles for MULTD and 2 cycles for SUBD 35

Scoreboard Example: Cycle 10 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 9Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 1Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 Add Divide 36

Scoreboard Example: Cycle 11 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 0Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 Add Divide SUBD ends execution 37

Scoreboard Example: Cycle 12 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 7Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 Divide SUBD writes result in F8 38

Scoreboard Example: Cycle 13 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 6Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 Add Divide ADDD can be issued DIVD still waits for operand F0 from MULTD 39

Scoreboard Example: Cycle 14 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 5Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 Add Divide ADDD reads operands (out-of-order read operands: ADDD reads operands before DIVD) 40

Scoreboard Example: Cycle 15 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 4Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 1 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 Add Divide ADDD starts execution 41

Scoreboard Example: Cycle 16 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU Mult1 Add Divide ADDD ends execution 42

Scoreboard Example: Cycle 17 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 WAR F6 Hazard! Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 17 FU Mult1 Add Divide Why not write result of ADDD??? WAR must be detected before write result of ADDD in F6 DIVD must first read F6 (before ADDD write F6), but DIVD cannot read operands until MULTD writes F0 43

Scoreboard Example: Cycle 18 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 18 FU Mult1 Add Divide 44

Scoreboard Example: Cycle 19 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 0Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 19 FU Mult1 Add Divide MULTD ends execution 45

Scoreboard Example: Cycle 20 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 20 FU Add Divide MULTD writes in F0 46

Scoreboard Example: Cycle 21 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 40 Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 21 FU Add Divide DIVD can read operands WAR Hazard is now gone... 47

Scoreboard Example: Cycle 22 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 22 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 39 Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 22 FU Divide DIVD has read its operands in previous cycle ADDD can now write the result in F6 48

(skipping some cycles ) 49

Scoreboard Example: Cycle 61 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 ADDD F6 F8 F2 13 14 16 22 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 61 FU Divide DIVD ends execution 50

Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 62 ADDD F6 F8 F2 13 14 16 22 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU DIVD writes in F10 51

Review: Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 3 4 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 62 ADDD F6 F8 F2 13 14 16 22 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU In-order issue; out-of-order execute & commit 52

CDC 6600 Scoreboard Speedup of 2.5 w.r.t. no dynamic scheduling Speedup 1.7 by reorganizing instructions from compiler; BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards 53

Summary Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops Memory dependencies hardest to determine HW exploiting ILP Works when can t know dependence at run time Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode Issue Instruction & Read Operands) Enables out-of-order execution => out-of-order completion ID stage checked both structural and WAW hazards on destination operands. 54

End of Part I References: Chapter 2 of the text book: J. Hennessey, D. Patterson, Computer Architecture: a quantitative approach 4 th Edition, Morgan-Kaufmann Publishers. 55