CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Similar documents
Instruction Level Parallelism Part II - Scoreboard

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CMP 301B Computer Architecture. Appendix C

COSC4201. Scoreboard

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Tomasulo s Algorithm. Tomasulo s Algorithm

Parallel architectures Electronic Computers LM

CS521 CSE IITG 11/23/2012

Dynamic Scheduling I

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

DAT105: Computer Architecture

Instruction Level Parallelism. Data Dependence Static Scheduling

Out-of-Order Execution. Register Renaming. Nima Honarmand

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Instruction Level Parallelism III: Dynamic Scheduling

CSE502: Computer Architecture CSE 502: Computer Architecture

Tomasolu s s Algorithm

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Dynamic Scheduling II

Lecture 4: Introduction to Pipelining

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

OOO Execution & Precise State MIPS R10000 (R10K)

CSE502: Computer Architecture CSE 502: Computer Architecture

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Project 5: Optimizer Jason Ansel

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Pipelined Processor Design

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Lecture 8-1 Vector Processors 2 A. Sohn

Precise State Recovery. Out-of-Order Pipelines

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

EECE 321: Computer Organiza5on

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

CS 110 Computer Architecture Lecture 11: Pipelining

FMP For More Practice

CSE502: Computer Architecture CSE 502: Computer Architecture

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

RISC Central Processing Unit

Compiler Optimisation

RISC Design: Pipelining

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

LECTURE 8. Pipelining: Datapath and Control

CS429: Computer Organization and Architecture

Computer Architecture

Computer Hardware. Pipeline

CS4617 Computer Architecture

Issue. Execute. Finish

CSE502: Computer Architecture Welcome to CSE 502

Pipelining and ISA Design

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

CS Computer Architecture Spring Lecture 04: Understanding Performance

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Department Computer Science and Engineering IIT Kanpur

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

CS420/520 Computer Architecture I

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Performance Metrics, Amdahl s Law

CS61C : Machine Structures

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

CSEN 601: Computer System Architecture Summer 2014

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

On the Rules of Low-Power Design

Peer-to-Peer Architecture

CS61C : Machine Structures

Pipelined Architecture (2A) Young Won Lim 4/7/18

Pipelined Architecture (2A) Young Won Lim 4/10/18

DIGITAL DESIGN WITH SM CHARTS

Measuring and Evaluating Computer System Performance

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Digital Integrated CircuitDesign

High-Speed RSA Crypto-Processor with Radix-4 4 Modular Multiplication and Chinese Remainder Theorem

Blackfin Online Learning & Development

Final Report: DBmbench

CSE 305: Computer Architecture

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Lec 24: Parallel Processors. Announcements

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling

Administrative Issues

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

CS61c: Introduction to Synchronous Digital Systems

Transcription:

CISC 662 Graduate Computer Architecture Lecture 9 - Scoreboard Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture tes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)

Can we use HW to get CPI closer to 1? Why in HW at run time? Works when can t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Out-of-order execution => out-of-order completion. 2

Problems? How do we prevent WAR and WAW hazards? How do we deal with variable latency? Forwarding for RAW hazards harder. Clock Cycle Number Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 LD F6,34(R2) IF ID EX MEM WB LD F2,45(R3) IF ID EX MEM WB MULTD F0,F2,F4 IF ID stall M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 MEM WB SUBD F8,F6,F2 IF ID A1 A2 MEM WB DIVD F10,F0,F6 IF ID stall stall stall stall stall stall stall stall stall D1 D2 ADDD F6,F8,F2 IF ID A1 A2 MEM WB WAR RAW 3

Scoreboard: a bookkeeping technique Out-of-order execution divides ID stage: 1. Issue decode instructions, check for structural hazards 2. Read operands wait until no data hazards, then read operands Scoreboards date to CDC6600 in 1963 Instructions execute whenever not dependent on previous instructions and no hazards. CDC 6600: In-order issue, out-of-order execution, outof-order commit (or completion) forwarding! Imprecise interrupt/exception model for now 4

Scoreboard Architecture (CDC 6600) Registers FP Mult FP Mult FP Divide FP Add Integer Functional Units SCOREBOARD Memory 5

Scoreboard Implications Out-of-order completion => WAR, WAW hazards? Solutions for WAR: Stall writeback until registers have been read Read registers only during Read Operands stage Solution for WAW: Detect hazard and stall issue of new instruction until other instruction completes register renaming! Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies between instructions that have already issued. Scoreboard replaces ID, EX, WB with 4 stages 6

Four Stages of Scoreboard Control Issue decode instructions & check for structural hazards (ID1) Instructions issued in program order (for hazard checking) Don t issue if structural hazard Don t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards) Read operands wait until no data hazards, then read operands (ID2) All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. forwarding of data in this model! 7

Four Stages of Scoreboard Control Execution operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. Write result finish execution (WB) Stall until no WAR hazards with previous instructions: Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands 8

Scoreboard Architecture (CDC 6600) Registers FP Mult FP Mult FP Divide FP Add Integer Functional Units SCOREBOARD Memory 9

Three Parts of the Scoreboard Instruction status: Which of 4 steps the instruction is in Functional unit status: Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g., + or ) Fi: Destination register Fj,Fk: Source-register numbers Qj,Qk: Functional units producing source registers Fj, Fk Rj,Rk: Flags indicating when Fj, Fk are ready Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register 10

Scoreboard Example LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Mult1 Mult2 Add Divide FU 11

Detailed Scoreboard Pipeline Control Instruction status Issue Read operands Execution complete Write result Wait until t busy (FU) and not result(d) Rj and Rk Functional unit done f((fj(f) Fi(FU) or Rj(f)=) & (Fk(f) Fi(FU) or Rk( f )=)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj ; Rk f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) 12

Scoreboard Example: Cycle 1 LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Yes Mult1 Mult2 Add Divide 1 FU Integer 13

Scoreboard Example: Cycle 2 LD F6 34+ R2 1 2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Yes Mult1 Mult2 Add Divide 2 FU Integer Issue 2nd LD? 14

Scoreboard Example: Cycle 3 LD F6 34+ R2 1 2 3 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Mult1 Mult2 Add Divide 3 FU Integer Issue MULT? 15

Scoreboard Example: Cycle 4 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Mult1 Mult2 Add Divide 4 FU Integer 16

Scoreboard Example: Cycle 5 LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 Yes Mult1 Mult2 Add Divide 5 FU Integer 17

Scoreboard Example: Cycle 6 LD F2 45+ R3 5 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer Yes Mult2 Add Divide 6 FU Mult1 Integer 18

Scoreboard Example: Cycle 7 LD F2 45+ R3 5 6 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer Yes Mult2 Add Yes Sub F8 F6 F2 Integer Yes Divide 7 FU Mult1 Integer Add Read multiply operands? 19

Scoreboard Example: Cycle 8a (First half of clock cycle) LD F2 45+ R3 5 6 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer Yes Load F2 R3 Mult1 Yes Mult F0 F2 F4 Integer Yes Mult2 Add Yes Sub F8 F6 F2 Integer Yes Divide Yes Div F10 F0 F6 Mult1 Yes 8 FU Mult1 Integer Add Divide 20

Scoreboard Example: Cycle 8b (Second half of clock cycle) LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 Yes 8 FU Mult1 Add Divide 21

Scoreboard Example: Cycle 9 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 te Remaining Integer 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 2 Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 Yes 9 FU Mult1 Add Divide Read operands for MULT & SUB? Issue ADDD? 22

Scoreboard Example: Cycle 10 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer 9 Mult1 Yes Mult F0 F2 F4 Mult2 1 Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 Yes 10 FU Mult1 Add Divide 23

Scoreboard Example: Cycle 11 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer 8 Mult1 Yes Mult F0 F2 F4 Mult2 0 Add Yes Sub F8 F6 F2 Divide Yes Div F10 F0 F6 Mult1 Yes 11 FU Mult1 Add Divide 24

Scoreboard Example: Cycle 12 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer 7 Mult1 Yes Mult F0 F2 F4 Mult2 Add Divide Yes Div F10 F0 F6 Mult1 Yes 12 FU Mult1 Divide Read operands for DIVD? 25

Scoreboard Example: Cycle 13 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Integer 6 Mult1 Yes Mult F0 F2 F4 Mult2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 Yes 13 FU Mult1 Add Divide 26

Scoreboard Example: Cycle 14 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Integer 5 Mult1 Yes Mult F0 F2 F4 Mult2 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 Yes 14 FU Mult1 Add Divide 27

Scoreboard Example: Cycle 15 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Integer 4 Mult1 Yes Mult F0 F2 F4 Mult2 1 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 Yes 15 FU Mult1 Add Divide 28

Scoreboard Example: Cycle 16 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer 3 Mult1 Yes Mult F0 F2 F4 Mult2 0 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 Yes 16 FU Mult1 Add Divide 29

Scoreboard Example: Cycle 17 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 WAR Hazard! Integer 2 Mult1 Yes Mult F0 F2 F4 Mult2 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 Yes 17 FU Mult1 Add Divide Why not write result of ADD??? 30

Scoreboard Example: Cycle 18 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer 1 Mult1 Yes Mult F0 F2 F4 Mult2 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 Yes 18 FU Mult1 Add Divide 31

Scoreboard Example: Cycle 19 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer 0 Mult1 Yes Mult F0 F2 F4 Mult2 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Mult1 Yes 19 FU Mult1 Add Divide 32

Scoreboard Example: Cycle 20 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer Mult1 Mult2 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Yes Yes 20 FU Add Divide 33

Scoreboard Example: Cycle 21 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 Integer Mult1 Mult2 Add Yes Add F6 F8 F2 Divide Yes Div F10 F0 F6 Yes Yes 21 FU Add Divide WAR Hazard is now gone... 34

Scoreboard Example: Cycle 22 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 22 Integer Mult1 Mult2 Add 39 Divide Yes Div F10 F0 F6 22 FU Divide 35

Faster than light computation (skip a couple of cycles) 36

Scoreboard Example: Cycle 61 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 ADDD F6 F8 F2 13 14 16 22 Integer Mult1 Mult2 Add 0 Divide Yes Div F10 F0 F6 61 FU Divide 37

Scoreboard Example: Cycle 62 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 62 ADDD F6 F8 F2 13 14 16 22 Integer Mult1 Mult2 Add Divide 62 FU 38

Review: Scoreboard Example: Cycle 62 LD F2 45+ R3 5 6 7 8 MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 62 ADDD F6 F8 F2 13 14 16 22 Integer Mult1 Mult2 Add Divide 62 FU In-order issue; out-of-order execute & commit 39

CDC 6600 Scoreboard Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards 40