Dynamic Scheduling I

Similar documents
EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Instruction Level Parallelism III: Dynamic Scheduling

Dynamic Scheduling II

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Out-of-Order Execution. Register Renaming. Nima Honarmand

Tomasolu s s Algorithm

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

CSE502: Computer Architecture CSE 502: Computer Architecture

Precise State Recovery. Out-of-Order Pipelines

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

OOO Execution & Precise State MIPS R10000 (R10K)

COSC4201. Scoreboard

CMP 301B Computer Architecture. Appendix C

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Instruction Level Parallelism Part II - Scoreboard

CS521 CSE IITG 11/23/2012

Issue. Execute. Finish

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

CSE502: Computer Architecture CSE 502: Computer Architecture

Parallel architectures Electronic Computers LM

Project 5: Optimizer Jason Ansel

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Instruction Level Parallelism. Data Dependence Static Scheduling

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Tomasulo s Algorithm. Tomasulo s Algorithm

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

CS 110 Computer Architecture Lecture 11: Pipelining

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture 4: Introduction to Pipelining

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Pipelined Processor Design

Lecture 8-1 Vector Processors 2 A. Sohn

EECE 321: Computer Organiza5on

ECE473 Computer Architecture and Organization. Pipeline: Introduction

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

LECTURE 8. Pipelining: Datapath and Control

DAT105: Computer Architecture

Compiler Optimisation

CMSC 611: Advanced Computer Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

On the Rules of Low-Power Design

Department Computer Science and Engineering IIT Kanpur

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

CS429: Computer Organization and Architecture

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

SCALCORE: DESIGNING A CORE

Computer Hardware. Pipeline

Final Report: DBmbench

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Quantifying the Complexity of Superscalar Processors

Multiple Predictors: BTB + Branch Direction Predictors

Computer Architecture

Clock-Powered CMOS: A Hybrid Adiabatic Logic Style for Energy-Efficient Computing

EE382V-ICS: System-on-a-Chip (SoC) Design

CS Computer Architecture Spring Lecture 04: Understanding Performance

Register Allocation by Puzzle Solving

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

Pipelining and ISA Design

RISC Central Processing Unit

Enhancing System Architecture by Modelling the Flash Translation Layer

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Subra Ganesan DSP 1.

RISC Design: Pipelining

Reading Material + Announcements

CSEN 601: Computer System Architecture Summer 2014

Performance Evaluation of Recently Proposed Cache Replacement Policies

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

FMP For More Practice

CZ3001 ADVANCED COMPUTER ARCHITECTURE

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

CS420/520 Computer Architecture I

WiMAX Basestation: Software Reuse Using a Resource Pool. Arnon Friedmann SW Product Manager

Pre-Silicon Validation of Hyper-Threading Technology

Advances in Antenna Measurement Instrumentation and Systems

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Run-time Power Control Scheme Using Software Feedback Loop for Low-Power Real-time Applications

The challenges of low power design Karen Yorav

Low-Power Design for Embedded Processors

Transcription:

basic pipeline started with single, in-order issue, single-cycle operations have extended this basic pipeline with multi-cycle operations multiple issue (superscalar) now: dynamic scheduling (out-of-order issue) Scoreboard: OoO without solving WAW/WAR Tomasulo s algorithm: OoO + register renaming to fix WAR/WAW next half unit: dynamic scheduling II dynamic scheduling + precise state + speculation advanced topic: dynamic load scheduling 1

Readings H+P chapter 2 Recent Research Papers (can read these soon) Pentium4 Complexity-Effective Superscalar Checkpoint Processing and Recovery 2

Dynamic Scheduling: Motivation 1 2 3 4 5 6 7 8 9 10 divf f0,f2,f4 F D E/ E/ E/ E/ W addf f6,f0,f2 F D d* d* d* E+ E+ W mulf f8,f2,f4 F p* p* p* D E* E* W cycle4: addf stalls due to RAW hazard OK, fundamental problem also cycle4: mulf stalls due to pipeline hazard (addf stalls) why? mulf can t proceed into ID because addf is there but that s the only reason not good enough! why can t we decode mulf in cycle 4 and execute it in c5? no fundamental reason why we can t do this! 3

Dynamic Scheduling dynamic scheduling (out-of-order execution) execute instructions in non-sequential (non-vonneumann) order + reduce stalls + improve functional unit utilization + enable parallel execution (not in-order can be in parallel) make it appear like sequential execution: precise interrupts very important but hard next unit of this course 4

Scheduling scheduling: re-arranging instructions to maximize performance requires knowledge about structure of processor requires knowledge about latencies and dependences two options for who should schedule instructions static scheduling: by compiler dynamic scheduling: by hardware 5

Before We Start why build complicated hardware if we can do this in software? + performance portability don t want to recompile for new machines + more information available to hardware addresses, branch directions, cache misses unknown to compiler + more resources available to hardware may not have enough architectural registers to fix WAR/WAW + easier to speculate in hardware easier to recover from mis-speculation but compiler can look at more instructions it s possible to do combination of both compiler does as much as it can, hardware does rest 6

The Problem with In-Order Pipelines PC F/D D/X regfile X/W IF I$ ID EX WB in-order pipeline simple 4-stage: IF,ID, EX (multiple cycle, includes M), WB structural hazard: 1 instruction register (latch) per stage 1 instruction per stage per cycle (unless pipe is replicated) younger instruction can t pass older without killing it out-of-order pipeline must implement passing functionality 7

Instruction Buffer instruction buffer PC F/D D/X regfile X/W IF I$ ID1 ID2 EX WB trick: instruction buffer (many names for this buffer) basically: a bunch of latches for holding instructions this is the scope of instructions that the scheduler can see split ID into two pieces accumulate decoded instructions in buffer in-order buffer sends instructions down rest of pipe out-of-order 8

Dispatch and Issue instruction buffer PC F/D D/X regfile X/W IF I$ DS dispatch (DS): first part of ID allocate resources in instruction buffer EX WB new kind of structural hazard (instruction buffer could be full) dispatch is in-order, and stall propagates to younger instructions issue (IS): second part of ID send instructions from instruction buffer to execution units IS out-of-order, wait does NOT propagate to younger instructions 9

DS Method #1: Scoreboarding instruction buffer scoreboard centralized control scheme no bypassing no elimination of WAR/WAW hazards first implementation: CDC6600 [1964] 16 separate non-pipelined functional units 4 FP, 5 memory, 7 integer our example: Simple Scoreboard 5 functional units: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, pipelined) for simplicity, assume 1-wide pipeline (not superscalar) 10

Scoreboard Data Structures instruction status: 1 entry per active instruction which stage instruction is in (presence in scoreboard implies DS) functional unit (FU) status: 1 entry per FU busy: FU is busy, op: current operation R1,R2, R: source and destination registers T1, T2: tags of FUs producing source registers T: tag of FU producing destination register register status: 1 entry per architectural register T: tag of FU (if any) that will write the register tag fields interpreted as ready bits (conversely busy bits ) tag == 0: register value is ready (in register file) tag!= 0: register value is not ready (will be supplied by [tag]) 11

Simple Scoreboard reg status T value fetched insns IS EX WB R R1 R2 T T1 T2 == RF inst status FU status instruction fields and status bits T FU tags values 12

Scoreboard Pipeline new pipeline structure: IF, DS, IS, EX, WB DS (dispatch) from fetch to the scoreboard (no scoreboard entry/structural hazard/waw)? (stall) : (allocate) IS (issue) to the functional units (RAW hazard)? (wait) : (read registers, go directly to execute) EX (execute) execute operation, notify scoreboard when done WB (writeback) (WAR hazard)? (wait) : (write register, free scoreboard entry) assume WB and RAW-dependent IS can take place in same cycle WB and structural-dependent DS can take place in same cycle 13

Scoreboard: Dispatch (DS) reg status T value fetched insns IS EX WB R R1 R2 T T1 T2 == RF inst status FU status T FU stall for WAW and structural hazards, but otherwise: allocate scoreboard entry copy status for input registers set status for output register 14

Scoreboard: Issue (IS) reg status T value fetched insns IS EX WB R R1 R2 T T1 T2 == RF inst status FU status T FU wait for RAW hazards (T1 or T2 not empty), but otherwise: read registers 15

Scoreboard: Execute (EX) reg status T value fetched insns IS EX WB R R1 R2 T T1 T2 == RF inst status FU status T FU 16

Scoreboard: Writeback (WB) reg status T value fetched insns IS EX inst status WB R wait for WAR hazards, but otherwise: writeback result R1 R2 FU status compare tags with waiting instructions on match: clear tag (set input to ready ) T T1 T2 == T RF FU 17

SAX: simplified SAXPY DO I = 1,N Z[I] = A*X[I] assembly code: loop: Running Example ldf f0,x(r1) // f0=x[i], assume I in r1 mulf f4,f0,f2 // assume A in f2 stf f4,z(r1) // Z[i]=A*X[i] add r1,r1,#4 // I=I+4 ble r1,r2,loop // assume 4N in r2 consider two iterations, ignore branch 18

Scoreboard Data Structures Instruction Status instruction DS IS EX WB ldf f0,x(r1) mulf f4,f0,f2 stf f4,z(r1) add r1,r1,#4 ldf f0,x(r1) mulf f4,f0,f2 stf f4,z(r1) Register Status register T f0 f2 f4 r1 Functional unit status T busy op R R1 R2 T1 T2 ALU No load No store No FP1 No FP2 No 19

Scoreboard Example: Cycle 1 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 mulf f4,f0,f2 stf f4,z(r1) add r1,r1,#4 ldf f0,x(r1) mulf f4,f0,f2 stf f4,z(r1) Register Status reg T f0 load f2 f4 r1 Functional unit status T busy op R R1 R2 T1 T2 ALU No load Yes ldf f0 r1 store No FP1 No FP2 No allocate 20

Scoreboard Example: Cycle 2 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 mulf f4,f0,f2 c2 stf f4,z(r1) add r1,r1,#8 ldf f0,x(r1) mulf f4,f0,f2 stf f4,z(r1) Register Status register T f0 load f2 f4 FP1 r1 Functional unit status T busy op R R1 R2 T1 T2 ALU No load Yes ldf f0 r1 store No FP1 Yes mulf f4 f0 f2 load FP2 No allocate 21

Scoreboard Example: Cycle 3 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 c3 mulf f4,f0,f2 c2 stf f4,z(r1) c3 add r1,r1,#8 ldf f0,x(r1) mulf f4,f0,f2 stf f4,z(r1) Register Status register T f0 load f2 f4 FP1 r1 Functional unit status T busy op R R1 R2 T1 T2 ALU No load Yes ldf f0 r1 store Yes stf f4 r1 FP1 FP1 Yes mulf f4 f0 f2 load FP2 No allocate stalled on RAW 22

Scoreboard Example: Cycle 4 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 c3 c4 mulf f4,f0,f2 c2 c4 stf f4,z(r1) c3 add r1,r1,#8 c4 ldf f0,x(r1) mulf f4,f0,f2 stf f4,z(r1) Register Status register T f0 load f2 f4 FP1 r1 ALU result written, clear status Functional unit status T busy op R R1 R2 T1 T2 ALU Yes add r1 r1 load No store Yes stf f4 r1 FP1 FP1 Yes mulf f4 f0 f2 load FP2 No allocate free f0 now ready 23

Scoreboard Example: Cycle 5 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 c3 c4 mulf f4,f0,f2 c2 c4 c5 stf f4,z(r1) c3 add r1,r1,#8 c4 c5 ldf f0,x(r1) c5 mulf f4,f0,f2 stf f4,z(r1) Register Status register T f0 load f2 f4 FP1 r1 ALU Functional unit status T busy op R R1 R2 T1 T2 ALU Yes add r1 r1 load Yes ldf f0 r1 ALU store Yes stf f4 r1 FP1 FP1 Yes mulf f4 f0 f2 FP2 No allocate 24

Scoreboard Example: Cycle 6 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 c3 c4 mulf f4,f0,f2 c2 c4 c5+ stf f4,z(r1) c3 add r1,r1,#8 c4 c5 c6 ldf f0,x(r1) c5 mulf f4,f0,f2 stf f4,z(r1) Register Status register T f0 load f2 f4 FP1 r1 ALU DS stall: WAW hazard w/ mulf (f4) Functional unit status T busy op R R1 R2 T1 T2 ALU Yes add r1 r1 load Yes ldf f0 r1 ALU store Yes stf f4 r1 FP1 FP1 Yes mulf f4 f0 f2 FP2 No 25

Scoreboard Example: Cycle 7 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 c3 c4 mulf f4,f0,f2 c2 c4 c5+ stf f4,z(r1) c3 add r1,r1,#8 c4 c5 c6 ldf f0,x(r1) c5 mulf f4,f0,f2 stf f4,z(r1) Register Status register T f0 load f2 f4 FP1 r1 ALU WB stall: WAR hazard w/ stf (r1) DS stall: WAW hazard w/ mulf (f4) Functional unit status T busy op R R1 R2 T1 T2 ALU Yes add r1 r1 load Yes ldf f0 r1 ALU store Yes stf f4 r1 FP1 FP1 Yes mulf f4 f0 f2 FP2 No 26

Scoreboard Example: Cycle 8 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 c3 c4 mulf f4,f0,f2 c2 c4 c5+ c8 stf f4,z(r1) c3 c8 add r1,r1,#8 c4 c5 c6 ldf f0,x(r1) c5 mulf f4,f0,f2 c8 stf f4,z(r1) Register Status register T f0 load f2 f4 FP1FP2 r1 ALU first mulf (FP1) is finished WB stall due to WAR hazard Functional unit status T busy op R R1 R2 T1 T2 ALU Yes add r1 r1 load Yes ldf f0 r1 ALU store Yes stf f4 r1 FP1 FP1 No FP2 Yes mulf f4 f0 f2 load f4 is ready free allocate 27

Scoreboard Example: Cycle 9 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 c3 c4 mulf f4,f0,f2 c2 c4 c5+ c8 stf f4,z(r1) c3 c8 c9 add r1,r1,#8 c4 c5 c6 c9 ldf f0,x(r1) c5 c9 mulf f4,f0,f2 c8 stf f4,z(r1) Register Status register T f0 load f2 f4 FP2 r1 ALU add wrote DS stall due to structural hazard Functional unit status T busy op R R1 R2 T1 T2 ALU No load Yes ldf f0 r1 ALU store Yes stf f4 r1 FP1 No FP2 Yes mulf f4 f0 f2 load free entry r1 is ready 28

Scoreboard Example: Cycle 10 Instruction Status instruction DS IS EX WB ldf f0,x(r1) c1 c2 c3 c4 mulf f4,f0,f2 c2 c4 c5+ c8 stf f4,z(r1) c3 c8 c9 c10 add r1,r1,#4 c4 c5 c6 c9 ldf f0,x(r1) c5 c9 c10 mulf f4,f0,f2 c8 Register Status register T f0 load f2 f4 FP2 r1 stf f4,z(r1) c10 WB and dependent DS in same cycle Functional unit status T busy op R R1 R2 T1 T2 ALU No load Yes ldf f0 r1 store Yes stf f4 r1 FP2 FP1 No FP2 Yes mulf f4 f0 f2 load free then allocate 29

Scoreboard Redux + cheap hardware scoreboard is cheap (~1 FU in area) pretty good performance 1.7X for FORTRAN programs 2.5X for hand-coded assembly (how would a compiler do?) no bypassing RAW dependences handled through registers limited scheduling scope WAW/structural hazards force in-order dispatch WAR hazards delay writeback and issue of dependent operations can solve these problems with register renaming! 30