Dynamic Scheduling II

Similar documents
EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Precise State Recovery. Out-of-Order Pipelines

Dynamic Scheduling I

CSE502: Computer Architecture CSE 502: Computer Architecture

Issue. Execute. Finish

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Tomasolu s s Algorithm

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Out-of-Order Execution. Register Renaming. Nima Honarmand

Instruction Level Parallelism III: Dynamic Scheduling

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

OOO Execution & Precise State MIPS R10000 (R10K)

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Instruction Level Parallelism Part II - Scoreboard

CS521 CSE IITG 11/23/2012

CMP 301B Computer Architecture. Appendix C

COSC4201. Scoreboard

Parallel architectures Electronic Computers LM

Tomasulo s Algorithm. Tomasulo s Algorithm

Project 5: Optimizer Jason Ansel

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

CSE502: Computer Architecture CSE 502: Computer Architecture

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

CS 110 Computer Architecture Lecture 11: Pipelining

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Pipelined Processor Design

Instruction Level Parallelism. Data Dependence Static Scheduling

ECE473 Computer Architecture and Organization. Pipeline: Introduction

SCALCORE: DESIGNING A CORE

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Lecture 4: Introduction to Pipelining

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

RISC Central Processing Unit

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Multiple Predictors: BTB + Branch Direction Predictors

Compiler Optimisation

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

CS429: Computer Organization and Architecture

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Computer Architecture

EECE 321: Computer Organiza5on

Computer Hardware. Pipeline

Department Computer Science and Engineering IIT Kanpur

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

LECTURE 8. Pipelining: Datapath and Control

DAT105: Computer Architecture

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Lecture 8-1 Vector Processors 2 A. Sohn

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

On the Rules of Low-Power Design

EE382V-ICS: System-on-a-Chip (SoC) Design

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

Freeway: Maximizing MLP for Slice-Out-of-Order Execution

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

CMSC 611: Advanced Computer Architecture

COTSon: Infrastructure for system-level simulation

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors

RISC Design: Pipelining

Bridgepad Swiss Team Guide 2010 BridgePad Company Version 2a BridgePad Swiss Team Manual2d-3c.doc. BridgePad Swiss Team Instruction Manual

Lecture 13 Register Allocation: Coalescing

Final Report: DBmbench

Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

A Brief History of Speculation

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

A Static Power Model for Architects

Reading Material + Announcements

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

Quantifying the Complexity of Superscalar Processors

Design Challenges in Multi-GHz Microprocessors

An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors

Certified Wireless USB Host Controller

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

MLP-aware Instruction Queue Resizing: The Key to Power- Efficient Performance

CSEN 601: Computer System Architecture Summer 2014

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Energy-aware Circuits for RFID

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

ANT Channel Search ABSTRACT

MLP-aware Instruction Queue Resizing: The Key to Power-Efficient Performance

CS Computer Architecture Spring Lecture 04: Understanding Performance

Transcription:

so far: dynamic scheduling (out-of-order execution) Scoreboard omasulo s algorithm register renaming: removing artificial dependences (WAR/WAW) now: out-of-order execution + precise state advanced topic: dynamic load scheduling PentiumII vs. Pentium4 limits of ILP 1

Readings H+P chapter 2 Research Papers Pentium4 Complexity-Effective Superscalar Checkpoint Processing and Recovery 2

Superscalar + Out-of-Order + Speculation superscalar + out-of-order + speculation three concepts that work well (best?) when used together CPI >= 1? overcome with superscalar superscalar increases hazards? overcome with dynamic scheduling RAW dependences still a problem? overcome with a large instruction window branches a problem for filling large window? overcome with speculation 3

Speculation and Precise Interrupts Q: why are we discussing these together? sequential (von Neumann) semantics for interrupts all instructions before interrupt should be complete all instructions after interrupt should look as if never started (abort) basically, we also want the same thing for a mis-predicted branch what makes precise interrupts hard? out-of-order completion must undo post-interrupt writebacks in-order pipe no post-branch writebacks before branch completes out-of-order pipe can happen A: with out-of-order pipe, precise interrupts and mis-speculation recovery are same problem same solution 4

Solution: Precise State speculative execution requirements ability to abort & restart at every branch precise synchronous interrupt requirements ability to abort & restart at every load, store, FP divide,?? precise asynchronous interrupt requirements ability to abort & restart at every?? just bite the bullet implement ability to abort & restart at every instruction called precise state 5

Ways to Implement Precise State force in-order completion (WB): stall pipe if necessary slow precise state in software even slower - would require a trap for every misprediction precise state in hardware: save recovery info internally + everything is better in hardware 6

he Problem with Precise State problem is in the writeback stage (WB) mixes two things together that should be separate (1) broadcasts values to, forwards to other instructions OK for this to be out-of-order (2) writes values to registers would like this to be in-order solution to every functionality problem? add a level of indirection have already seen this for out-of-order execution split ID into in-order DS and out-of-order IS separate using instruction buffer (scoreboard, reservation stations) 7

Re-Order Buffer (ROB) ROB PC F/D D/X regfile X/C IF I$ DS IS EX instruction buffer re-order buffer (ROB) CM R buffers completed results en route to register file and D$ may be combined with or separate (combined in the picture) split writeback (WB) into two stages: Complete and Retire 8

Complete and Retire ROB PC F/D D/X regfile X/C IF I$ DS IS EX CM R CM (complete) completed values write results to ROB out-of-order out-of-order stage R (retire, but sometimes called commit or graduate ) ROB writes results to register file in-order in-order stage hazards result in stalls 9

Memory Ordering Buffer (MOB) ROB makes register writes in-order, but what about stores? same as before (i.e., to D$ in MEM stage)? bad idea! imprecise memory worse than imprecise registers must do same trick for stores Memory Ordering Buffer (MOB) a.k.a. store buffer, store queue, load/store queue (LSQ) completed (but not retired) stores write to MOB to retire store, write head of MOB to D$ loads look at MOB and D$ in parallel forward from MOB if matching store (i.e. to same address) 10

ROB+MOB ROB PC F/D D/X regfile X/C IF I$ DS IS EX CM R stores loads loads/stores MOB stores D$ modulo some gross simplifications, this picture is almost realistic! 11

omasulo+rob add ROB to omasulo s algorithm combined ROB and are called RUU (or Sohi s method) RUU = register update unit separate ROB and are called P6-style (Intel P6 = Pentium Pro) our example: Simple-P6 separate ROB and same organization as before: 1 ALU, 1 load, 1 store, 2 3-cycle FP 12

P6-style Organization reg status + RF value R value tail dispatch dispatch 1 2 == CDB. V1 V2 CDB.V ROB head retire FU instruction fields and ready bits tags values 13

are the same as before P6 Data Structures ROB head, tail: to keep sequential order R: output register of instruction, V: output value of instruction tags are different was: # now: ROB# register status table is different +: tag + ready-in-rob bit tag == 0 result ready in register file tag!= 0 result not ready tag!= 0 + result ready in ROB 14

P6 Data Structures hd tl ROB + MOB # instruction R V addr IS EX CM 1 ldf f0,x(r1) 2 mulf f4,f0,f2 3 stf f4,z(r1) 4 add r1,r1,#8 5 ldf f0,x(r1) 6 mulf f4,f0,f2 7 stf f4,z(r1) Reg. Status reg + f0 f2 f4 r1 V CDB # FU busy op V1 V2 1 2 1 ALU No 2 load No 3 store No 4 FP1 No 5 FP2 No 15

P6 Pipeline new pipeline structure: IF, DS, IS, EX, CM, R DS (dispatch) (/ROB/MOB full)? (stall) : {allocate /ROB/MOB entries, set tag to ROB#, set register status entry to ROB# with ready-in-rob bit off, read ready registers into } EX (execute) free entry used to be done at WB can be earlier now because # are not tags 16

CM (complete) (CDB not available)? (wait) : P6 Pipeline {write value into ROB entry indicated by tag, mark ROB entry complete, mark register status entry ready-in-rob bit (+)} R (retire, commit, graduate) (ROB head not complete)? (stall) : {write ROB head result to register file, if store, then write MOB head to D$, handle any exceptions, free ROB/MOB entries} 17

P6: Dispatch (DS) part I reg status + RF value R value tail dispatch dispatch 1 2 == CDB. V1 V2 CDB.V ROB head retire FU stall if or ROB or MOB is full allocate +ROB entries (assign ROB# to output tag) set register status entry to ROB# and ready-in-rob bit to 0 18

P6: Dispatch (DS) part II reg status + RF value R value tail dispatch dispatch 1 2 == CDB. V1 V2 CDB.V ROB head retire FU read tags for register inputs from register status table if tag==0: copy value from RF (not shown) if tag!=0: copy tag to if tag!=0 +: copy value from ROB 19

P6: Complete (CM) reg status + RF value R value tail fetch fetch 1 2 == CDB. V1 V2 CDB.V ROB head retire FU wait for CDB broadcast <result,tag> on CDB write result into ROB, set reg. status ready-in-rob bit (+) match tags, write CDB.V into of dependent instructions 20

P6: Retire (R) reg status + RF value R value tail fetch fetch 1 2 == CDB. V1 V2 CDB.V ROB head retire FU stall until instruction at ROB head has completed write ROB head result to reg-file (D$ if store), clear reg. status entry free ROB entry 21

P6 Example: Cycle 1 hd ROB + MOB tl # instruction R V addr IS EX CM ht 1 ldf f0,x(r1) f0 &X[0] 2 mulf f4,f0,f2 3 stf f4,z(r1) 4 add r1,r1,#8 5 ldf f0,x(r1) 6 mulf f4,f0,f2 7 stf f4,z(r1) Reg. Status reg + f0 ROB#1 f2 f4 r1 V CDB before ROB, this was #2 # FU busy op V1 V2 1 2 1 ALU No 2 load Yes ldf ROB#1 REG[r1] 3 store No 4 FP1 No 5 FP2 No allocate set reg. status 22

P6 Example: Cycle 2 hd tl ROB + MOB # instruction R V addr IS EX CM h 1 ldf f0,x(r1) f0 &X[0] c2 t 2 mulf f4,f0,f2 f4 3 stf f4,z(r1) 4 add r1,r1,#8 5 ldf f0,x(r1) 6 mulf f4,f0,f2 7 stf f4,z(r1) Reg. Status reg + f0 ROB#1 f2 f4 ROB#2 r1 V CDB # FU busy op V1 V2 1 2 1 ALU No 2 load Yes ldf ROB#1 REG[r1] 3 store No 4 FP1 Yes mulf ROB#2 REG[f2] ROB#1 5 FP2 No allocate ROB, allocate, set reg. status 23

P6 Example: Cycle 3 hd tl ROB + MOB # instruction R V addr IS EX CM h 1 ldf f0,x(r1) f0 &X[0] c2 c3 2 mulf f4,f0,f2 f4 t 3 stf f4,z(r1) &Z[0] 4 add r1,r1,#8 5 ldf f0,x(r1) 6 mulf f4,f0,f2 7 stf f4,z(r1) Reg. Status reg + f0 ROB#1 f2 f4 ROB#2 r1 V CDB # FU busy op V1 V2 1 2 1 ALU No 2 load No 3 store Yes stf ROB#3 REG[r1] ROB#2 4 FP1 Yes mulf ROB#2 REG[f2] ROB#1 5 FP2 No free allocate 24

P6 Example: Cycle 4 hd tl ROB + MOB # instruction R V addr IS EX CM h 1 ldf f0,x(r1) f0 [f0] &X[0] c2 c3 c4 2 mulf f4,f0,f2 f4 c4 3 stf f4,z(r1) &Z[0] t 4 add r1,r1,#8 r1 5 ldf f0,x(r1) 6 mulf f4,f0,f2 7 stf f4,z(r1) Reg. Status reg + f0 ROB#1+ f2 f4 ROB#2 r1 ROB#4 # FU busy op V1 V2 1 2 1 ALU Yes add ROB#4 REG[r1] 2 load No 3 store Yes stf ROB#3 REG[r1] ROB#2 4 FP1 Yes mulf ROB#2 CDB.V REG[f2] ROB#1 5 FP2 No CDB V [f0] ROB#1 ldf finished 1. write result to ROB 2. CDB broadcast 3. set ready-in-rob bit allocate f0 ready grab from CDB 25

P6 Example: Cycle 5 hd tl ROB + MOB # instruction R V addr IS EX CM 1 ldf f0,x(r1) f0 [f0] &X[0] c2 c3 c4 h 2 mulf f4,f0,f2 f4 c4 c5 3 stf f4,z(r1) &Z[0] 4 add r1,r1,#8 r1 c5 t 5 ldf f0,x(r1) f0 6 mulf f4,f0,f2 7 stf f4,z(r1) Reg. Status reg + f0 ROB#5 f2 f4 ROB#2 r1 ROB#4 V CDB retire, write ROB result into regfile # FU busy op V1 V2 1 2 1 ALU Yes add ROB#4 REG[r1] 2 load Yes ldf ROB#5 ROB#4 3 store Yes stf ROB#3 REG[r1] ROB#2 4 FP1 No 5 FP2 No allocate free 26

P6 Example: Cycle 6 hd tl ROB + MOB # instruction R V addr IS EX CM 1 ldf f0,x(r1) f0 [f0] &X[0] c2 c3 c4 h 2 mulf f4,f0,f2 f4 c4 c5+ 3 stf f4,z(r1) &Z[0] 4 add r1,r1,#8 r1 c5 c6 5 ldf f0,x(r1) f0 t 6 mulf f4,f0,f2 f4 7 stf f4,z(r1) Reg. Status reg + f0 ROB#5 f2 f4 ROB#6 r1 ROB#4 V CDB # FU busy op V1 V2 1 2 1 ALU No free 2 load Yes ldf ROB#5 ROB#4 3 store Yes stf ROB#3 REG[r1] ROB#2 4 FP1 No 5 FP2 Yes mulf ROB#6 REG[f2] ROB#5 allocate 27

P6 Example: Cycle 7 hd tl ROB + MOB # instruction R V addr IS EX CM 1 ldf f0,x(r1) f0 [f0] &X[0] c2 c3 c4 h 2 mulf f4,f0,f2 f4 c4 c5+ 3 stf f4,z(r1) &Z[0] 4 add r1,r1,#8 r1 [r1] c5 c6 c7 5 ldf f0,x(r1) f0 &X[1] c7 t 6 mulf f4,f0,f2 f4 7 stf f4,z(r1) Reg. Status CDB reg + V f0 ROB#5 [r1] ROB#4 f2 f4 ROB#6 r1 ROB#4+ add finished write result into ROB, CDB stall DS, no free store # FU busy op V1 V2 1 2 1 ALU No 2 load Yes ldf ROB#5 CDB.V ROB#4 3 store Yes stf ROB#3 REG[r1] ROB#2 4 FP1 No 5 FP2 Yes mulf ROB#6 REG[f2] ROB#5 r1 ready grab from CDB 28

P6 Example: Cycle 8 hd tl ROB + MOB # instruction R V addr IS EX CM 1 ldf f0,x(r1) f0 [f0] &X[0] c2 c3 c4 h 2 mulf f4,f0,f2 f4 [f4] c4 c5+ c8 3 stf f4,z(r1) &Z[0] c8 4 add r1,r1,#8 r1 [r1] c5 c6 c7 5 ldf f0,x(r1) f0 &X[1] c7 c8 t 6 mulf f4,f0,f2 f4 7 stf f4,z(r1) Reg. Status CDB reg + V f0 ROB#5 [f4] ROB#2 f2 f4 ROB#6 r1 ROB#4+ stall R stall DS, no free store # FU busy op V1 V2 1 2 1 ALU No 2 load No 3 store Yes stf ROB#3 CDB.V REG[r1] ROB#2 4 FP1 No 5 FP2 Yes mulf ROB#6 REG[f2] ROB#5 free f4 ready grab from CDB 29

P6 Example: Cycle 9 hd tl ROB + MOB # instruction R V addr IS EX CM 1 ldf f0,x(r1) f0 [f0] &X[0] c2 c3 c4 2 mulf f4,f0,f2 f4 [f4] c4 c5+ c8 h 3 stf f4,z(r1) &Z[0] c8 c9 4 add r1,r1,#8 r1 [r1] c5 c6 c7 5 ldf f0,x(r1) f0 &X[1] c7 c8 c9 6 mulf f4,f0,f2 f4 c9 t 7 stf f4,z(r1) &Z[1] Reg. Status reg + f0 ROB#5+ f2 f4 ROB#6 r1 ROB#4+ stall R CDB V [f0] ROB#5 read from ROB not reg. file (+) # FU busy op V1 V2 1 2 1 ALU No 2 load No 3 store Yes stf ROB#7 ROB#4.V ROB#6 4 FP1 No 5 FP2 Yes mulf ROB#6 CDB.V REG[f2] ROB#5 free (ROB#3) allocate (ROB#7) f0 ready grab from CDB 30

P6 Example: Cycle 10 hd tl ROB + MOB # instruction R V addr IS EX CM 1 ldf f0,x(r1) f0 [f0] &X[0] c2 c3 c4 2 mulf f4,f0,f2 f4 [f4] c4 c5+ c8 h 3 stf f4,z(r1) &Z[0] c8 c9 c10 4 add r1,r1,#8 r1 [r1] c5 c6 c7 5 ldf f0,x(r1) f0 &X[1] c7 c8 c9 6 mulf f4,f0,f2 f4 c9 c10 t 7 stf f4,z(r1) Reg. Status reg + f0 ROB#5+ f2 f4 ROB#6 r1 ROB#4+ stall R V CDB # FU busy op V1 V2 1 2 1 ALU No 2 load No 3 store Yes stf ROB#7 ROB#4.V ROB#6 4 FP1 No 5 FP2 No free 31

P6 Example: Cycle 11 hd tl ROB + MOB # instruction R V addr IS EX CM 1 ldf f0,x(r1) f0 [f0] &X[0] c2 c3 c4 2 mulf f4,f0,f2 f4 [f4] c4 c5+ c8 3 stf f4,z(r1) &Z[0] c8 c9 c10 h 4 add r1,r1,#8 r1 [r1] c5 c6 c7 5 ldf f0,x(r1) f0 &X[1] c7 c8 c9 6 mulf f4,f0,f2 f4 c9 c10 t 7 stf f4,z(r1) Reg. Status reg + f0 ROB#5+ f2 f4 ROB#6 r1 ROB#4+ retire stf V CDB # FU busy op V1 V2 1 2 1 ALU No 2 load No 3 store Yes stf ROB#7 ROB#4.V ROB#6 4 FP1 No 5 FP2 No 32