Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Similar documents
CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

CMP 301B Computer Architecture. Appendix C

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Instruction Level Parallelism Part II - Scoreboard

COSC4201. Scoreboard

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Tomasulo s Algorithm. Tomasulo s Algorithm

Parallel architectures Electronic Computers LM

CS521 CSE IITG 11/23/2012

Dynamic Scheduling I

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Instruction Level Parallelism. Data Dependence Static Scheduling

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

DAT105: Computer Architecture

Instruction Level Parallelism III: Dynamic Scheduling

CSE502: Computer Architecture CSE 502: Computer Architecture

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Project 5: Optimizer Jason Ansel

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Tomasolu s s Algorithm

Out-of-Order Execution. Register Renaming. Nima Honarmand

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CMSC 611: Advanced Computer Architecture

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

CMSC 611: Advanced Computer Architecture

OOO Execution & Precise State MIPS R10000 (R10K)

Lecture 8-1 Vector Processors 2 A. Sohn

Lecture 4: Introduction to Pipelining

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Pipelined Processor Design

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

CSE502: Computer Architecture CSE 502: Computer Architecture

EECE 321: Computer Organiza5on

CSE502: Computer Architecture CSE 502: Computer Architecture

Dynamic Scheduling II

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Compiler Optimisation

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Precise State Recovery. Out-of-Order Pipelines

CS 110 Computer Architecture Lecture 11: Pipelining

LECTURE 8. Pipelining: Datapath and Control

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Issue. Execute. Finish

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

FMP For More Practice

Computer Architecture

RISC Central Processing Unit

Computer Hardware. Pipeline

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

CS429: Computer Organization and Architecture

Pipelining and ISA Design

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

CS Computer Architecture Spring Lecture 04: Understanding Performance

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Power-conscious High Level Synthesis Using Loop Folding

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Class Project: Low power Design of Electronic Circuits (ELEC 6970) 1

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

CS61C : Machine Structures

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

CS61c: Introduction to Synchronous Digital Systems

On the Rules of Low-Power Design

10. BSY-1 Trainer Case Study

Department Computer Science and Engineering IIT Kanpur

EE 457 Homework 5 Redekopp Name: Score: / 100_

5. (Adapted from 3.25)

EE382V-ICS: System-on-a-Chip (SoC) Design

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

Warp-Aware Trace Scheduling for GPUS. James Jablin (Brown) Thomas Jablin (UIUC) Onur Mutlu (CMU) Maurice Herlihy (Brown)

CS4617 Computer Architecture

Peer-to-Peer Architecture

Digital Integrated CircuitDesign

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

RISC Design: Pipelining

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

Pipelined Architecture (2A) Young Won Lim 4/7/18

Pipelined Architecture (2A) Young Won Lim 4/10/18

Combined Circuit and Microarchitecture Techniques for Effective Soft Error Robustness in SMT Processors

CS61C : Machine Structures

Design of FIR Filter Using Modified Montgomery Multiplier with Pipelining Technique

Digital Land Surveying Dr. Jayanta Kumar Ghosh Department of Civil Engineering Indian Institute of Technology, Roorkee

CSE502: Computer Architecture Welcome to CSE 502

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

Chapter 3 Digital Logic Structures

Transcription:

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science

!!! Basic MIPS integer pipeline Branches with one delay cycle Functional units are fully pipelined or replicated (as many times as the pipeline depth)! An operation of any type can be issued on every clock cycle and there are no structural hazard Instruction producing result Instruction using results Latency in clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store Double 2 Load Double FP ALU op 1 Load Double Store Double 0

! Determining how one instruction depends on another is critical not only to the scheduling process but also to determining how much parallelism exists! If two instructions are parallel they can execute simultaneously in the pipeline without causing stalls (assuming there is not structural hazard)! Two instructions that are dependent are not parallel and their execution cannot be reordered

! Data dependence (RAW)! Transitive: i! j! k = i! k! Easy to determine for registers, hard for memory! Does 100(R4) = 20(R6)?! From different loop iterations, does 20(R6) = 20(R6)?! Name dependence (register/memory reuse)! Anti-dependence (WAR): Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first! Output dependence (WAW): Instructions i and j write the same register or memory location; instruction ordering must be preserved! Control dependence, caused by conditional branching

Loop:!LD!F0,x(R1)!!ADDD!F4,F0,F2!!SD!x(R1),F4!!LD!F0,x-8(R1)!!ADDD!F4,F0,F2!!SD!x-8(R1),F4!!LD!F0,x-16(R1)!!ADDD!F4,F0,F2!!SD!x-16(R1),F4!!LD!F0,x-24(R1)!!ADDD!F4,F0,F2!!SD!x-24(R1),F4!!SUBI!R1,R1,#32!!BNEZ!R1,Loop! Register renaming Loop:!LD!F0,x(R1)!!ADDD!F4,F0,F2!!SD!x(R1),F4!!LD!F6,x-8(R1)!!ADDD!F8,F6,F2!!SD!x-8(R1),F8!!LD!F10,x-16(R1)!!ADDD!F12,F10,F2!!SD!x-16(R1),F12!!LD!F14,x-24(R1)!!ADDD!F16,F14,F2!!SD!x-24(R1),F16!!SUBI!R1,R1,#32!!BNEZ!R1,Loop!! Again Name Dependencies are Hard for Memory Accesses!Does 100(R4) = 20(R6)?!From different loop iterations, does 20(R6) = 20(R6)?! Compiler needs to know that R1 does not change! 0(R1)! -8(R1)! -16(R1)! -24(R1) and thus no dependencies between some loads and stores so they could be moved

! Why in HW at run time?! Works when can t know real dependence at compile time! Compiler simpler! Code for one machine runs well on another! Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14! Enables out-of-order execution => out-of-order completion! ID stage checks for structural and data hazards

! Out-of-order execution divides ID stage: 1.! Issue decode instructions, check for structural hazards 2.! Read operands wait until no data hazards, then read operands! Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions! CDC 6600: In order issue, out of order execution, out of order commit / completion

! Out-of-order completion! WAR, WAW hazards Example: DIVID F0, F2, F4 ADDD F10, F0, F8 SUBD F8, F8, F8! Solutions for WAR! Queue both the operation and copies of its operands! Read registers only during Read Operands stage! For WAW, must detect hazard: stall until other completes! Scoreboard keeps track of dependencies, state or operations! Replace ID, EX, WB with 4 stages

1.! Issue decode instructions & check for structural hazards (ID1).! If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure.! If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2.! Read operands wait until no data hazards, then read operands (ID2).! A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit.! When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution.! The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order. 3.! Execution operate on operands (EX)! The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4.! Write result finish execution (WB)! Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results, otherwise it stalls

MIPS Processor with Scoreboard! Given the small latency of integer operations, it is not worth the scoreboard complexity! 2 Multiplier, 1 divider, 1 adder and one integer unit! Major cost driven by data buses! The scoreboard control function units! The scoreboard enables out-of-order execution to maximize parallelism

1.! Instruction status which of 4 steps for instruction 2.! Functional unit status Indicates the state of the functional unit (FU). 9 fields for each functional unit! Busy Indicates whether the unit is busy or not! Op Operation to perform in the unit (e.g., + or )! Fi Destination register! Fj, Fk Source-register numbers! Qj, Qk Functional units producing source registers Fj, Fk! Rj, Rk Flags indicating when Fj, Fk are ready 3.! Indicates which functional unit will write each register, if any. Blank when no pending instructions will write that register

! Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache)! Limitations of 6600 scoreboard:!no forwarding hardware!limited to instructions in basic block (small window)!small number of functional units (causes structural hazards)!do not issue on structural hazards!wait for WAR hazards and prevent WAW hazards

LD F6 34+ R2 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer No Mult1 No Add No Divide No FU

LD F6 34+ R2 1 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Yes Mult1 No Add No Divide No 1 FU Integer

LD F6 34+ R2 1 2 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Yes Mult1 No Add No Divide No 2 FU Integer! Issue 2nd LD?

LD F6 34+ R2 1 2 3 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Yes Mult1 No Add No Divide No 3 FU Integer

LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F6 R2 Yes Mult1 No Add No Divide No 4 FU Integer

LD F2 45+ R3 5 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 Yes Mult1 No Add No Divide No 5 FU Integer

LD F2 45+ R3 5 6 MULT F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Add No Divide No 6 FU Mult1 Integer

LD F2 45+ R3 5 6 7 MULT F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Add Yes Sub F8 F6 F2 Integer Yes No Divide No 7 FU Mult1 Integer Add! Read multiply operands?

LD F2 45+ R3 5 6 7 MULT F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Add Yes Sub F8 F6 F2 Integer Yes No Divide Yes Div F10 F0 F6 Mult1 No Yes 8 FU Mult1 Integer Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 8 FU Mult1 Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes 2 Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 9 FU Mult1 Add Divide! Read operands for MULT & SUBD?! Issue ADDD?

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer No 8 Mult1 Yes Mult F0 F2 F4 Yes Yes 0 Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 11 FU Mult1 Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Integer No 7 Mult1 Yes Mult F0 F2 F4 Yes Yes Add No Divide Yes Div F10 F0 F6 Mult1 No Yes 12 FU Mult1 Divide! Read operands for DIVD?

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Integer No 6 Mult1 Yes Mult F0 F2 F4 Yes Yes Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 13 FU Mult1 Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Integer No 5 Mult1 Yes Mult F0 F2 F4 Yes Yes 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 14 FU Mult1 Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Integer No 4 Mult1 Yes Mult F0 F2 F4 Yes Yes 1 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 15 FU Mult1 Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No 3 Mult1 Yes Mult F0 F2 F4 Yes Yes 0 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 16 FU Mult1 Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No 2 Mult1 Yes Mult F0 F2 F4 Yes Yes Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 17 FU Mult1 Add Divide! Write result of ADDD?

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No 1 Mult1 Yes Mult F0 F2 F4 Yes Yes Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 18 FU Mult1 Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 19 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No 0 Mult1 Yes Mult F0 F2 F4 Yes Yes Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes 19 FU Mult1 Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 16 Integer No Mult1 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Yes Yes 20 FU Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 Integer No Mult1 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Yes Yes 21 FU Add Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 ADDD F6 F8 F2 13 14 16 22 Integer No Mult1 No Add No 40 Divide Yes Div F10 F0 F6 Yes Yes 22 FU Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 ADDD F6 F8 F2 13 14 16 22 Integer No Mult1 No Add No 0 Divide Yes Div F10 F0 F6 Yes Yes 61 FU Divide

LD F2 45+ R3 5 6 7 8 MULT F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 DIVD F10 F0 F6 8 21 61 62 ADDD F6 F8 F2 13 14 16 22 Integer No Mult1 No Add No 0 Divide No 62 FU