Instruction Level Parallelism. Data Dependence Static Scheduling

Similar documents
Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Lecture 4: Introduction to Pipelining

Pipelined Processor Design

CMP 301B Computer Architecture. Appendix C

CS 110 Computer Architecture Lecture 11: Pipelining

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CMSC 611: Advanced Computer Architecture

Instruction Level Parallelism Part II - Scoreboard

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

CS521 CSE IITG 11/23/2012

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Out-of-Order Execution. Register Renaming. Nima Honarmand

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

LECTURE 8. Pipelining: Datapath and Control

COSC4201. Scoreboard

Computer Architecture

EECE 321: Computer Organiza5on

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Dynamic Scheduling I

RISC Central Processing Unit

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Tomasulo s Algorithm. Tomasulo s Algorithm

Computer Hardware. Pipeline

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

RISC Design: Pipelining

Project 5: Optimizer Jason Ansel

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

DAT105: Computer Architecture

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

EE 457 Homework 5 Redekopp Name: Score: / 100_

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

OOO Execution & Precise State MIPS R10000 (R10K)

Parallel architectures Electronic Computers LM

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

Instruction Level Parallelism III: Dynamic Scheduling

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Multiple Predictors: BTB + Branch Direction Predictors

Compiler Optimisation

Lecture 8-1 Vector Processors 2 A. Sohn

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

CZ3001 ADVANCED COMPUTER ARCHITECTURE

Dynamic Scheduling II

FMP For More Practice

Tomasolu s s Algorithm

CS420/520 Computer Architecture I

CSE502: Computer Architecture CSE 502: Computer Architecture

Pipelining and ISA Design

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

CS429: Computer Organization and Architecture

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

ECE473 Computer Architecture and Organization. Pipeline: Introduction

CS61C : Machine Structures

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

CMSC 611: Advanced Computer Architecture

Precise State Recovery. Out-of-Order Pipelines

Department Computer Science and Engineering IIT Kanpur

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Architecture Lab Session

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

Understanding Engineers #2

1 Solutions. Solution Computer used to run large problems and usually accessed via a network: 5 supercomputers

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

SCALCORE: DESIGNING A CORE

Power Issues with Embedded Systems. Rabi Mahapatra Computer Science

Controller Implementation--Part I. Cascading Edge-triggered Flip-Flops

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #20. Warehouse Scale Computer

CS61C : Machine Structures

Reading Material + Announcements

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Issue. Execute. Finish

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

EE382V-ICS: System-on-a-Chip (SoC) Design

Quantifying the Complexity of Superscalar Processors

paioli Power Analysis Immunity by Offsetting Leakage Intensity Sylvain Guilley perso.enst.fr/ guilley Telecom ParisTech

Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era

CSEN 601: Computer System Architecture Summer 2014

How a processor can permute n bits in O(1) cycles

CS4617 Computer Architecture

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

10. BSY-1 Trainer Case Study

Transcription:

Instruction Level Parallelism Data Dependence Static Scheduling

Basic Block A straight line code sequence with no branches in except to the entry and no branches out except at the exit Loop: L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop

Dependence for (i=0; i<=999; i=i+1) x[i] = x[i] + a; Data Dependence Name Dependence Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop Name dependence antidependence, output dependence Register renaming Hazard ADD.D ADD.D F4, F0, F2 F4, F6, F8 Overlap during execution could change the order of access to the operand involved in the dependence.

Hazards Program Order ILP preserves program order only where it affects the outcome of the program Structural Hazards Resource conflicts Data Hazards RAW, WAW, WAR Control Hazard Whether or not an instruction should be executed depends on a control decision made by an earlier instruction

Structural Hazard 1 2 3 4 5 6 7 8 9 i1 i2 i3 i4 i5... E E E WB E E E WB E E E WB E E E WB E E E WB HAZARD!!!

Cost of a Load Structural Hazard Data references constitute 40% of the instruction mix. Ideal CPI = 1 (with no structural hazards). Assume that the processor with the structural hazard has a clock rate that is 1.1 times higher than the clock rate of the processor without the hazard. Which processor is faster, and by how much? Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime ideal =CPI Clock cycle time ideal

Cost of a Load Structural Hazard Avg. Instruction Time =CPI Clock cycle time Avg. InstructionTime =(1+0.4 1) Clock cycle time ideal 1.1 Avg. InstructionTime =1.27 Clock cycle time ideal

Resolving Structural Hazards Scheduling hazardous instructions away from each other Stalling one of the instructions Duplicating hardware units

AL Data Hazards E E WB 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend 16 32 Imm R1 R2 + R3 R4 R1 + R5

AL Data Hazards R1 is updated in the WB stage. IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD R1 R2 + R3 Sign Extend 16 32 Imm R4 R1 + R5 How to overcome this hazard?

AL Stalling (Interlocking) Stall Condition NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD R1 R2 + R3 R4 R1 + R5 Sign Extend 16 32 Imm

Stalled Stages and Pipeline Bubbles Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB E A WB Stalled Stages E A WB E A WB I1 I2 I3 I3 I3 I3 I4 I5 I1 I2 I2 I2 I2 I3 I4 I5 E I1 nop nop nop I2 I3 I4 I5 A I1 nop nop nop I2 I3 I4 I5 WB I1 nop nop nop I2 I3 I4 I5

AL rs1 rs2 C stall rd Stall Control Logic NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend 16 32 Imm Compare the source registers of the instructions in the decode stage with the destination register in the uncommitted instructions.

AL rs C stall rd Stall Control Logic rt NOP IR IR IR 4 ADD NPC Zero? Cond P C I IR rs rt Regs rd A B AL Output D LD Sign Extend 16 32 Imm

Stall Condition C stall stall = ( (rs D = rd E ) + (rs D = rd ) + (rs D = rd W ) ) + ( (rt D = rd E ) + (rt D = rd ) + (rt D = rd W ) ) The pipeline should stall for all instructions? Are rs, rt and rd valid for all instructions?

IPS I Sources & Destinations R-type op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits I-type op rs rt 6 bits 5 bits 5 bits 16 bits J-type op Offset added to immediate PC 6 bits 26 bits Source(s) Destination AL rd (rs) func (rt) rs, rt rd ALI rt (rs) func Immediate rs rt LW rt em[ (rs) + Immediate ] rs rt SW em[ (rs) + Immediate ] rt rs, rt BZ Cond (rs)? PC = PC + Offset : PC = PC + 4 rs J PC = PC + Offset JAL R31 PC; PC = PC + Offset; R31 JR PC (rs) rs JALR R31 PC; PC (rs) rs R31

Stall Control Logic rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs rt ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D C dest R

Deriving the Stall Signal C dest ws = case opcode AL rd ALi, LW rt JAL, JALR R31 we = case opcode AL, ALi, LW (ws 0) JAL, JALR on. off C re re1 = case opcode AL, ALi, LW, SW, BZ, JR, JALR on J, JAL off re2 = case opcode AL, SW on... off C stall stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs D = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt D = ws W ).we W ). re2 D

The Stall Control Signal stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs D = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt D = ws W ).we W ). re2 D Is that all? Results of all instructions ready by E stage? [(R1) + 7] R2 R4 [(R3) + 13] Is there a possible data hazard here? What if the addresses (R1 + 7) == (R3 + 13)? Careful design of the memory system required.

Resolving Data Hazards Scheduling hazardous instructions away from each other Stalling one of the instructions Data Forwarding (Bypassing)

Forwarding DADD DSB AND OR OR R1,R2,R3 R4,R1,R5 R6,R1,R7 R8,R1,R9 R10,R1,R11 Time (clock cycles) DADD I REG AL D REG DSB I REG AL D REG AND I REG AL D REG

Forwarding Before Bypassing Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB CPI > 1 E A WB Stalled Stages E A WB E A WB After Bypassing Time (clock cycles) R1 R2 + R3 E A WB R4 R1 + R5 E A WB E A WB E A WB CPI = 1 E A WB

The Pipeline without Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

The Pipeline with Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

The Pipeline with Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

The Pipeline with Bypassing rs ws we 0x4 stall N PC rt C stall re1 re2 C re rd nop IR IR IR 31 PC addr I inst IR rs1 rs2 ws wd we GPR rd1 rd2 Imm Ext A B D Y D we addr rdata wdata D R

Cost of Forwarding In longer pipelines? In multiple issue pipelines?

No Stalls in the Pipeline? What about this instruction sequence? LD ADD R1, 4(R2) R3, R1, R4 When, at the latest, is the value of R1 needed by ADD? When, at the earliest can does R1 enter the pipeline?

Stall Logic stall = ( (rs D = ws E ).we E + (rs D = ws ).we + (rs W = ws W ).we W ). re1 D + ( (rt D = ws E ).we E + (rt D = ws ).we + (rt W = ws W ).we W ). re2 D stall = ( (rs D = ws E ) (opcode E = LW E ) (ws 0) ) re1 D + ( (rt D = ws E ) (opcode E = LW E ) (ws 0) ) re2 D

Pipeline Scheduling Reorder the instructions of the program so that dependent instructions are far enough apart Done by the compiler, before the program runs: Static Instruction Scheduling Done by the hardware, when the program is running: Dynamic Instruction Scheduling

Pipeline Scheduling Original Program LW R3, 0(R1) ADDI R5, R3, 1 ADD R2, R2, R3 LW R13, 0(R11) ADD R12, R13, R3 stall stall Scheduled Code LW R3, 0(R1) LW R13, 0(R11) ADDI R5, R3, 1 ADD R2, R2, R3 ADD R12, R13, R3 Total Execution Cycles: 7 Total Execution Cycles: 5

Loop-level Parallelism Original Loop: Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F2, F6 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12, F2, F10 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F2, F14 S.D F16, -24(R1) DADDI R1, R1, #-32 BNE R1, R2, Loop N R O L L E D L O O P

Loop nrolling Instr producing result FP AL op Instr using result Latency to avoid a stall Another FP AL op FP AL op Store Double 2 Load Double FP AL op 1 Load Double Store double 0 Total Cycles: 27 cycles 3 Loop: L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D F0, 0(R1) F4, F0, F2 F4, 0(R1) F6, -8(R1) F8, F2, F6 F8, -8(R1) F10, -16(R1) F12, F2, F10 F12, -16(R1) F14, -24(R1) F16, F2, F14 F16, -24(R1) DADDI R1, R1, #-32 BNE R1, R2, Loop

Loop nrolling Instr producing result FP AL op Instr using result Latency to avoid a stall Another FP AL op FP AL op Store Double 2 Load Double FP AL op 1 Load Double Store double 0 3 Loop: L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F2, F6 F12, F2, F10 F16, F2, F14 Total Cycles: 14 cycles S.D F4, 0(R1) S.D F8, -8(R1) DADDI R1, R1, #-32 Code Size Register pressure S.D S.D BNE F12, 16(R1) F16, 8(R1) R1, R2, Loop