Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Similar documents
6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

CS 110 Computer Architecture Lecture 11: Pipelining

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

EECE 321: Computer Organiza5on

Instruction Level Parallelism. Data Dependence Static Scheduling

What you can do with very little:

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Pipelined Processor Design

RISC Central Processing Unit

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

CS429: Computer Organization and Architecture

CMSC 611: Advanced Computer Architecture

Lecture 4: Introduction to Pipelining

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Hardware. Pipeline

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

RISC Design: Pipelining

CS420/520 Computer Architecture I

Lecture 8-1 Vector Processors 2 A. Sohn

Dynamic Scheduling I

Dynamic Scheduling II

Out-of-Order Execution. Register Renaming. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

CSE502: Computer Architecture CSE 502: Computer Architecture

OOO Execution & Precise State MIPS R10000 (R10K)

LECTURE 8. Pipelining: Datapath and Control

CS521 CSE IITG 11/23/2012

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

CMP 301B Computer Architecture. Appendix C

COSC4201. Scoreboard

Project 5: Optimizer Jason Ansel

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Instruction Level Parallelism III: Dynamic Scheduling

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Pipelining and ISA Design

Department Computer Science and Engineering IIT Kanpur

CMSC 611: Advanced Computer Architecture

Precise State Recovery. Out-of-Order Pipelines

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Parallel architectures Electronic Computers LM

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

CS61C : Machine Structures

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

CSE502: Computer Architecture CSE 502: Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

EC4205 Microprocessor and Microcontroller

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

CS Computer Architecture Spring Lecture 04: Understanding Performance

CZ3001 ADVANCED COMPUTER ARCHITECTURE

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

Tomasolu s s Algorithm

CS 61C: Great Ideas in Computer Architecture Lecture 10: Finite State Machines, Func/onal Units. Machine Interpreta4on

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

On the Rules of Low-Power Design

EE382V-ICS: System-on-a-Chip (SoC) Design

CS61c: Introduction to Synchronous Digital Systems

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

Computer Architecture and Organization: L08: Design Control Lines

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Controller Implementation--Part I. Cascading Edge-triggered Flip-Flops

The Digital Abstraction

Understanding Engineers #2

The Digital Abstraction

6.004 Computation Structures Spring 2009

Performance Metrics, Amdahl s Law

Lecture 3: Logic circuit. Combinational circuit and sequential circuit

CS61C : Machine Structures

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Multiple Predictors: BTB + Branch Direction Predictors

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

EE 457 Homework 5 Redekopp Name: Score: / 100_

Instruction Level Parallelism Part II - Scoreboard

ENGIN 112 Intro to Electrical and Computer Engineering

Learning Outcomes. Spiral 2 3. DeMorgan Equivalents NEGATIVE (ACTIVE LO) LOGIC. Negative Logic One hot State Assignment System Design Examples

paioli Power Analysis Immunity by Offsetting Leakage Intensity Sylvain Guilley perso.enst.fr/ guilley Telecom ParisTech

DAT105: Computer Architecture

Welcome to 6.111! Introductory Digital Systems Laboratory

Electronic Instrumentation

Transcription:

Pipelined Beta Where are the registers? Handouts: Lecture Slides L16 Pipelined Beta 1

Increasing CPU Performance MIPS = Freq CPI MIPS = Millions of Instructions/Second Freq = Clock Frequency, MHz CPI = Clocks per Instruction To Increase MIPS: 1. DECRESE CPI. -RISC simplicity reduces CPI to 1.0. -CPI below 1.0? Tough... you ll see multiple instruction issue machines in 6.823. 2. INCRESE Freq. - Freq limited by delay along longest combinational path; hence - PIPELINING is the key to improved performance through fast clocks. L16 Pipelined Beta 2

CLK New PC Beta Timing PC+4 +OFFSET Fetch Inst. R2SEL mux Control Logic Wanted: longest path =0? PCSEL mux Read Regs SEL mux BSEL mux Fetch data WDSEL mux Complications: some apparent paths aren t possible blobs have variable execution times (eg, ) time axis is not to scale (eg, t PD,MEM is very big!) PC setup RF setup Mem setup CLK L16 Pipelined Beta 3

Where are the Bottlenecks? PCSEL Xdr 4 ILL OP 3 JT 2 1 PC 0 00 Instruction Memory Pipelining goal: Break LONG combinational paths memories, in separate stages +4 D + PC+4+4*SXT(C) IRQ Z Control Logic XP 1 Rc: <25:21> 0 Ra: <20:16> WSEL Z C: SXT(<15:0>) W W SEL 1 0 R1 RD1 JT Rb: <15:11> Register File 1 0 1 R2 RD2 0 Rc: <25:21> R2SEL WD WE BSEL WERF PCSEL R2SEL SEL BSEL WDSEL FN Wr WERF WSEL FN PC+4 B Data Memory dr RD WD R/W Wr 0 1 2 WDSEL L16 Pipelined Beta 4

Ultimate Goal: 5-Stage Pipeline GOL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ) PPROCH: structure processor as 5-stage pipeline: RF MEM WB Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to Register File stage: Reads source operands from register file, passes them to stage: Performs indicated operation, passes result to Memory stage: If it s a LD, use result as an address, pass mem data (or result if not LD) to Write-Back stage: writes result back into register file. L16 Pipelined Beta 5

Simple 2-Stage Pipeline PCSEL Xdr 4 ILL OP 3 JT 2 1 0 PC +4 00 Instruction Memory D EXE PC EXE 00 IR EXE + XP <25:21> Z <15:0> <20:16> WSEL R1 1 W W 0 RD1 JT <15:11> Register File 0 1 R2 RD2 <25:21> R2SEL WD WE WERF SEL 1 0 1 0 BSEL FN B WD R/W Wr Data Memory dr RD PC+4 0 1 2 WDSEL L16 Pipelined Beta 6

2-Stage Pipelined Beta Operation Consider a sequence of instructions:... C(r1, 1, r2) SUBC(r1, 1, r3) XOR(r1, r5, r1) MUL(r2, r6, r0)... Executed on our 2-stage pipeline: TIME (cycles) Pipeline EXE i i+1 i+2 i+3 i+4 i+5 i+6 C SUBC XOR MUL... C SUBC XOR MUL... L16 Pipelined Beta 7

Pipeline Control Hazards BUT consider instead: LOOP: (r1, r3, r3) LEC(r3, 100, r0) BT(r0, LOOP) XOR(r3, -1, r3) MUL(r1, r2, r2)... i i+1 i+2 i+3 i+4 i+5 i+6 BT XOR... EXE BT?... This is the cycle where the branch decision is made but we ve already fetched the following instruction which should be executed only if branch is not taken! L16 Pipelined Beta 8

Branch Delay Slots PROBLEM: One (or more) following instructions have been pre-fetched by the time a branch is taken. POSSIBLE SOLUTIONS: 1. Make hardware annul instructions following branches which are taken, e.g., by disabling WERF and WR. 2. Program around it. Either a) Follow each BR with a NOP instruction; or b) Make compiler clever enough to move USEFUL instructions into the branch delay slots i. lways execute instructions in delay slots ii. Conditionally execute instructions in delay slots L16 Pipelined Beta 9

Branch lternative 1 Make the hardware annul instructions in the branch delay slots of a taken branch. LOOP: (r1, r3, r3) LEC(r3, 100, r0) BT(r0, LOOP) XOR(r3, -1, r3) MUL(r1, r2, r2)... i i+1 i+2 i+3 i+4 i+5 i+6 BT XOR BT EXE BT XOR NOP Branch taken Pros: same program runs on both unpipelined and pipelined hardware Cons: in SPEC benchmarks 14% of instructions are taken branches 12% of total cycles are annulled L16 Pipelined Beta 10

Branch nnulment Hardware Xdr ILL OP JT PCSEL 4 3 2 1 0 PC +4 00 Instruction Memory D NOP 0 1 NNUL PC EXE 00 IR EXE + XP <25:21> Z <15:0> <20:16> WSEL R1 1 W W 0 RD1 JT <15:11> Register File 0 1 R2 RD2 <25:21> R2SEL WD WE WERF SEL 1 0 1 0 BSEL FN B WD R/W Wr Data Memory dr RD PC+4 0 1 2 WDSEL L16 Pipelined Beta 11

Branch lternative 2a Fill branch delay slots with NOP instructions (i.e., the software equivalent of alternative 1) LOOP: (r1, r3, r3) LEC(r3, 100, r0) BT(r0, LOOP) NOP() XOR(r3, -1, r3) MUL(r1, r2, r2)... i i+1 i+2 i+3 i+4 i+5 i+6 BT NOP BT EXE BT NOP Branch taken Pros: same as alternative 1 Cons: NOPs make code longer; 12% of cycles spent executing NOPs L16 Pipelined Beta 12

Branch lternative 2b(i) Put USEFUL instructions in the branch delay slots; remember they will be executed whether the branch is taken or not i i+1 i+2 i+3 i+4 i+5 i+6 BT LOOP: (r1,r3,r3) LOOPx: LEC(r3,100,r0) BT(r0,LOOPx) (r1,r3,r3) SUB(r1,r3,r3) XOR(r3,-1,r3) MUL(r1,r2,r2)... BT We need to add this silly instruction to UNDO the effects of that last EXE BT BT Branch taken Pros: only two extra instructions are executed (on last iteration) Cons: finding useful instructions that are always executed is difficult; clever rewrite may be required. Program executes differently on naïve unpipelined implementation. L16 Pipelined Beta 13

Branch lternative 2b(ii) Put USEFUL instructions in the branch delay slots; annul them if branch doesn t behave as predicted LOOP: (r1, r3, r3) LOOPx: LEC(r3, 100, r0) BT.taken(r0, LOOPx) (r1, r3, r3) XOR(r3, -1, r3) MUL(r1, r2, r2)... i i+1 i+2 i+3 i+4 i+5 i+6 BT BT EXE BT BT Branch taken Pros: only one instruction is annulled (on last iteration); about 70% of branch delay slots can be filled with useful instructions Cons: Program executes differently on naïve unpipelined implementation; not really useful with more than one delay slot. L16 Pipelined Beta 14

rchitectural Issue: Branch Decision Timing BET approach: SIMPLE branch condition logic... Test for Reg[Ra] = 0! DVNTGE: early decision, single delay slot LTERNTIVES: Compare-and-branch... (eg, if Reg[Ra] > Reg[Rb]) MORE powerful, but LTER decision (hence more delay slots) Wow! I guess those guys really were thinking when they made up all those instructions Instruction Fetch Register File Memory Write Back instruction instruction instruction instruction instruction CL CL CL CL RF (read) Y Y B RF (write) Suppose decision were made in the stage... then there would be 2 branch delay slots (and instructions to annul!) L16 Pipelined Beta 15

PCSEL Xdr 4 ILL OP 3 JT 2 1 +4 PC RF 0 PC 00 Instruction Memory D IR RF Instruction Fetch 4-Stage Beta Pipeline Ra <20:16> Rb: <15:11> Rc <25:21> PC PC MEM <PC>+C (NB: SME RF S BOVE!) + C: <15:0> << 2 sign-extended IR IR MEM WSEL Z Rc <25:21> SEL 1 0 FN 0 1 W XP R1 W RD1 JT Register File Register File C: <15:0> sign-extended Y Y MEM 0 1 2 WD WE 1 0 BSEL B B R2 RD2 WDSEL WERF R2SEL dr D D MEM WD Data Memory RD Register File R/W Write Back Treat register file as two separate devices: combinational RED, clocked WRITE at end of pipe. What other information do we have to pass down pipeline? PC (return addresses) instruction fields (decoding) What sort of improvement should expect in cycle time? L16 Pipelined Beta 16

4-Stage Beta Operation Consider a sequence of instructions: Executed on our 4-stage pipeline:... C(r1, 1, r2) SUBC(r1, 1, r3) XOR(r1, r5, r1) MUL(r2, r6, r0)... TIME (cycles) i i+1 i+2 i+3 i+4 i+5 i+6 C SUBC XOR MUL... Pipeline RF C SUBC C XOR SUBC MUL XOR... MUL... WB C SUBC XOR MUL L16 Pipelined Beta 17

Pipeline Data Hazard BUT consider instead: (r1, r2, r3) LEC(r3, 100, r0) MULC(r1, 100, r4) SUB(r1, r2, r5) i i+1 i+2 i+3 i+4 i+5 i+6 MUL SUB RF MUL SUB MUL SUB WB MUL SUB Oops! is trying to read Reg[R3] during cycle I+2 but doesn t write its result into Reg[R3] until the end of cycle I+3! L16 Pipelined Beta 18

Data Hazard Solution 1 Program around it... document weirdo semantics, declare it a software problem. - Breaks sequential semantics! - Costs code efficiency. EXMPLE: Rewrite (r1, r2, r3) LEC(r3, 100, r0) MULC(r1, 100, r4) SUB(r1, r2, r5) as (r1, r2, r3) MULC(r1, 100, r4) SUB(r1, r2, r5) LEC(r3, 100, r0) How often can we do this? L16 Pipelined Beta 19

Data Hazard Solution 2 Stall the pipeline: Freeze, RF stages for 2 cycles, inserting NOPs into -stage instruction register RF WB i i+1 i+2 i+3 i+4 i+5 i+6 MUL MUL MUL SUB MUL SUB NOP NOP MUL NOP NOP Drawback: NOPs mean wasted cycles L16 Pipelined Beta 20

Data Hazard Solution 3 Bypass (aka forwarding) Paths: dd extra data paths & control logic to re-route data in problem cases. RF WB i i+1 i+2 i+3 i+4 i+5 i+6 MUL SUB MUL SUB MUL SUB MUL SUB Idea: the result from the which will be written into the register file at the end of cycle I+3 is actually available at output of during cycle I+2 just in time for it to be used by in the RF stage! L16 Pipelined Beta 21

Bypass Paths (I) IR RF LEC(r3,100,r0) IR R1 RD1 Register File R2 RD2 B Bypass muxes SELECT this BYPSS path if OpCode RF = reads Ra and OpCode = OP, OPC and Ra RF = Rc (r1,r2,r3) Y B i.e., instructions which use to compute result IR WB Y WB and Ra RF!= R31 W Register File WD WE L16 Pipelined Beta 22

Bypass Paths (II) IR RF (r1,r2,r3) IR MULC(r4,17,r5) R1 RD1 Register File Y R2 RD2 B B Bypass muxes SELECT this BYPSS path if OpCode RF = reads Ra and Ra RF!= R31 and not using bypass and WERF = 1 and Ra RF = W XOR(r2,r6,r1) IR WB Y WB But why can t we get It from the register file? It s being written this cycle! W Register File WD WE L16 Pipelined Beta 23

Next Time More Beta Bypasses head L16 Pipelined Beta 24