Lecture 4: Introduction to Pipelining

Similar documents
Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

ECE473 Computer Architecture and Organization. Pipeline: Introduction

CMSC 611: Advanced Computer Architecture

CS 110 Computer Architecture Lecture 11: Pipelining

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

CS420/520 Computer Architecture I

CMSC 611: Advanced Computer Architecture

CS429: Computer Organization and Architecture

EECE 321: Computer Organiza5on

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

RISC Design: Pipelining

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Instruction Level Parallelism. Data Dependence Static Scheduling

LECTURE 8. Pipelining: Datapath and Control

Pipelined Processor Design

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Computer Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Computer Hardware. Pipeline

CS61C : Machine Structures

RISC Central Processing Unit

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

CMP 301B Computer Architecture. Appendix C

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Dynamic Scheduling I

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

COSC4201. Scoreboard

Instruction Level Parallelism Part II - Scoreboard

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Out-of-Order Execution. Register Renaming. Nima Honarmand

Pipelining and ISA Design

Project 5: Optimizer Jason Ansel

Tomasolu s s Algorithm

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

Instruction Level Parallelism III: Dynamic Scheduling

Department Computer Science and Engineering IIT Kanpur

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

EE 457 Homework 5 Redekopp Name: Score: / 100_

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

CSEN 601: Computer System Architecture Summer 2014

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

CS521 CSE IITG 11/23/2012

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Dynamic Scheduling II

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

Lecture 8-1 Vector Processors 2 A. Sohn

On the Rules of Low-Power Design

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

CS61C : Machine Structures

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

EE382V-ICS: System-on-a-Chip (SoC) Design

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

CSE502: Computer Architecture Welcome to CSE 502

Parallel architectures Electronic Computers LM

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

DAT105: Computer Architecture

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

CS Computer Architecture Spring Lecture 04: Understanding Performance

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

Performance Metrics, Amdahl s Law

CSE502: Computer Architecture CSE 502: Computer Architecture

Understanding Engineers #2

FMP For More Practice

OOO Execution & Precise State MIPS R10000 (R10K)

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

CS 6290 Evaluation & Metrics

Lecture 2: Review of Pipelines

CZ3001 ADVANCED COMPUTER ARCHITECTURE

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

The Metrics and Designs of an Arithmetic Logic Function over

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Multiple Predictors: BTB + Branch Direction Predictors

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

Computer Architecture

CS4617 Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

Measuring and Evaluating Computer System Performance

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

Precise State Recovery. Out-of-Order Pipelines

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Embedded Hardware (1) Kai Huang

Transcription:

Lecture 4: Introduction to Pipelining

Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder takes 20 minutes

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 20 30 40 20 30 40 20 30 40 20 Sequential laundry takes 6 hours for 4 loads

Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 40 40 40 20 Pipelined laundry takes 3.5 hours for 4 loads

T a s k O r d e r A B C D Pipelining: Observations 6 PM 7 8 9 Time 30 40 40 40 40 20 Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup

5 Steps of DLX Datapath Figure 3.1 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back + NPC Zero? Cond. M U X PC 4 Inst. Mem. IR Regs A B M U X M U X ALU ALU Output Data Mem. LMD M U X Sign Imm. 16 Ext. 32

Pipelined DLX Datapath Figure 3.4 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. M U X PC 4 + Inst. Mem. Regs M U X M U X Zero? ALU Memory Access Data Mem. Write Back M U X 16 32 Sign Ext. IF/ID ID/EX EX/MEM MEM/WB

Visualizing Pipelining Figure 3.3 Time (clock cycles) I n s t r. O r d e r

Limits to Pipelining Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more bubbles in the pipeline

One Memory Port/Structural Hazards Figure 3.6 Time (clock cycles) I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4

One Memory Port/Structural Hazards Figure 3.7 I n s t r. O r d e r Load Instr 1 Instr 2 stall Instr 3

Speed Up Equation for Pipelining Speedup from pipelining = Ave Instr Time unpipelined Ave Instr Time pipelined = CPI unpipelined x Clock Cycle unpipelined CPI pipelined x Clock Cycle pipelined = CPI unpipelined Clock Cycle x unpipelined CPI pipelined Clock Cycle pipelined Ideal CPI = CPI unpipelined /Pipeline depth Speedup = Ideal CPI x Pipeline depth Clock Cycle x unpipelined CPI pipelined Clock Cycle pipelined

Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycle x unpipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle x unpipelined 1 + Pipeline stall CPI Clock Cycle pipelined

Example: Dual-port vs. Single-port Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.4 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

Data Hazard on R1 Figure 3.9

Three Generic Data Hazards Instr I followed by Instr J Read After Write (RAW) Instr J tries to read operand before Instr I writes it

Three Generic Data Hazards Instr I followed by Instr J Write After Read (WAR) Instr J tries to write operand before Instr I reads it Can t happen in DLX 5 stage pipeline because: All instructions take 5 stages, Reads are always in stage 2, and Writes are always in stage 5

Three Generic Data Hazards Instr I followed by Instr J Write After Write (WAW) Instr J tries to write operand before Instr I writes it Leaves wrong result ( Instr I not Instr J ) Can t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

Forwarding to Avoid Data Hazard Figure 3.10

HW Change for Forwarding Figure 3.20

Data Hazard Even with Forwarding Figure 3.12

Data Hazard Even with Forwarding Figure 3.13

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf d,rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf d,rd

Compiler Avoiding Load Stalls scheduled unscheduled gcc spice tex 14% 25% 31% 42% 54% 65% 0% 20% 40% 60% 80% % loads stalling pipeline

Pipelining Summary Just overlap tasks, and easy if tasks are independent Speed Up vs Pipeline Depth; if ideal CPI is 1, then: Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined Hazards limit performance on computers: Structural: need more HW resources Data: need forwarding, compiler scheduling Control: discuss next time