ECE473 Computer Architecture and Organization. Pipeline: Introduction

Similar documents
Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Lecture 4: Introduction to Pipelining

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

CS 110 Computer Architecture Lecture 11: Pipelining

CMSC 611: Advanced Computer Architecture

CS429: Computer Organization and Architecture

CS420/520 Computer Architecture I

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

EECE 321: Computer Organiza5on

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Pipelined Processor Design

Computer Hardware. Pipeline

CS61C : Machine Structures

LECTURE 8. Pipelining: Datapath and Control

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

Computer Architecture

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

RISC Design: Pipelining

RISC Central Processing Unit

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

COSC4201. Scoreboard

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Department Computer Science and Engineering IIT Kanpur

EE 457 Homework 5 Redekopp Name: Score: / 100_

Measuring and Evaluating Computer System Performance

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

Pipelining and ISA Design

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

CMP 301B Computer Architecture. Appendix C

Dynamic Scheduling I

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

CS Computer Architecture Spring Lecture 04: Understanding Performance

Performance Metrics, Amdahl s Law

Dynamic Scheduling II

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

Instruction Level Parallelism Part II - Scoreboard

HIGH-PERFORMANCE HYBRID WAVE-PIPELINE SCHEME AS IT APPLIES TO ADDER MICRO-ARCHITECTURES

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

CSEN 601: Computer System Architecture Summer 2014

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Lecture 8-1 Vector Processors 2 A. Sohn

Pipelined Architecture (2A) Young Won Lim 4/7/18

Pipelined Architecture (2A) Young Won Lim 4/10/18

Instruction Level Parallelism. Data Dependence Static Scheduling

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Precise State Recovery. Out-of-Order Pipelines

Out-of-Order Execution. Register Renaming. Nima Honarmand

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

CZ3001 ADVANCED COMPUTER ARCHITECTURE

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Controller Implementation--Part I. Cascading Edge-triggered Flip-Flops

On the Rules of Low-Power Design

CS 6290 Evaluation & Metrics

Lecture 9: Clocking for High Performance Processors

Improving Loop-Gain Performance In Digital Power Supplies With Latest- Generation DSCs

CMSC 611: Advanced Computer Architecture

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

CSE502: Computer Architecture Welcome to CSE 502

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

Understanding Engineers #2

Final Report: DBmbench

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

FMP For More Practice

EE382V-ICS: System-on-a-Chip (SoC) Design

Tomasolu s s Algorithm

CSE 305: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

CS61C : Machine Structures

EECS 427 Lecture 22: Low and Multiple-Vdd Design

Instruction Level Parallelism III: Dynamic Scheduling

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

CSE502: Computer Architecture CSE 502: Computer Architecture

Administrative Issues

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

CSE502: Computer Architecture CSE 502: Computer Architecture

Summary Last Lecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Lec 24: Parallel Processors. Announcements

Reduction. CSCE 6730 Advanced VLSI Systems. Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

CS61C : Machine Structures

CS521 CSE IITG 11/23/2012

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Transcription:

Computer Architecture and Organization Pipeline: Introduction Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 11.1

The Laundry Analogy Student A, B, C, D each have one load of clothes to wash, dry, and fold A B C D Washer takes 30 minutes Dryer takes 30 minutes Folder takes 30 minutes Stasher takes 30 minutes to put clothes into drawers Lec 11.2

If we do laundry sequentially... 6 PM 7 8 9 10 11 12 1 2 AM T as k A B 30 30 30 30 30 30 30 30 Time 30 30 30 30 30 30 30 30 O rd e r C D Time Required: 8 hours for 4 loads Lec 11.3

To Pipeline, We Overlap Tasks 12 2 AM 6 PM 7 8 9 10 11 1 T as k O rd e r A B C D 30 30 30 30 30 30 30 Time Time Required: 3.5 Hours for 4 Loads Lec 11.4

To Pipeline, We Overlap Tasks 12 2 AM 6 PM 7 8 9 10 11 1 T as k O rd e r A B C D 30 30 30 30 30 30 30 Time Latency? Throughput? Potential Speedup? How to determine the clock? Influence of unbalanced lengths of tasks? Any assumption about fill and drain? Time Required: 3.5 Hours for 4 Loads Lec 11.5

To Pipeline, We Overlap Tasks 12 2 AM 6 PM 7 8 9 10 11 1 T as k O rd e r A B C D 30 30 30 30 30 30 30 Time Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Lec 11.6

What is Pipelining? A way of speeding up execution of instructions Key idea: overlap execution of multiple instructions Lec 11.7

Pipelining a Digital System 1 nanosecond = 10^-9 second 1 picosecond = 10^-12 second Key idea: break big computation up into pieces 1ns Separate each piece with a pipeline register 200ps 200ps 200ps 200ps 200ps Pipeline Register Lec 11.8

Pipelining a Digital System Why do this? Because it's faster for repeated computations Non-pipelined: 1 operation finishes every 1ns 1ns Pipelined: 1 operation finishes every 200ps 200ps 200ps 200ps 200ps 200ps Lec 11.9

Comments about pipelining Pipelining increases throughput, but not latency Answer available every 200ps, BUT A single computation still takes 1ns Limitations: Computations must be divisible into stage size Pipeline registers add overhead Lec 11.10

Pipelining a Processor Recall the 5 steps in instruction execution: 1. Instruction Fetch (IF) 2. Instruction Decode and Register Read (ID) 3. Execution operation or calculate address (EX) 4. Memory access (MEM) 5. Write result into register (WB) Review: Single-Cycle Processor All 5 steps done in a single clock cycle Dedicated hardware required for each step Lec 11.11

Review - Single-Cycle Processor What do we need to add to actually split the datapath into stages? Lec 11.12

ALU ALU ALU ALU The Basic Pipeline For MIPS Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. Ifetch Reg Ifetch Reg DMem Reg DMem Reg O r d e r Ifetch Reg Ifetch Reg DMem Reg DMem Reg What do we need to add to actually split the datapath into stages? Lec 11.13

Basic Pipelined Processor Lec 11.14

Pipeline example: lw IF Lec 11.15

Pipeline example: lw ID Lec 11.16

Pipeline example: lw EX Lec 11.17

Pipeline example: lw MEM Lec 11.18

Pipeline example: lw WB Can you find a problem? Lec 11.19

Basic Pipelined Processor (Corrected) Lec 11.20

Single-Cycle vs. Pipelined Execution Non-Pipelined Instruction Order 0 200 400 600 800 1000 1200 1400 1600 1800 Instruction Fetch REG RD ALU 800ps MEM REG WR Instruction Fetch REG RD ALU 800ps MEM REG WR Instruction Fetch Time Pipelined Instruction Order 0 200 400 600 800 1000 1200 1400 1600 Instruction Fetch 200ps REG RD Instruction Fetch 200ps ALU REG RD Instruction Fetch MEM ALU REG RD REG WR MEM ALU REG WR MEM REG WR 200ps 200ps 200ps 200ps 200ps 800ps Time Lec 11.21

Speedup Consider the unpipelined multicycle processor introduced previously. Assume that it has a 1 ns clock cycle and it uses 4 cycles for ALU operations and branches, and 5 cycles for memory operations, assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? Nonpipelined Multicycle Processor: Clock = 1ns Pipelined Processor: Clock = 1.2ns What is the speedup? Operations Cycles Percentage ALU 4 40% Branch 4 20% Memory 5 40% Lec 11.22

Speedup Consider the unpipelined processor introduced previously. Assume that it has a 1 ns clock cycle and it uses 4 cycles for ALU operations and branches, and 5 cycles for memory operations, assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? Average instruction execution time = 1 ns * ((40% + 20%)*4 + 40%*5) = 4.4ns Speedup from pipeline = Average instruction time unpiplined/average instruction time pipelined = 4.4ns/1.2ns = 3.7 Lec 11.23

Comments about Pipelining The good news Multiple instructions are being processed at same time This works because stages are isolated by registers Best case speedup of N The bad news Instructions interfere with each other - hazards» Example: different instructions may need the same piece of hardware (e.g., memory) in same clock cycle» Example: instruction may require a result produced by an earlier instruction that is not yet complete Lec 11.24