Computer Hardware. Pipeline

Similar documents
RISC Central Processing Unit

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

ECE473 Computer Architecture and Organization. Pipeline: Introduction

EECE 321: Computer Organiza5on

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

CS 110 Computer Architecture Lecture 11: Pipelining

Pipelined Processor Design

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Lecture 4: Introduction to Pipelining

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Computer Architecture

RISC Design: Pipelining

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Instruction Level Parallelism. Data Dependence Static Scheduling

LECTURE 8. Pipelining: Datapath and Control

Basic Symbols for Register Transfers. Symbol Description Examples

CMSC 611: Advanced Computer Architecture

CS429: Computer Organization and Architecture

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

CMP 301B Computer Architecture. Appendix C

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

CS420/520 Computer Architecture I

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

COSC4201. Scoreboard

SOFTWARE IMPLEMENTATION OF THE

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

2002 IEEE International Solid-State Circuits Conference 2002 IEEE

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Department Computer Science and Engineering IIT Kanpur

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Computer Architecture and Organization:

EE 457 Homework 5 Redekopp Name: Score: / 100_

Instruction Level Parallelism Part II - Scoreboard

On the Rules of Low-Power Design

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

Measuring and Evaluating Computer System Performance

Project 5: Optimizer Jason Ansel

ELEC 204 Digital Systems Design

A Static Power Model for Architects

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Dynamic Scheduling I

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Computer Architecture and Organization: L08: Design Control Lines

FMP For More Practice

Outline Single Cycle Processor Design Multi cycle Processor. Pipelined Processor Design. Overall clock period. Analyzing performance 3/18/2015

Model 25D Manual. Introduction: Technical Overview:

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

Pipelining and ISA Design

10 Mb/s Single Twisted Pair Ethernet Implementation Thoughts Proof of Concept Steffen Graber Pepperl+Fuchs

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

Dynamic Scheduling II

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

CS61C : Machine Structures

MSI Design Examples. Designing a circuit that adds three 4-bit numbers

Exam #2 EE 209: Fall 2017

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

Lecture 8-1 Vector Processors 2 A. Sohn

Bluespec-3: Architecture exploration using static elaboration

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

GATE Online Free Material

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

CSEN 601: Computer System Architecture Summer 2014

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

HIGH-PERFORMANCE HYBRID WAVE-PIPELINE SCHEME AS IT APPLIES TO ADDER MICRO-ARCHITECTURES

The Mote Revolution: Low Power Wireless Sensor Network Devices

To appear in IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, San Francisco, February 2002.

Understanding Engineers #2

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

DATASHEET HI1171. Ordering Information. Typical Application Circuit. Pinout. 8-Bit, 40 MSPS, High Speed D/A Converter. FN3662 Rev.3.

SPT BIT, 100 MWPS TTL D/A CONVERTER

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

HI Bit, 40 MSPS, High Speed D/A Converter

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

MODEL 25D MANUAL PRODUCT OVERVIEW:

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

The Metrics and Designs of an Arithmetic Logic Function over

VT1419A Multifunctional Plus Measurement and Control Module

Bootstrapped ring oscillator with feedforward inputs for ultra-low-voltage application

Precise State Recovery. Out-of-Order Pipelines

Behavioral Simulator of Analog-to-Digital Converters

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

Final Report: DBmbench

SDR14TX: Synchronization of multiple devices via PXIe backplane triggering

Transcription:

Computer Hardware Pipeline

Conventional Datapath 2.4 ns is required to perform a single operation (i.e. 416.7 MHz). Register file MUX B 0.6 ns Clock 0.6 ns 0.2 ns Function unit 0.8 ns MUX D 0.2 ns c.

Production Line Analogy Automated car wash: Cars are pulled through a series of stations at which a particular step if performed: 1. Wash 2. Rinse 3. Dry Think of latency time = time needed to wash, rinse and dry. Think of rate of delivery of washed cars or throughput Based on this analogy à pipelined datapaths with n- stages have a processing rate or throughput for instructions that is n times that of non-pipelined datapaths.

Pipelined Datapath 3 Clock 0.6 ns A Pipelined Datapath is done by breaking a conventional datapath into parts by inserting registers as pipeline platforms between these parts These registers provide temporary storage for data passing through the pipeline Data moves synchronously with the clock Delay of operand fetch (OF) is 0.8 ns, delay of execution (EX) is 1.0 ns, delay of write-back (WB) is 1.0 ns min clock period = 1.0 ns Operating frequency= 1.0 Ghz MHz (2.4 times that of the non-pipelined.) Even though there are 3 stages, the improvement factor is not quite 3. Why? WB OF 1 OF EX 2 EX WB 3 Register file MUX B Function unit MUX D 0.6 ns 0.2 ns 0.2 ns 0.8 ns 0.2 ns 0.2 ns (b) Pipelined

Pipelined Datapath OF AA Register file A data B data BA OF consists of reading register values (A&B), or selecting constant value (MB). The pipeline platform stores the operand(s) to be used in EX during next clock cycle In EX a function unit operation occurs, and the results captured by the 2nd pipeline platform WB is the write-back stage: the result is saved from the EX stage or the value on Data in (selected by MUX D). Constant in 1 Operand Fetch (OF) OF EX 2 FS Execute (EX) V C N Z EX WB 3 Write-back (WB) MD WB RW DA FUNDAMENTALS,4e A MUX B Function unit F B 0 1 MUX D D data Register file (same as above) MB Address out Data out Data in

Pipelined Execution Pattern Clock cycle 1 2 3 4 5 6 7 8 9 R1 R2 R3 1 R4 sl R6 2 R7 R7 1 R1 R0 2 3 4 Data out R3 5 R4 Data in 6 R5 0 7 Microoperation What is total time required by conventional datapath for execution? à 7 (microoperation) 2.4 (ns) = 16.8 ns What is total time required by pipelined datapath for execution? à (9 cycles) 1 = 9 ns

Pipelined Execution Pattern Clock cycle 1 2 3 4 5 6 7 8 9 R1 R2 R3 1 R4 sl R6 2 R7 R7 1 R1 R0 2 3 4 Data out R3 5 R4 Data in 6 R5 0 7 Microoperation Maximum improvement of pipelined over conventional can be obtained when the pipeline if fully utilized (all stages are active) e.g. over the 5 clock cycles, 3 to 7 (the pipeline is full), 5 operations are completed in 5 ns. While in the same time the conventional can execute 5ns 2.6 ns/ microoperation = 2.083 microoperations à the pipelined executes 5 2.083 = 2.4 times as many microoperations as conventional

Pipelined Computer PC Registers are added to the pipeline platforms to pass the instruction information through the pipeline. Stage 1 DOF Stage 2 Address Instruction memory Instruction IR Instruction decoder AA Zero fill Register file A data B data MUX B MB BA DOF EX AA BAMB FS MW Data A Data B Address out Stage 3 EX WB FS C V N Z 4 Data F A Function unit F B Data in Data memory Data out Address Data out MW Data I Data in Address Stage 4 WB ALS,4e DA MD RW RW DA MD CONTROL DATAPATH MUX D D data Register file (same as above) Data memory (same as above)

Performance of Pipelined Computer 1 Clock cycle 1 2 3 4 5 6 7 8 9 10 D 2 D 3 D 4 D 5 D 6 D 7 D Instruction Compare the performance of the single-cycle computer with the performance of the pipelined computer (Compare for the situation in which the pipeline is fully utilized.) 4 instructions versus 20ns/17ns/inst. or 1.18 instructions Throughput Pipelined = 3.4x Single-Cycle

Performance Issues If a pipeline has 4 stages performance is improved 4 times! Why? Pipelining Hazards cause the pipe to stall because of some conflict in the pipe (prevents the next instruction in pipe from executing in its turn) Types of hazards Structural: contention for same hardware resource Data: dependency on earlier instruction for the correct sequencing of register reads and writes Control: branch/jump instructions stall the pipe until get correct target address into PC

Reduction in Throughput Filling and flushing of the pipeline reduces the throughput executed below the maximum level. Data and control hazards are timing problems that arise because the execution of an operation in a pipeline is delayed by one or more clock cycles from the time at which the instruction containing the operation was fetched.

Data Hazard Problem

A hardware-based solution MOVA R1, R5 (ADD R2, R1, R6) ADD R2, R1, R6 ADD R3, R1, R2 R1 R5 R2 R1 R6 D DOF R2 data hazard detected, pipeline stalled, and bubble launched. R1 write and reads 1 2 3 4 5 6 7 R2 R1 R6 D (ADD R3, R1, R2) R3 R1 R2 DOF R1 data hazard detected pipeline stalled, and bubble launched R3 R1 R2 R2 Write and read D 8

Control Hazards

Control Hazards R1 = 0 evaluated 1 BZ R1, 18 2 MOVA R2 R3 3 MOVA R1 R2 20 MOVA R5 R6 PC set to 20 1 2 3 4 5 6 7 D No change D No change DOF WB D Branch detected and bubbles launched Instruction MOV R5, R6 fetched from target address