CS429: Computer Organization and Architecture

Similar documents
ECE473 Computer Architecture and Organization. Pipeline: Introduction

Lecture 4: Introduction to Pipelining

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

CS 110 Computer Architecture Lecture 11: Pipelining

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

CMSC 611: Advanced Computer Architecture

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

CS420/520 Computer Architecture I

Pipelined Processor Design

EECE 321: Computer Organiza5on

Computer Hardware. Pipeline

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Performance Metrics, Amdahl s Law

RISC Central Processing Unit

CMP 301B Computer Architecture. Appendix C

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

CS61C : Machine Structures

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

Measuring and Evaluating Computer System Performance

Computer Architecture

CS Computer Architecture Spring Lecture 04: Understanding Performance

Project 5: Optimizer Jason Ansel

Assessing and. Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

COSC4201. Scoreboard

Department Computer Science and Engineering IIT Kanpur

Out-of-Order Execution. Register Renaming. Nima Honarmand

Instruction Level Parallelism. Data Dependence Static Scheduling

CS4617 Computer Architecture

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

Lecture 8-1 Vector Processors 2 A. Sohn

CSE 2021: Computer Organization

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Pipelining and ISA Design

CSE 305: Computer Architecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

LECTURE 8. Pipelining: Datapath and Control

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Dynamic Scheduling I

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

CS 6290 Evaluation & Metrics

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Finite State Machines CS 64: Computer Organization and Design Logic Lecture #16

Administrative Issues

Lec 24: Parallel Processors. Announcements

Instruction Level Parallelism Part II - Scoreboard

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

CSE502: Computer Architecture Welcome to CSE 502

Precise State Recovery. Out-of-Order Pipelines

CZ3001 ADVANCED COMPUTER ARCHITECTURE

5. (Adapted from 3.25)

A Static Power Model for Architects

CSE502: Computer Architecture CSE 502: Computer Architecture

Data Acquisition & Computer Control

HIGH-PERFORMANCE HYBRID WAVE-PIPELINE SCHEME AS IT APPLIES TO ADDER MICRO-ARCHITECTURES

Dynamic Scheduling II

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Announcements. Advanced Digital Integrated Circuits. Midterm feedback mailed back Homework #3 posted over the break due April 8

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

CSE502: Computer Architecture CSE 502: Computer Architecture

Minimization Of Power Dissipation In Digital Circuits Using Pipelining And A Study Of Clock Gating Technique

CSE502: Computer Architecture CSE 502: Computer Architecture

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

CS521 CSE IITG 11/23/2012

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

ICS312 Machine-level and Systems Programming

CISC 662 Graduate Computer Architecture. Lecture 9 - Scoreboard

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

Homework Problem Set: Combinational Devices & ASM Charts. Answer all questions on this sheet. You may attach additional pages if necessary.

Class Project: Low power Design of Electronic Circuits (ELEC 6970) 1

EECS 470 Lecture 8. P6 µarchitecture. Fall 2018 Jon Beaumont Core 2 Microarchitecture

OOO Execution & Precise State MIPS R10000 (R10K)

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

Multi-Channel FIR Filters

Pipelined Architecture (2A) Young Won Lim 4/10/18

Pipelined Architecture (2A) Young Won Lim 4/7/18

Tomasolu s s Algorithm

Final Report: DBmbench

Digital Transmission using SECC Spring 2010 Lecture #7. (n,k,d) Systematic Block Codes. How many parity bits to use?

Problem: hazards delay instruction completion & increase the CPI. Compiler scheduling (static scheduling) reduces impact of hazards

INF3430 Clock and Synchronization

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Instruction Level Parallelism III: Dynamic Scheduling

VLSI System Testing. Outline

Single Error Correcting Codes (SECC) 6.02 Spring 2011 Lecture #9. Checking the parity. Using the Syndrome to Correct Errors

RISC Design: Pipelining

ECE 2300 Digital Logic & Computer Organization. More Pipelined Microprocessor

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

Processors Processing Processors. The meta-lecture

DATA ENCODING TECHNIQUES FOR LOW POWER CONSUMPTION IN NETWORK-ON-CHIP

Transcription:

CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: November 8, 2017 at 09:27 CS429 Slideset 14: 1

Overview What s wrong with the sequential (SEQ) Y86? It s slow! Each piece of hardware is used only a small fraction of the time. We would like to find a way to get more performance with only a little more hardware. General Principles of Pipelining Express task as a collection of stages Move instructions through stages Process several instructions at any given moment CS429 Slideset 14: 2

Overview Creating a Pipelined Y86 Processor Rearrange SEQ Insert pipeline registers Deal with data and control hazards Pipeline Correctness Axiom: A pipeline is correct only if the resulting machine satisfies the ISA (nonpipelined) semantics. CS429 Slideset 14: 3

Pipelining: Laundry Example Suppose you have four folks, each with a load of clothes to wash, dry, fold and stash away. There are four subtasks: wash, dry, fold, stash. Suppose each takes 30 minutes. Time to do a load of laundry from start to finish: 2 hours. (That s the latency.) CS429 Slideset 14: 4

Sequential Laundry Sequential laundry takes 8 hours for 4 loads. If they learned pipelining, how long would laundry take? CS429 Slideset 14: 5

Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads! But each load still takes 2 hours. What s the metric that improved? How would you measure the efficiency of the process if you were running a laundry service with loads (inputs) always ready to process? CS429 Slideset 14: 6

Latency vs. Throughput Latency is the time from start to finish for a given task. Throughput is the number of tasks completed in a given time period. Example: suppose that each laundry stage (wash, dry, fold, stash) takes 30 minutes. But you have a laundromat with 4 washers, 4 driers, 4 folding stations, 4 stashing stations. What is the latency? CS429 Slideset 14: 7

Latency vs. Throughput Latency is the time from start to finish for a given task. Throughput is the number of tasks completed in a given time period. Example: suppose that each laundry stage (wash, dry, fold, stash) takes 30 minutes. But you have a laundromat with 4 washers, 4 driers, 4 folding stations, 4 stashing stations. What is the latency? Latency is 2 hours, because it still takes two hours to get any single load through the entire process. What is the highest possible throughput (per hour)? CS429 Slideset 14: 8

Latency vs. Throughput Latency is the time from start to finish for a given task. Throughput is the number of tasks completed in a given time period. Example: suppose that each laundry stage (wash, dry, fold, stash) takes 30 minutes. But you have a laundromat with 4 washers, 4 driers, 4 folding stations, 4 stashing stations. What is the latency? Latency is 2 hours, because it still takes two hours to get any single load through the entire process. What is the highest possible throughput (per hour)? Throughput is (theoretically) 8 loads / hour since you can complete 8 loads every hour in steady state. How? CS429 Slideset 14: 9

Pipelining Lessons Pipelining doesn t help latency of a single task; it helps throughput of the entire workload. Multiple tasks operate simultaneously using different resources. Potential speedup = number of stages. Unbalanced lengths of pipe stages reduces speedup. Time to fill pipeline and time to drain it reduces speedup. May need to stall for dependencies. CS429 Slideset 14: 10

Computational Example 300 ps 20 ps Combinational Delay = 320 ps Throughput = 3.12 GIPS Clock System Computation requires a total of 300 picoseconds. Needs an additional 20 picoseconds to save the result in the register. Must have a clock cycle of at least 320 ps. Why? CS429 Slideset 14: 11

3-Way Pipelined Version 100 ps 20 ps 100 ps 20 ps 100 ps Comb. Comb. Comb. A B C 20 ps Delay = 360 ps Throughput = 8.33 GIPS Clock System Divide combinational logic into 3 blocks of 100 ps each. Can begin a new operation as soon as the previous one passes through stage A. Begin new operation every 120 ps. Why? Overall latency increases! It s now 360 ps from start to finish. CS429 Slideset 14: 12

Pipeline Diagrams Unpipelined OP1 OP2 OP3 Time Cannot start new operation until the previous one completes. 3-Way Pipelined OP1 A B C OP2 OP3 A B C A B C Time Up to 3 operations in process simultaneously. CS429 Slideset 14: 13

Operating a Pipeline Clock OP1 OP2 OP3 A B C A B C A B C 0 120 240 360 480 600 At time 300. 100 ps 20 ps 100 ps 20 ps 100 ps Comb. Comb. Comb. A B C 20 ps Clock CS429 Slideset 14: 14

Limitations: Non-uniform Delays 50 ps 20 ps 150 ps 20 ps 100 ps 20 ps Comb A Comb Comb B C Delay = 510 ps Throughput = 5.88 GIPS Clock OP1 OP2 OP3 A B C A B C A B C Time Throughput is limited by the slowest stage. Other stages may sit idle for much of the time. It s challenging to partition the system into balanced stages. CS429 Slideset 14: 15

Limitations: ister Overhead 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb Comb Comb Comb Comb Comb Clock Delay = 420 ps, Throughput = 14.29 GIPS As you try to deepen the pipeline, the overhead of loading registers becomes more significant. Percentage of clock cycle spent loading registers: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57% High speeds of modern processor designs are obtained through very deep pipelining. (Some models of x86 have a pipeline of 20-24 stages.) CS429 Slideset 14: 16

The Performance Equation CPU Time = Seconds Program = Instructions Program Cycles Instruction Seconds Cycle Clock Cycle Time Improves by a factor of almost N for N-deep pipeline. Not quite a factor of N due to pipeline overheads. Cycles Per Instructions (CPI) In an ideal world, CPI would stay the same. An individual instruction takes N cycles. But we have N instructions in flight at a time. So, average CPI pipe = (CPI no pipe N)/N Thus, performance can improve by up to a factor of N. CS429 Slideset 14: 17

Data Dependencies Combinational Clock OP1 OP2 OP3 Time Sequential System: Each operation may depend on the previous one. (It doesn t matter for a sequential system. Why not?) CS429 Slideset 14: 18

Data Hazards Comb. Comb. Comb. A B C Clock OP1 A B C OP2 A B C OP3 A B C OP4 A B C Time Pipelined System: Result does not feed back around in time for the next operation. Pipelining has changed the behavior of the system. Alarm!! CS429 Slideset 14: 19

Data Hazards in Processors irmovq $50, %rax addq %rax, %rbx mrmovq 100( %rbx ), %rdx Result from one instruction is used as an operand for another; called read-after-write (RAW) dependency. This is very common in actual programs. Must make sure that our pipeline handles these properly and gets the right result. Should minimize performance impact as much as possible. CS429 Slideset 14: 20

Control Hazards A control hazard occurs if something interferes with the flow of control through the program. I.e., the PC is not determined quickly enough to allow fetching the next instruction. xorq %rbx, %rbx je Done irmovq $100, %rax ret Done: irmovq $200, %rax ret When the je instruction moves from the fetch to decode stage, what is the next instruction to fetch? When will you know? CS429 Slideset 14: 21

Pipeline Correctness Pipeline Correctness Axiom: A pipeline is correct only if the resulting machine satisfies the ISA (nonpipelined) semantics. That is, the pipeline implementation must deal correctly with potential data and control hazards. Any program that runs correctly on the sequential machine must run on the pipelined version with the exact same results. CS429 Slideset 14: 22

SEQ Hardware Stages occur in sequence. One operation in process at at time. One stage for each logical pipeline operation. Fetch: get next instruction from memory. Decode: figure out what to do, and get values from regfile. Execute: compute. Memory: access data memory if needed. Write back: write results to regfile, if needed. CS429 Slideset 14: 23

SEQ+ Hardware Still sequential implementation, but reorder PC stage to put at the beginning PC Stage Task is to select PC for current instruction. Based on results computed by previous instruction. Processor State PC is no longer stored in a register. But, can determine PC based on other stored information. CS429 Slideset 14: 24