Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

Similar documents
Computer Architecture

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

CMSC 611: Advanced Computer Architecture

CS 110 Computer Architecture Lecture 11: Pipelining

ECE473 Computer Architecture and Organization. Pipeline: Introduction

EECE 321: Computer Organiza5on

Pipelined Processor Design

RISC Design: Pipelining

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Lecture 4: Introduction to Pipelining

CS420/520 Computer Architecture I

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Measuring and Evaluating Computer System Performance

Selected Solutions to Problem-Set #3 COE 608: Computer Organization and Architecture Single Cycle Datapath and Control

RISC Central Processing Unit

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

EE 457 Homework 5 Redekopp Name: Score: / 100_

CS61C : Machine Structures

Instruction Level Parallelism. Data Dependence Static Scheduling

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

CS61C : Machine Structures

CS429: Computer Organization and Architecture

Single-Cycle CPU The following exercises are taken from Hennessy and Patterson, CO&D 2 nd, 3 rd, and 4 th Ed.

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

Pipelining and ISA Design

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

You are Here! Processor Design Process. Agenda. Agenda 10/25/12. CS 61C: Great Ideas in Computer Architecture Single Cycle MIPS CPU Part II

1 Solutions. Solution Computer used to run large problems and usually accessed via a network: 5 supercomputers

EECS 470 Lecture 4. Pipelining & Hazards II. Winter Prof. Ronald Dreslinski h8p://

EE382V-ICS: System-on-a-Chip (SoC) Design

WEEK 4.1. ECE124 Digital Circuits and Systems Page 1

High Resolution Pulse Generation

CS4617 Computer Architecture

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EEL 4744C: Microprocessor Applications Lecture 8 Timer Dr. Tao Li

Reading Assignment. Timer. Introduction. Timer Overview. Programming HC12 Timer. An Overview of HC12 Timer. EEL 4744C: Microprocessor Applications

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Instruction Level Parallelism III: Dynamic Scheduling

CS Computer Architecture Spring Lecture 04: Understanding Performance

Performance Metrics, Amdahl s Law

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Computer Architecture ( L), Fall 2017 HW 3: Branch handling and GPU SOLUTIONS

CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

Universal Input Switchmode Controller

AV9108. CPU Frequency Generator. Integrated Circuit Systems, Inc. General Description. Features. Block Diagram

LECTURE 8. Pipelining: Datapath and Control

Improving Loop-Gain Performance In Digital Power Supplies With Latest- Generation DSCs

Timing and Power Optimization Using Mixed- Dynamic-Static CMOS

Multiple Predictors: BTB + Branch Direction Predictors

A Flying-Domain DC-DC Converter Powering a Cortex-M0 Processor with 90.8% Efficiency

Computer Elements and Datapath. Microarchitecture Implementation of an ISA

OOO Execution & Precise State MIPS R10000 (R10K)

CSE502: Computer Architecture CSE 502: Computer Architecture

CMOS Process Variations: A Critical Operation Point Hypothesis

CZ3001 ADVANCED COMPUTER ARCHITECTURE

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

EECS150 - Digital Design Lecture 2 - Synchronous Digital Systems Review Part 1. Outline

Proc. IEEE Intern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), IEEE Computer Society Press, 1995, 76-84

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

CS61c: Introduction to Synchronous Digital Systems

Digital Controller Chip Set for Isolated DC Power Supplies

CMSC 611: Advanced Computer Architecture

Administrative Issues

ANALOG TO DIGITAL (ADC) and DIGITAL TO ANALOG CONVERTERS (DAC)

UC Berkeley CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

CS 61C: Great Ideas in Computer Architecture Lecture 10: Finite State Machines, Func/onal Units. Machine Interpreta4on

Project 5: Optimizer Jason Ansel

CSE 260 Digital Computers: Organization and Logical Design. Midterm Solutions

Lecture 9: Clocking for High Performance Processors

Computer Hardware. Pipeline

ANITA ROSS Trigger/Digitizer/DAQ. Gary S. Varner University of Hawai, i, Manoa ANITA Collaboration JPL March 2004

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lesson 7. Digital Signal Processors

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

CS 6290 Evaluation & Metrics

a b y UC Berkeley CS61C : Machine Structures Hello Helo,world!

AN2424 Application note

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Hello, and welcome to this presentation of the STM32 Digital Filter for Sigma-Delta modulators interface. The features of this interface, which

Dr. D. M. Akbar Hussain

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

CSE502: Computer Architecture Welcome to CSE 502

Ac#on vs. Interac#on CS CS 4730 Computer Game Design. Credit: Several slides from Walker White (Cornell)

ELG3311: EXPERIMENT 2 Simulation of a Transformer Performance

Korea Advanced Institute of Science and Technology Korea Advanced Institute of Science and Technology

Fall 2015 COMP Operating Systems. Lab #7

Transcription:

Single vs. Mul2- cycle MIPS Single Clock Cycle Length Suppose we have 2ns 2ns ister read 2ns ister write 2ns ory read 2ns ory write 2ns 2ns What is the clock cycle length? 1

Single Cycle Length Worst case propaga,on delay involves Load, / regs,, mem, regs Thus, cycle length is: 2ns + 2ns + 2ns + 2ns + 2ns = 10ns /register read are done simultaneously Clock cycle rate is 1s / 10ns = 100 MHz Single cycle design: ALL instruc2on types take 10ns! Single Cycle Length One clock cycle for all steps Load Arithme2c Branch, Jump Comp. Store 2

Single Cycle Length How long does an Add take? 10ns it s a single cycle implementa2on But no2ce add doesn t use the memory It could be done in 8ns (fetch, decode/read registers,, write registers) How about a Branch? (done in 6 ns but takes 10 ns) How about a Jump? (done in 6 ns but takes 10 ns) How about a Store? (done in 8 ns but takes 10 ns) One clock cycle Load 10 ns Arithme2c 8 ns but 10 ns cycle Branch, Jump Comp. 6 ns but 10 ns cycle Store 8 ns but 10 ns cycle 3

Mul2- cycle implementa2on Divide steps into their own shorter (faster) clock cycles : 1 cycle /read registers: 1 cycle : ory read/write: 1 cycle 1 cycle registers: 1 cycle Load takes 5 cycles, add takes 4 cycles, branch takes 3 cycles (comparison done in 3 rd cycle), jump takes 3 cycles and store takes 4 cycles One clock cycle Load Arithme2c Branch, Jump Comp. Wasted 2me Store 4

Single cycle: One clock cycle for all steps Load 5 cycles Arithme2c 4 cycles Branch, Jump Comp. 3 cycles Store 4 cycles Mul2- cycle: One clock cycle for each step What is the clock cycle length for mul2- cycle case? Maximum delay of any one of the steps Clock cycle length = max(2me of each step) In this example, the clock cycle length is 2 ns How long does each instruc<on type take now? Load: 5 cycles * 2 ns/cycle = 10 ns Add/r- type: 4 cycles * 2 ns/cycle = 8 ns Jump, branch: 3 cycles * 2 ns/cycle = 6 ns Store: 4 cycles * 2 ns/cycle = 8 ns 5

How does this help? Consider this program:.data A:.word 10,20,30,40,50,60,70,80,90 B:.word 0,0,0,0,0,0,0,0,0,0.text li $t0,10 # 1 instruction la $t1,a # 2 instructions loop: lw $t3,0($t1) # executed 10 times, 10 loads total add $t3,$t3,$t3 add $t3,$t3,$t3 sw $t3,40($t1) # executed 10 times, 10 stores addi $t1,$t1,4 addi $t0,$t0,-1 # 4 adds per iteration * 10 = 40 adds bne $t0,$0,loop # executed 10 times, 10 branches li $v0,10 # 1 instruction syscall # 1 instruction How does this help? For previous program, we have the counts: 45 add instruc2ons 10 load instruc2ons 10 store instruc2ons 10 branch instruc2ons Total instruc2on count (IC) = 75 instruc2ons Suppose single cycle implementa2on is 10 ns cycle CPU 2me is how long program executes Thus, single cycle CPU 2me is 75 instr * 10 ns = 750ns 6

How does this help? CPU 2me for mul2- cycle? I.e., how much 2me does it take to execute this program on mul2- cycle. Each instruc2on type takes different number cycles Thus, we have in this example: CPU 2me = 10 loads * 5 cycles * 2 ns/cycle + 45 adds * 4 cycles * 2 ns/cycle + 10 stores * 4 cycles * 2 ns/cycle + 10 branches * 3 cycles * 2 ns/cycle = 600 ns Mul2- cycle is FASTER than single cycle (600 ns vs. 750 ns) How does this help? Consider ra2o of single cycle and mul2 cycle CPU 2mes: 750 ns / 600 ns = 1.25 2mes faster The mul2- cycle is 1.25 2mes faster than single cycle Speedup = Slower CPU 2me / Faster CPU 2me 7

Consider two programs A, B A, B executed on single and mul2- cycle MIPS impl. A: 800 adds, 200 branches CPU 2me single cycle = (800+200) 1 cycle per instruc2on 10ns = 10,000ns CPU 2me mul2 cycle = 800 adds 4 cycles 2ns + 200 branches 3 cycles 2 ns = 7,600ns Speedup = 10,000 ns / 7,600 ns = 1.32x B: 100 adds, 800 loads, 100 branches CPU 2me single cycle = (100+800+100) 1 cycle 10ns = 10,000ns CPU 2me mul2 cycle = 100 adds 4 cycles 2ns + 800 loads 5 cycles 2 ns + 100 branches 3 cycles 2 ns = 9,400 ns Speedup = 10,000 ns / 9,400 ns = 1.06x Instruc2on Mix Speedups are vastly different in A,B due to the different instruc2ons executed Instruc2on mix: The percentage of total instruc2on count (IC) corresponding to each instruc2on type A: 80% arithme2c (add), 20% branches B: 10% arithme2c, 80% loads, 10% branches 8