Metrics How to improve performance? CPI MIPS Benchmarks CSC3501 S07 CSC3501 S07. Louisiana State University 4- Performance - 1

Similar documents
CS Computer Architecture Spring Lecture 04: Understanding Performance

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Measuring and Evaluating Computer System Performance

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

A B C D. Ann, Brian, Cathy, & Dave each have one load of clothes to wash, dry, and fold. Time

Lecture 4: Introduction to Pipelining

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Performance Metrics, Amdahl s Law

CS429: Computer Organization and Architecture

CS 110 Computer Architecture Lecture 11: Pipelining

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Suggested Readings! Lecture 12" Introduction to Pipelining! Example: We have to build x cars...! ...Each car takes 6 steps to build...! ! Readings!

Pipelined Processor Design

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

CS 6290 Evaluation & Metrics

CS420/520 Computer Architecture I

CMSC 611: Advanced Computer Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

EECE 321: Computer Organiza5on

CS4617 Computer Architecture

CS61c: Introduction to Synchronous Digital Systems

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

Administrative Issues

Final Report: DBmbench

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Assessing and. Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

CSE 305: Computer Architecture

IF ID EX MEM WB 400 ps 225 ps 350 ps 450 ps 300 ps

CSE502: Computer Architecture Welcome to CSE 502

Single vs. Mul2- cycle MIPS. Single Clock Cycle Length

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Computer Hardware. Pipeline

Project 5: Optimizer Jason Ansel

Pipelined Beta. Handouts: Lecture Slides. Where are the registers? Spring /10/01. L16 Pipelined Beta 1

Computer Architecture

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Chapter 3. H/w s/w interface. hardware software Vijaykumar ECE495K Lecture Notes: Chapter 3 1

UNIT-III POWER ESTIMATION AND ANALYSIS

CS61C : Machine Structures

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Department Computer Science and Engineering IIT Kanpur

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Advances in Antenna Measurement Instrumentation and Systems

7/19/2012. IF for Load (Review) CSE 2021: Computer Organization. EX for Load (Review) ID for Load (Review) WB for Load (Review) MEM for Load (Review)

CSE 2021: Computer Organization

Introduction (concepts and definitions)

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

6.S084 Tutorial Problems L19 Control Hazards in Pipelined Processors

Computer Architecture

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

CSE502: Computer Architecture CSE 502: Computer Architecture

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Datorstödd Elektronikkonstruktion

CMOS Process Variations: A Critical Operation Point Hypothesis

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

How cryptographic benchmarking goes wrong. Thanks to NIST 60NANB12D261 for funding this work, and for not reviewing these slides in advance.

RISC Central Processing Unit

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Digital Signal Processors principles, use & application to PS systems.

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors

Digital Integrated CircuitDesign

Chapter 3 Digital Logic Structures

The Metrics and Designs of an Arithmetic Logic Function over

# σ& # = ' ( # %". σ. # + %- %"0 (1) Evaluating the partial derivatives: (2) %- (3) %- %"0

Dr. D. M. Akbar Hussain

On the Rules of Low-Power Design

Subra Ganesan DSP 1.

Dynamic Scheduling I

Data Acquisition & Computer Control

Introduction to co-simulation. What is HW-SW co-simulation?

CS 147: Computer Systems Performance Analysis

1) Fixed point [15 points] a) What are the primary reasons we might use fixed point rather than floating point? [2]

Outline Simulators and such. What defines a simulator? What about emulation?

RISC Design: Pipelining

CS 61C: Great Ideas in Computer Architecture Finite State Machines, Functional Units

Challenges of in-circuit functional timing testing of System-on-a-Chip

Lesson 7. Digital Signal Processors

Plan 9 in Technicolor

EE382V-ICS: System-on-a-Chip (SoC) Design

Arithmetic Structures for Inner-Product and Other Computations Based on a Latency-Free Bit-Serial Multiplier Design

Trace Based Switching For A Tightly Coupled Heterogeneous Core

Design Challenges in Multi-GHz Microprocessors

What is a Simulation? Simulation & Modeling. Why Do Simulations? Emulators versus Simulators. Why Do Simulations? Why Do Simulations?

Fall 2015 COMP Operating Systems. Lab #7

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Unit-6 PROGRAMMABLE INTERRUPT CONTROLLERS 8259A-PROGRAMMABLE INTERRUPT CONTROLLER (PIC) INTRODUCTION

Author: Yih-Yih Lin. Correspondence: Yih-Yih Lin Hewlett-Packard Company MR Forest Street Marlboro, MA USA

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

AN FPGA IMPLEMENTATION OF ALAMOUTI S TRANSMIT DIVERSITY TECHNIQUE

Lec 24: Parallel Processors. Announcements

ECOM 4311 Digital System Design using VHDL. Chapter 9 Sequential Circuit Design: Practice

Introduction. Reading: Chapter 1. Courtesy of Dr. Dansereau, Dr. Brown, Dr. Vranesic, Dr. Harris, and Dr. Choi.

How a processor can permute n bits in O(1) cycles

Product Information Using the SENT Communications Output Protocol with A1341 and A1343 Devices

LECTURE 8. Pipelining: Datapath and Control

Video Enhancement Algorithms on System on Chip

Transcription:

Performance of Computer Systems Dr. Arjan Durresi Louisiana State University Baton Rouge, LA 70810 Durresi@Csc.LSU.Edu LSUEd These slides are available at: http://www.csc.lsu.edu/~durresi/csc3501_07/ Louisiana State University 4- Performance - 1 Overview Metrics How to improve performance? CPI MIPS Benchmarks Louisiana State University 4- Performance - 2

The Design Process "To Design Is To Represent" Design activity yields description/representation of an object -- Traditional craftsman does not distinguish between the conceptualization and the artifact -- Separation comes about because of complexity -- The concept is captured in one or more representation languages -- This process IS design Design Begins With Requirements -- Functional Capabilities: what it will do -- Performance Characteristics: Speed, Power, Area, Cost,... Louisiana State University 4- Performance - 3 Design Process (cont.) Design Finishes As Assembly -- Design understood in terms of components and how they have been assembled -- Top Down decomposition of complex functions (behaviors) into more primitive functions Datapath -- bottom-up composition of primitive building blocks into more complex assemblies CPU ALU Regs Shifter Nand Gate Control Design is a "creative process," not a simple method Louisiana State University 4- Performance - 4

Design Refinement Informal System Requirement Initial Specification Intermediate Specification Final Architectural Description refinement increasing level of detail Intermediate Specification of Implementation Final Internal Specification Physical Implementation Louisiana State University 4- Performance - 5 Design as Search Problem A Strategy 1 Strategy 2 SubProb 1 SubProb2 SubProb3 BB1 BB2 BB3 BBn Design involves educated guesses and verification -- Given the goals, how should these be prioritized? -- Given alternative design pieces, which should be selected? -- Given design space of components & assemblies, which part will yield the best solution? Feasible (good) choices vs. Optimal choices Louisiana State University 4- Performance - 6

Measurement and Evaluation Architecture is an iterative process Design -- searching the space of possible designs -- at all levels of computer systems Analysis Creativity Cost / Performance Analysis Bad Ideas Good Ideas Mediocre Ideas Louisiana State University 4- Performance - 7 Measure, Report, and Summarize Make intelligent choices Performance See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) How does the machine's instruction set affect performance? Louisiana State University 4- Performance - 8

Performance Metrics Purchasing perspective given a collection of machines, which has the best performance? least cost? best cost/performance? Design perspective faced with design options, which has the best performance improvement? least cost? best cost/performance? Both require basis for comparison metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors Louisiana State University 4- Performance - 9 Which of these airplanes has the best performance? Airplane Passengers Range (mi) Speed (mph) Passenger throughput Boeing 737-100 101 630 598 228,750 Boeing 747 470 4150 610 286,700 BAC/Sud Concorde 132 4000 1350 178,200 Douglas DC-8-50 146 8720 544 79,429 How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8? Louisiana State University 4- Performance - 10

Computer Performance: Basic Metrics Response Time (latency) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput How many jobs can the machine run at once? What is the average execution rate? How much work is getting done? Example: Car assembly factory: 4 hours to produce a car (response time) 6 cars per an hour produced (throughput) If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase? Louisiana State University 4- Performance - 11 Computer Performance: Introduction The computer user is interested in response time (or execution time) the time between the start and completion of a given task (program). The manager of a data processing center is interested in throughput the total amount of work done in given time. The computer user wants response time to decrease, while the manager wants throughput increased. Main factors influencing performance of computer system are: processor and memory, input/output controllers and peripherals, compilers, and operating system. Louisiana State University 4- Performance - 12

Elapsed Time Execution Time counts everything (disk and memory accesses, I/O, etc.) a useful number, but often not good for comparison purposes CPU time doesn't count I/O or time spent running other programs can be broken up into system time, and user time Our focus: user CPU time time spent executing the lines of code that are "in" our program CPU time is a true measure of processor/memory performance. Performance of processor/memory = 1 / CPU_time Louisiana State University 4- Performance - 13 Book's Definition of Performance For some program running on machine X, Performance X = 1 / Execution time X "X is n times faster than Y" Performance X / Performance Y = n Problem: machine A runs a program in 20 seconds machine B runs the same program in 25 seconds Louisiana State University 4- Performance - 14

Analysis of CPU Time CPU time depends on the program which is executed, including: a number of instructions executed, types of instructions executed and their frequency of usage. Computers are constructed is such way that events in hardware are synchronized using a clock. Clock rate is given in Hz (=1/sec). A clock rate defines durations of discrete time intervals called clock cycle times or clock cycle periods: clock_cycle_time = 1/clock_rate (in sec) Thus, when we refer to different instruction types (from performance point of view), we are referring to instructions with different number of clock cycles required (needed) to execute. Louisiana State University 4- Performance - 15 Clock Cycles Instead of reporting execution time in seconds, we often use cycles CPU time = seconds cycles seconds program = program cycle Clock ticks indicate when to start activities (one abstraction): cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz = 1 cycle/sec) time A 4 GHz clock has a time 1 4 10 9 10 12 = 250 picoseconds (ps) cycle Louisiana State University 4- Performance - 16

Clock Cycles (cont.) Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate Louisiana State University 4- Performance - 17 How to Improve Performance seconds program = cycles program seconds cycle So, to improve performance (everything else being equal) you can either (increase or decrease?) the # of required cycles for a program, or the clock cycle time or, said another way, the clock rate. Louisiana State University 4- Performance - 18

Example Our favorite program runs in 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?" Don't Panic, can easily work this out from basic principles Louisiana State University 4- Performance - 19 Example A program runs in 10s on computer A, at 4GHz How to build a computer B to run this program in 6s The designer has determined that if the clock rate will be increased, it will cause computer B to require 1.2 times more clock cycles than A What clock rate should be used in computer B? CPU clock cycles A CPU clock cycles A CPU time A = 10s = Clock rate A 4 x 10 9 cycles seconds CPU clock cycles A =40 x 10 9 cycles CPU time A = 1.2 x CPU clock cycles A Clock rate B Clock rate B = 8 GHz Louisiana State University 4- Performance - 20

Measuring Time using Clock Cycles CPU execution time for program = Clock Cycles for a program x Clock Cycle Time One way to define clock cycles: Clock Cycles for program = Instructions for a program (called Instruction Count ) x Average Clock cycles Per Instruction (called CPI ) CPI the average number of clock cycles per instructions is an important parameter CPI = Clock_cycles_for_a_program/Instruction_count Instruction_count is the number of instructions executed Louisiana State University 4- Performance - 21 Performance Calculation CPU execution time for program = Clock Cycles for program x Clock Cycle Time Substituting for clock cycles: CPU execution time for program = (Instruction Count x CPI) x Clock Cycle Time = Instruction Count x CPI x Clock Cycle Time CPU time = Instructions x Cycles x Seconds Program Instruction Cycle CPU time = Instructions ti x Cycles x Seconds Program Instruction Cycle CPU time = Seconds Program Louisiana State University 4- Performance - 22

How Calculate the 3 Components? Clock Cycle Time: in specification of computer (Clock Rate in advertisements) Instruction Count: Count instructions in loop of small program Use simulator to count instructions Hardware counter in spec. register (most CPUs) CPI= Clock_cycles_for_a_program/Instruction_count Calculate: Execution Time / Clock cycle time Instruction_count Hardware counter in special register (most CPUs) Louisiana State University 4- Performance - 23 How many cycles are required for a program? Could assume that number of cycles equals number of instructions 1st instructio on 2nd instructio on 3rd instructio on 4th 5th 6th... time This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code Louisiana State University 4- Performance - 24

Different numbers of cycles for different instructions time Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions (more later) Louisiana State University 4- Performance - 25 Phases in Instruction Execution We can divide the execution of an instruction into the following five stages: Instruction fetch Instruction decode and register fetch Execution, effective address or brunch calculation Memory access (for lw and sw instructions only) Register write back (for ALU and lw instructions) Louisiana State University 4- Performance - 26

Sequential Execution of 3 LW Instructions Assumed are the following delays: Memory access = 2 nsec, ALU operation = 2 nsec, Register file access = 1 nsec; Program Execution order lw r1, 100(r0) lw r2, 200(r0) lw r2, 200(r0) IF Reg ALU MEM Reg 8 ns IF Reg ALU MEM Reg 8 ns 8 ns Every lw instruction needs 8 nsec to execute. In this course, we are designing processor that Executes instructions sequentially. Louisiana State University 4- Performance - 27 A given program will require Now that we understand cycles some number of instructions (machine instructions) some number of cycles some number of seconds We have a vocabulary that relates these quantities: cycle time (seconds per cycle) clock rate (cycles per second) CPI (cycles per instruction) a floating point intensive application might have a higher CPI MIPS (millions of instructions per second) this would be higher for a program using simple instructions Louisiana State University 4- Performance - 28

Performance Performance is determined by execution time Do any of the other variables equal performance? # of cycles to execute program? # of instructions in program? # of cycles per second? average # of cycles per instruction? average # of instructions per second? Common pitfall: thinking one of the variables is indicative of performance when it really isn t. Louisiana State University 4- Performance - 29 CPI Example Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical? Louisiana State University 4- Performance - 30

CPI Example CPU clock cycles A = I x 2.0 ; CPU clock cycles B = I x 1.2 CPU time A = CPU clock cycles A x CPU clock time A = I x 2.0 x 250ps=Ix500ps CPU time A = CPU clock cycles A x CPU clock time A = I x 1.2 x 500ps=Ix600ps CPU time B = 1.2 CPU time A Instruction count x CPI CPU time = Clock rate Louisiana State University 4- Performance - 31 CPU clock cycles = (CPI i x C i ) CPI C i is the count of the number of instructions of class i, i CPI i is the average number per instructions for that class. Louisiana State University 4- Performance - 32

Computer Performance CPI inst count Cycle time CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Program Inst Count CPI Clock Rate X Compiler X (X) Inst. Set. X X Organization X X Technology X Louisiana State University 4- Performance - 33 # of Instructions Example A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence? Louisiana State University 4- Performance - 34

# of Instructions Example CPU clock cycles 1 = (CPI i x C i ) = (2x1)+(1x2)+(2x3) = 10 cycles CPU clock cycles 2 = (CPI i x C i ) = (4x1)+(1x2)+(1x3) = 9 cycles CPI 1 = 10/2 = 2 CPI 2 = 9/6 = 1.5 When comparing, all three factors: clock rate, number of instructions, and CPI should be compared Louisiana State University 4- Performance - 35 CPU Time: Example Consider an implementation of MIPS ISA with 500 MHz clock and each ALU instruction takes 3 clock cycles, each branch/jump instruction takes 2 clock cycles, each sw instruction takes 4 clock cycles, each lw instruction takes 5 clock cycles. Also, consider a program that during its execution executes: x=200 million ALU instructions y=55 million branch/jump instructions z=25 million sw instructions w=20 million lw instructions Find CPU time. Louisiana State University 4- Performance - 36

CPU Time: Example 1 (continued) Approach 1: Clock cycles for a program = (x 3 + y 2 + z 4 + w 5) = =910 10 6 clock cycles CPU_time = Clock cycles for a program / Clock rate = =910 10 6 / 500 10 6 = 1.82 sec Approach 2: CPI = Clock cycles for a program / Instructions count CPI = (x 3 + y 2 + z 4 + w 5)/ (x + y + z + w) = 3.03 clock cycles/ instruction CPU time = Instruction count CPI / Clock rate = =(x+y+z+w) 3.03 / 500 10 6 = = 300 10 6 3.03 /500 10 6 = 1.82 sec Louisiana State University 4- Performance - 37 CPU Time: Example 2 Consider another implementation of MIPS ISA with 1 GHz clock and each ALU instruction takes 4 clock cycles, each branch/jump instruction takes 3 clock cycles, each sw instruction ti takes 5 clock cycles, each lw instruction takes 6 clock cycles. Also, consider the same program as in Example 1. Find CPI and CPU time. CPI = (x 4 + y 3 + z 5 + w 6)/ (x + y + z + w) = 4.03 clock cycles/ instruction CPU time = Instruction count CPI / Clock rate = (x+y+z+w) 4.03 / 1000 10 6 = 300 10 6 4.03 /1000 10 6 = 1.21 sec Louisiana State University 4- Performance - 38

Analysis of CPU Performance Equation CPU time = Instruction count * CPI / Clock rate How to improve (i.e. decrease) CPU time: Clock rate: hardware technology & organization, CPI: organization, ISA and compiler technology, Instruction count: ISA & compiler technology. Many potential performance improvement techniques primarily il improve one component with small or predictable impact on the other two. Louisiana State University 4- Performance - 39 Calculating Components of CPU time For an existing processor it is easy to obtain the CPU time (i.e. the execution time) by measurement, and the clock rate is known. But, it is difficult to figure out the instruction count or CPI. Newer processors, MIPS64 processor is such an example, include counters for instructions executed and for clock cycles. Those can be helpful to programmers trying to understand and tune the performance of an application. Also, different simulation techniques and queuing theory could be used to obtain values for components of the execution (CPU) time. Louisiana State University 4- Performance - 40

Attempting to Calculate CPI The table below indicates frequency of all instruction types execu ted in a typical program and, from the reference manual, we are provided with a number of cycles per instruction for each type. Instruction Type Frequency Cycles ALU instruction 50% 4 Load instruction 30% 5 Store instruction 5% 4 Branch instruction 15% 2 CPI = 0.5*4 + 0.3*5 + 0.05*4 + 0.15*2 = 4 cycles/instruction The calculation may not be necessary correct since the numbers for cycles per instruction given don t account for pipeline effects. Louisiana State University 4- Performance - 41 A Simple Example Op Freq CPI i Freq x CPI i ALU 50% 1. Load 20% 5 Store 10% 3 Branch 20% 2 Σ = How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? Louisiana State University 4- Performance - 42

A Simple Example Op Freq CPI i Freq x CPI i ALU 50% 1.5 Load 20% 5 1.0 Store 10% 3.3 Branch 20% 2.4 Σ =.5.4.5 1.0.25 1.0.3.3.3.4.2.4 1.6 2.0 1.95 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPU time new = 1.6 x IC x CC so 2.2/1.6 2/1 means 37.5% faster How does this compare with using branch prediction to shave a cycle off the branch time? CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster What if two ALU instructions could be executed at once? CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster 2.2 Louisiana State University 4- Performance - 43 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder takes 20 minutes Louisiana State University 4- Performance - 44

T a s k O r d e r A B C D Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? Louisiana State University 4- Performance - 45 Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 40 40 40 20 Pipelined laundry takes 3.5 hours for 4 loads Louisiana State University 4- Performance - 46

T a s k O r d e r A B C D Pipelining Lessons 6 PM 7 8 9 Time 30 40 40 40 40 20 Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Louisiana State University 4- Performance - 47 Computer Pipelines Execute billions of instructions, so throughput is what matters What is desirable in instruction sets for pipelining? Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers s located in same place? Louisiana State University 4- Performance - 48

Pipeline Executing 3 LW Instructions Assuming delays as in the sequential case and pipelined processor with a clock cycle time of 2 nsec lw r1, 100(r0) lw r2, 200(r0) lw r2, 200(r0) Note that registers are written during the first part of a cycle and read during the second part of the same cycle. Pipelining doesn t help to execute a single instruction, it may improve performance by increasing instruction throughput; Louisiana State University 4- Performance - 49 MIPS One alternative to time is the metric MIPS (Million Instructions per Second) MIPS = Instruction count Execution time x 10 6 MIPS does not take into account the capabilities of instructions MIPS varies among programs on the same computer MIPS can vary inversely with performance Louisiana State University 4- Performance - 50

MIPS example Two different compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time? Louisiana State University 4- Performance - 51 Execution time = MIPS example CPU clock cycles Clock rate CPU clock cycles = = + + 9 = 9 1 (CPI i x C i ) ((5x1) (1x2) (1x3))x10 10x10 CPU clock cycles 2 = (CPI i x C i ) = ((10x1) + (1x2) + (1x3))x10 9 = 15x10 9 Execution time 1 = 2.5 seconds Execution time 2 = 3.75 seconds MIPS = MIPS 1 = 2800 MIPS 2 = 3200 Instruction count Execution time x 10 6 Louisiana State University 4- Performance - 52

Quantitative Performance Measures Another popular, misleading and essentially useless measure was peak MIPS. That is a MIPS obtained using an instruction mix that minimizes the CPI, even if that instruction mix is totally impractical. Computer manufacturers still occasionally announce products using peak MIPS as a metric, often neglecting to include the work peak. Another popular alternative to execution time was million floating point operations per second MFLOPS: Number of floating point operations in a program MFLOPS = Execution time * 10 6 Because it is based on operations in the program rather than on instructions, MFLOPS has a stronger claim than MIPS to being a fair comparison between different machines. MFLOPS are not applicable outside floating-point performance. Louisiana State University 4- Performance - 53 Benchmarks Performance best determined by running a real application Use programs typical of expected workload Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks nice for architects and designers easy to standardize can be abused SPEC (System Performance Evaluation Cooperative) was founded in late 1980s companies have agreed on a set of real program and inputs valuable indicator of performance (and compiler technology) can still be abused Louisiana State University 4- Performance - 54

SPEC Benchmark Suites The SPEC benchmarks are real programs, modified for portability and to minimize the role of I/O in overall benchmark performance. Example: Optimizer GNU C compiler. First in 1989, SPEC89 was introduced with 4 integer programs and 6 floating point programs, providing a single SPECmarks. SPEC92 had 5 integer programs and 14 floating point programs, and provided SPECint92 and SPECfp92. SPEC95 provided d SPECint_base95, SPECfp_base95. SPEC CPU2000 has 12 integer benchmarks and 14 floating point benchmarks, and provides CINT2000 and CFP2000. Louisiana State University 4- Performance - 55 Benchmark Games An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error was a sad commentary on a common industry practice of cheating on standardized performance tests The error was pointed out to Intel two days ago by a competitor, Motorola came in a test known as SPECint92 Intel acknowledged that it had optimized its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing At the heart of Intel s problem is the practice of tuning compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code Saturday, January 6, 1996 New York Times Louisiana State University 4- Performance - 56

SPEC 89 Compiler enhancements and performance 800 700 600 SPEC performance ratio 500 400 300 200 100 0 gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Benchmark Compiler Enhanced compiler Louisiana State University 4- Performance - 57 SPEC CPU2000 Louisiana State University 4- Performance - 58

SPEC 2000 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance? 1400 1200 1000 Pentium 4 CFP2000 Pentium 4 CINT2000 1.66 1.4 1.2 1.0 Pentium M @ 1.6/0.6 GHz Pentium 4-M @ 2.4/1.2 GHz Pentium III-M @ 1.2/0.8 GHz 800 0.8 600 400 Pentium III CINT2000 0.6 0.4 200 Pentium III CFP2000 0.2 0 500 1000 1500 2000 2500 3000 3500 Clock rate in MHz 0.0 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 Always on/maximum clock Laptop mode/adaptive clock Minimum power/minimum clock Benchmark and power mode Louisiana State University 4- Performance - 59 SPEC 2000 Ratio Pentium III Pentium IV CINT2000/Clock [MHz] 0.47 0.36 CFP2000/Clock [MHz] 0.34 0.39 CINT 2000 CPI of Pentium 4 is 1.3 times that of Pentium 3 (0.47/0.36) How come these numbers are reversed for CFP? Pentium 4 provides a new set of instructions (Streaming SIMD) So both CPI and instruction count are different Louisiana State University 4- Performance - 60

Performance Example We are interested in two implementations of two similar but still different ISA, one with and one without special real number instructions. Both machine have 1000MHz clock. Machine With Floating Point Hardware - MFP implements real number operations directly with the following characteristics: real number multiply instruction requires 6 clock cycles real number add instruction requires 4 clock cycles real number divide instruction requires 20 clock cycles Any other instruction (including integer instructions) requires 2 clock cycles Louisiana State University 4- Performance - 61 Performance Example Machine with No Floating Point Hardware - MNFP does not support real number instructions, but all its instructions are identical to non-real number instructions of MFP. Each MNFP instruction ti (including integer instructions) ti takes 2 clock cycles. Thus, MNFP is identical to MFP without real number instructions. Any real number operation (in a program) has to be emulated by an appropriate software subroutine (i.e. compiler has to insert an appropriate sequence of integer instructions for each real number operation). The number of integer instructions needed to implement each real number operations is as follows: real number multiply needs 30 integer instructions real number add needs 20 integer instructions real number divide needs 50 integer instructions Louisiana State University 4- Performance - 62

Performance Example Consider Program P with the following mix of operations: real number multiply 10% real number add 15% real number divide 5% other instructions 70% A. Find MIPS rating for both machine. CPI MFP = 0.1 6 + 0.15 4 + 0.05 20 + 0.7 2 = 3.6 clocks/instr CPI MNFP = 2 MIPS MFP rating = clock rate CPI * 10 = 6 270.3 MIPS MNFP rating =500 According to MIPS rating, MNFP is better than MFP!? Louisiana State University 4- Performance - 63 Performance Example B. If Program P on MFP needs 300,000,000 instructions, find the time to execute this program on each machine. CPU_time MFP = 300 10 6 3.6 / 1000 10 6 = 1.08 sec CPU_time MNFP = 2760 10 6 2 / 1000 10 6 = 5.52 sec Louisiana State University 4- Performance - 64

Performance Example C. Calculate MFLOPS for both computers. Number of floating gpoint operations in a program MFLOPS = Execution time * 10 6 MFLOPS MFP = 90 10 6 / 1.08 10 6 = 83.3 MFLOPS MNFP = 90 10 6 / 5.52 10 6 = 16.3 Louisiana State University 4- Performance - 65 Experiment Phone a major computer retailer and tell them you are having trouble deciding between two different computers, specifically you are confused about the processors strengths and weaknesses (e.g., Pentium 4 at 2Ghz vs. Celeron M at 1.4 Ghz ) What kind of response are you likely to get? What kind of response could you give a friend with the same question? Louisiana State University 4- Performance - 66

Performance Louisiana State University 4- Performance - 67 Performance Louisiana State University 4- Performance - 68

Summarizing Performance Louisiana State University 4- Performance - 69 Summarizing Performance Louisiana State University 4- Performance - 70

Geometric mean. where Execution time ratio i is the execution time, normalized to the reference computer, for the i th program of a total of n in the workload, and Louisiana State University 4- Performance - 71 Mean The geometric mean is independent of which data series we use for normalization because it has the property The advantage of the geometric mean is that it is independent of the running times of the individual programs, and it doesn t matter which computer is used for normalization The drawback to using geometric means of execution times is that they violate our fundamental principle of performance measurement they do not predict execution time. The ideal solution is to measure a real workload and weight the programs according to their frequency of execution. Louisiana State University 4- Performance - 72

Amdahl's Law Execution Time After Improvement = Execution Time Unaffected +( Execution Time Affected / Amount of Improvement ) Example: "Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster? Principle: Make the common case fast Louisiana State University 4- Performance - 73 ExTimenew = ExTimeold 1 Amdahl s Law Fractionenhanced ( Fractionenhanced ) + Speedup enhanced Speedup overall = ExTime ExTime old new = ( 1 Fraction ) enhanced 1 + Fraction Speedup enhanced enhanced Best you could ever hope to do: Speedup = maximum 1 1 - Fraction ( ) enhanced Louisiana State University 4- Performance - 74

Example Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions? We are looking for a benchmark to show off the new floatingpoint unit described above, and want the overall benchmark to show a speedup of 3. One benchmark we are considering runs for 100 seconds with the old floating-point hardware. How much of the execution time would floating-point instructions have to account for in this program in order to yield our desired speedup on this benchmark? Louisiana State University 4- Performance - 75 Performance is specific to a particular program/s Remember Total execution time is a consistent summary of performance For a given architecture performance increases come from: increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count Pitfall: expecting improvement in one aspect of a machine s performance to affect the total performance Louisiana State University 4- Performance - 76

The Art of Performance Evaluation: The Ratio Game If you can t convince them, confuse them. Truman s Law Throughput in Transaction per Second System Workload 1 Workload 2 A 20 10 B 10 20 Comparing the Average Throughput System Workload 1 Workload 2 Average A 20 10 15 B 10 20 15 The two systems are equally good. Louisiana State University 4- Performance - 77 The Ratio Game 1 Throughput in Transaction per Second System Workload 1 Workload 2 A 20 10 B 10 20 Throughput will Respect to System B System Workload 1 Workload 2 Average A 2 0.5 1.25 B 1 1 1 System A is better than system B! Louisiana State University 4- Performance - 78

The Ratio Game 2 Throughput in Transaction per Second System Workload 1 Workload 2 A 20 10 B 10 20 Throughput will Respect to System A System Workload 1 Workload 2 Average A 1 1 1 B 0.5 2 1.25 System B is better than system A!! The problem is with taking the average of ratios Louisiana State University 4- Performance - 79 Ratio Game with Percentages System A Test 1 Total Pass % Pass 1 300 60 20 2 50 2 4 350 62 20.6 Percent of test passed System B Test 1 Total Pass % Pass 1 32 8 25 1 500 40 8 532 48 9 Percent of total tests passed Which is better A or B? Louisiana State University 4- Performance - 80

Ratio Game with Percentages (Cont.) Both alternatives have the problem of incomparable bases. In Alternative 1, the base is the total number of times the experiment is repeated on a system, which is different for the two systems. In Alternative 2, the base is sum of repetitions of the two experiments together, which is also different for the two systems. Louisiana State University 4- Performance - 81 The Art of Performance Evaluation: Benchmark to benchmark v. trans. To subject (a system) to a series of tests in order to obtain prearranged results not available on competitive systems S. Kelly-Bootle The Devil s DP Dictionary Benchmarking is the process of comparing two systems using standard d well known benchmarks. Louisiana State University 4- Performance - 82

Misleading by Benchmarking 1 Different configuration may be used to run the same workload on two systems. Different Dff amount of fmemory, m disks The compilers may be wired to optimize the workload. For example, eliminating recognized loops Test specification may be written so that they are biased towards one machine. For example, if the specifications are written based on an existing environment. A synchronized job sequence may be used. It is possible to manipulate a job sequence so that CPU-bound and I/O-bound steps synchronize to give a better overall performance. Louisiana State University 4- Performance - 83 Misleading by Benchmarking 2 The workload may be arbitrary picked. The workload might not be representative of real-world applications. Very small benchmarks may be used. For example, such small benchmarks can give 100% cache hits, thereby ignoring the inefficiency of memory and cache organization. May not show the effect of I/O overhead. Few instructions in a loop: By judicious choice of instructions in the loop, the results can be skewed by any amount desired. Benchmarks may be manually translated to optimize the performance. Often need to manually translated on different systems. The performance may then depend on the ability of the translator than on the system under test. Louisiana State University 4- Performance - 84

Summary Instruction complexity is only one variable lower instruction count vs. higher CPI / lower clock rate Design Principles: simplicity favors regularity smaller is faster good design demands compromise make the common case fast Instruction set architecture a very important abstraction indeed! Performance measurement more art than science. Louisiana State University 4- Performance - 85