CS Computer Architecture Spring Lecture 04: Understanding Performance

Size: px

Start display at page:

Download "CS Computer Architecture Spring Lecture 04: Understanding Performance"

Mildred Parker
5 years ago
Views:

1 CS Computer Architecture Spring 2008 Lecture 04: Understanding Performance Taken from Mary Jane Irwin ( and Kevin Schaffer [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] CS35101 L04 Understanding Performance.1

2 Airplane Example Airplane Passenger capacity Cruising range (mi) Cruising speed (MPH) Passenger throughput (passengers MPH) Boeing ,750 Boeing ,700 Concorde ,200 DC ,424 CS35101 L04 Understanding Performance.2

3 Performance Metrics Purchasing perspective given a collection of machines, which has the - best performance? - least cost? - best cost/performance? Design perspective faced with design options, which has the - best performance improvement? - least cost? - best cost/performance? Both require basis for comparison metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors CS35101 L04 Understanding Performance.3

4 Defining (Speed) Performance Normally interested in reducing Response time (aka execution time) the time between the start and the completion of a task - Important to individual users (desktop computers, user-interactive) Thus, to maximize performance, need to minimize execution time Performance X = 1 Execution Time X Performance 1 Execution Time Execution Time X X X > > < Performance Y 1 Execution Time Execution Time Y Y CS35101 L04 Understanding Performance.4

5 Relative Performance Commonly we want to compare performance of two different computers If X is n times faster than Y, then... Performance Performance X Y = Execution Time Execution Time Y X = n Throughput the total amount of work done in a given time - Important to data center managers (non-user-interactive systems, servers) Decreasing response time almost always improves throughput CS35101 L04 Understanding Performance.5

6 Measuring Time Wall-clock time is the total time to complete a task including OS overhead, I/O time, etc. CPU time is the time spent by the CPU and excluding time spent waiting for I/O devices User CPU time is the time the CPU spends executing a program excluding time performing OS tasks, which is called system CPU time Time can also be measured in clock cycles CS35101 L04 Understanding Performance.6

7 Clock Cycles Clock cycle time (clock period) is the length of one clock cycle in seconds (ps, ns) Clock rate (clock frequency) is the number of clock cycles per second measured in Hertz (GHz, MHz) Clock rate = 1 / Clock cycle time Make sure units match CS35101 L04 Understanding Performance.7

8 Review: Machine Clock Rate CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate CS35101 L04 Understanding Performance.8

9 Performance Factors Time = Seconds Program = Instructions Program! Clock Cycles Instruction! Seconds Clock Cycle Program is a sequence of instructions Each instruction requires a some number of clock cycles to execute CS35101 L04 Understanding Performance.9

10 Performance Factors continued Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) time the CPU spends working on a task Does not include time waiting for I/O or running other programs CPU execution time for a program CPU execution time for a program = # CPU clock cycles x clock cycle time for a program = # CPU clock cycles for a program clock rate Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program or CS35101 L04 Understanding Performance.10

11 Clock Cycles per Instruction Not all instructions take the same amount of time to execute One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction # CPU clock cycles # Instructions Average clock cycles for a program = for a program x per instruction Clock cycles per instruction (CPI) the average number of clock cycles each instruction takes to execute A way to compare two different implementations of the same ISA CPI CPI for this instruction class A B C CS35101 L04 Understanding Performance.11

12 Effective CPI Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI = Σ (CPI i x IC i ) i = 1 Where IC i is the count (percentage) of the number of instructions of class i executed CPI i is the (average) number of clock cycles per instruction for that instruction class n is the number of instruction classes n The overall effective CPI varies by instruction mix a measure of the dynamic frequency of instructions across one or many programs CS35101 L04 Understanding Performance.12

13 THE Performance Equation Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Instruction_count x CPI CPU time = clock_rate These equations separate the three key factors that affect performance or Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details CS35101 L04 Understanding Performance.13

14 Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Algorithm Instruction_ count CPI clock_cycle Programming language Compiler ISA Processor organization Technology CS35101 L04 Understanding Performance.14

15 A Simple Example Op Freq CPI i Freq x CPI i ALU 50% 1. Load 20% 5 Store 10% 3 Branch 20% Σ = How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? 2 CS35101 L04 Understanding Performance.16

16 Evaluating Performance Set of programs run on a computer is a workload A benchmark is a workload specifically designed to measure a computer's performance The best benchmarks are made up of real programs Synthetic benchmarks, on the other hand, try to measure low-level performance by repeating short blocks of code CS35101 L04 Understanding Performance.18

17 Comparing and Summarizing Performance How do we summarize the performance for benchmark set with a single number? The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) n AM = 1/n Σ Time i i = 1 Where Time i is the execution time for the i th program of a total of n programs in the workload A smaller mean indicates a smaller average execution time and thus improved performance Guiding principle in reporting performance measurements is reproducibility list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) CS35101 L04 Understanding Performance.19

18 SPEC Standard sets of benchmarks for modern computers based on real programs Covers a number of application areas including graphics, file servers, web servers, etc. CPU benchmarks measure CPU performance on integer and floating-point programs CS35101 L04 Understanding Performance.20

19 SPEC Benchmarks Integer benchmarks gzip compression vpr FPGA place & route gcc GNU C compiler mcf Combinatorial optimization crafty Chess program parser Word processing program eon Computer visualization perlbmk perl application gap Group theory interpreter vortex Object oriented database bzip2 compression twolf Circuit place & route wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi FP benchmarks Quantum chromodynamics Shallow water model Multigrid solver in 3D fields Parabolic/elliptic pde 3D graphics library Computational fluid dynamics Image recognition (NN) Seismic wave propagation simulation Facial image recognition Computational chemistry Primality testing Crash simulation fem Nuclear physics accel Pollutant distribution CS35101 L04 Understanding Performance.21

20 Example SPEC Ratings CS35101 L04 Understanding Performance.22

21 Pentium M Performance CS35101 L04 Understanding Performance.23

22 Power Efficiency Power consumption especially in the embedded market where battery life is important (and passive cooling) For power-limited applications, the most important metric is energy efficiency Modern mobile processors implement features to reduce power usage such as dynamic clock scaling Goal is to maximize performance/power ratio CS35101 L04 Understanding Performance.24

23 Performance/Power CS35101 L04 Understanding Performance.25

24 Speedup Speedup tells us how many times faster our system is after making some improvement That is, a speedup of a 2 means the new version is twice as fast as the old one Speedup = Time before improvement Time after improvement CS35101 L04 Understanding Performance.26

25 Amdahl's Law Amdahl's law provides a limit on the improvement in system performance from an improvement in one part of the system Demonstrates the law of diminishing returns Speedup = f s + 1 (1! f ) f is the fraction of the computation that is improved s is the speedup of the improvement CS35101 L04 Understanding Performance.27

26 Amdahl's Law (2) Speedup = Time before improvement Time affected + Time unaffected Amount of improvement Alternate form of Amdahl's law based on actual execution time instead of fractions This is the form used by the book CS35101 L04 Understanding Performance.28

27 Poor Performance Measures Clock rate Instructions per second (IPS) MIPS (million instructions per second) - MIPS = Instruction count / (Execution time x 10 6 ) Floating-point operations per second (FLOPS) CS35101 L04 Understanding Performance.29

28 Summary: Evaluating ISAs Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? How many clocks are required per instruction? CPI How "lean" a clock is practical? Best Metric: Time to execute the program! depends on the instructions set, the processor organization, and compilation techniques. Inst. Count Cycle Time CS35101 L04 Understanding Performance.30

29 Next Lecture and Reminders Next lecture MIPS non-pipelined datapath/control path review - Reading assignment PH, Chapter 5 Reminders Choose one of the first 100 listed at the most current list at top500.org CS35101 L04 Understanding Performance.31

Performance Metrics, Amdahl s Law

ecture 26 Computer Science 61C Spring 2017 March 20th, 2017 Performance Metrics, Amdahl s Law 1 New-School Machine Structures (It s a bit more complicated!) Software Hardware Parallel Requests Assigned