Improving Energy-Efficiency of Multicores using First-Order Modeling

Size: px

Start display at page:

Download "Improving Energy-Efficiency of Multicores using First-Order Modeling"

Terence Warner
5 years ago
Views:

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1404 Improving Energy-Efficiency of Multicores

1 Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1404 Improving Energy-Efficiency of Multicores using First-Order Modeling VASILEIOS SPILIOPOULOS ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2016 ISSN ISBN urn:nbn:se:uu:diva

2 Dissertation presented at Uppsala University to be publicly examined in ITC/2446, Lägerhyddsvägen 2, Uppsala, Thursday, 29 September 2016 at 13:00 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Lieven Eeckhout (Ghent University, Belgium). Abstract Spiliopoulos, V Improving Energy-Efficiency of Multicores using First-Order Modeling. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology pp. Uppsala: Acta Universitatis Upsaliensis. ISBN In the recent decades, power consumption has evolved to one of the most critical resources in a computer system. In the form of electricity bill in data centers, battery life in mobile devices, or thermal constraints in desktops and laptops, power consumption imposes several limitations in today s processors and improving power and energy efficiency is one of the most urgent research topics of Computer Architecture. Dynamic Voltage and Frequency Scaling (DVFS) and Cache Resizing are among the most popular energy saving techniques. Previous work, however, has focused on developing heuristics and trial-and-error methods that yield acceptable savings, but fail to provide insight and understanding of how these techniques affect power and performance of a computer system. In contrast, this Thesis proposes the use of first-order modeling to improve the energy efficiency of computer systems. A first-order model needs to be (i) accurate enough to efficiently drive DVFS and Cache Resizing decisions, and (ii) simple enough to eliminate the overhead of collecting the required inputs to the model. We show that such models can be constructed and successfully applied in modern systems. For DVFS, we propose to scale frequency down to exploit applications memory slack, i.e., periods that the processor spends waiting for data to be fetched from the main memory. In such cases, the processor frequency can be scaled down to save energy without inordinate performance penalty. Our DVFS models can detect slack and predict the impact of DVFS in both power and performance with great accuracy. Cache Resizing, on the other hand, relies on the fact that many applications do not benefit from the vast amount of cache that modern processors are equipped with. In such cases, the cache can be resized to save static energy consumption at limited performance cost. Since both techniques are related with the memory behavior of applications, we propose a unified model to manage the two techniques in tandem and maximize energy efficiency through synergistic DVFS and Cache Resizing. Finally, our experience with DVFS in real systems motivated us to contribute to the integration of DVFS into the gem5 simulator. Unlike other simulators that ignore the role of OS in DVFS, we extend the gem5 simulator by developing the hardware and software components that allow existing Linux DVFS infrastructure to be seamlessly integrated in the simulator. Keywords: Computer Architecture, DVFS, Cache Resizing, Interval modeling, Power modeling Vasileios Spiliopoulos, Department of Information Technology, Computer Architecture and Computer Communication, Box 337, Uppsala University, SE Uppsala, Sweden. Vasileios Spiliopoulos 2016 ISSN ISBN urn:nbn:se:uu:diva (

3 To my parents and my loving wife.

5 List of papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I II III IV V Georgios Keramidas, Vasileios Spiliopoulos, Stefanos Kaxiras, Interval-Based Models for Run-Time DVFS Orchestration in SuperScalar Processors, In Proc. International Conference on Computing Frontiers (CF), 2010 I am the primary author of this paper. Georgios Keramidas contributed in writing the text of the paper. Vasileios Spiliopoulos, Stefanos Kaxiras, Georgios Keramidas, Green Governors: A framework for Continuously Adaptive DVFS, In Proc. International Green Computing Conference and Workshops (IGCC), 2011 I am the primary author of this paper. Vasileios Spiliopoulos, Andreas Sembrant, Stefanos Kaxiras, Power-Sleuth: A Tool for Investigating your Program s Power Behavior, In Proc. International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2012 I am the primary author of this paper. Andreas Sembrant provided the phase-detection tool and contributed in discussions. Vasileios Spiliopoulos, Akash Bagdia, Andreas Hansson, Peter Aldworth, Stefanos Kaxiras, Introducing DVFS-Management in a Full-System Simulator, In Proc. International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2013 I am the primary author of this paper. Vasileios Spiliopoulos, Andreas Sembrant, Georgios Keramidas, Erik Hagersten, Stefanos Kaxiras, A Unified DVFS-Cache Resizing Framework, Technical Report , Department of Information Technology, Uppsala University, 2016 I am the primary author of this paper. Andreas Sembrant and Georgios Keramidas were involved in discussions. Reprints were made with permission from the publishers.

6 Other publications not included: Konstantinos Koukos, David Black-Schaffer, Vasileios Spiliopoulos, Stefanos Kaxiras, Towards more efficient execution: a decoupled accessexecute approach, In Proc. International Conference on Supercomputing (ICS), 2013 I developed the power model used in the paper. Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David Black-Schaffer, Stefanos Kaxiras, Fix the code. Don t tweak the hardware: A new compiler approach to Voltage-Frequency scaling, In Proc. International Symposium on Code Generation and Optimization (CGO), 2014 I developed the power model used in the paper. Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios Spiliopoulos, Stefanos Kaxiras, Alexandra Jimborean, Multiversioned decoupled access-execute: the key to energy-efficient compilation of general-purpose programs, In Proc. International Conference on Compiler Construction (CC), 2016 I developed the power model used in the paper. Kai Lampka, Björn Forsberg, Vasileios Spiliopoulos, Keep it cool and in time: With runtime monitoring to thermal-aware execution speeds for deadline constrained systems, In Journal of Parallel and Distributed Computing (JPDC), 2016 I contributed with technical details about the gem5 and McPAT and provided scripts to estimate power consumption.

7 Contents 1 Introduction Power Consumption in Computer Systems Power & Energy Efficiency Metrics Power and Performance Modeling for Energy Efficiency DVFS Performance Modeling Interval-based DVFS Performance Model Stall-based model Miss-based model Implementing the Models in Real Processors Model Accuracy Model Extensions DVFS Power Modeling Measuring Power Consumption From the voltage regulator From the motherboard ATX connector A Frequency-Independent Power Model Understanding the Power Behavior of Applications Phase Detection Utilizing Program Phases to Aid Performance Event Monitoring Estimating Power and Performance in Different Frequencies Improving Energy Efficiency with Linux Frequency Governors Green Governors Multicore Management Co-ordinated DVFS and Cache Resizing Management Unified DVFS-Cache Resizing Model Estimating MLP in Different Cache Configurations Estimating Performance Frequency-LLC Size Adaptation Introducing DVFS Support in gem5 Simulator Hardware Extensions Software Integration Validation and Use-cases... 39

8 8 Summary Svensk Sammanfattning Bakgrund Sammanfattning av Forskningen Acknowledgements References... 49

9 1. Introduction For many years, improving performance had been the main concern of research in Computer Architecture. Maximizing performance at any cost led to increased complexity in different levels of the design process. At the circuit level, the shrinking of manufacturing process combined with aggressive pipelining led to faster processors through increased clock frequency. At the architectural level, an abundance of sophisticated techniques have contributed to improving performance: out-of-order execution, several cache levels, aggressive prefetching, branch prediction are only a few of the advances in the field of Computer Architecture in recent years. In the past decade, however, optimizing performance at any cost was not possible any more, as energy consumption evolved to one of the most important design constraints. Energy consumption raised as a crucial limitation in modern computer systems due to two main reasons. First, computers hit the power wall. With the increase of frequency, higher power consumption led to increased thermal dissipation, approaching the physical limits of the devices; further increasing energy consumption and thermal dissipation would simply lead to temperatures that chips cannot tolerate. Second, energy itself has become a first-priority resource, in the form of electricity cost for large data-centers and battery life for mobile devices. Consequently, in the past 15 years, lowering energy consumption has become an important concern even in high-performance computing. However, performance is still important for computer systems, hence architects strive for what is known as energy efficiency: improving performance should only come at a reasonable energy cost, or, in other words, the energy overhead should not be more than the performance benefit achieved by a given optimization. Similarly, there are many different energy-saving techniques, the challenge, however, is to reduce energy consumption with limited performance degradation. In this Thesis, we develop modeling techniques that can be used to improve the energy efficiency of computer systems. 1.1 Power Consumption in Computer Systems In CMOS circuits, power consumption is broken down in dynamic and static power consumption. Dynamic power is given by the following equation [23]: P dynamic = afcv 2 (1.1) 1

10 Dynamic power is consumed due to the switching activity of the transistors. The activity factor a denotes the percentage of transistors that switch state on every cycle, and depends on the input of the circuit. C is the load capacitance, and f and V are the frequency and voltage respectively. Different power-saving techniques aim to reduce dynamic power consumption by targeting different components of the above equation. For instance, designing more compact and efficient systems leads to smaller load capacitance C, which in turn reduces power consumption. Clock-gating aims at reducing the activity factor a by cutting-off the clock at idle components, to prevent transistors switching state when they don t have to. And Dynamic Voltage and Frequency Scaling (DVFS) scales down voltage and frequency to reduce power consumption at the expense of reduced performance. Voltage and frequency should always be treated in combination, since there is a close connection between the two: for a target frequency, there is a minimum voltage that the circuit should be supplied with to ensure proper functionality. Static power is consumed by transistors due to various leakage currents. On a high level, static power is given by the following equation [23]: P static = V I leak (1.2) Equation 1.2 suggests that static power consumption can be reduced by reducing either supply voltage or the leakage current. Although DVFS can be used to reduce voltage, the voltage scaling range is significantly limited in modern processors [10], hence DVFS cannot aggressively attack the problem of static dissipation. An effective technique to minimize static power is to reduce the leakage current by shutting down parts of the processor that are not used. Recent Intel and AMD processors apply this technique by power-gating idle cores. Moreover, shutting down parts of the caches is another technique that has been extensively studied [45, 46, 47], due to the fact that caches are responsible for a big portion of the chip s total static power consumption. 1.2 Power & Energy Efficiency Metrics Nowadays, energy consumption is equally important to performance, hence computer architects and system designers often choose to trade performance for energy and power savings. Depending on design requirements, different optimization metrics can be used: Power: In some cases, system performance is sacrificed to reduce power consumption, even though this does not guarantee that the total energy consumption for a given task will be reduced. Due to the fact that power consumption directly correlates with thermal dissipation, such strategies are usually applied when there are thermal constraints in the system. Energy: When low energy consumption is the ultimate design metric, minimizing energy consumption at any performance cost can be accept- 2

11 able. For instance, microcontrollers used in embedded systems use simple designs that greatly prolong battery life at the expense of performance. Energy Delay Product (EDP): In many cases, both performance and energy consumption are critical resources in a system. For example, a mobile device (e.g. smartphones and tablet-computers) should have both good performance and long battery life, and a data-center should be fast, but not at unreasonable electricity cost. In such cases, the product of execution time and energy consumption has been proposed [23] as a metric that depicts how well a technique trades one for the other. EDP is a lower-is-better metric, giving equal weights to both performance and energy consumption. In case energy/performance is valued more, higher-order variations of this metric are used (e.g., ED 2 P). 1.3 Power and Performance Modeling for Energy Efficiency This Thesis presents modeling techniques for optimizing energy efficiency in modern Out-of-Order processors. It focuses on two very popular power-saving techniques that have been studied in the past, but mainly through empirical methods: Dynamic Voltage and Frequency Scaling (DVFS) [18, 17, 26, 44, 42] and Cache Resizing [45, 46, 47]. DVFS relies on the fact that applications cannot always take full advantage of a fast processor. This is due to memory slack: in memory-intensive applications, the CPU spends a significant amount of time stalled, waiting for the data to arrive from the main memory. In such cases, the processor s voltage and clock frequency can be scaled down to save energy at limited performance cost. Existing techniques rely on empirical models to determine how performance of an application is affected by operating the processor in different DVFS levels. In Chapter 2, we show that a simple analytical model can accurately estimate the execution time of applications under frequency scaling and can be used to find the optimal DVFS-point. To reason about and optimize energy efficiency, it is important to estimate the power consumption of a processor at runtime. Building power models for real processors and using them to apply different power-saving techniques is a well-studied topic [19, 7, 14, 33, 20]. However, previous studies only demonstrated how to generate models for a single frequency. In Chapter 3, we extend previous modeling methodologies to estimate power consumption across different frequencies. Then, in Chapter 4, we show that the performance and power models presented at Chapters 2 and 3 can be combined to create a powerful tool that allows us to understand the power behavior of different applications with respect to DVFS. This tool, called Power-Sleuth, only needs to profile an application by running it once, in a single frequency. 3

12 With our DVFS power and performance models, the tool can estimate how application behavior varies in any frequency of interest. To build our power model and evaluate its effectiveness, we demonstrate different methodologies for measuring power consumption on real processors. Apart from providing insights to understand the power behavior of different applications, our models can efficiently drive runtime optimizations. In Chapter 5, we use our models to implement Linux frequency governors running on real platforms. These governors are highly flexible, enforcing a variety of different energy-efficiency policies: they can accurately estimate the impact of frequency scaling on each of the applications running on a multicore and apply the frequency that optimizes the power, energy and performance requirements set by the user. Although the model inputs are not always obtainable through the existing performance-counter events, we show that using certain approximations leads to near-optimal DVFS decisions. Dynamic power consumption dominates total power consumption, static power, however, is not negligible, especially for vast Last-Level Caches (LLCs) that correspond to a significant part of the chip area (~50%). Dynamic Cache Resizing is a popular technique that targets at turning-off parts of the cache to save static energy consumption and improve energy efficiency. However, resizing the cache has an impact on the memory behavior of the application, which directly correlates to the application behavior in different DVFS levels. In Chapter 6, we demonstrate a unified model to estimate the impact of DVFS and LLC Resizing at the same time. We then use the model to manage core frequency and LLC size, and show that energy efficiency can be optimized by applying these techniques in a co-ordinated way. Finally, architectural simulators are powerful tools that allow researchers to fine-tune parameters and investigate research ideas that are not always applicable in existing hardware. Therefore, making simulators as realistic as possible is a significant part of research in Computer Architecture. Most simulators provide only basic DVFS support, taking shortcuts and disregarding aspects that can be important in real hardware. In Chapter 7, we demonstrate how we introduce full DVFS-support in one of the most popular full-system simulators, the gem5 simulator. Our extensions allow gem5 users to model different clock and voltage domain topologies. We also provide full Linux support by developing drivers that are compatible with the existing Linux DVFS infrastructure. Hence, default Linux frequency governors, like interactive and ondemand, can be used out-of-the-box in the simulator. 4

13 2. DVFS Performance Modeling Dynamic Voltage and Frequency Scaling (DVFS) is one of the most popular power/energy saving techniques. DVFS relies on the fact that, by reducing the clock frequency, the circuit can tolerate higher latencies and therefore voltage can also be reduced. This has a cubic impact on dynamic power consumption (Equation 1.1), while it also impacts static power consumption (Equation 1.2). Most DVFS approaches take advantage of system slack or idleness. Kaxiras and Martonosi [23] demonstrate the different levels that slack can appear. At the system-level, DVFS mechanisms take advantage of the processor being idle and scale frequency down to minimize idle periods [41, 13]. At the hardware-level, DVFS can be applied at a very fine-grained level, targeting at the slack that appears in the hardware operations [9]. Finally, at the programlevel (or program-phase-level), slack can appear due to long-latency memory operations that force the processor to stall while such operations are pending. Regardless of the level that a DVFS technique targets, the heart of every approach is to detect and exploit the slack to save energy without inordinately penalizing system performance. In Paper I, we focus on the program-level DVFS, and we propose a simple and accurate analytical model to quantify how the execution time of an application is affected by frequency scaling. To develop our model, we use a previously proposed analytical model [22, 12] that estimates performance based on different miss-events experienced by the processor. Although DVFS has been extensively studied in previous work, most approaches rely on empirical models and trial-and-error methods [17, 18, 26, 42, 28, 44]. Our analytical models, on the other hand, investigate DVFS from a different perspective and open up new opportunities for energy efficiency optimizations. Paper I focuses on developing, evaluating and using our runtime models to improve energy efficiency on a simulator. However, the simplicity of our models and the fact that DVFS is available in most modern commercial processors motivated us to use our models in real systems. Papers II, III discuss how our models can be ported in commodity processors, and demonstrate how we can use them to understand the behavior of different applications under DVFS and optimize energy efficiency. 2.1 Interval-based DVFS Performance Model The interval-based analytical model, proposed by Karkhanis and Smith [22] and further enhanced by Eyerman et. al. [12], breaks execution of a program into intervals. During the steady-state intervals, the processor executes 5

14 Instructions executed on-chip miss Steady State branch miss total cycles LLC miss Steady State Figure 2.1. The baseline analytical interval-based model. Steady-state intervals are shaped by the machine width and the program s ILP. Miss-intervals are due to miss events such as cache misses and branch mispredictions and introduce stall-cycles to the machine. instructions at a rate that is only limited by the width of the processor and the workload s Instruction Level Parallelism (ILP). Steady-state intervals are punctuated by miss-intervals, introducing stall cycles to the machine. Missintervals are introduced by various miss events, such as instruction and data cache misses and branch mispredictions. To model performance variation due to DVFS, we need to understand how different intervals are affected by frequency scaling. Figure 2.1 shows how the interval model represents the different miss events. The x-axis corresponds to time in processor cycles, while the y-axis shows the number of instructions issued per cycle. Assuming that the processor operates in a single voltage and frequency domain, the latency of on-chip events (measured in processor cycles), such as branch mispredictions and on-chip misses, is not affected by frequency scaling. This is because all processor components are fed with the same clock, therefore changing the clock speed does not change the relative speed between different components. Consequently, for a workload exhibiting only on-chip miss events, changing the clock speed does not affect the number of cycles required to execute that application. Of course, the clock period is affected, therefore execution time scales proportionally to frequency. Such applications are characterized as compute-intensive. When off-chip misses occur, however, DVFS causes a change in the relative speed between the processor and the off-chip memory, hence the total number of cycles is not constant any more. Therefore, to understand how processor performance is affected by DVFS, we need to model how the off-chip miss-intervals are affected by frequency scaling. Unlike on-chip miss events, the latency of an off-chip miss is affected by the core frequency, due to the asynchronous communication between the processor and the main memory. For example, given a processor operating at 1GHz and a main memory with 100ns access time, the latency of the main memory is 100ns 10 9 cycles sec = 100 processor cycles. If we scale the processor frequency down to 500MHz, main memory access time remains 100ns, therefore mem- 6

15 Instructions executed this area does not scale at all with frequency scaling LLC miss Steady State inelastic area ROB-fill memory latency elastic area full stall IQ drain ramp-up total cycles only this quantity scales proportionally to frequency scaling stall cycles as measured by the stall-based model (do not scale proportionally with frequency scaling) Steady State Figure 2.2. The miss-interval of an LLC-miss. Due to out-of-order execution, the processor can issue instructions under an LLC miss up to the point that all the remaining instructions depend on that miss. The different areas of the miss-interval are characterized as elastic or inelastic to frequency scaling. Instructions executed memory latency LLC miss1 Steady State y ROB-fill LLC miss2 memory latency ST1 ST2 x total cycles Steady State Figure 2.3. Overlapping LLC misses. When the first miss reaches the head of the Reorder Buffer, the processor stalls until the miss is serviced. Then, new instructions can enter the instruction window until the processor stalls again, due to the second miss reaching the head of the Reorder Buffer. When the second miss is also serviced, the processor can reach the steady-state issue-rate again. ory latency now becomes 50 processor cycles. In other words, memory latency (measured in processor cycles) scales proportionally to processor frequency. Figure 2.2 shows an off-chip miss in more detail. Once such a miss occurs, the processor keeps issuing instructions until the miss reaches the head of the Reorder Buffer (ROB). This area is called ROB-fill. At this point, no more instructions can enter the issue window, hence the issue rate starts to drop. When all the instructions left in the instruction window depend on the pending miss, the processor stalls and waits for the miss to be serviced. Only after the data has arrived from the main memory will the processor be able to execute new instructions and ramp-up to the steady-state issue rate again. As explained above, memory latency scales proportionally to frequency, but the different areas of the miss-interval are affected in different ways. Regarding 7

16 ROB-fill, it will take the same amount of cycles to fill-up the ROB regardless of the frequency, hence we say that ROB-fill is inelastic to frequency scaling. Full-stall, on the other hand, is the number of cycles that the processor spends being completely idle. Since the total memory latency scales proportionally to frequency, full-stall also changes with frequency, or it is elastic to frequency scaling. Finally, IQ-drain and ramp-up can also be elastic to frequency scaling if frequency is aggressively scaled down to the extent that full-stall is completely eliminated. Figure 2.3 shows the case that more than one misses overlap with each other. In particular, during the ROB-fill or IQ-drain areas of LLC miss1, a second miss LLC miss2 occurs. The processor first stalls because of the first miss, and after this miss is serviced, it starts issuing instructions again before it stalls due to LLC miss2 reaching the head of the ROB. When the second miss is also serviced, the instruction issue rate rises again to meet the steady-state rate. Based on the observations discussed above, in Paper I we propose two simple interval-based analytical models to estimate how performance changes between different DVFS settings. The first model, called stall-based model, makes certain simplifications and can be applied in almost every modern processor. The miss-based model, on the other hand, is more accurate, but the input required is not readily available in all processors Stall-based model The stall-based model assumes that ROB-fill is negligible, therefore the stall of an LLC miss is proportional to frequency scaling. As shown in Figure 2.2 memory_latency = ROB_ fill+ stall stall (2.1) For the multiple-misses case of Figure 2.3, we can also approximate that ST1+ST2 = y + memory_latency ROB_ fill x memory_latency (2.2) assuming that y ROB_ fill x 0. Hence, in both cases of single and overlapping misses, stalls are approximately equal to memory latency, which scales proportionally to frequency. Consequently, the total number of stalls also scales (approximately) proportionally to frequency. Assuming that executing a fixed number of instructions under frequency f takes c cycles, we can estimate the number of cycles required to execute the same instructions under frequency f /k. As explained above, non-stall cycles (c ST) remain intact with frequency scaling, while stall cycles scale in the same way that frequency does. Therefore, execution cycles in frequency f /k are approximated as c f /k = c ST + ST (2.3) k 8

17 Instructions executed LLC miss1 memory latency x ROB-fill Steady State LLC miss2 y x LLC miss3 LLC miss4 total cycles memory latency y Steady State Figure 2.4. A complex case explaining which misses are critical for DVFS, i.e., which are the misses whose miss-intervals scale with frequency. In a group of overlapping misses, only the first miss is important for DVFS performance estimation. Once this miss is serviced, the next miss that occurs indicates the start of a new group of overlapping misses. After estimating the execution cycles on the target frequency, the execution time can be easily calculated by dividing execution cycles with frequency Miss-based model The miss-based model acknowledges that it is the whole miss-interval that scales proportionally to frequency. Notice that this does not imply that ROBfill scales with frequency. As explained above, ROB-fill is inelastic to frequency scaling, but the full-stall area changes in a way that the whole missinterval scales proportionally to frequency. Moreover, the inelasticity of ROBfill has an important implication for the miss-interval of the overlapping misses. Figure 2.4 shows a complex case for overlapping misses. LLC miss1 occurs and, x cycles later, LLC miss2 overlaps with the first miss. Since LLC miss2 occurred x cycles after the first miss, it will also be serviced x cycles after the first miss is serviced. Moreover, ROB-fill does not scale with frequency, which means that LLC miss2 will always occur and be serviced x cycles after LLC miss1. Therefore, when misses overlap, only the miss interval of the first miss scales with frequency, whereas the miss interval(s) of the additional miss(es) remains intact. One might assume that the misses that should be counted for DVFS modeling are the ones that do not overlap with previous misses. This, however, is not true. In Figure 2.4, LLC miss3 occurs after LLC miss1 has been serviced, but it overlaps with LLC miss2 which is still pending. Although LLC miss3 overlaps with another miss, it initiates a new group of overlapping misses, therefore it should be accounted for DVFS. This is because, although x will not change with DVFS, the remainder of LLC miss3 miss-interval (shown as memory latency in Figure 2.4) will scale. LLC miss4 overlaps with LLC miss3 after y 9

18 cycles, hence it will always be serviced y cycles after LLC miss3 is serviced and its extra miss interval will not change with DVFS. From the example above we can determine that a miss is important for DVFS modeling (i.e., its miss-interval scales with frequency) if it initiates a new group of overlapping misses. Once such a miss occurs, the misses overlapping with that miss are not counted for DVFS, until that first miss is serviced. Then, the next miss that occurs indicates a new group of overlapping misses, even if it overlaps with pending misses from the previous group. In the literature, the name leading miss [35] has been proposed for the first miss in a group of overlapping misses. Hence, a miss that does not overlap with a leading miss is a leading miss itself. After counting the number of leading misses, we can calculate the number of cycles that scale with frequency simply by multiplying the number of leading misses with the memory latency (mem_lat leading_misses). Then, when scaling frequency from f to f /k, execution cycles can be estimated as: c f /k = c mem_lat leading_misses + mem_lat leading_misses k (2.4) 2.2 Implementing the Models in Real Processors The models shown in this chapter were conceived and evaluated using the SimpleScalar [6] simulator. Using a simulator enabled us to investigate concepts that would be otherwise hard to explore in detail. In particular, understanding how misses overlap with each other and how the stall cycles are affected by frequency scaling depending on whether they were generated by isolated or overlapping misses would not have been possible without the detailed view of a cycle-accurate simulator. However, our motivation for creating our models was to use them in real hardware, therefore we put a great effort on implementing them in commercial processors. This task is particularly challenging due to the limited selection of performance-counter events that are available in commodity processors. Up till recently, implementing the miss-based model was infeasible, due to the lack of events that monitor the overlapping of the last-level cache misses. In recent AMD processors, however, researchers have been able to implement a leading-load estimator [39]. The stall-based model, on the other hand, can be approximated using a combination of events that have been supported in the majority of processors in the past years: number of stall cycles and number of LLC misses. Papers II and III discuss the heuristics used to implement the stall-based model on an Intel Nehalem and an AMD Phenom II processor. Although both processors offer a performance-counter event to measure stall cycles, there is no event that counts stalls that are explicitly due to LLC misses. A first-order approximation is to use the total number of stalls and assume that those are mostly due to LLC misses, as these are the longest-latency 10

19 miss-events that a processor can experience. As an extra optimization step, we use the total number of LLC misses along with the average memory latency to estimate the worst-case stalls due to off-chip misses. This prevents us from erroneously classifying other stalls, such as on-chip misses or long-latency operations (e.g. DIV), as LLC-miss stalls, in compute-intensive applications. This heuristic, despite its simplicity, achieves good accuracy at predicting performance across different frequencies, leading to near-optimal runtime DVFS decisions (Paper II). 2.3 Model Accuracy The models presented in this chapter were evaluated both in a simulator (Paper I) and real hardware (Paper III). Running the SPEC2000 benchmark suite in the SimpleScalar simulator, the stall-based model yields an average error of 2.1% when executing at f max and predicting for f max /4 and vice versa. The maximum error, however, can be up to 20%, due to the fact that this model disregards the existence of the ROB-fill area. The miss-based model, on the other hand, achieves an impressive 0.2% average error for the same frequency range, while maximum error is less than 5%. In real hardware, implementing the miss-based model was infeasible at the time due to the lack of appropriate performance-counter events. Regarding the stall-based model, we could only implement an approximation of the model due to the lack of an event that would count stalls explicitly due to LLC misses in Intel and AMD processors. However, the approximation discussed in Section 2.2 yields good accuracy for practical purposes. Running the SPEC2006 benchmark suite on Intel Nehalem processor, we could estimate execution time across maximum (2.66GHz) and minimum (1.6GHz) frequency with an average error of less than 5%. 2.4 Model Extensions Our DVFS models have served as an inspiration for a significant amount of related work and have been extended in various interesting directions. First of all, two more research groups, working independently but concurrently with us, have proposed models that are similar to our miss-based model [11, 35]. Rountree et. al. [35] proposed the term leading loads for the loads that initiate a group of overlapping misses, i.e., the misses that our miss-based model identifies as critical for DVFS performance estimation. Miftakhutdinov et. al [29] extended our model to account for memory systems with non-fixed memory latency. Nath et. al. [30] adapted our model to estimate performance variation of GPGPU workloads under DVFS. Finally, Akram et. al. [3] extended previ- 11

20 ous work in the field to model DVFS performance for managed multi-threaded applications. 12

21 3. DVFS Power Modeling Estimating energy and power consumption of a running application is crucial both for (i) understanding the behavior of an application, and (ii) optimizing its energy efficiency through different runtime techniques. Although an abundance of different power models have been proposed, they can be divided in two main categories [40]: Bottom-up power models [5, 27, 38] use theoretical models to estimate the power consumption of different parts of a system, based on characteristics such as node technology, circuit layout and design parameters. These models tend to be highly-configurable, as different parameters are simply fed into the theoretical models, but their accuracy is questionable, with the estimation error often exceeding 20% [34, 43]. The goal of these models is to at least provide reliable relative power estimation, to determine whether certain modifications have a positive/negative impact on energy and power consumption. Top-down models [21, 19, 7, 14], on the other hand, employ machine learning theory, treating the processor as a black box, and use power measurements obtained from a real system while running a set of test applications to create a regression power-model. Although these models are only useful for the hardware that they were trained on, they are highly accurate and can be used in real systems to drive power and energy optimization techniques. The challenge in this class of models is to identify the processor events that best correlate with processor power consumption, as well as to select a good benchmark training-set to build the regression model. Top-down models are typically built by running a set of applications in a real processor and measuring the power consumption for each of them. Moreover, different performance counter events are monitored to represent the activity of the processor during the execution of the applications, such as the number of instructions executed, number of accesses and misses in the different caches of the system and the type of executed instructions. Then, a regression model is built, assuming that power consumption is a function of the selected performance counter-events, and the model parameters are acquired by fitting the model to the observed power and counter measurements. The accuracy of the model depends on the selection of events to be monitored, as well as the diversity of the applications used to train the model. In Paper III we extend previous work by building a frequency-independent regression power-model that only needs to be trained in a single frequency and 13

22 Figure 3.1. Measuring Intel Nehalem power consumption from the voltage regulator. Using the motherboard schematics, we were able to detect the pins providing the voltage and current that is fed to the processor. Then, by attaching cables and monitoring them with an oscilloscope, we were able to measure and log power consumption of the processor. can then be used to estimate power consumption in any frequency. Moreover, we show that by combining the power model with the DVFS performance model of Chapter 2, power consumption of an application can be estimated for any frequency, regardless of the frequency that the application was profiled at. To train and evaluate our power model, we first showcase different methods for obtaining power measurements of applications running in real processors. 3.1 Measuring Power Consumption Measuring power consumption is an important part of developing a linearregression power model. Moreover, it is necessary for evaluating the benefits achieved when using our performance and power models at runtime to optimize energy efficiency (Chapter 5). In our work, we have used two different approaches to measure power consumption on real hardware From the voltage regulator Measuring power consumption from the voltage regulator yields the most accurate results, since this is the closest to the processor component that we can measure power. Other approaches (wall-power, power from the motherboard ATX connector) introduce noise due to the power consumption of components other than the processor. The disadvantage, however, is that the voltage regula- 14

23 tor is not always easily accessible on the motherboard, and detailed schematics need to be available to determine where to install the measuring probes. Moreover, adding helper electronic devices (e.g. shunt resistors, inductive current sensors) is infeasible, and measuring power relies on the voltage regulator s self-monitoring capabilities. Paper II presents in more detail the methodology for measuring power consumption on Intel Nehalem and AMD Phenom II from the voltage regulator pins From the motherboard ATX connector An alternative, more generic approach to measure power consumption is to monitor the power that is fed from the power supply to the motherboard through the ATX connectors. To do so, we designed the power-measuring device shown in Figure 3.2a. The device consists of a PCB that is installed between the power supply and the motherboard of the system that we want to measure. Each of the power-supply voltage rails (ATX 3.3V, 5V and 12V, as well as the separate 12V rail supplying the processor), is driven through sense resistors R s on the PCB. Current flowing through the resistor affects the voltage across the resistor. This voltage is measured using fine-grained sensors [8], the output of which is sampled using an A/D device. Eventually, the voltage read by A/D is proportional to the current flowing through each voltage rail. Using test currents for each of the sensors, we were able to calibrate our device and determine the exact relationship between current flowing through the sensors and voltage read by A/D. Then, our device can be plugged into a system as shown in Figure 3.2b to measure its power consumption. This device was used in Paper III to develop and evaluate a linear-regression power model for Intel Nehalem processor. 3.2 A Frequency-Independent Power Model Previous work [21, 19, 7, 14] has focused on how to select a representative set of (i) performance-counter events, and (ii) training benchmarks to build regression power-models. However, these works have not investigated how such models can be voltage and frequency independent. Consequently, different model parameters have to be derived for different frequencies by training the power model in each of them. This is because, in most cases, it is assumed that energy consumption is a linear function of different event counts. Equivalently, power consumption is a linear function of different event rates. Energy = coe f f 1 event 1 + coe f f 2 event Power = coe f f 1 event_rate 1 + coe f f 2 event_rate (3.1) where the coefficients are obtained by fitting the equation to the observed values of the events and power samples measured. This power-model structure, 15

Logging System AD Power Supply PCB Custom Sensors I m R s I m V i 1 2 Sensor 3 I out 1K Target System (a) Design of our power-measuring device. (b) Connecting the device on a real system. Figure 3.2. Measuring power consumption from the ATX connector.

24 Logging System AD Power Supply PCB Custom Sensors I m R s I m V i 1 2 Sensor 3 I out 1K Target System (a) Design of our power-measuring device. (b) Connecting the device on a real system. Figure 3.2. Measuring power consumption from the ATX connector. (a) shows the design of our device, which includes a PCB board with current sensors attached for every ATX connector cable. The device is installed between the power supply and the target-machine motherboard. An A/D device is used to read the outputs of the current sensors, and a logging machine collects and reports the samples. (b) shows how the device is attached on a real system. 16

25 however, does not identify the role of voltage and frequency in the power consumption, hence a different power model has to be built for every V-f configuration. In Paper III, we propose a more flexible power model that can be built by selecting a more detailed model formulation. As seen in Chapter 1, power consumption is given by the following equation: Power = f C ef f V 2 + P static In this equation, it is only C ef f (activity_ factor load_capacitance) that is non-deterministic, as f,v are controlled by the user/operating system, and P static does not depend on the application running on the processor. Therefore, to estimate the power consumption of an application, we only need to estimate effective capacitance. The source of dynamic power consumption is the switching activity of the various node-capacitances that make up a processor. On every cycle, some of these node-capacitances switch state, leading to dynamic power consumption. Hence, it is intuitive to assume that C ef f, which is the average capacitance that switches state on every cycle, correlates with the number of different events that occur on every processor cycle. Therefore, we propose the following power-model formulation: C ef f = n i=1 (coe f f i event i cycles )+C clock Power = f C ef f V 2 + P static (V ) (3.2) Static power consumption P static is a function of voltage and temperature, and can be measured off-line when the processor is idle (hence dynamic power is 0). To obtain the model coefficients, we run a set of benchmarks in a single voltage and frequency and measure power consumption and a set of events. To determine C ef f, we subtract P static from total power consumption and then divide with f V 2. Then, the events measured for each benchmark are divided with the number of cycles to obtain the average number of events occurring on every processor cycle. Finally, the computed event rates and C ef f values are used to determine the model coefficients through linear regression. In Paper III, we use events that have been previously proposed in the literature (instructions executed, floating point instructions, L2 accesses/misses, branch mispredictions, resource stalls), but different events can be used with the same model formulation. The advantage of this power-model is that it decouples voltage and frequency from the power model coefficients. The model is only trained once, at a single voltage and frequency, and the model parameters can be used to estimate power at any voltage and frequency of interest. This is because the model correlates event rates with the processor activity (C ef f ), which is decoupled from voltage and frequency. In Paper III we build such a model for 17

26 the Intel Nehalem processor. Our experimental findings show that our model can estimate power consumption across different frequencies with an average error of less than 4% for the SPEC2006 benchmark suite. Recently, Walker et. al. [40] proposed a similar power model formulation by decoupling voltage, frequency and static power and automating the process of curve fitting to create power models for an ARM SoC. 18

27 4. Understanding the Power Behavior of Applications Although power is a critical design constraint, there is a lack of profiling tools that provide information about the applications power behavior. Simply measuring average power consumption is not sufficient, since the behavior can differ significantly within an application, if the application exhibits different phases throughout its execution. Moreover, changing the voltage and frequency of the processor can have a significant impact on power and performance, which may vary depending on the phase of execution. Figure 4.1 shows the execution time, energy and power consumption of two different phases, X and Y, of the gcc/166 application from SPEC2006 benchmark suite, when executed in frequencies f max (2.66GHz) and f min (1.6GHz). Regarding execution time, whether X or Y is the longest running phase depends on the frequency. At maximum frequency, X takes longer to execute than Y. When frequency is scaled down, however, the execution time of Y increases substantially, as opposed to X which only experiences a small overhead. This is because Y is a compute-bound phase, while X involves a significant amount of off-chip communication and is therefore memory-bound. Regarding power consumption, Y is more power-hungry than X in maximum frequency, which is reasonable since compute-bound phases make better utilization of the processor and therefore consume more power. When it comes to energy, however, due to X s longer execution time, both phases consume approximately the same amount of energy at f max. Therefore, it is obvious that Time (s) Power (W) Energy (J) f max f min f max f min f max f min X Y Figure 4.1. Execution time, power and energy consumption in maximum ( f max ) and minimum ( f min ) frequency for two phases (X, Y) of gcc/166 application from the SPEC2006 benchmark suite. 19

28 execution time, power and energy are equally important to characterize the behavior of the different application phases. Paper III discusses how the DVFS models presented in the previous chapters can be utilized to understand and estimate the behavior of an application in different DVFS configurations. Moreover, it employs phase detection to correlate behavior with different application phases, and it proposes a phasebased sampling methodology to monitor performance counter events when their number exceeds the hardware capabilities. The tool presented in Paper III, called Power-Sleuth, only needs to profile an application once, in a single frequency. Then, it can estimate the power and performance behavior of each individual phase in different V-f configurations. 4.1 Phase Detection Applications can have very distinct behavior throughout their execution, or, in other words, they can exhibit different phases. To detect such phases online, we use the ScarPhase [36] library. ScarPhase is characterized by low overhead (less than 2%), and utilizes the concept of Basic Block Vectors (BBV) [37] to detect code regions that make up different execution phases. In particular, execution is divided into intervals, during which branch instructions are sampled using Intel Precise Event Based Sampling (PEBS). The branch addresses are hashed into a vector, the entries of which show how many times the corresponding branches were sampled in the execution of the program. Hence, every interval is characterized by a vector. Similar intervals/vectors are clustered together and form a program phase, and they are characterized by similar behavior in terms of different metrics (IPC, cache misses, power etc.). 4.2 Utilizing Program Phases to Aid Performance Event Monitoring To better understand the processor behavior, many times we need to monitor various performance-counter events, the number of which often exceeds the number of events that can be concurrently monitored by the hardware. Of course, one can run an application several times and monitor different events at a time. This, however, is not applicable for run-time optimizations, and it introduces significant overhead in profiling tools. In Power-Sleuth, phase detection and performance and power modeling require a total of 9 performancecounter events. This number can easily increase if more events are involved in the linear-regression power model to achieve higher accuracy. Phase information, however, can be used to efficiently sample a subset of events at a time, and interpolate the event values for the missing samples. 20

Power-Sleuth: A Tool for Investigating your Program s Power Behavior

Power-Sleuth: A Tool for Investigating your Program s Power Behavior Vasileios Spiliopoulos, Andreas Sembrant, Stefanos Kaxiras Uppsala University, Department of Information Technology P.O. Box 337, SE-751