Power Modeling and Characterization of Computing Devices: A Survey. Contents

Size: px

Start display at page:

Download "Power Modeling and Characterization of Computing Devices: A Survey. Contents"

Molly Palmer
6 years ago
Views:

1 Foundations and Trends R in Electronic Design Automation Vol. 6, No. 2 (2012) c 2012 S. Reda and A. N. Nowroz DOI: / Power Modeling and Characterization of Computing Devices: A Survey By Sherief Reda and Abdullah N. Nowroz Contents 1 Introduction Computing Substrates Survey Overview Summary Background: Basics of Power Modeling Dynamic Power Static Power Summary Pre-Silicon Power Modeling Techniques Power Modeling for General-Purpose Processors Power Modeling for SoC-Based Embedded Systems Power Modeling for FPGAs Summary 173

2 4 Post-Silicon Power Characterization Power Characterization for Validation and Debugging Power Characterization for Adaptive Power-Aware Computing Power Characterization for Software Power Analysis Summary Future Directions 199 Acknowledgments 203 Notations and Acronyms 204 References 207

3 Foundations and Trends R in Electronic Design Automation Vol. 6, No. 2 (2012) c 2012 S. Reda and A. N. Nowroz DOI: / Power Modeling and Characterization of Computing Devices: A Survey Sherief Reda 1 and Abdullah N. Nowroz 2 1 Brown University, 182 Hope st, Providence 02912, USA, sherief reda@brown.edu 2 Brown University, 182 Hope st, Providence 02912, USA, Abdullah nowroz@brown.edu Abstract In this survey we describe the main research directions in pre-silicon power modeling and post-silicon power characterization. We review techniques in power modeling and characterization for three computing substrates: general-purpose processors, system-on-chip-based embedded systems, and field programmable gate arrays. We describe the basic principles that govern power consumption in digital circuits, and utilize these principles to describe high-level power modeling techniques for designs of the three computing substrates. Once a computing device is fabricated, direct measurements on the actual device reveal a great wealth of information about the device s power consumption under various operating conditions. We describe characterization techniques that integrate infrared imaging with electric current measurements to generate runtime power maps. The power maps can be used to validate design-time power models and to calibrate computer-aided design

4 tools. We also describe empirical power characterization techniques for software power analysis and for adaptive power-aware computing. Finally, we provide a number of plausible future research directions for power modeling and characterization.

5 1 Introduction In the past decade power has emerged as a major challenge to computing advancement. A recent report by the National Research Council (NRC) of the National Academies highlights power as the number one challenge to sustain historical improvements in computing performance [39]. Power is limiting the performance of both mobile and server computing devices. At one extreme, embedded and portable computing devices operate within power constraints to prolong battery operation. The power budgets of these devices are about tens of milli- Watts for some embedded systems (e.g., sensor nodes), 1 2 W for mobile smart phones and tablets, and W for laptop computers. At another extreme, high-end server processors, where performance is the main objective, are increasingly becoming hot-spot limited [46], where increases in performance are constrained by a maximum junction temperature (typically 85 C). Economic air-based cooling techniques limit the total power consumption of server processors to about W, and it is the spatial and temporal allocation of the power distribution that leads to hot spots in the die that can comprise the reliability of the device. Because server-based systems are typically deployed in data centers, their aggregate performance becomes power 123

6 124 Introduction limited [6], where energy costs represent the major portion of total cost of ownership. The emergence of power as a major constraint has forced designers to carefully evaluate every architectural and design feature with respect to its performance and power trade-offs. This evaluation requires pre-silicon power modeling tools that can navigate the rich design landscape. Furthermore, runtime constraints on power consumption require power management tools that control a number of runtime knobs that trade-off performance and power consumption. Power management techniques that seek to meet a power cap, e.g., as in the case of servers in data centers [6, 37], require either direct power measurements when feasible, or alternatively, runtime power modeling techniques that can substitute direct characterization. In addition, software power characterization can help tune and restructure algorithms to reduce their power consumption. The last decade has seen a diversification in possible computing substrates that offer different trade-offs in performance, power, and cost for different applications. These substrates include application-specific custom-fabricated circuits, application-specific circuits implemented in field-programmable logic arrays (FPGAs), general-purpose processors whose functionality is determined by software, general-purpose graphical processing units (GP-GPUs), digital signal processors (DSPs), and system-on-chip (SoC) substrates that combine general-purpose cores with heterogeneous application-specific custom circuits. None of these substrates necessarily dominate the other, but they rather offer certain advantages that depend on the target application and the deployment setting of the computing device. For instance, custom fabricated circuits outperform their FPGA counterparts in performance and power, but they are more expensive. SoCs offer higher performance/watt ratio for a range of applications than general-purpose processors; however, general-purpose processors offer higher throughput for scientific applications. GPGPUs are also emerging as a strong contender to processors and FPGAs; however, the relative advantage of each of these substrates differs by the application [39, 54, 76]. Sorting out the exact trade-offs of all these substrates across different application domains is an active area of research [4, 29, 76]. While power modeling and characterization for these substrates share common concepts, each of these substrates

7 125 has its own peculiarities. In this survey we will discuss the basic power modeling and characterization concepts that are shared among these substrates as well as the specific techniques that are applicable for each one. Pre-silicon power modeling and post-silicon power characterization are very challenging tasks. The following factors contribute to these challenges. (1) Large die areas with billions of transistors and interconnects lead to computational difficulties in modeling. (2) Input patterns and runtime software applications trigger large variation in power consumption. These variations are computationally impossible to enumerate exhaustively during modeling. (3) Spatial and temporal thermal variations arising from power consumption trigger large variations in leakage power, which lead to intricate dependencies in power modeling. (4) Process variabilities that arise during fabrication lead to intra-die and inter-die power leakage variations that are unique to each die. These deviations recast the modeling results to be educated guesses, rather than exact estimates. (5) Practical limitations on the design of power-delivery networks make it difficult to directly characterize the runtime power consumption of individual circuit blocks. The objective of this survey is to describe modern research directions for pre-silicon power modeling and post-silicon power characterization. Pre-silicon power modeling tools estimate the power consumption of an input design, and they can be used to create a power-aware design exploration framework, where different design choices are evaluated in terms of their power impact in addition to traditional design objective such as performance and area. Post-silicon power characterization tools are applied to a fabricated design to characterize its power consumption under various workloads and environmental variabilities. The results of power characterization are useful for power-related debugging issues, calibration of design-time power modeling tools, software-driven power analysis, and adaptive

8 126 Introduction power-aware computing. Our technical exposition reviews power modeling and characterization techniques of various computing substrates, while emphasizing cross-cutting issues. We also connect the dots between the research results of different research communities, such as circuit designers, computer-aided design (CAD) developers, computer architects, and system designers. Our discussions reveal the shared concepts and the different research angles that have been explored for power modeling and characterization. 1.1 Computing Substrates General-Purpose Processors A general-purpose processor is designed to serve a large variety of applications, rather than being highly tailored to one specific application or a class of applications. The design of a general-purpose processor has to be done carefully to lead to good performance within the processor s thermal design power (TDP) limit under different kinds of workloads. The TDP limit has forced a significant change in the design of processors. At present, designers aim to increase the processor s total throughput rather than improving the single-thread performance. This throughput increase is achieved by using more than one processing core per chip. Figure 1.1 gives an example of a quad-core processor based on Intel s Core i7 Nehalem architecture. The 64-bit processor features four cores that share an 8 MB of L3 cache. The cores can run up to 3.46 GHz in a 130 W TDP. Each core has a 16-stage pipeline and includes a 32 KB L1 instruction cache, a 32 KB L1 data cache, and a 256 KB of L2 cache. The front-end of the pipeline can fetch up to 16 bytes from the L1 instruction cache. The instructions in the fetched 16 bytes are identified and inserted into an instruction queue. The decoder unit receives its inputs from the instruction queue, and it can decode up to four instructions per cycle into micro-ops. A branch prediction unit with a branch target buffer enables the core to fetch and process instructions before the outcome of a branch is determined. The back-end of the pipeline allocates resources for the micro-ops and renames their source and destination registers to eliminate hazards and to expose instruction-level

1.1 Computing Substrates 127 Instruction Fetch and PreDecode Core 1 Core 2 Core 3 Core 4 Instruction Queue Decode Front-End In-Order Shared L3 Cache Rename/ Allocate Retirement Unit Reorder Buffer

9 1.1 Computing Substrates 127 Instruction Fetch and PreDecode Core 1 Core 2 Core 3 Core 4 Instruction Queue Decode Front-End In-Order Shared L3 Cache Rename/ Allocate Retirement Unit Reorder Buffer Schedulers Reservation Stations Execution Units Execution Engine Out-of-Order L1 D-Cache (32KB) and D-TLB L2 D-Cache (256KB) Memory Shared L3 Cache (8 MB) Fig High-level diagram of Intel Core i7 processor (Nehalem architecture). parallelism. The micro-ops are then queued in the re-order buffer until they are ready for execution. The pipeline can dynamically schedule and issue up to six micro-ops per cycle to the execution units as long as the operands and resources are available. The execution units perform loads, stores, scalar integer or floating-point arithmetic, and vector integer or floating-point arithmetic. The results from the execution of micro-ops are stored in the re-order buffer, and results are committed in-order only for correct instruction execution paths Embedded SoC SoCs are computational substrates that are targeted for embedded systems and mobile computing platforms for a certain niche of applications. An SoC for a smart phone or a tablet typically consumes less than 1 2 W of power, while delivering the throughput required for

10 128 Introduction applications that include video and audio playback, internet connectivity, and games. In contrast to a general-purpose processor, an SoC includes, in addition to the general-purpose core(s), application-specific custom hardware (HW) components that can provide the required throughput for the target applications within the power envelope of the embedded system. Because total die area is constrained by cost and yield considerations, the inclusion of application-specific custom HW components must come at the expense of the functionality of the general-purpose core. SoC general-purpose cores are less capable than the ones used in general-purpose processors. They are usually less aggressively pipelined with limited instruction-level parallelism capabilities and smaller cache sizes. Figure 1.2 gives an example of an SoC based on nvidia s Tegra platform that has a total power budget of about 250 mw. The SoC features a 32-bit ARM11 general-purpose core that runs up to 800 MHz. The ARM11 core has an 8-stage pipeline, with a single instruction issue and support for out-of-order completion. The L1 data and code cache memory sizes are 32 KB each, and the size of the L2 cache is 256 KB. The performance specifications of the core are clearly inferior compared to the specifications of the Core i7. To compensate for the lost generalpurpose computing performance, the SoC uses a number of applicationspecific components to deliver the required performance within its Fig Example of nvidia Tegra SoC.

11 1.1 Computing Substrates 129 power budget. These include an image signal processor that can provide image processing functions (e.g., de-noising, sharpening, and color correction) for images captured from embedded cameras. The SoC includes a high-definition audio and video processor for image, video and audio playback, and a GPU to deliver the required graphics performance for 3-D games. The SoC supports an integrated memory controller, an encryption/decryption accelerator component, and components for communication, such as Universal Asynchronous Receiver/Transmitter (UART), Universal Serial Bus (USB), and High-Definition Multimedia Interface (HDMI). All SoC components communicate with each other using an on-chip communication network, which can take a number of forms, including shared and hierarchical busses, point-to-point busses, and meshes Field-Programmable Gate Arrays Soaring costs associated with fabricating computing circuitry at advanced technology nodes have increased the interest in programmable logic devices that can be configured after fabrication to implement user designs. The most versatile programmable logic currently available is Field Programmable Gate Arrays (FPGAs). The basic FPGA architecture is an island-style structure, where programmable logic array blocks (LABs) are embedded in a reconfigurable wiring fabric that consists of wires and switch blocks as illustrated in Figure 1.3. The inputs and outputs of the LABs are connected to the routing fabric through programmable switches. When programmed, these switches determine the exact input and output connections of the LABs. In addition, 10s 100s of programmable I/O pads are available in the FPGA. In many occasions, FPGAs also host heterogeneous dedicated computing resources, such as digital signal processors to implement multiplications, memory blocks to store runtime data, and even full light-weight processor cores. Each LAB is composed of several basic logic elements (BLEs), where a BLE is made up of a 4-, 5-, or 6-input look-up table (LUT) together with an associated flip-flop. A 4-input LUT can be used to implement any 4-input Boolean function. Figure 1.4(a) illustrates the structure

12 130 Introduction I/O pad Logic Array Block Switch Block Wire Segment Fig Island-style FPGA. Inputs K-input LUT Clock DFF Basic Logic Element (BLE) Out BLE Outputs BLE Inputs Logic Array Block (LAB) Fig Typical design of a Logic Array Block (LAB) and a basic logic element (BLE). of a BLE and, Figure 1.4(b) illustrates the structure of a LAB. Each BLE can receive its inputs from other BLEs inside its LAB or from other LABs through the reconfigurable wiring fabric. Additional wiring structures in the LAB enable it to propagate arithmetic carry outputs

13 1.2 Survey Overview 131 in a fast and efficient way. To implement a computing circuit into an FPGA, it is first necessary to synthesize the input circuit by breaking it up into subcircuits, where each subcircuit is mapped to a BLE. These BLEs are then clustered into groups, where the size of each group is determined by the number of BLEs in a LAB. These clusters are then mapped and placed at the LABs. Finally, routing is conducted to determine the exact routes and switches of the routing fabric used by the circuit. The configuration bits for the logic and routing are stored in SRAM or FLASH memory cells. While FPGAs are very attractive to computer-system designers due to their post-silicon flexibility, this flexibility comes at the expense of higher design area and power consumption compared to custom circuits that perform the same computing tasks. For example, Kuon and Rose report almost a 35 overhead for using programmable logic over custom logic [68]. However, for low to mid-volume fabrication, programmable logic is the only economically feasible technology. Along with performance and area, power is also an important factor that must be considered during architectural design exploration of FPGAs. FPGA architectural parameters include segment length, switch block topology, cluster size, BLE/LAB designs. Choices for these parameters lead to different power, performance, and area trade-offs. Thus, proper evaluation of power consumption is required to help designers and users make correct choices for the FPGA s architecture and programmed designs. 1.2 Survey Overview The basic techniques for circuit-level power modeling are discussed in Section 2. The power consumption of computing circuits can be des cribed by two components: dynamic power and static power. The section includes discussions on how to estimate each of these components when the design s circuit is available. We will also discuss the various factors that impact these power components, which include, circuit design and layout, input patterns, fabrication technology, process variability, and operational temperature. The discussions in Section 2 will form the basis for the techniques discussed in Sections 3 and 4.

14 132 Introduction In Section 3 we discuss techniques for pre-silicon power modeling techniques. Historically, performance and area were the two main criteria during the design of computing devices. In the past years, power has emerged as a third criterion that has to be considered during design. Every architectural feature has to be judged in terms of its performance, area and power. A typical design space has an exponential number of possible combination of settings for the various features. Thus, there is a strong need for power modeling methods that enable designers to efficiently explore the design space and to evaluate the impact of various high-level system architectural choices and optimizations on power consumption. These architectural features and choices vary by the medium of the computing substrate. For multi-core processors, the choices include, for example, pipeline depth, instruction issue width, and cache sizes. For SoC-based embedded systems, the choices include, the functionality of the custom blocks and the on-chip communication architecture (e.g., network topology, buffer sizes and transfer modes). In some embedded systems, the boundary between hardware (HW) and software (SW) is fluid, where the choice of the implementation (SW or HW) of every component could be decided based on its impact on performance, power, and area. In embedded design environments, it is necessary to have power co-modeling tools that can effectively explore the possible HW/SW implementation choices of every design component, and guide designers to the correct choice. FPGA power modeling is also challenging as the user s design is not known during the design and fabrication of the FPGA. Furthermore, users do not have direct access to the internal circuits of the FPGA. Thus, precharacterized power models for the different FPGA structures must be estimated during the design of the FPGA and then bundled with the vendor s tools to be used by the end user. Once a design is implemented and a physical prototype is available for direct measurements, new opportunities become possible. In Section 4, we discuss a number of techniques for post-silicon power characterization. We describe techniques that integrate infrared imaging and direct electric current measurements to develop power mapping techniques, that reveal the true power consumption of every design structure. These true power maps can be used to validate pre-silicon

15 1.3 Summary 133 design estimates, to calibrate power-modeling CAD tools, and to estimate the impact of variabilities introduced during fabrication. We also discuss power characterization techniques for adaptive power-aware computing, where power models based on lumped power measurements are used by power management systems to cut down operational margins and to enforce runtime power constraints. Another discussed topic is SW power characterization using instruction-level, architectural-level and algorithmic-level power models. SW power characterization helps software developers and compiler designers to cut down the power consumption of their applications. 1.3 Summary In this section we have highlighted the importance of power modeling and characterization techniques for modern computing devices. Future computing systems will be constrained by power, and the choices for design features and runtime settings have to be guided by the impact on power consumption as well as traditional objectives such as performance and implementation area. Computing substrates can come in a number of forms, including custom circuits with fixed functionality, general-purpose processors whose functionality is determined by software applications, SoCs that combine general-purpose processing cores with application specific custom circuits, and programmable logic that can be used to implement computing circuits in a cost-effective way. These computing forms share some basic power modeling techniques; however, their unique architectural features enable them to utilize efficient large-scale modeling and characterization methods. Pre-silicon power modeling and post-silicon characterization techniques will be discussed in the remaining sections of this survey. The basic circuit-level power modeling techniques are discussed in Section 2. High-level power modeling techniques for various computing substrates will be discussed in Section 3. In Section 4 we overview different techniques for post-silicon power characterization through physical measurements on a fabricated device. Finally, a number of future research directions are outlined in Section 5.

16 2 Background: Basics of Power Modeling Basic transistor and circuit-level power modeling is a mature area of research with numerous existing textbooks and surveys [73, 86, 99, 100]. The objective of this section is to introduce sufficient background to understand the focus of this survey, which spans from pre-silicon highlevel power modeling techniques to post-silicon power characterization techniques. Power consumption of digital Complementary Metal Oxide Semiconductor (CMOS) circuits is caused by two mechanisms. The first mechanism is dynamic power, which arises when signals transition their values, and the second mechanism is static power, which causes circuits to dissipate power when no switching activity is occurring. One of the main advantages of using CMOS technology over earlier bipolar technology was that CMOS circuits consumed power only during circuit switching. However, aggressive technology scaling in the past decade led to a situation where static power is no longer negligible, but rather a significant contributor to total power consumption. This section discusses the basic principles and main challenges of circuit-level power modeling. The techniques discussed in this section provide the basic background required for the next two sections. 134

17 2.1 Dynamic Power Dynamic Power Logic gates implemented in CMOS chips use two complementary transistor types, NMOS and PMOS, to build the functionality of each gate. One terminal of PMOS transistors is typically connected to the voltage supply, V DD, while one terminal of NMOS transistors is connected to the ground voltage V GND. Figure 2.1 gives the schematic of an inverter gate that consists of one NMOS transistor and one PMOS transistor. To understand the operation of the gate, assume that the input voltage is first at logic 0 (i.e., V GND ). In this case the PMOS transistor is in on state with a very low resistance (ideally 0), while the NMOS transistor is in off state with a very high resistance (ideally ), and a path exists to charge the load capacitance C L until the output voltage reaches V DD. The load capacitance, C L, represents the total capacitance arising from the output diffusion capacitances of the two transistors, the input capacitances of fan-out gates, wiring capacitance, and parasitics. When the input voltage switches to a logic 1 (i.e., V DD ), the PMOS transistor is in off state with a very high resistance (ideally ), while the NMOS transistor is in on state with a very low resistance (ideally 0), and a path exists to discharge the charges on the load capacitance to the ground until the output voltage reaches 0. The sum of energy consumed during the charging and discharging, i.e., the energy per cycle, is V DD PMOS input output C L NMOS GND Fig CMOS inverter.

18 136 Background: Basics of Power Modeling equal to C L VDD 2. The dynamic power consumed, which is the switching energy per second, by the gate is equal to P dynamic gate = sc L V 2 DD, (2.1) where s is an activity factor that denotes the number of switching cycles per second. If a circuit has N gates, then the total dynamic power is equal to N P dynamic = s i C Li VDD, 2 (2.2) where s i and C Li are the switching activity and load capacitance of gate i respectively. Besides directly contributing to dynamic power, the capacitances also determine the exact propagation delays of the signals, which influence the occurrence of signal glitches. Glitches are unnecessary signal transitions arising from unbalanced path delays at gate inputs as illustrated in Figure 2.2. In synchronous logic circuits, the clock signal has the highest switching activity, f, in the circuit. Thus, the buffer gates along the clock network path will have the highest transition frequency. All other gates in the design can switch at a rate that is at most half of f. Determining the exact switching activity of each gate is a challenging task because (1) it requires knowledge of the exact sequence of applied input vectors to the circuit; and (2) the exact signal timing information which is not accurately available until the final circuit layout is determined. The final layout provides the exact wiring and parasitic capacitances, which are needed to estimate the load capacitances and the propagation delays along the wires. Another power component that is incurred during switching is short circuit power. If the transition edge (from 1 to 0 or 0 to 1) of the input i a b c a b c glitch Fig Illustration of a glitch arising from a mismatch in arrival times at the inputs of an OR gate.

19 2.1 Dynamic Power 137 signal is not sharp, there will exist a brief moment of time where the NMOS and the PMOS transistors are both turned on and current will flow from the supply terminal to the ground. Short circuit power is incurred only whenever a switching activity occurs, which makes it proportional to dynamic power consumption. Its exact value is determined by the slopes, or transition times, of the input and output signals. With proper circuit design, short circuit power is usually about 10% of the dynamic power [43]. To get a reasonable estimate of switching activity, it is necessary to obtain representative input waveform vectors for the circuit. Two approaches are possible. (1) Designers can generate input vectors that are derived from the knowledge of the semantics and intended functionality of the circuit. Finding the most relevant input vectors is a challenge, especially for computing circuits of general-purpose processors, where software applications can trigger a large range of input vector sequences. While it is possible to construct input vectors that are triggered by some standard benchmark applications, there is always a chance that a new application can trigger non-modeled activity behavior. (2) If realistic input vectors are not available, then it is possible to construct pseudo-random input sequences that are generated in a way that mimics realistic input vectors. The first input vector can be constructed by assigning each input bit a logic level that depends on the probability of observinga0or1asaninput signal. Then each subsequent input vector is generated from the previous vector with some transition probability for each bit. Input signals of real circuits exhibit spatial and temporal correlations, where signal levels of inputs that are structurally close to each other can exhibit spatial correlations, and input transitions can be correlated in time. Thus, it is desirable to account for these spatiotemporal correlations during the generation of pseudo-random input sequences [88]. Accurate circuit-level power estimates are obtained using a circuit-level simulator such as SPICE [94]. Given the the input voltage

20 138 Background: Basics of Power Modeling waveform vectors and the layout capacitances, SPICE can solve the equations of the circuit to compute the voltage and current signals at all circuit nodes. These signals give the exact switching activities and the total current drawn from the voltage supply. Furthermore, leakage estimation (discussed in Section 2.2) can be realized within SPICE by back annotating the temperatures of transistors. However, SPICE simulations are computationally feasible for only small circuits. Speeding-up Power Simulation. To speed-up dynamic power estimation, it is possible to partition a circuit into a number of blocks, and estimate the power consumption of each block using SPICE. The input vectors and the resultant power estimates from the simulations can be then used to build a power macro model for quick estimation of the block s power within the context of the larger circuit. If the block size is relatively small, then it is possible to store the power estimates of every possible pair of consecutive input vectors. For larger circuit blocks, it is necessary to abstract the input vectors into macro parameters, and for each possible combination of these parameters, the average estimated power value is stored in the corresponding entry in the macro model look-up table. Candidate parameters include the Hamming distance between consecutive input vectors, the average input signal probability, the average input transition density, and the average spatial correlation coefficients [44]. In macro modeling, deviations from the average value that arise because of interactions of different partitioned blocks and because of lack of glitch propagation. Another speed-up option is to use fast switch-level logic simulators; however, these simulators cannot incorporate exact timing information, and consequently they tend to underestimate the dynamic power incurred from signal glitches. Average Power Estimation. It is also possible to speed up dynamic power estimation by directly estimating the average dynamic power. In this case, Equation (2.3) can be approximated by a simpler form: P dynamic = sc total V 2 DD, (2.3) where s is the average switching activity per circuit node and C total is the total circuit capacitance including the wiring. In the approximate

21 2.1 Dynamic Power 139 form, estimating the average dynamic power breaks into two estimation problems: total capacitance and average switching activity. Capacitance: the total capacitance is the sum of the gate capacitances and wiring capacitances. If the gate-level implementation is not known, but another representative specification (e.g., Boolean formulae) is known, then it is possible to derive complexity metrics that relate the boolean formulae to the number of gates [21]. If the final-layout is not available, then the wiring capacitance can be estimated using wire-load models and Rent-based statistical techniques [42, 135]. These techniques use parameters such as number of gates, number of design partitions, fan-out distribution, and wiring complexity to estimate the wire length distribution. However, these estimates quickly lose their accuracy at higher design abstractions [62]. Switching activity: to calculate the average switching activity, the average signal level and transition probabilities are first assigned to the circuit inputs, and then propagated analytically through the different gates depending on their functions and their input probabilities [93, 98, 101]. Let s o denote the average switching activity at output node o, that is driven by a gate with k inputs, where s i denote the average switching activity of the ith input. Assuming that the switching activities of the input signals are uncorrelated, then it is possible to express s o as s o = k ( ) so p s i, (2.4) s i i where p ( s o s i ) is the probability that the output transitions due to a transition in the ith input. To make this analysis more accurate, it is possible to consider the spatiotemporal correlations among the different input and internal signals [88]. Temporal correlations are of particular importance because they determine the transition rates of signals.

22 140 Background: Basics of Power Modeling In many cases the objective of power estimation is to explore different design choices at higher levels during architectural and system exploration. At these levels, circuit-level implementations might not be known and it is necessary to estimate power from high-level specifications. The topic of high-level power modeling for various computing substrates will be the main focus of Section Static Power Static power is the power consumed by transistors when they are not switching. When CMOS logic gates do not switch, they have no electrical path between the supply terminal, V DD, and the ground terminals. Thus, static or leakage current was historically negligible; however, aggressive scaling in sub-100 nm technologies has led to a substantial increase in its magnitude. Note that modern computing devices include analog components (e.g., phase-locked loops, sense amplifiers) that incur static DC power consumption. For digital CMOS, there are three main components for leakage current: (1) subthreshold leakage current between the transistor s source and drain; (2) gate leakage between the transistor s channel and gate; and (3) reverse bias current between the transistor s drain and well [2, 125]. Figure 2.3 illustrates these three leakage components. With the introduction of high-k dielectrics, the main source of leakage current is the subthreshold leakage. MOS transistors operate by modulating an energy barrier between the source and drain of the transistor. The height of the energy barrier Source (S) Gate (G) Drain (D) n well Fig Three components of leakage current in a MOS transistor.

23 2.2 Static Power 141 is called the threshold voltage (V th ). By increasing the potential difference between the gate and the source, V GS, it is possible to reduce the energy barrier enabling more flow of electrical carriers between the source and the drain of the transistor. During the off state when V GS = 0, the average carrier s energy is lower than the barrier s energy; however, the carriers do not have a uniform energy distribution, and there is a probability that some carriers will have higher energy than the height of the barrier, enabling them to leak from the source to the drain [39]. The probability of a carrier having an energy higher than the average energy drops exponentially with a factor that is proportional to temperature. The subthreshold leakage current can be mathematically expressed as I leak = I o e qv th αkt, (2.5) where I o is a constant that depends on the transistor s geometrical dimensions and fabrication technology, α is a number greater than 1, T is the temperature of the transistor, q is the charge of the electrical carrier, and k is the Boltzmann constant [39]. Equation (2.5) reveals a number of sensitivity factors that impact leakage [1]. Process sensitivity: leakage power depends exponentially on V th ; for every drop in threshold voltage by 100 mv, the leakage current increases by about a factor of 10. Aggressive scaling in sub-100 nm process technologies have reduced the value of V th enough to make leakage power a good contributor of the total power consumption. Furthermore, aggressive technology scaling has led to large levels of manufacturing process variations due to statistical fluctuations inherent in the manufacturing process. Manifestations of these variations include gate length variations, line-edge roughness, and dopant fluctuations [14, 106, 123, 146]. These variations impact the threshold voltages of individual transistors leading to intra-die and inter-die leakage variations. In 2002, Intel released a statistical study that revealed about 20 variations in the leakage of one of its production processors [16].

142 Background: Basics of Power Modeling Fig. 2.4. Thermal distributions of a dual-core processor during runtime. Temperature sensitivity: leakage power also depends exponentially on temperature.

24 142 Background: Basics of Power Modeling Fig Thermal distributions of a dual-core processor during runtime. Temperature sensitivity: leakage power also depends exponentially on temperature. Besides the change in environmental temperature. Internal heat generation arising from power consumption increases the temperature. Furthermore, the large spatial variations in power consumption lead to spatial thermal gradients on the chip, which create non-uniform leakage power variations. Using our laboratory s infrared imaging setup, we illustrate in Figure 2.4 a thermal map of a popular dual-core processor during operation. The thermal maps show non-uniform temperature distributions with gradients reaching up to 14 C. While the leakage current of an individual logic gate also depends on its input vector, the sum of these vector-dependent variations typically average out for large circuits; furthermore, they are dwarfed by the impact of temperature and V th variations [1]. The sensitivities of leakage power (i.e., V DD I leak ) to the supply voltage, threshold voltage, and temperature introduce variabilities in leakage modeling. Modeling Leakage Dependency on Process and Temperature Variations. To model inter-die variations, circuit designers have to simulate the design at a number of corners that capture the worst, typical, and best case scenarios for supply voltage, threshold voltage, temperature, and transistor geometrical dimensions (gate length in particular). The results from the simulation give a range for leakage current from the highest leakage at the high-supply voltage, high temperature and low V th corner to the lowest leakage at the low-supply voltage,

25 2.2 Static Power 143 Fig Corner simulation results (timing in ns and leakage in mw) for a 64-bit adder in 90 nm technology. low temperature and high V th corner. Figure 2.5 gives the results of a typical corner simulation for a design, where the delay of the design s critical path is given on the y-axis, and total leakage power is given on the x-axis. 1 The results show a large range (almost 25 ) of leakage power variations. The results also show the typical inverse correlation between leakage and delay, where leaky transistors tend to be faster, which creates critical paths with smaller delay. To model intra-die leakage variations, it is necessary to account for the intra-die variations arising from manufacturing variability and for the spatial and temporal variations in temperature [22, 121, 136]. Intra-die process variability tends to be spatially correlated; that is, devices that are close to each other will likely have similar characteristics than those that are farther away. If V th is considered a random variable, then the statistical field formed from the V th of all transistors is spatially correlated, where the correlation between two transistors 1 We simulate a 64-bit ripple carry adder at 90 nm using the PTM SPICE models.

26 144 Background: Basics of Power Modeling Fig Temperature-aware leakage power modeling. on the die is a function of the distance between them [38, 47]. Using a representative model that captures the spatial correlations in device parameters, it is possible to statistically compute the intra-die variations in leakage power [22, 48]. Modeling the dependency of intra-die leakage on temperature requires the use of a thermal simulator that can estimate the temperature distributions on the chip, and then provide feedback to the leakage power model. Figure 2.6 shows an iterative feedback power modeling technique, where estimates of the leakage and dynamic power components are given as inputs to the thermal simulator. The resultant spatial thermal distribution is fed back to the leakage model to update the leakage power estimates. This feedback loop is repeated a few times until convergence is reached. While temperature and process variations have strong impact on leakage power, they have a weak impact on dynamic power. In particular, temperature and process variabilities lead to timing variabilities along the various paths of a circuit. These timing variabilities can alter the number of circuit glitches, which impacts dynamic power consumption. 2.3 Summary In this section we discussed the basics of modeling power consumption in CMOS digital circuits. Power consumption is contributed by two components: dynamic power and static power. Dynamic power is incurred due to switching activity, while static power is consumed by

27 2.3 Summary 145 transistors when they do not switch. For dynamic switching power estimation, it is necessary to know the load capacitances and the signal transition activities of every signal. For dynamic short circuit power, it is necessary to know the signal transition times for every signal, and for static power, it is necessary to know the threshold voltages and temperatures of the individual transistors. Power estimation is highly challenging due to many modeling complexities and unknown factors during design time. Dynamic power is highly dependent on input vector sequences, and it is impossible to exhaustively cover all possible sequences, making these estimates vulnerable to deviations arising from unforeseen runtime inputs. Furthermore, process variabilities can impact the signal propagation and transition times along different paths of the circuit, which impacts dynamic power and short circuit power. Leakage power is highly dependent on temperature and process variabilities. On-die thermal gradients complicate leakage estimation and can lead to deviations between estimates and actual leakage consumption. Furthermore, intra-die and inter-die process variabilities create large variations in leakage power consumption. The supply voltage is a key parameter in determining the total power consumption. Dynamic power has a quadratic dependency on the supply voltage, while leakage power have a linear dependency. Except for IR drops on the power distribution network, the supply voltage is a relatively easy parameter to estimate. In the past decade, supply and threshold voltages have deviated from ideal scaling, leading to an increase in power consumption. The exponential dependency of leakage power on the threshold voltage has constrained the ability to scale the threshold voltage down with feature size, which in return stalled supply voltage scaling. This stalling is necessary because transistor current, switching speed, and noise margin are all proportional to the difference between the supply voltage and the threshold voltage. Thus, scaling down the supply voltage at a higher rate than the threshold voltage hurts performance and makes the circuit susceptible to runtime errors.

28 3 Pre-Silicon Power Modeling Techniques In this section we discuss pre-silicon power modeling techniques. These techniques are used by designers to evaluate the power consumption of a given design and to explore the device s architectural choices, where different designs are evaluated to assess their performance, power, and area trade-offs. We discuss power modeling techniques for three computing substrates: general-purpose processors, SoC-based embedded systems, and FPGAs. Power modeling for these substrates shares the basic concepts discussed in Section 2; however, their sheer complexity makes direct use of basic circuit power estimation techniques a computationally infeasible task. We will explore techniques that leverage the unique features of these substrates to customize their high-level power models to result in accurate yet computationally-feasible models. We discuss power modeling techniques for general-purpose processors in Section 3.1; power modeling techniques for SoC-based embedded systems are discussed in Section 3.2; and power modeling techniques for FPGAs are discussed in Section

29 3.1 Power Modeling for General-Purpose Processors Power Modeling for General-Purpose Processors Power dissipation and thermal issues are the most important concerns in modern general-purpose processors. As a result, computer architects develop power-aware designs for processors, which require accurate models to estimate power at the early stage of the design flow. Power estimates help to perform architectural optimizations that improve power and performance. A good amount of research work has been done in estimating processor power at the early stage, but accurate and efficient power estimation of general-purpose processors still remains a challenge. As technology advances, transistor miniaturization impacts both dynamic and leakage power, adds functional units to the design, and demands more interconnections between the units. All these factors impact the processor power consumption and distribution significantly, and as a result, it is hard for power estimation tools to keep up the accuracy and efficiency. To analyze and estimate processor power dissipation at an early stage, high-level architectural power modeling combines characterizations from circuit-level simulation data or analytical power models with architectural-level switching activity. Figure 3.1 shows a Fig Cycle-by-cycle power estimation framework for a processor.

30 148 Pre-Silicon Power Modeling Techniques cycle-by-cycle power estimation framework for processors. The framework has the following parts: a compiler to generate machine code from various benchmark applications, a cycle-accurate simulator to generate resource utilization and circuit switching activity information, and various kinds of power models. There are two main types of simulators for processors: trace-driven and execution-driven. Trace-driven simulators record an execution trace and use the recorded trace as input to the simulator to drive the model. In execution-driven simulators, applications run directly on the simulator host which maintains both application state and architecture state, while recording detailed performance statistics. The current trend in power modeling simulation is to use execution-driven simulation because it allows to capture dynamic properties of the application unlike the trace-driven simulation. The dynamic properties of an application (e.g., speculative execution for branch prediction) are crucial for modern processor power modeling. Examples of execution-driven simulators are SimpleScalar [20], and Asim [36], which are used by many popular power modeling tools such as Wattch [19], SimplePower [148] and CAMP [118]. Some simulators can be either configured as trace-driven or execution-driven (e.g., Turandot [95]) [18, 74]. The cycle-level simulator requires application programs, the hardware configuration and technology parameters as inputs to model the processor power on a cycle-by-cycle basis. For dynamic power estimation, there are two tasks: one is to estimate the switching activity and the other to estimate the capacitances as discussed in Section 2.1. The cycle-accurate simulator is usually used to generate the architectural activity information (switching activity/vector sequences), which is used with the available power/capacitance models to estimate power on a cycle-by-cycle basis. Various methods are used for building the models at the architectural level. We will discuss three approaches in this section. First, a macro modeling approach which involves building the power models by estimating the power from existing circuits through measurement or simulation. This method is suitable for newly designed circuits for which no information of previous implementation is available and for irregular structured circuits which do not have a specific architectural design template. Second, an analytical modeling approach which is used for

31 3.1 Power Modeling for General-Purpose Processors 149 circuits that have regular organizations. In that case, the circuit can be divided into smaller sub-circuits and each can have a separate analytical representation. Third, we will discuss regression-based modeling approaches which are used to speed up power estimation. Estimating the power of a general-purpose processor is usually done by dividing the processor into smaller architectural functional blocks. Each of these functional blocks are modeled separately and the individual block power is combined to get the total processor power. As shown in Figure 3.1, the processor is mainly divided into following categories of functional blocks: datapath units, control unit, memory modules, interconnection/bus wires, and clock distribution networks. Datapath components are adders, multipliers, register file, pipeline registers, arithmetic logic unit (ALU) whereas memory models are composed of caches, translation lookaside buffers (TLBs), and reorder buffers. Most of the architectural-level power estimation tools combine both macro modeling and analytical modeling depending on the type of the functional block. In this section we will organize our discussion accordingly; the modeling of complex units (e.g., ALU, control unit) will be discussed in the context of macro modeling, while the modeling of more regular structures (e.g., register file, cache) will be discussed in the context of analytical modeling. We will be discussing both the dynamic and leakage power of the functional blocks Macro Modeling Macro modeling is a good candidate for estimating the power of functional blocks that do not have a regular structure such as control units. The first step for macro modeling is choosing appropriate input parameters through which the power models will be queried. We discuss four different techniques to choose appropriate input parameters to the look-up tables: (1) the input patterns, (2) probabilistic parameters, (3) the instruction traces, and (4) leakage parameters. (1) The first technique is to use the input vectors to the functional blocks as the parameters to access the tables. Because the power of any functional block depends on the input pattern transitions, a natural choice for power macro modeling is to store the estimated switching

32 150 Pre-Silicon Power Modeling Techniques Fig Example of a look-up table for macro modeling. energy as a function of consecutive input vectors. In this case, the tables are basically matrices, where the row index is the previous pattern, the column index is the present pattern, and a matrix entry value corresponds to the switching energy. An example of such a table is illustrated in Figure 3.2 for a circuit with two inputs. The stored value can also be switching capacitance, which is defined as the effective capacitance value corresponding to a certain switching activity. The problem with this approach is that the size of the look-up table size grows exponentially as a function of the input vector length. To address this problem, power estimation tools use the following techniques to compress the size of the tables. The circuit can be partitioned into smaller sub-circuits before building the tables. Power values from the sub-circuits can be combined to find the power for the larger circuit which helps reduce the size of the look-up table. For example, one 4:1 multiplexer can be partitioned into three 2:1 multiplexers. The table for a 4:1 multiplexer will have 4096 entries, whereas the number of entries for a 2:1 multiplexer table is only 64 [26]. Instead of sampling the input vectors and running a full simulation, it is possible to sample a small subset of internal nodes in a circuit unit and use this subset to estimate the power dissipation of the unit under a larger range of input statistics [12]. In general, partitioning techniques have the

33 3.1 Power Modeling for General-Purpose Processors 151 disadvantage that they can introduce errors in power estimation due to its inability to perform glitch propagation. To address this problem, Ragunathan et al. discuss techniques that estimate glitches generation and propagation from the Boolean formula that specify the control logic [120]. Various clustering algorithms can be used to compress the size of the inputs of these tables keeping the error in estimation within tolerable limits. The values in the look-up tables are distributed in different clusters by collapsing closely related input transition vectors and similar power patterns [90]. For example, clustering algorithm can compress the table for a 16-bit ripple carry adder with 2 32 entries to a table with only 97 entries, but this clustering reduces estimation accuracy as the number of inputs increases [148]. The dynamic power is directly proportional to the number of transitions of the inputs, and as a result another choice of the input parameters is the Hamming distance between the input patterns [64]. The power dissipation of different blocks can be stored in a look-up table indexed by the Hamming distances between the consecutive vectors. If the Hamming distance is used as the input parameter, the table in Figure 3.2 will reduce to only 3 entries instead of 16 entries. Though this approach drastically reduces the size of the tables, it is not accurate for most cases when certain input bit transitions impact the transitions occurring of other bits. For example, in the case of a 32-bit adder, one single bit transition means the Hamming distance between previous and current vectors is one. But this transition can trigger many transitions and large power dissipation occurs in the adder due to the propagation of the carry bit. (2) Probabilistic parameters are also suitable choices of input parameters for macro modeling instead of using the exact input vectors. Gupta et al. propose a four-dimensional (4-D) look-up table based on the macro modeling approach [44]. The dimensions of the 4-D model are the average input signal probability, average input transition density,

34 152 Pre-Silicon Power Modeling Techniques average output transition density, and average input spatial correlation coefficient. As discussed in Section 2, the inputs of a functional unit can be either temporally or spatially correlated, which makes it essential to include the last parameter as an input to the look-up tables. (3) A third technique is to use the instruction traces from the cycleaccurate simulator as indices to the look-up table. The control unit in the data path is so complex and accepts hundreds of instructions. It is impossible to construct one single table that stores all input transitions. Because instructions of the same format have similar power dissipation, different instructions can be categorized according to their format [26]. For each format, it is efficient to extract corresponding power values using a lower-level power simulator and to access them from the look-up table during runtime power estimation. (4) The fourth technique is relevant for macro modeling of leakage power. Though it is common to use analytical models for circuit leakage estimation as discussed in Section 2.2, macro modeling is also possible. Pre-characterized leakage power values for different functional block obtained from transistor-level simulators can be stored in a look-up table and used during processor power estimation. The inputs to the look-up tables are all the parameters that impact leakage power, which include device width, temperature, supply voltage, and process variation parameters (mean and variance). Do et al. perform characterization of a SRAM cell (shown in Figure 3.3) by connecting a single cell to a pair of bitlines to quantify leakage components via circuit-level simulations [34]. To improve the leakage estimation results, a look-up table based leakage current model is used in ecacti [87]. The look-up tables are built using SPICE simulations of transistors with different widths. After choosing the appropriate input parameters using any of the four techniques discussed above, the next critical step in macro modeling is to choose a suitable training set for the models. The training set is a representative of the subset of all possible set of the chosen input parameters. The training set should be sufficiently large to encompass all the possible domains of the parameters. For example, for input vectors, the training set should span over all the possible previous and

35 3.1 Power Modeling for General-Purpose Processors 153 current input vectors; otherwise, the power values stored in model will not be accurate. The last step in macro modeling is model characterization through simulation, where the power tools use a low-power simulator (gate-level or transistor level) to find the corresponding power values for each representative training set and store the values in the look-up tables Analytical Modeling Analytical models are suitable for functional blocks which are organized in a regular way such as cache, TLBs, register file, queues, and buffers. Due to the regular structure of these functional blocks, a relationship between the parameters can be established to estimate power. Each functional block can be further divided into smaller structures, where each structure is modeled with a separate analytical equation. Cache memory, which can also be represented as an array, is the one of the power hungry modules of a processor. The fraction of the total area and total contribution to power from memory can be as high as 40% [65, 87]. A number of power estimation tools are purely dedicated for estimating cache power [87, 128, 145]. These tools are used for micro-architectural power modeling to determine the best cache configuration. They take as inputs a few high-level cache parameters and output the best cache configuration. High-level input cache parameters include cache size in bytes, block size in bytes, associativity, data output width in bits, and address bus width in bits. Typically, these models estimate the read and write access times as well as the cache power consumption. The basic building block of an array structure (e.g., register file or cache) is the static random-access memory (SRAM) cell which is shown in Figure 3.3. Power modeling of an SRAM cell is done with analytical equations due to its fixed structure. The SRAM dynamic power dissipation happens through two lumped capacitances: wordline capacitance C WL and bitline capacitance C BL. During each read or write, the wordline is charged through the decoder (shown in Figure 3.4) which results in dynamic power consumption. The bitline power consumption is divided into the read phase and the write phase. During read, the power consumption results from charging and discharging of the bitline

36 154 Pre-Silicon Power Modeling Techniques Bit WL Bit P1 P2 N3 N1 N2 N4 Fig Typical 6T SRAM cell structure. Fig Wordline and bitline schematic in a register file. capacitance through the pre-charge circuitry and NMOS access transistors. During write, current discharges through the write circuitry in the cell which causes power consumption. The register file has a very regular internal structure as shown in Figure 3.4, and it can be represented by an array structure. The basic building block of a register file is the SRAM cell. Beside the SRAM cells, a register file also includes decoders, precharge circuitry and sense amplifiers. Each of these structures has a separate capacitance

37 3.1 Power Modeling for General-Purpose Processors 155 equation. Equations (3.1) and (3.2) give the analytical modeling for wordline capacitance C WL and bitline capacitance C BL, respectively, in a register file: C WL = C diff +2 C gate N col + C metal WL length (3.1) C BL = C diff + C diff N row + C metal BL length, (3.2) where C diff, C gate and C metal are the diffusion, gate and the metal wire capacitances, respectively. N row, N col,wl length and BL length are the number and length of the wordlines and bitlines, respectively. The diffusion capacitance C diff corresponds to the transistor from the decoder line, gate capacitance C gate is estimated from the NMOS transistors used to access the SRAM cells, and metal wire capacitance C metal is the load capacitance for the wordline or bitline. The dynamic power of the register file can be estimated using activity information from the cycle simulator combined with the capacitances calculated from the analytical equations. For example, the dynamic power consumption of a bitline can be expressed by, P dynamic-bitline = rc BL V DD V swing, (3.3) where r is the number of reads or writes per second and V swing is the voltage swing of the bitline [34, 81]. The leakage current of the SRAM cell can be also determined analytically [87, 149]. Transistors that are in the off-state need to be analyzed during read or write phase. Due to symmetry in the SRAM cell design, the leakage current from each of the transistors within a pair such as (N1, N2), (N3, N4) and (P 1, P 2) are the same. For example, when WL = 1 and Bit = 1, transistors N1 and P 2 are in the off state and when WL = 0 and Bit = 1, transistors N1, N3 and P 2 are in the off state and as a result they contribute to leakage current. Due to pairwise symmetry, the leakage current while WL = 1 and WL = 0 can be expressed by Equation (3.4) regardless of the data stored in the cell. { IleakN1 + I I leak = leakp 2 for WL = 1 (3.4) I leakn1 + I leakn3 + I leakp 2 for WL = 0 where I leakn1, I leakn3 and I leakp 2 are the leakage currents due to transistors N1, N3 and P 2 respectively. The leakage current for each

38 156 Pre-Silicon Power Modeling Techniques transistor is obtained from the NMOS and PMOS models as given earlier in Equation (2.5). The leakage power of a register file can be formulated using Equation (3.4). If we assume N row and N col are the total number of rows (wordlines) and columns (bitlines) in the array structure, then Equation (3.4) can be summed up for all the rows and columns as Equation (3.5). During the read phase only one WL = 1 and rest (N row 1) rows have WL = 0. I leakmemrd = N col N row (I leakn1 + I leakp 2 ) +(N row 1) N col I leakn3 (3.5) where I leakmemrd is the total leakage current during one read in the array [87]. The cache is the main component of a processor memory, and it also has a very regular structure similar to the register file. As a result, the power consumption of most memory blocks can be estimated quite accurately by using analytical models such as cache access time estimation model. A typical cache is composed of multiple sub-banks in both tag and data array as illustrated in Figure 3.5. The configuration Address Input Wordlines Bitlines Tag Array Decoder Data Array Column Muxes Sense Amps Comparators Mux drivers Output Data Output Fig Typical cache structure.

39 3.1 Power Modeling for General-Purpose Processors 157 of the tag array and data array are usually identical. Other components in the cache structure include: bitlines, wordlines, precharge circuitry, row and column multiplexers, address decoder, comparators, and sense amplifiers. Tools like CACTI estimate the capacitances of all these components by representing the circuit into several equivalent RC networks [128]. The gate and diffusion capacitances are estimated using individual circuit layout and combined to find the total capacitance using the cache configuration parameters. To reduce the impact of a cache miss, modern processors employ multi-level cache design with L1, L2 and L3 caches. Power tools like AccuPower [114] includes the option for modeling caches in separate stages. The tool iteratively estimates the capacitances on the switching nodes using organization specified by the cache configuration parameters and analytical equations Regression-Based Modeling The long simulation time and large number of parameters encountered during cycle-accurate power estimation constrain design studies. To overcome this problem, regression-based modeling is used for predicting performance and power for various applications running on processors. Regression analysis is a statistical inference method, where the relationship between a dependent variable and one or more independent variable is established. For power estimation, the dependent variable is the power which is the observed response and the independent variables are various architectural parameters. We will discuss two different approaches for regression-based power modeling. Both techniques utilize offline regression-based modeling to establish coefficients through a large set of training benchmarks, which can be used to estimate processor power as shown in Figure 3.6. The first approach is an offline trace-driven simulation, where samples are collected to fit a regression model that relates architectural design parameters to processor power. The second approach is to generate activity factors offline and use the activity factors combined with simulated performance counters information from interval cycles to estimate runtime power metrics. Both of these techniques estimate only

40 158 Pre-Silicon Power Modeling Techniques Training Benchmarks Power simulators Observed responses (power) Correlation analysis Predictors (Architectural parameters, Utilization statistics) Regression based modeling Model coefficients Fig Estimating modeling coefficients by using regression analysis. the dynamic power of the processor. The first method can be used to evaluate different design choices for processors at the architectural level (e.g., instruction width, cache size, ALU latency, branch latency), whereas the second method is used to estimate software-based power consumption for a given architectural configuration. In the first approach, Lee et al. [74] use 12 different sets S 1,S 2,...,S 12, each of which defines a group of specific architectural configurations. The groups include pipeline depth, width, physical registers, reservation stations, cache, main memory, control latency, fixedpoint latency, and floating-point latency. Each of these sets describe a specific type of architectural configuration for the processor and has a few parameters, which are varied within a specific range and power is estimated in each case using simulation. For example, set S 2 is the group that describes width of different functional blocks in the processor, and it has four different architectural parameters such as instruction width, reorder queue entries, store queue entries, and functional units count. Each of these parameters are varied within a range and each has a cardinality of 3. For example, instruction width is varied over 4-bit, 8-bit and 16-bit and has cardinality 3. Combining all the parameters for S 2, the cardinality of the set is found to be S 2 = = 81. The cardinality of S defines the design space which is all the cardinalities multiplied with each other; i.e., S =Π 12 i=1 S i, i corresponds to a specific set.

41 3.1 Power Modeling for General-Purpose Processors 159 Though the number of sets is only 12, the total cardinality S is about one billion, and for total of 22 benchmarks the whole design space is approximately 22 billion points. Sweeping design parameter values to consider all the points in this design space is not possible, only 4000 out of 22 billion samples are considered while building the models, which gives reasonable accuracy with a median error between 4.3% and 22%. The different parameters are called the predictor variables and the observed responses are basically performance and power. The relationship between the predictors and the observed responses are captured by regression model coefficients. These coefficients can be thought of as the expected change in power per unit change in the predictor variables. The observed responses are found via simulations of benchmarks and are used to formulate the models by regression analysis. Regression model fitting to the observed responses is done commonly through method of least squares to find the coefficients. Once established, these coefficients can be used to predict the runtime power without performing runtime simulations. Basic linear regression modeling can be too simple for processor power modeling. To improve accuracy, spline functions are used to take into account the non-linear relation between the predictors and the responses [74]. The second approach is to use parameters based on general processor utilization statistics. CAMP [118] performs correlation analysis to choose among all the utilization statistics that correlate best with the architectural activity factors. Example of parameters that correlate best are instruction per cycle (IPC), speculative-ipc (SIPC), load rate (LR), store rate (SR), floating-point load/store rate (FPLd/St), single instruction multiple data (SIMD) instruction rate, 64-bit FP instruction rate (FP64), micro-sipc rate, branch rate (BR), load hit rate, and store hit rate. The processor is divided into different power macros, which is the smallest unit of power computation. The switching activity and the capacitance in the power equation are replaced by the architectural activities and the equivalent capacitance for the architectural events. The activity factor (AF) for each macro can be expressed as a function of the statistics as shown in Equation (3.6). AF = c 1 + c 2 IPC + c 3 SIPC + c 4 LR +c 5 SR + c 6 FPLd/St + c 7 FP64 + c 8 BR (3.6)

42 160 Pre-Silicon Power Modeling Techniques where c 1,...,c 8 are the coefficients of the chosen statistics. By performing regression analysis on several training benchmarks the corresponding coefficients for each utilization statistics in Equation (3.6) are found for each macro. CAMP requires a set of performance counters that monitor the processor statistics mentioned above. These statistics are collected from the performance counters over certain interval sizes and activity factors are estimated using previously established coefficients as in Equation (3.6). These activity factors are combined with capacitance models in the simulator to estimate power. 3.2 Power Modeling for SoC-Based Embedded Systems In contrast to general-purpose processors that are designed for a large range of SW applications, an embedded system is designed for a target application or a range of applications with known characteristics. To create the best design, embedded system designers have to consider the co-optimization of both the SW and HW, where the boundary between the HW and SW domains is fluid, and computing functions can be implemented in either domain [8, 31]. An SW/HW power co-estimation framework enables designers to identify the impact of SW/HW partitioning, component selection, and the system design choices on the total power consumption [70, 80, 84, 129, 137] Power-Aware SoC Design Flow The overall flow of power modeling in SoC-based embedded systems is illustrated in Figure 3.7 [70, 80]. At the first step, a system specification (typically written in SystemC) is partitioned into HW and SW domains. The SW component is compiled with a compiler for the target processor core. Using a library of cores, custom intellectual property (IP) blocks, memories, and on-chip network designs, a fast synthesis tool compiles the HW specifications into representations (e.g., gate, RTL or architectural design templates) sufficient for power modeling. Macro-models or analytical power modeling techniques discussed in Section 3.1 can be then applied to these representations. The original SystemC specifications, the outputs of SW compilation, and the outputs of HW synthesis are provided as inputs to the system simulator.

43 3.2 Power Modeling for SoC-Based Embedded Systems 161 Fig Power modeling flow for SoC designs. The simulator simulates a discrete-event model of the entire system, synchronizing and transferring the required data between the HW and SW power simulators. The SW power simulator uses an instruction-set simulator as described in Section 3.1 to estimate the power consumption on a cycle-per-cycle basis for the target processor core. The HW power estimator estimates the power consumption of the individual HW components based on their input vectors, commands, and operation modes. By synchronizing and transferring data between the SW and HW components, the system simulator is able to capture the interplay between the SW and HW components. In embedded systems the interactions between software transformations and the parameters of the memory subsystem are particularly important [80]. For example, loop unrolling can reduce loop overhead and improve instruction-level parallelism, but it can lead to a situation where the code does not fit in the cache anymore. System-level simulation enables designers to assess the implications of loop unrolling and other SW optimizations on the total system power. At the most basic level, the power consumption of the HW components can be estimated from the input vector sequences and HW gate-level representations using the techniques described in Section 2. However, these techniques become computationally

44 162 Pre-Silicon Power Modeling Techniques infeasible, especially for large SoC designs. Many of the high-level power modeling techniques described in Section 3.1 for processors are applicable to SoC designs as well. However, the embedded customized nature of SoC designs enables new modeling techniques. In the rest of this section, we describe some of these techniques that have been proposed in the SoC-based embedded systems literature for fast, high-level power modeling Fast High-Level SoC Power Modeling Early approaches for power modeling of embedded systems considered simple power models for the HW components, where each component is characterized by a number of mode states, each with its own unique power consumption [129]. For example, a processor could be in an active or a stall mode, and a memory could be in an active, an idle or a refresh mode. The average power values for each of these modes are first identified, and then incorporated within a cycle-accurate system-level simulation as illustrated in Figure 3.7 to estimate the total power. While such approach is sufficient for simple embedded systems, advanced SoCs in embedded systems can execute a large range of applications with different power characteristics that require more elaborate power models. A reasonable compromise between the high-accuracy of waveformbased estimation and the simplicity of mode-based power estimation is to extract key information from the waveforms of input sequences and use this information as input to the power models. Two techniques are possible. (1) Sequence compaction: it is possible to compact a sequence of input vectors into a smaller sequence that yields almost the same power characteristics. One possible compaction heuristic is to derive the spatiotemporal statistics of the input vectors within a window of cycles and then use these statistics to identify a smaller subset of input vectors that preserve the same spatiotemporal statistics of the original sequence [70]. (2) Sequence abstraction: another possibility is to abstract a sequence of input vectors into commands or fundamental operations that describe the intended function requested

45 3.2 Power Modeling for SoC-Based Embedded Systems 163 from the component [40, 41]. For example, a UART component can receive a sequence of input vectors that commands it to transmit an input data packet, or a video processing component can receive a sequence of input vectors that commands it to decode, rotate, or transform an input data frame [105]. In this technique, the commands or operations of every HW component are first identified from its functionality. The power consumption of these commands are then estimated using input sequences that correspond to each command and representative input data. The results of the power simulation are then stored in a look-up table that is accessed whenever a future command is sent to the component. To make this technique more accurate, it is possible to model the impact of data-dependency and inter-command dependency on the component s power consumption. With the use of both techniques, it might be necessary to preserve some key bits in the input waveforms. For example, power management techniques introduce a number of different operational modes for a component. Each of these modes gives a trade-off between throughput and power consumption. Direct gate or RTL power estimation using waveform sequences naturally estimates the power under the correct operational mode. Because the correct operational mode is encoded in the signal waveforms, sequence compaction or abstraction can lead to a loss of such key information, potentially yielding wrong power estimates due to mode mismatch. Thus, it is necessary that the waveforms of key input signals are first analyzed to identify the operational mode of the component [53]. The identified mode is then used to steer power modeling to produce the correct estimates. Sometimes it can be also useful to deploy different techniques for different groups of bits. For example, it is better to model the least and most significant bits of data path components (e.g., adders and multipliers) separately, where switching activities of the least-significant bits are modeled using a white noise model with an average value, while those of the mostsignificant bits are modeled more accurately using their spatiotemporal characteristics [30, 73].

46 164 Pre-Silicon Power Modeling Techniques Power Modeling of SoC On-Chip Communication An important SoC component is the system-level on-chip communication subsystem, where studies in the literature have shown that it can consume a significant amount of power [61, 69]. System-level on-chip communication is typically referred to as network-on-chip (NoC) or onchip network (OCN) [113]. On-chip architectures can take a number of topologies, including shared and hierarchical buses, point-to-point buses, rings and meshes [109, 113]. The building blocks of an NoC include the FIFO buffers, arbiters, crossbar switches, and bus links. Nodes in shared bus systems are designated as masters or slaves and communicate using transactions on the bus. Estimating the NoC power entails estimating and adding up the dynamic and leakage power of all of its building blocks [143, 61]. Modeling each of these blocks can be achieved using the same empirical synthesis and analytical techniques discussed in Section 2 and Section 3.1. Figure 3.8 gives an overview of an empirical synthesis-based flow for bus-based NoC power estimation [69]. NoCs are usually implemented using a number of typical architectural design templates, and thus, analytical techniques are particularly attractive. FIFO buffers, for example, resemble to a great extent SRAM arrays, and thus the techniques discussed in Section 3.1 are well applicable to them. We describe an analytical power model for a crossbar switch. Fig Synthesis-based for power estimation of a shared-bus NoC [69].

47 3.2 Power Modeling for SoC-Based Embedded Systems 165 tri-state buffer connector control signal input 1 input 2 input n output 1 output 2 output n Fig Architectural template for a crossbar matrix design with n input ports and n output ports with 1-bit port width. Not all control signals of tri-state buffer connectors are labeled. A crossbar switch is a circuit that is designed to generally forward traffic from any its of input ports to any of its output ports depending on the applied control signals. Figure 3.9 illustrates an architectural template design for an NoC matrix crossbar switch [143]. The total capacitance of the switch is equal to the sum of three capacitances: input line capacitances, output line capacitances, and control line capacitances. The capacitance of each line can be estimated as follows. (1) input line capacitance = capacitance of an input buffer + capacitance of the wire which is proportional to the width of the crossbar + the number of output ports input capacitance of a connector. (2) output line capacitance = capacitance of an output buffer + capacitance of the wire which is proportional to the length of the crossbar + the number of input ports output capacitance of a connector.

48 166 Pre-Silicon Power Modeling Techniques (3) control line capacitance = capacitance of an average length control wire + capacitance of a control connector port width in bits (assuming the control wires run horizontally). Using these capacitances together with the data port width and rates of input and output bit switching, it is possible to estimate the dynamic power consumption. To estimate the crossbar leakage power consumption, it is first necessary to estimate the leakage power of the fundamental circuit elements (e.g., individual transistors and gates) under various input states as discussed in Section 2. Because each element can be in a different state, the leakage has to be analyzed probabilistically, where the average leakage of a element is equal to a weighted sum of the leakage at its different states, where a state s weight is equal to the probability of being in that state [27]. The total leakage of the crossbar is then equal to the sum of the average leakages of its elements. As discussed in Section 2, leakage power is sensitive to process variations, operational temperatures, and voltage. Therefore, it is necessary to re-evaluate the NoC power consumption at different design corners [110]. In Section 3.1, we discussed techniques that use regression macro models to express the power consumption of a processor as a function of its design parameters. This technique is well applicable to on-chip communication, where the power consumption of an on-chip communication system can be modeled as a weighted linear combination of a number of high-level parameters, such as the number of masters, number of slaves, data path width, and arbitration policy [13]. The coefficient weights can be found using regression analysis based on results from empirical or analytical power models of the NoC configuration. Executing the regression analysis on every possible NoC configuration is computationally infeasible. Instead, the regression analysis can be executed on a sample of NoC configurations, and the results are used to build a response surface model for each coefficient. The coefficients for the non-modeled configurations are interpolated directly from the constructed response surfaces. As in the case of computational nodes, the nature of NoC transactions can lead to good variations in power consumption. For example,

49 3.3 Power Modeling for FPGAs 167 Lahiri and Raghunathan [69] show up to 19% power variations that are dependent on the spatial distribution of transactions in a memorymapped shared bus architecture. Other factors that determine power consumption include the nature of bus transactions (e.g., sequential versus pipeline) and the encoding of the bus Heterogeneous Power Modeling It is possible to utilize heterogeneous power models, with different accuracies and computational efforts, to model the power consumption of SoC components [5, 53, 75]. Each component in an SoC can be modeled using various different power models (e.g., mode-based models, gate-level models, or command-based models). The power models vary in accuracy and efficiency. Designers usually have to make a trade-off between power estimation accuracy and computation runtime of power estimation tools. One possibility is to choose among various power models during simulation depending on the requirement of accuracy or efficiency. For example, if the SoC processing core consumes the largest portion of the total power consumption, then the power models for the core should be more accurate in comparison to the UART power model which may consume a smaller portion of total power. Another possibility is to allocate alternative power models which vary in accuracy and efficiency, where the simulation tool customizes the temporal and spatial allocation of computational effort among the power models of various components. The customization is based on the variations of each component s power consumption and its percentage contribution to the total power. 3.3 Power Modeling for FPGAs There are two purposes for power modeling of FPGAs. The first purpose is power modeling for FPGA architectural exploration, and the second purpose is power modeling for designs programmed in the FPGA fabric. The first purpose is important for FPGA architects and manufacturers. An FPGA architecture has many parameters, such as LUT size, number of BLEs clustered in a LAB, distribution of wire segment lengths, switch topology, and distribution of heterogeneous resources. Architects

50 168 Pre-Silicon Power Modeling Techniques need to understand the impact of different values of these parameters on the FPGA s final power consumption. These architectural estimates tend to be highly approximate as detailed designs are not yet available for the FPGA itself. For architectural exploration, the fidelity of sorting the relative impact of different design choices on power is critical. Once an architectural design is chosen, implemented, and fabricated, it is possible to obtain highly accurate power estimates that are calibrated against measurements from the actual FPGA device [107]. Power evaluation for FPGA architectural exploration reduces to estimating the power consumption of a number of representative user designs for every considered FPGA architectural configuration. The second purpose is relevant for FPGA users who need to quantify the impact of their design choices on power consumption. Users might also like to conduct their own high-level power-aware design exploration, where the power consumption of different user design choices are estimated and used toward choosing the best design. In FPGAs, power modeling is split between two designs: the design of the actual FPGA fabric which is known to the designers who conceived the FPGA architecture, and the design that is programmed into the FPGA fabric by the users. Users do not have access to the detailed FPGA architecture and schematics, and thus power characterization of different FPGA structures must be conducted by the FPGA designers, who have access to the internal FPGA circuits. The pre-characterization data has to be bundled with the power estimation tools distributed by the FPGA vendor Power-Aware FPGA Design Flow A typical flow for FPGA power estimation is given in Figure At the highest abstraction level, the design can be described in a high-level description language such as C, SystemC, or SystemVerilog. Behavioral synthesis is used to compile the high-level description into an registertransfer level (RTL) implementation, which is then mapped to the BLE structure of the FPGA. Clustering packs groups of these BLEs into LABs depending on the connectivity between the BLEs and the size of each LAB. The LABs are then placed in the FPGA and routing is determined. Using the final routed designs, the configuration bits of

51 3.3 Power Modeling for FPGAs 169 FPGA Design Flow Requirement High-level circuit description (C, SystemC) High-level/ Behavioral synthesis RTL Design/ Pre-layout Netliste Logic synthesis and Place and route Routed circuit High level power estimation and program/design space exploration Switching activity Low level power estimation and FPGA architecture exploration Wire length estimation (Rent s rule) Input stimuli(test vectors or random vectors)/ Statistical methods Activity Generation Capacitance estimation (Post-layout Netlist) Resource utilization Fig Power estimation flow. all the logic and routing blocks are determined, and the design is then ready to be programmed into the FPGA. Power modeling can occur at all of the different stages of the design. It is expected that highlevel power estimates are inherently less-accurate than low-level power estimates; however, high-level estimates are very useful in guiding the crucial early stages of the design toward the best choices from the perspective of power and other design considerations. FPGAs can be also programmed with soft processing cores that are programmed as regular designs in the FPGA s reconfigurable fabric. These soft processors add an extra dimension to FPGA power modeling, as their power consumption depends on their SW applications [108]. We will next discuss the various techniques that can be used to model switching activity, capacitances, and leakage at different design abstractions Low-Level Power Modeling Switching activity analysis for FPGAs can be based on input vector simulation or probabilistic analysis as discussed in Section 2.1. The

52 170 Pre-Silicon Power Modeling Techniques regular structure of FPGAs is particularly attractive for input vector simulation, where it is possible to build a pre-characterized look-up table for a single element such as a BLE and then re-use the table for the simulation of the thousands of BLEs available in the FPGA [77, 78]. Simulating thousands of BLEs is more feasible than direct simulation of millions of gates as in the case of custom designs. To build the look-up table, every possible input pair sequence is applied to a single BLE and a transistor-level SPICE simulation is conducted to estimate the dynamic power of every pair. Exact power consumption is impossible to determine during FPGA design. Thus, the look-up table has to be bundled with the vendor s power estimation tool, and later used by users when the input vectors and the programmed design are known. Similar to custom circuits, the average switching activity of a BLE can be analytically calculated as a function of the switching activities of the BLE s inputs and the Boolean function programmed in the BLE, where the output switching activity is the weighted sum of the input switching activities, and the weight of an input is equal to the probability that a transition occurs at the output of the BLE in response to a transition in the input [99, 107, 116, 127]. Through an iterative approach, the power estimation tool propagates the probabilistic information associated with the input pins through the FPGA design configuration and generates the activities of every internal signal. The procedure is conceptually similar to the probabilistic technique discussed in Section 2.1 in the context of Equation (2.4), except that in the FPGA case, it is necessary to consider that the building blocks of the design are the BLEs rather than the gate primitives. The regular structure of FPGAs generally lends itself to fast capacitance characterization. Because the reconfigurable routing resources in FPGAs consume a considerable area, they tend to dominate the FPGA s timing and power [24]. In contrast to custom circuits that usually have a myriad of wire segments of different sizes, FPGAs have a regular structure, where a few wire segment configurations are replicated across the whole FPGA. Thus, once capacitance estimates are extracted for a single wire segment, the estimates for all instances of the segment are automatically determined. Capacitance estimates can also

53 3.3 Power Modeling for FPGAs 171 be verified and calibrated using characterization measurements from the actual FPGA [107]. These capacitance estimates are also bundled with the vendor s FPGA power estimation tool. The user would later use the placement and routing results of the mapped design to identify the exact wire paths and routing resources used in the FPGA. An FPGA is designed to accommodate a large number of possible designs. These designs might not fully utilize the available programmable logic components in the FPGA. The average utilization of an FPGA is estimated to be about 75% [140], but this figure can go very low or very high depending on the design and the dynamic power will vary accordingly. Thus, it is necessary to scale any precharacterization power estimates for the FPGA structures by the actual resource utilization of the programmed design [127]. Due to its high switching activity, the clock network has a good impact on the FPGA s power consumption. Estimating the power consumption of the FPGA s clock requires knowledge of its topology, the dimensions of its global and local wire segments, as well as the number of its buffers. If the final FPGA implementation is available, then it is possible to extract good estimates for these network parameters. For early architectural exploration, approximate models can be developed based on clock network topology, the size of the FPGA, and the target fabrication technology [71, 72, 115]. The number of clock-tree repeaters can also be estimated using approximate RC models [115]. Modern FPGAs usually have more than one clock domain, where global clocks are distributed across the whole chip and local clocks are distributed within a region [71, 72]. Multiple clock domains enable greater user design flexibility and finer grain control over runtime power consumption. Multiple clock domains introduce reconfigurable multiplexers into the clock signal path to select the appropriate domain for each flip-flop. The additional power consumption of these multiplexors have to be modeled High-Level Power Modeling During high-level design power exploration and synthesis, the final BLE-based implementation is not available, which complicates power

54 172 Pre-Silicon Power Modeling Techniques analysis. High-level synthesis is divided into three tasks: scheduling, allocation and binding. Each of these tasks needs to be guided by power, area and delay estimates to find the best possible synthesis solution. To find the switching activity at such high-level of abstraction, the design is first expressed by a two-level control data flow graph (CDFG). In the first level, the graph has nodes corresponding to the basic blocks of the design, and in the second level, each block contains a set of operation nodes and edges that represent data dependencies. The switching activity is determined by using an iterative test-vector based functional simulation. For each set of input vectors, the control and the data flows from the inputs through the CDFG until they reach the outputs [24, 25]. The individual CDFG graph nodes are looked up to find their expected number of logic elements, DSPs and I/Os, together with their pre-characterized capacitances. For routing power estimation, statistical Rent-based estimation techniques must be utilized to estimate the wire length distribution as a function of the LAB sizes and the number of their inputs and outputs, as well as the expected complexity of the programmed design [24, 32]. The statistical wire length distribution is used to calculate the capacitance estimates as well as the timing information using approximate RC models. The timing information is helpful to estimate the switching activity FPGA Dynamic Power Breakup Combining the estimates from the capacitances, resource utilization and switching activities, the dynamic power of the FPGA can be calculated. Based on various reported power distributions of real circuits, we give in Figure 3.11 a typical dynamic power distribution for various FPGA structures [127]. In FPGAs, it is expected that the reconfigurable routing fabric consumes about 45 55% of the total power; the logic blocks consume about 20 25%; and the clock and I/O consume about 25 35%. Short circuit power is approximately about 10% of the dynamic power [115, 127].

55 3.3 Power Modeling for FPGAs 173 Interconnect 45-55% Logic 20-25% Clocking 25-35% Fig Typical FPGA power consumption breakdown FPGA Leakage Power Analysis and Breakdown For leakage power, the simplest estimation method is to use SPICE simulation or analytical derivations based on Equation (2.5) of Section 2 to estimate the leakage for the individual NMOS and PMOS transistors that are part of the reconfigurable fabric of the FPGA. The leakages of the transistors are then summed up for all idle transistors [115, 116]. A more elaborate scheme would simulate in SPICE or analytically derive the leakage of the individual structures that make-up the reconfigurable fabric (e.g., logic blocks, routing switching, and SRAM blocks) [67, 140]. The total leakage power is equal to the product of the leakage of an individual structure times the number of structure instances summed over all structures. Since the estimation is conducted on a per-structure basis, it is then possible to explore the impact of different possible input data states and temperature on leakage [67, 140]. Resource utilization can impact leakage power estimates as well, because non-utilized reconfigurable blocks will have a different input state than the utilized blocks [140]. In one study for 90 nm technology, it was reported that routing structures, logic structures, and SRAM configuration cells contribute 53%, 41%, and 6% respectively to the total leakage power [67]. It is assumed that the SRAM configuration cells are manufactured with high V th. Implementing high V th for the configuration SRAM cells only increase the FPGA s programming time but it does not impact its runtime performance for user designs.

56 174 Pre-Silicon Power Modeling Techniques 3.4 Summary In this section we discussed power modeling techniques for three computing substrates: processors, SoC-based embedded systems, and FPGAs. The sheer complexity of these systems makes power estimation and power-aware architectural exploration a challenging task. We discussed the impact of SW variations on the power consumption of general purpose processors. To estimate the power consumption of processors, it is necessary to use a cycle-accurate instruction-set simulator that can identify the activity factors of different processor units. The capacitances of the different units can be either identified from direct extraction of gate-level layouts, or from analytical derivations using canonical architectural templates. A number of speed-up techniques were discussed, including the use of regression models. We also discussed SoC power modeling techniques. Compared to processors, SoCs pose additional difficulties because the boundaries between the HW and SW domains are fluid, and because they incorporate application-specific custom components. We discussed system-level simulation techniques that can transfer and synchronize data between the SW and HW components to ensure that each component can receive its correct input sequences which are necessarily to estimate power consumption. We also discussed speed-up techniques that are specific for SoCs. In particular, the application-specific custom components can be associated with a number of high-level operations or commands. Power modeling for FPGAs is also difficult because it is not possible to know the user s design during the design of the FPGA device itself. FPGA power-aware architectural evaluation involves simulating the power consumption of a number of representative user designs for each possible architectural configuration. We discussed techniques for developing power models of the FPGA s basic building blocks (e.g, logic array blocks and routing switches). These models are later used by the end users to estimate the power consumption of their designs. In some cases, user designs can be programmable processors or SoCs that may run a large range of applications, making the power estimation process co-dependent on software.

57 4 Post-Silicon Power Characterization Once a design is fabricated, the manufactured devices can yield a wealth of power characterization data that can improve various design choices and runtime applications. The produced devices can be directly characterized for their true power consumption under various loading conditions with no need for approximate models or simulations as in the case with the pre-silicon design. In this section we discuss three main research directions for post-silicon characterization. (1) Detailed power mapping for validation. Design-time power estimates are approximate. By measuring electrical current and infrared emissions from a fabricated device, the true power estimates of the circuit structures are revealed. These post-silicon power estimates can be used to validate design time power estimates, calibrate various empirical models used by CAD tools, or re-spin the design if necessary. (2) Power management for adaptive power-aware computing. Runtime power management systems adapt the performance and power of computing devices depending on runtime constraints. In case direct power measurements are not available, it is useful to develop mathematical models that can estimate 175

58 176 Post-Silicon Power Characterization the total power consumption of a device from its operating characteristics. (3) Software power analysis. Using the fabricated device, it is possible to conduct lumped power characterization for individual instructions and their interactions, and then use the characterization results to develop models for software power analysis. These models enable software developers and compiler designers to tune their code and algorithms to improve their energy efficiency. 4.1 Power Characterization for Validation and Debugging While computer-aided power analysis tools can provide power consumption estimates for various circuit blocks, these estimates can deviate from the actual power consumption of working silicon chips. There are a number of reasons these pre-silicon estimates can deviate from the actual post-silicon power measurements. (1) Large input vector space: the exponentially vast number of possible input vector sequences and the billions of transistors implemented in current designs make it impossible to determine the power consumption incurred from every possible input vector sequence or software application. (2) Errors in coupling capacitance estimation: coupling capacitance between neighboring wires is determined by the exact waveform activity experienced by the wires. The computational infeasibility of simulating every input vector automatically makes estimating the coupling capacitance during design time a difficult process. Wrong estimates for the coupling capacitance impact switching power the following ways [43]. Wrong estimates incorrectly estimate incomplete voltage transitions arising from crosstalk noise. They impact signal slew times which determine short circuit power. Furthermore, they impact the signal timing delays which determine the occurrence of glitches.

59 4.1 Power Characterization for Validation and Debugging 177 (3) Spatiotemporal thermal variations: incorrect estimates for dynamic power implies incorrect estimation of thermal variations, which in turn leads to further deviations in leakage and total power estimates. (4) Process variations: intra-die and inter-die process variations are unique for each die. Process variations impact leakage power as discussed in Section 2, and they also impact the signal delays along the circuit paths, which determine glitches. Process variations can introduce serious deviations in power consumption compared to pre-silicon design time estimates. Because of the aforementioned complexities, pre-silicon power estimates might not be accurate when compared to the real post-silicon characteristics. In many cases the mismatch between pre-silicon and post-silicon characteristics forces major changes in the design and implementation of integrated circuits. In a study from 2005, it was shown that 70% of new designs require at least one design re-spin to fix post-silicon problems and that 20% of these re-spins are due to power and thermal issues [133]. It is likely that these figures are much higher now as chips are more complex with a larger number of transistors. Post-silicon power characterization of integrated computing devices provides the true spatial and temporal power characteristics under representative loading conditions, such as workloads and dynamic voltage and frequency (DVFS) settings. Figure 4.1 gives an integrated framework, where measurements from infrared imaging equipment together with electrical current measurements are used for power characterization. The electric current measurements can provide the total power consumption. If the chip supports multiple independent power supply networks, then it is possible to increase the pool of current measurements. The thermal emissions captured from the infrared imaging system, together with the lumped electrical current measurements, can be inverted to yield high-resolution spatial power maps for the individual circuit blocks.

178 Post-Silicon Power Characterization Fig. 4.1. Using post-silicon power characterization for design validation and CAD tool calibration. P & R stands for placement and routing.

60 178 Post-Silicon Power Characterization Fig Using post-silicon power characterization for design validation and CAD tool calibration. P & R stands for placement and routing. The results of post-silicon power characterization can improve the design process during re-spins or for future designs in the following ways. High-level power modeling tools rely on the use of parameters that are estimated from empirical data as discussed in Section 3 [19, 118]. The post-silicon power characterization results can be used to calibrate and tune high-level pre-silicon power modeling tools. To evaluate what if? design re-spin questions, the power characterization results can substitute the power simulator estimates and directly feed the thermal simulator. For example, if the thermal characterization results are unacceptable, then the layout can be changed to reduce the spatial power densities and hot spot temperatures. The power characterization results are fed together with the new layouts to the thermal simulator to evaluate the impact of the changes. The post-silicon power characterization results can also force a re-evaluation of the computing device specifications (e.g., operating frequencies and voltages).

61 4.1 Power Characterization for Validation and Debugging Relationship between Power and Temperature The relationship between power and temperature is described by the physics of heat transfer. Mainly, heat diffusion governs the relationship between power and temperature in the bulk of the die and the associated metal heat spreader, while heat convection governs the transfer of heat from the boundary of the integrated metal spreader/sink to the surrounding fluid medium which is either air or liquid. The heat diffusion equation is given by (k t(x,y,z)) + p(x,y,z) =ρc t(x,y,z), (4.1) t where t(x,y,z) is the temperature at location (x,y,z), p(x,y,z) isthe power density at location (x,y,z), ρ is the material density, c is the specific heat density, and k is the thermal conductivity [112]. k, ρ, and c are all functions of location [112]. The power transferred at the boundary by convection is described by Fourier s law for heat transfer, and it is proportional to the temperature difference between the boundary of the heat sink and the ambient temperature. The constant of proportionality is the heat transfer coefficient which depends on the geometry of the heat sink, and the fluid used for heat removal and its speed [55] Basics of Infrared Imaging Any body above absolute zero Kelvin emits infrared thermal radiation with an intensity that depends on its temperature, its emissivity, and the radiation wavelength. Transistors and interconnects in integrated circuits operate at elevated temperatures due to resistive heating by charge carriers [117]. Modern computing integrated circuits use flipchip packaging, where the die is flipped over and soldered to the package substrate. By removing the package heat spreader, one can obtain optical access to every device on the die through the silicon backside. It is possible to use infrared-transparent heat removal techniques that have similar thermal characteristics to the original heat spreader [91, 122]. 1 Silicon is transparent to infrared emissions with photon energies that 1 To capture the thermal emissions from an operational die, it is necessary to remove the optical obstruction introduced by traditional heat removal mechanisms (e.g., integrated heat spreader, metal heat sink and fan), and to substitute them with mechanisms that can remove heat while being transparent to infrared emissions. The standard technique is

62 180 Post-Silicon Power Characterization are less than its bandgap energy (1.12 ev), which corresponds to wavelengths larger than 1.1 µm. This transparency is ideal from an infrared imaging perspective as it enables the capture of photonic emissions from the devices, which provides valuable information for thermal and power characterization of computing devices operating under realistic loading conditions. One of the key specifications of an infrared imaging system is its spectral response, which determines the part of the infrared spectrum that the imaging equipment can detect. For the range of temperatures encountered during chip operation, the mid-wave infrared (MWIR) range, which stretches from 3 µm to5µm, yields the most sensitive and accurate characterization of thermal emissions. Detecting emissions in the MWIR range requires the use of InSb quantum detectors which have to be cooled to cryogenic temperatures to ensure sensitivity. As a consequence, high-resolution MWIR imaging systems tend to be fairly expensive. The MWIR imaging system used in our laboratory has a focal point array of InSb detectors cooled to 77 K ( 196 C) to reduce the noise in temperature measurements to about 20 mk. The imaging system can capture thermal frames at a rate of up to 380 Hz, and depending on the used microscopic lens, the system can resolve temperature down to 5 5 µm Thermal to Power Inversion The objective of post-silicon power characterization is to identify the true power consumption of different blocks of a computing integrated circuit under real loading conditions. While it is possible to isolate the power consumption of a circuit block during testing by using scan chains, such approach is not capable of characterizing the wide range of possible power consumption maps under true loading conditions, where through the use of infrared transparent oil-based heat removal systems [46, 91, 92, 102, 122]. For our experiments, we machined a special sapphire oil-based infrared-transparent heat sink that precisely controls the oil flow over the computing device. Chilled oil is forced into an inlet valve of the sink, which then flows on top of the die to sweep the heat and then exits through an outlet valve. The oil maintains its flow using an external pump, and the temperature of the oil is controlled using a thermoelectric cooler. By controlling the oil flow and its temperature, as well as the sapphire window dimensions, it is possible to get similar thermal characteristics as the original heat removal package [91, 122].

63 4.1 Power Characterization for Validation and Debugging 181 workloads simultaneously exercise multiple circuit blocks in intricate ways. The post-silicon flow of Figure 4.1 is the most versatile approach to compute the power map from the captured thermal emissions and the external current measurements. In its discretized version, the steadystate form of Equation (4.1) is approximated by the following linear matrix formulation Rp + e = t m, (4.2) where R is the system s thermal resistance matrix, t m is a thermal map acquired from the thermal imaging system, e is the noise in measurements, and p is the desired power map. The objective of post-silicon power mapping is to invert the thermal map t m to produce p. Given a computing device, it is necessary to measure the thermal to power model matrix R. The matrix R can be estimated in a column-by-column basis as follows. Enabling only one circuit block at a time is mathematically equivalent to setting the vector p to be equal to [0 0 p k 00] T, where p k denotes the power consumption of the enabled kth circuit block and T denotes the transpose operation. The power consumption incurred from enabling only one circuit block can be easily measured using the external current sensing meter. If t k denotes the steady-state thermal emissions captured from enabling the kth micro-heater block, then column k of matrix R is equal to t k /p k. The described estimation method for the matrix R can be implemented using three different approaches depending on the design [28, 46, 92]. (1) One approach is to use control signals to enable individual circuit blocks, which trigger their power consumption [28]. Software micro benchmarks that selectively stimulate desired circuit blocks can be considered as soft enable signals. (2) A second approach is to turn off the chip, and scan a laser beam with known power density to deliver the power from the outside to the circuit block of interest. The scanning of the laser system can be automated by using a pair of galvodirecting mirrors [46]. (3) A third approach is to use design information to conduct finite-element heat convection and diffusion simulations to

64 182 Post-Silicon Power Characterization estimate the matrix R, by applying power pulses in simulation at the desired locations. The results from the thermal simulations of the finite element model are used to construct the matrix R [46, 63, 92]. Given the matrix R, the next step is to execute the desired workload on the device and to capture the resultant thermal emissions t m. The power map p can be estimated as follows. where 2 is the L 2 norm, and such that p = arg p min Rp t m 2 2, (4.3) p 1 = i p i = p total and p 0, (4.4) where 1 is the L 1 norm and p total is the total power consumption of the chip which can be measured externally using a digital multimeter. The constraint given by Equation (4.4) reduces the possibility of getting multiple solutions to objective (4.3) in case the matrix R is not well conditioned, and it ensures that the the power mapping from infrared imaging is consistent with the electrical current measurements. In practice, any digital multimeter has a tolerance, tol, in its measurements (the tolerance is typically listed in the multimeter s data sheet), and thus it is better to replace constraint Equation (4.4) by the following two constraints: p 1 p total + tol p 1 p total tol In chips with multiple independent power networks, there will be one measurement for each network, and thus there will be multiple corresponding constraints. To illustrate the effectiveness of post-silicon power mapping, our group constructed a test chip where the underlying switching activity can be precisely controlled. Our test chip consists of microheater blocks organized in a grid fashion. Each block contains nine ring oscillators that can be simultaneously enabled or disabled through a control signal that is stored in an associated flip-flop. When enabled, the

4.1 Power Characterization for Validation and Debugging 183 (a) injected power map (b) resultant thermal map

2% Fig. 4.2. Examples of using post-silicon power mapping. Estimated power is in mw.

desired power map; thus, the length of the p vector is 100.

2a shows two usages of the test device where the block heaters are enabled to create dynamic power maps.

The resultant captured thermal emissions shown in Figure 4.

The resultant power estimates from our technique are given in Figure 4.

Compared to the known power maps, the estimated rounded maps show an average of 0% and 2.

65 4.1 Power Characterization for Validation and Debugging 183 (a) injected power map (b) resultant thermal map (c) estimated power map (d) estimated rounded power map av. error = 0% av. error = 2.2% Fig Examples of using post-silicon power mapping. Estimated power is in mw. dynamic power consumption of a micro-heater block is equal to 25 mw. The size of each block is about 0.56 mm 2. To create a desired spatial power map, we program the flip-flops with 100 bits that correspond to the desired power map; thus, the length of the p vector is 100. The design is embedded in a section of a 90 nm Altera Stratix II FPGA. Figure 4.2a shows two usages of the test device where the block heaters are enabled to create dynamic power maps. One map spells the word POWER in dynamic power activity and the other map resembles the power maps of SoCs. The resultant captured thermal emissions shown in Figure 4.2b clearly show that the process of heat diffusion blurs the underlying power maps. The resultant power estimates from our technique are given in Figure 4.2c, and the rounded power estimates power is rounded to either 0 mw or 25 mw are given in Figure 4.2d. Compared to the known power maps, the estimated rounded maps show an average of 0% and 2.2%, which clearly demonstrate the effectiveness of post-silicon power mapping. We also conduct a spatial power mapping of the embedded Nios II/f soft processor [103]. The layout of the processor is given in Figure 4.3a and the captured thermal emissions from executing the Dhrystone II application are illustrated in Figure 4.3b. The estimated spatial power maps in mw are plotted in Figure 4.3c. The estimated spatial maps augment the floorplan with valuable spatial power density

184 Post-Silicon Power Characterization (a) Nios II/f layout (b) resultant thermal map (c) estimated power map power (mw) Fig. 4.3. Post-silicon power mapping of a Nios II processor.

Given that the design of the Nios II processor is proprietary, it is not possible for us to match the spatial power consumption estimates to the various functional blocks of the processor.

4 AC-based Thermal to Power Inversion A major source of error in power mapping from infrared emissions is heat diffusion which blurs the underlying power map.

66 184 Post-Silicon Power Characterization (a) Nios II/f layout (b) resultant thermal map (c) estimated power map power (mw) Fig Post-silicon power mapping of a Nios II processor. Total Power 477 mw estimates. The revealed detailed power estimation maps can be used to calibrate the estimates from high-level dynamic power modeling tools. Given that the design of the Nios II processor is proprietary, it is not possible for us to match the spatial power consumption estimates to the various functional blocks of the processor. Previous work in the literature report power mapping results for IBM s PowerPC 970 processor [46] and AMD s Athlon processor [92] AC-based Thermal to Power Inversion A major source of error in power mapping from infrared emissions is heat diffusion which blurs the underlying power map. A recent research direction advocates the use of AC excitation sources instead of typical DC excitation sources [104]. AC excitation reduces the amount of spatial heat diffusion as the AC excitation frequency increases [17, 89]. In addition, AC excitation has the benefit of reducing flicker noise which has a 1 f spectrum [51]. Applying a true sinusoidal AC source to excite a digital circuit is impossible. Instead a square wave is applied, and because a square wave can be represented by a Fourier series, whose dominant component is the fundamental frequency, such technique does not alter the results as long as the acquired infrared emissions are filtered to only extract the fundamental frequency [17]. Creating AC square-wave excitations in digital circuits can be implemented by a number of techniques such as: (1) toggling enable signals of circuit blocks while keeping the operating voltage constant; (2) alternating the voltage supply signal between two values (e.g., 0.9 V and 1 V); or

4.2 Power Characterization for Adaptive Power-Aware Computing 185 by (3) executing workloads that alternate between an activity phase and an inactivity phase.

67 4.2 Power Characterization for Adaptive Power-Aware Computing 185 by (3) executing workloads that alternate between an activity phase and an inactivity phase. AC-based power mapping reduces power estimation error by more than half [104]. 4.2 Power Characterization for Adaptive Power-Aware Computing Another useful reason for post-silicon power characterization is to enable runtime power management. Runtime power management is concerned with adapting the performance and power of a computing device in response to runtime power objectives and constraints. For example, in portable systems, it could be desirable to minimize runtime power consumption under performance constraints that guarantee acceptable operation for a particular application such as H.264 playback. In case of server systems, it could be desirable to constrain peak power consumption when there is a light demand on the system. In many of these runtime management situations, a computing device can lack the means to directly measure its own power consumption, and thus, there is a need to develop lumped power models that can estimate the total power of a device from its operating characteristics. To develop any lumped power-related model, it is necessary to first collect a large volume of power characterization data. There are generally two approaches used to measure the total electrical current consumption of a computing device or system. In the first approach, the power supply lines are intercepted and a shunt resistor (e.g,. Figure 4.4a) is inserted in series with the positive supply line. (a) shunt resistor (b) clamp meter Fig Power measurement techniques used for lumped current measurements.

68 186 Post-Silicon Power Characterization In contrast to regular resistors, shunt resistors have very low resistance (e.g., 1 mω) with high accuracy of about ±0.1%. The low resistance is needed to avoid adding a voltage drop along the supply line. The changes in voltage across the shunt resistor are proportional to the electrical current variations as dictated by Ohm s law. The second approach uses clamp meters (e.g., Figure 4.4b), which utilize the Hall effect to detect electric current variations in the supply line by measuring the induced magnetic field variations surrounding the supply wire. Clamp meters are less intrusive, but they are less accurate and their measurements tend to be noisy compared to shunt resistors. In both approaches, a digital multimeter or an analog-to-digital device is required to log the measurements of the shunt resistor or clamp meter into the power management system of the computing device Power Characterization Using Performance Counters A popular approach for modeling total power in software is through the use of performance counters [9, 11, 56, 57, 58, 59, 82, 111, 118, 131, 147]. Performance counters are inserted by designers to track the usage of different device blocks. For example, Intel Core i7 has seven counters per core that can be programmed to monitor tens of different processor events. Examples of such events for processors include the number of retired instructions, the number of cache hits, and the number of correctly predicted branches. Example of performance counters for GPUs include memory global/local load and stores, number of instructions executed and number of thread wraps with bank conflict [97]. In GPUs, it is also possible to measure workload signals that track the utilization levels of various GPU units, such as vertex shading units, texture units, and geometric shading units [85]. In an off-line characterization phase a number of workloads are executed on the device and the values from the performance counters together with measurements of the total power are recorded, and then used to develop and train an empirical mathematical model of the power consumption as a function of the performance counters. During runtime, the values of the performance counters are given as inputs to the model to estimate the total power of any instance of the device.

69 4.2 Power Characterization for Adaptive Power-Aware Computing 187 Fig Correlation between performance counters and Intel Core i7 processor power for the PARSEC benchmarks. In devising empirical power models, there are two key questions. (1) Which performance counters should be used as inputs to the model? It is desirable to use few performance counters to avoid over fitting the model and to avoid complicating the assignment of performance counters to events. Popular examples in the literature include: the instructions per cycle and L1 load misses per cycle for the POWER6 processor [58]; the dispatch stalls, the L2 cache misses, and the number of retired micro-ops for AMD Phenom [131]. For example, using our own measurements of the PARSEC benchmarks [10] on Intel Core i7, we give in Figure 4.5 the correlation coefficients between a number of performance counters and the processor s power consumption. The results show a very strong correlation between power and the number of microops retired, the L1 data reads, total instructions executed, and L1 instruction misses. Figure 4.6 shows the processor power consumption and number of retired micro-ops over a period of time, clearly illustrating the correlation between the two values.

70 188 Post-Silicon Power Characterization Fig Processor power consumption and number of retired micro-ops for the 264 application running on Core i7 (1 thread at 2.67 GHz) for 60 s. (2) What is the nature of the mathematical model? Most of the proposed models in the literature use linear regression, where the values from the performance counters are weighted and linearly combined. If p(k) denote the power at time instance k, then in a linear regression model, p(k) is equal to p(k)=w 0 + i w ic i (k), where c i (k) is the value of counter i at time k, w i is the weight associated with counter i, and w 0 is a constant weight offset. The weights are learned offline from the collected characterization data using least-square estimation. Using the performance counters reported in Figure 4.5, we built a power model for the Core i7 processor. Our results show that the average absolute error in power estimation is about 2.13% Block-Level Power Characterization To enable sophisticated power management systems that utilize per-core DVFS or selective clock gating to attain larger reductions in power consumption, it is valuable to estimate the power on a per-block basis [57, 131, 118, 147]. These blocks can be coarse-grain such as corelevel structures or fine-grain such as register files, reservation stations,

71 4.3 Power Characterization for Software Power Analysis 189 and memory caches. Structures in GPUs include register units, ALUs, texture caches, and global/local memories [49]. One method for estimating the power of a circuit block is to scale its known maximum power consumption by an activity factor that is computed from the measurements of the performance counters. The maximum power of the structure can be known from design time estimates or identified to a reasonable degree during runtime by crafting micro benchmarks, which are simple programs that execute only one type of instructions within infinite loops. These micro benchmarks can be written for general-purpose processors [57] and GPUs [49]. These benchmarks trigger power activity in a small subset of structures and can identify the maximum power consumption of the desired blocks. Another method to estimate the power of a circuit block is to build a submodel for its power consumption based on the performance counter measurements that are relevant for the block. For example, the power consumption of a core should only utilize its performance counter measurements. In both methods, estimation errors can be reduced by ensuring that the sum of estimated powers of all blocks is equal to the total measured power. Performance counters mainly capture the activity of the processor and memory. In some cases, adaptive power-aware computing needs to occur at a full server scale [35, 79, 124]. Besides the processor and memory components, major activity and power consumption also occur in the hard disk and network interface components of server systems. The activity of these components are not logged in by the processor s performance counters, and thus it is necessary to collect system-wide statistics that measure the utilization of these components. The measured utilizations together with the performance counters can be then used within a regression framework to model the power consumption of the full server. 4.3 Power Characterization for Software Power Analysis The power consumption of a computing device is influenced by the SW applications that is executed on it. Once a computing device is in production, its power consumption as a function of the SW applications

72 190 Post-Silicon Power Characterization Fig Post-silicon power analysis for software applications. can be analyzed and used to improve the energy efficiency of SW applications. This characterization is particularly attractive in embedded systems where the software application is a priori known. Hence, tuning the algorithm and/or the compiler can lead to consistent power improvements. SW power analysis can be tackled at a number of abstraction levels as shown in Figure 4.7. At the lowest level, the individual instructions of a processor and their interactions can be characterized and modeled. These models can then be used to analyze the energy consumption of SW applications. A higher-level approach is to abstract the impact of the instructions by looking at the architectural parameters (e.g., instructions per cycle, cache miss rate, and resource utilization) that are triggered by the instruction makeup of an application. Such approach is not fundamentally different from the power characterization techniques developed for adaptive power-aware computing (discussed in Section 4.2). The difference stems from the target setting and goal; i.e., software power analysis versus adaptive power-aware computing. The highest level of abstraction is to characterize power consumption as a function of the algorithmic parameters that are relevant to the application, such as function argument types and variables sizes. We discuss these techniques in the rest of this section.

73 4.3 Power Characterization for Software Power Analysis Instruction-Level Analysis One of the main hypotheses in instruction-level power analysis is that it is possible to evaluate the energy consumption of a program by composing the energy consumption of its individual instructions, which are characterized from prior measurements [139]. One of the challenges in such evaluation is that the energy consumption of an individual instruction is also dependent on the state of the processor, which is determined by preceding instructions. To address this issue, the energy cost of an instruction is decomposed into two components: a base cost and an inter-instruction cost. The base cost of an instruction is determined by repeatedly executing multiple instances of the exact same instruction instantiated within a loop, while the processor s power is being externally measured by a digital multimeter. Because the switching activity of a circuit is a function of the changes in its inputs and its state, executing the exact instruction repeatedly will consume less energy than executing a number of different instructions. Thus, it is necessary to add an inter-instruction energy cost to the base cost. To reduce the intractability of analyzing the impact of all preceding instructions, a number of techniques have been proposed. In the simplest case, a fixed cost is added to account for interdependencies [139]. Three sources of additional costs can be considered. First, the impact of the immediately preceding instruction can be considered by executing different pairs of instructions repeatedly in a loop and the additional average energy incurred is characterized; second, the energy cost of a pipeline stall can be characterized; and third, the cost of a cache hit and miss can be characterized. The pipeline stall and cache miss costs are multiplied by the expected number of stalls and cache misses, respectively. The characterization results of these inter-instruction dependencies are summed up and added to the base cost [139]. This model can be extended to incorporate data-dependent power consumption associated with bus and memory transactions [134]. A more detailed analysis can use statistical techniques, where the energy consumption of an instruction is characterized by

74 192 Post-Silicon Power Characterization both a mean and a standard deviation [126]. To determine the mean and standard deviation of an instruction, characterizing variations of the same instruction can be considered, including different source and destination registers, different operand values, and different conditions (e.g., cache hit or miss). Using statistical models has the advantage of reporting confidence intervals for estimated program energy. Sometimes it is infeasible to characterize every instruction pair combination. For example, in the ARM7 instruction set architecture, there is a total of 500 basic instruction costs to be measured. It will be practically infeasible to characterize all possible combinations of instruction pairs [130]. Instead, it is possible to first group the instructions into sets based on the resources they utilize, and then measure the inter-set power dependency between instructions from different sets [130]. In a study for the StrongARM SA-1100 processor, four possible sets were identified: instruction, sequential memory access, non-sequential memory access, and internal cycles [132]. SW power consumption depends on the operating frequency and voltage. For example, we use our measurement setup to measure the average power consumption of the PARSEC applications as a function of the frequency voltage settings of a quad-core Intel Core i7 processor. Figure 4.8 gives the average power measurements at nine different frequency voltage settings. The figure shows that there are strong variations in power consumption that depend on the application and the applied frequency voltage setting. For devices that support multiple frequencies and voltages, it is necessary to characterize the variations in instruction energy consumption as a function of the operating voltage and frequency [132]. It is possible to integrate the post-silicon power measurements together with pre-silicon power modeling techniques to obtain finegrain power estimates for each circuit component. Varma et al. describe a SystemC power modeling technique in which the parameters of the

75 4.3 Power Characterization for Software Power Analysis 193 Processor Power (W) blackscholes bodytrack canneal dedup facesim ferret fluidanimate freqmine raytrace streamcluster swaptions vips x264 Fig Average power consumption of the PARSEC applications as a function of frequency voltage settings. Number of threads = 4. Frequencies start at 1.6 GHz and increase in increments of 133 MHz until 2.67 GHz. power models of the SoC components are obtained from measurements. The models are then used to break-up the total power consumption of general applications among the different SoC components [142]. With a more elaborate measurement infrastructure, it is possible to measure the energy consumption on a per-cycle basis [23]. A per-cycle measurement setup enables energy measurements for instructions as they advance through the individual pipeline stages. As discussed in Section 3.1, the Hamming distance between consecutive input vectors is a good estimator of the amount of switching activity. It is possible to generalize such approach to measure the energy consumption of each pipeline stage (e.g., IF, ID, and EX) as a function of the Hamming distance between consecutive instructions. The Hamming distance between instructions can be controlled by changing the operation codes, the source and destination registers, the memory address, and the immediate operands. Thus, by administering stimuli code that is crafted to create desired Hamming distances between instructions, it is possible to measure the energy consumption of each pipeline stage as

194 Post-Silicon Power Characterization Fig. 4.9. Measuring instruction-level energy consumption on a per-cycle, per-stage basis [23]. a function of the Hamming distance. Figure 4.

76 194 Post-Silicon Power Characterization Fig Measuring instruction-level energy consumption on a per-cycle, per-stage basis [23]. a function of the Hamming distance. Figure 4.9 illustrates an example where a base instruction is used to first fill the pipeline stages, and then a test instruction with a prescribed Hamming distance is applied afterwards. A per-cycle energy measurement setup can measure the energy consumption of each stage as the test instruction progresses through the stages. By using different test instructions, it is possible to construct energy characterization tables that give the energy consumption of each pipeline stage as a function of the Hamming distance. These tables can be used to quickly characterize software energy consumption of the target processor. Techniques for instruction-level power analysis are also applicable for SW applications executing on soft processors (e.g., MicroBlaze and PicoBlaze from Xilinx, and Nios II from Altera), which are synthesized in FPGAs. Despite their simplicity, these processors also display good variations in their energy as a function of the executed instructions [108], and the discussed characterization techniques can be also used to analyze their instruction-level power analysis Architectural Analysis Some of the technical difficulties in instruction-level power analysis include (1) the large number of measurements that need to be conducted for different base instructions and their inter-dependencies, and (2) handling the intricacies involved with Very-Large Instruction Word (VLIW) processors, which are typically deployed as digital signal processors in embedded systems [7]. To reduce the difficulties

UNIT-II LOW POWER VLSI DESIGN APPROACHES

UNIT-II LOW POWER VLSI DESIGN APPROACHES Low power Design through Voltage Scaling: The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage.