Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture

Similar documents
IBM Research Report. GPUVolt: Modeling and Characterizing Voltage Noise in GPU Architectures

Big versus Little: Who will trip?

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Adaptive Guardband Scheduling to Improve System-Level Efficiency of the POWER7+

Recent Advances in Simulation Techniques and Tools

Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Sensing Voltage Transients Using Built-in Voltage Sensor

Analysis of Dynamic Power Management on Multi-Core Processors

Final Report: DBmbench

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

Evaluation of CPU Frequency Transition Latency

High Performance ZVS Buck Regulator Removes Barriers To Increased Power Throughput In Wide Input Range Point-Of-Load Applications

Software-assisted Hardware Reliability: Enabling Aggressive Timing Speculation Using Run-Time Feedback From Hardware and Software

Engineering the Power Delivery Network

UNIT-III POWER ESTIMATION AND ANALYSIS

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

Simulating GPGPUs ESESC Tutorial

Best Instruction Per Cycle Formula >>>CLICK HERE<<<

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

High Speed Digital Systems Require Advanced Probing Techniques for Logic Analyzer Debug

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Analysis and Reduction of On-Chip Inductance Effects in Power Supply Grids

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

Supply-Adaptive Performance Monitoring/Control Employing ILRO Frequency Tuning for Highly Efficient Multicore Processors

Reduction of Peak Input Currents during Charge Pump Boosting in Monolithically Integrated High-Voltage Generators

Chip Package - PC Board Co-Design: Applying a Chip Power Model in System Power Integrity Analysis

LSI and Circuit Technologies for the SX-8 Supercomputer

PROCESS-VOLTAGE-TEMPERATURE (PVT) VARIATIONS AND STATIC TIMING ANALYSIS

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

VOLTAGE NOISE IN PRODUCTION PROCESSORS

Instantaneous Loop. Ideal Phase Locked Loop. Gain ICs

Ivory: Early-Stage Design Space Exploration Tool for Integrated Voltage Regulators

A Low-Power SRAM Design Using Quiet-Bitline Architecture

Dynamic Warp Resizing in High-Performance SIMT

System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators

A Static Power Model for Architects

CS4617 Computer Architecture

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION

CHAPTER 4 POWER QUALITY AND VAR COMPENSATION IN DISTRIBUTION SYSTEMS

Domino Static Gates Final Design Report

Performance Evaluation of Recently Proposed Cache Replacement Policies

Microarchitectural Simulation and Control of di/dt-induced. Power Supply Voltage Variation

WEI HUANG Curriculum Vitae

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Module 1: Introduction to Experimental Techniques Lecture 2: Sources of error. The Lecture Contains: Sources of Error in Measurement

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

Enhancing Power, Performance, and Energy Efficiency in Chip Multiprocessors Exploiting Inverse Thermal Dependence

Study On Two-stage Architecture For Synchronous Buck Converter In High-power-density Power Supplies title

PHY Layout APPLICATION REPORT: SLLA020. Ron Raybarman Burke S. Henehan 1394 Applications Group

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Noise Aware Decoupling Capacitors for Multi-Voltage Power Distribution Systems

Fast Placement Optimization of Power Supply Pads

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Understanding Channel and Interface Heterogeneity in Multi-channel Multi-radio Wireless Mesh Networks

Chapter 10: Compensation of Power Transmission Systems

Decreasing the Commutation Failure Frequency in HVDC Transmission Systems

Low-Cost, Low-Power Level Shifting in Mixed-Voltage (5 V, 3.3 V) Systems

Practical Limitations of State of the Art Passive Printed Circuit Board Power Delivery Networks for High Performance Compute Systems

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

Multi-Site Efficiency and Throughput

Introduction. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. July 30, 2002

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors

THE TREND toward implementing systems with low

Processors Processing Processors. The meta-lecture

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Instruction-Driven Clock Scheduling with Glitch Mitigation

While DIs may conform to a variety of input characteristics, the most commonly applied ones are IEC Type 1, 2 and 3 (see Figure 1).

Synthetic Aperture Beamformation using the GPU

Cherry Picking: Exploiting Process Variations in the Dark Silicon Era

CS61c: Introduction to Synchronous Digital Systems

Revisiting Dynamic Thermal Management Exploiting Inverse Thermal Dependence

Low-Power VLSI. Seong-Ook Jung VLSI SYSTEM LAB, YONSEI University School of Electrical & Electronic Engineering

Challenges of in-circuit functional timing testing of System-on-a-Chip

A single-slope 80MS/s ADC using two-step time-to-digital conversion

Di/dt Mitigation Method in Power Delivery Design & Analysis

EECS 579 Fall What is Testing?

CUDA-Accelerated Satellite Communication Demodulation

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

Static Power and the Importance of Realistic Junction Temperature Analysis

Dynamic MIPS Rate Stabilization in Out-of-Order Processors

Improving Loop-Gain Performance In Digital Power Supplies With Latest- Generation DSCs

Compiler-Directed Power Management for Superscalars

AN Analog Power USA Applications Department

DESIGN AND SIMULATION OF A HIGH PERFORMANCE CMOS VOLTAGE DOUBLERS USING CHARGE REUSE TECHNIQUE

Impact of Low-Impedance Substrate on Power Supply Integrity

Statistical Static Timing Analysis Technology

Highly Efficient Ultra-Compact Isolated DC-DC Converter with Fully Integrated Active Clamping H-Bridge and Synchronous Rectifier

Active Decap Design Considerations for Optimal Supply Noise Reduction

Microcircuit Electrical Issues

Using an FPGA based system for IEEE 1641 waveform generation

The Ghost in the Machine Observing the Effects of Kernel Operation on Parallel Application Performance

Increasing Performance Requirements and Tightening Cost Constraints

CMOS Process Variations: A Critical Operation Point Hypothesis

Development of an Experimental Rig for Doubly-Fed Induction Generator based Wind Turbine

Decreasing the commutation failure frequency in HVDC transmission systems

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

A Survey of the Low Power Design Techniques at the Circuit Level

CHAPTER 4 GALS ARCHITECTURE

Transcription:

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture Jingwen Leng Yazhou Zu Vijay Janapa Reddi The University of Texas at Austin {jingwen, yazhou.zu}@utexas.edu, vj@ece.utexas.edu Abstract Energy efficiency of GPU architectures has emerged as an important design criterion for both NVIDIA and AMD. In this paper, we explore the benefits of scaling a generalpurpose GPU (GPGPU) core s supply voltage to the near limits of execution failure. We find that as much as 21% of NVIDIA GTX 680 s core supply voltage guardband can be eliminated to achieve significant energy efficiency improvement. Measured results indicate that the energy improvements can be as high as 25% without any performance loss. The challenge, however, is to understand what impacts the minimum voltage guardband and how the guardband can be scaled without compromising correctness. We show that GPU microarchitectural activity patterns caused by different program characteristics are the root cause(s) of the large voltage guardband. We also demonstrate how microarchitecture-level parameters, such as clock frequency and the number of cores, impact the guardband. We hope our preliminary analysis lays the groundwork for future research. I. INTRODUCTION General-purpose GPU (GPGPU) architectures are increasingly becoming mainstream general-purpose computing counterparts to the CPU. For applications with significant data parallelism, the GPU architecture can offer better performance than the CPU architecture. The GPU s throughput-driven architecture maps well to data-parallel applications as compared to the CPU s single-thread-performance focused architecture. The cost of throughput is power consumption. Historically, the power consumption of a general purpose GPU architecture has remained higher than that of the CPU, although the performance-per-watt efficiency of the GPU may be higher. High-performance GPU architectures have maintained a typical power consumption between 0 W and 250 W, whereas many of the most competitive commodity CPU counterparts plateau at around 130 W power budget. With the recent GPU architectures, however, we have seen a significant emphasis on lowering the GPU s power consumption. For example, NVIDIA claims that the latest Kepler architecture achieves 3 the performance per watt of their previous-generation architecture (i.e., Fermi) [1]. State-of-theart GPU power-saving efforts strongly reflect and follow the trend of CPU power optimizations. Typical optimizations include clock and power gating, dynamic voltage and frequency scaling (DVFS), and boosted clock frequencies [1], [2]. Although there has been increasing focus on applying traditional CPU power-saving techniques to GPUs, we need to focus on new(er) opportunities for energy optimization that push the GPU to the limits of its operation. In this paper, we demonstrate the energy-efficiency benefits of pushing the GPU architecture to the limits of its operating voltage. To combat the worst-case process, voltage, and temperature variations, traditional design methodology requires excessive supply voltage guardband, which can be as high as % in a production processor [3]. The guardband is predicted to grow due to the increased variations as technology scales [4]. The industry-standard practice of designing for the worst case leads to wasted energy and performance because the circuit could have operated at a lower supply voltage or a higher clock frequency in the typical case [5], [6]. This tradeoff between performance, power, and reliability has remained largely unexplored by previous works in the case of GPUs. Using NVIDIA s GTX 680 with the Kepler architecture, we show the power benefits of reducing the processor s supply voltage at a fixed frequency to a critical voltage point at which the program executes correctly but fails when the voltage is reduced any further. We observe that the critical voltage depends on the workload s characteristics and can vary from 11% up to 21% of the nominal voltage. Based on the critical voltage data of different programs, we demonstrate that the L di effect is the main cause for the GPU s reducible voltage guardband (i.e., the offset between critical voltage and the nominal supply voltage). We also show that GPU architectural features like the number of cores and GPU program characteristics, for example, being memory bounded versus compute bounded, are two important factors that influence the amount of reducible voltage guardband. Understanding such features is crucial to effectively anticipate [7], predict [8] or mitigate [9] the reducible voltage guardband. Our findings show that there is great potential in improving GPU energy efficiency by controlling its reducible voltage guardband from the architecture and program viewpoint. The key challenge, however, is understanding and identifying the components that impact the reducible magnitude. The rest of this paper is organized as follows. Sec. II studies the extent that a GPU s voltage guardband can be pushed, as well as the benefits of exploiting the voltage guardband as a knob for improving a GPU s energy efficiency. Sec. III presents our analysis on the source of the GPU s voltage guardband and how microarchitectural activities and architectural features impact the benefits. Sec. IV concludes the paper with a summary of our important findings. SELSE 14 54

Voltage (V) 1. 1.15 1. Critical Voltage @ 1.2GHz Reducible Voltage Guardband Benchmarks concurrentkernels Fig. 1: Measured critical voltage for 48 programs on the GTX 680. II. PUSHING THE VOLTAGE GUARDBAND We use GTX 680, a high-end modern GPU with NVIDIA s latest Kepler architecture [1], to demonstrate that there is a large variation in the reducible voltage guardband among different programs. Pushing the guardband to the program s limit of correct execution can yield significant energy-efficiency benefits. Measurements show that we can achieve up to 25% energy reduction with this method. For all of our analysis, we use a total set of 48 programs from the NVIDIA CUDA SDK samples [] and the Rodinia benchmark suite [11]. A. Critical Voltage Exploration We experimentally reduce the operating voltage of each program to its critical voltage, an operating point at which the program executes correctly but fails when the voltage is reduced any further. The resolution with which we control the GPU s core voltage is 6 mv. As we decrease the GPU s operating voltage from its default 1.18 V at 1.2 GHz, we ensure program correctness at each step by checking if the GPU driver crashes during program execution and by validating program output against a reference run at nominal operating point. When validating program output, we restrict output data to be exactly the same as the reference run for integer workloads, and within 0.02% error range for floating-point workloads. We keep the core frequency, memory frequency, and memory voltage, and temperature constant during the experiment. Fig. 1 shows the critical voltage for the set of programs we studied. The critical voltage varies from 0.93 V to V, while the nominal operating point is 1.18 V at 1.2 GHz. Our measurement indicates that the critical voltage strongly varies among these programs. Nearly half of the workloads we run have a critical voltage of around 0.97 V, while some programs have a critical voltage that is above 1 V, the largest one is about 8% higher than the majority. Other programs have critical voltage that is below V. Overall, the voltage guardband is overprovisioned for the set of programs we evaluated. To quantify the amount of wasted guardband, we use the term reducible voltage guardband to denote the offset between the nominal supply voltage and the benchmark s critical voltage. In the extreme case (benchmark concurrentkernels), the reducible voltage guardband is 0.25 V (21%). For the benchmark with the highest critical voltage (benchmark Normalized Energy Saving (%) 25 15 Normalized Energy Saving Operating at Critical Voltage binomialoptions simplemultigpu SRAD Fig. 2: Energy savings for operating the GPU at the critical voltages. ), the reducible voltage guardband is 0.13 V (11%). Because the reducible voltage guardbands vary significantly across different programs, we attribute the difference to inherent program characteristics that impact (or determine) the worst-case critical voltage. B. Energy-Efficiency Benefits We measure and quantify the energy-saving benefits of operating the GPU at the programs critical voltage. For GTX 680 power measurement, we adopt the following method: The GPU card is connected to the PCIe slot through a PCIe riser card and the ATX power supply. The PCIe riser card and the ATX power supply both have power pins that deliver power to the GPU. We measure the instantaneous current and voltage to compute the power supply from each of these sources. We sense the instantaneous current draw by measuring the voltage drop that occurs across a shunt resistor. We use NI DAQ 6133 to sample voltage at a rate of 2 million samples per second. Fig. 2 shows the energy-saving benefits of operating at the critical voltage. 1 By lowering the core supply voltage without compromising frequency, we can improve energy efficiency. On (geometric) average, the energy savings is about 21%. We achieve the largest energy savings with the program. Operating at its critical voltage (0.97 V) instead of the original 1.18 V reduces GPU energy consumption by 24%. The smallest improvement is seen with. We can reduce its energy consumption by only 14%. Energy reductions are generally proportional to the reducible voltage guardband, as shown in Fig. 3. However, the relationship is not linear, because the magnitude of energy savings depends on both the reducible guardband and the program characteristics. For instance, programs that are not computebounded tend to exercise the memory subsystem heavily. Because we only scale core voltage, we observe smaller benefits for those memory-bound programs. For computebound programs, we achieve larger benefits. Therefore, it is also important to understand program behavior for optimizing the GPU s energy efficiency using the voltage guardband. 1 Henceforth, we use a subset of the programs mentioned in the beginning of Sec. II because of power or performance counter instrumentation difficulties. Nevertheless, the subsets are large enough to faithfully represent our observations. SELSE 14 55

Normalized Energy Savings (%) 25 15 12 14 16 18 Reducible Voltage Guardband (%) Fig. 3: Correlation between energy savings and voltage guardband of different programs. Average Power Consumption (W) 140 1 0 80 60 40 12 14 16 18 Reducible Voltage Guardband (%) Fig. 4: No correlation between average power and measured reducible voltage guardband. Voltage (V) Voltage (V) 1.24 1.22 1. 1.18 1.24 1.22 1. 1.18 3.5 Voltage Droop 3.6 Time (ms) 3.7 Fig. 5: Measured traces of two benchmarks, comparing IR drop and L di droops. Critical Voltage (V) 0.99 0.98 0.97 0.96 1 2 4 8 16 32 Number of Thread Blocks Fig. 6: Critical voltage at an increasing number of active cores when running. III. CHARACTERIZING FACTORS THAT IMPACT THE REDUCIBLE VOLTAGE GUARDBAND It is important to understand what constrains the extent to which the supply voltage of a GPU can be reduced and how architectural parameters and program characteristics interact with each other and impact a program s critical voltage. We present a measurement-based analysis that serves as the basis for understanding voltage noise in GPUs from this approach. We start by showing that the magnitude of the critical voltage is affected by the L di noise rather than the IR drop of a GPU. Next, we demonstrate that the number of active GPU cores impacts the critical voltage. We also show that increasing the clock frequency can detrimentally affect the critical voltage. Finally, we explain how the critical voltage can be associated with memory versus compute-bound program characteristics. A. L di Noise To understand why the programs have different reducible voltage guardbands, we must understand whether the reducible voltage guardband is caused by the IR drop or L di effect [4]. The static IR drop is the voltage drop resulting from the resistive component of the power delivery network when the processor consumes high power. L di is a dynamic event, resulting from the inductive and capacitive components when microarchitectural activity causes power fluctuations. Reducing the IR drop requires us to lower the GPU s peak power consumption, and therefore may negatively impact the GPU s performance. Because L di is typically a rare transient effect, prior CPU works have shown that optimizing it can significantly boost performance [12]. Alternatively, it can also be used to reduce energy consumption for a fixed frequency. We find that the majority of the voltage guardband is needed for the inductive voltage noise (i.e., di/ voltage droop). When the static IR drop is the main cause, the reducible voltage guardband would have a strong correlation with the average power consumption. Fig. 4 shows the relationship between the reducible voltage guardband and the average power consumption. It shows that these two are not correlated. To confirm our analysis, we also measure the GPU processor s voltage trace at the package level using the DAQ. Fig. 5 shows the snapshot of the measured voltage traces for and. Both programs have a similar power draw of around 115 watts (not shown). However, has a lower reducible voltage guardband than. Their reducible voltage guardbands are 0.13 V and 0.2 V, respectively. The top graph in Fig. 5 shows the transient voltage droops for. The bottom graph in Fig. 5 shows the measured trace for, which is more stable. The trends seen with and are representative of the other programs. Therefore, we conclude that the inductive voltage droop caused by the GPU processor s current draw variation is the major cause of the lower reducible voltage guardband in some benchmarks. The processor s current draw can vary in accordance with both microarchitectural activity and program characteristics. For instance, microarchitecture stalls can cause voltage droops [12]. B. Number of Cores Prior work with multicore CPUs demonstrated that the number of active cores could detrimentally impact the reducible voltage guardband due to the nature of constructive voltage noise interference [4]. This sort of analysis is yet to be studied in GPUs, which use many simply in-order cores that are significantly less power hungry than traditional out-of-order superscalar processors. We study the effect of active GPU cores on the critical voltage by conducting an experiment using. We use because it uniformly exercises all SIMD execution lanes in the GPU without introducing complex behavior (e.g., control divergence). Because we cannot directly control the number of active cores in the GPU, we vary the number of CUDA thread blocks used by the program. It lets us indirectly control the number of active cores. Fig. 6 shows the critical voltage changes as the number of thread blocks increase. When only one thread block is active, the critical voltage is as low as V. When 32 thread blocks are used, the critical voltage increases to 0.99 V. This result implies the guardband would increase as more cores are used. Granted that is a relatively simple application compared to other programs with complex control flow, the observation points to an optimization trade-off for energy efficiency between the number of cores, the GPU critical voltage, and energy efficiency that remains open for exploration. SELSE 14 56

simplezerocopycritical Voltage (V) 1.15 1. bandwihtest 1.3 GHz 1.2 GHz 1.1 GHz dwthaar1d reduction scalarprod vectoradd scan vectoradddrv transpose binomialoptions Fig. 7: Critical voltages for operating at different frequencies. C. Clock Frequencies In the pursuit for low power and high performance, GPUs are employing dynamic voltage and frequency control to lower voltage and achieve power savings when performance is not needed. For example, NVIDIA s GPUBoost dynamically increases the GPU s clock frequency until it hits a predetermined temperature to deliver performance [1]. We find the possibility for an interesting trade-off between the critical voltage and the processor s operating clock frequency. We discover the strong likelihood that we may need to consider the critical voltage when changing frequencies, because a small clock-frequency increase may necessitate a relatively large critical-voltage increase, and this could void the benefits of reducing the voltage guardband and/or the boosted clock frequency. We measure the critical voltage for the programs under three frequency settings: 1.1 GHz, 1.2 GHz and 1.3 GHz. Fig. 7 shows our results. From our measurements, we make three important observations: First, programs generally need a higher critical voltage at higher clock frequencies due to short cycle time s impact on L di noise. At a higher clock rate, stalls and their impact on voltage droop become more pronounced because current increases and the time duration during which current changes gets smaller. Second, for a fixed increase in clock frequency (e.g., 0 MHz step), the critical voltage increases superlinearly for nearly all programs. For example, the critical voltage of benchmark increases from 1 V to V when frequency increases from 1.1 GHz to 1.2 GHz. When the frequency is increased further to 1.3 GHz, the critical voltage increases by a larger amount to 1.15 V. The trend applies generally to almost all the programs we consider. Third, the exact magnitude of the increase can vary across the programs. For some programs, such as, scalarprod and, the critical voltage can increase much larger than the other benchmarks when frequency changes. When frequency increases from 1.2 GHz to 1.3 GHz, the critical voltage increases sharply for both and scalarprod. However, in the case of, the increase is smaller. To understand these differences, we need to examine the programs and understand their inherent workload characteristics. Critical Voltage (V) dwthaar1d Latency Bounded Compute Bounded Balanced Memory Bounded vectoradd vectoradddrv transpose simplemultigpu reduction scan Fig. 8: Critical voltages for the different program types. D. Program Characterization We demonstrate that the characteristics of a program impact its critical voltage. Specifically, we show the extent to which memory- versus compute-bound programs affect the critical voltage. We find that memory-bounded programs typically have a higher critical voltage, which is possibly caused by stalling behavior even though GPU architectures are aggressively designed to mask memory stalls via massive multithreading. A typical GPU can support over,000 threads. We categorize the programs into four different types using the NVIDIA visual profiler [13]. The four types of programs we study are groups as such: memory bounded, whose execution time is bounded by memory bandwih; compute bounded, whose execution time is bounded by the GPU s computational capabilities; latency bounded, which do not have enough threads to run on the GPU hardware and thus have very low utilization of both compute units and main memory bandwih; and balanced, which is the ideal program to run on a GPU because it has a high utilization rate on both compute units and memory bandwih. Fig. 8 shows the critical voltage for the different program types. Memory-bounded programs tend to have a higher critical voltage (i.e., larger voltage droops). Balanced programs show moderate critical voltage. Compute- and latencybounded programs tend to have lower critical voltage (i.e., smaller voltage droops). Prior work in the CPU domain has shown that two conditions are required for large voltage droops to occur: regular microarchitecture stalls, and synchronized stalls among multiple cores. Both of these conditions can explain why the memory-bounded kernels show a large droop. Memorybounded kernels have stall behavior that is caused by the memory subsystem, and these kinds of stalls tend to synchronize because of contention at the memory subsystem level. Although latency-bounded programs also have stall behaviors, the stalls are likely not aligned due to the lack of contention among common resources. Compute-bounded programs either have stable power draw or unsynchronized stalls. In the future, it may be worthwhile to explore GPU kernellevel characteristics. It may also be worthwhile to understand how explicit program characteristics such as barrier synchronizations, etc. impact the reducible guardband magnitude. SELSE 14 57

IV. CONCLUSION We demonstrated that we can achieve energy-reduction benefits as high as 25% by pushing the Kepler GPU s core supply voltage to its limit. The challenge for leveraging this opportunity lies in understanding what impacts the reducible voltage guardband. We find that voltage guardband of GPUs is mainly caused by L di noise, and the critical voltage depends on workload characteristics. We also show how microarchitecture-level parameters, such as the number of active cores and core frequency, impact the reducible voltage guardband. We believe that there is a large potential for this work, and it encourages us to further understand the GPU voltage guardband s interactions with architecture-level parameters as well as GPU programs characteristics. REFERENCES [1] NVIDIA Corporation, NVIDIA CUDA Programming Guide, 11. [2] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamo, and V. J. Reddi, GPUWattch: Enabling Energy Optimizations in GPGPUs, in Proceedings of the International Symposium on Computer Architecture (ISCA), 13. [3] N. James, P. Restle, J. Friedrich, B. Huott, and B. McCredie, Comparison of Split-Versus Connected-Core Supplies in the POWER6 Microprocessor, in International Solid-State Circuits Conference (ISSCC), 07. [4] V. Reddi, S. Kanev, W. Kim, S. Campanoni, M. Smith, G.-Y. Wei, and D. Brooks, Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling, in Proceedings of the International Symposium on Microarchitecture (MICRO),. [5] C. R. Lefurgy, A. J. Drake, M. S. Floyd, M. S. Allen-Ware, B. Brock, J. A. Tierno, and J. B. Carter, Active Management of Timing Guardband to Save Energy in POWER7, in Proceedings of the International Symposium on Microarchitecture (MICRO), 11. [6] V. J. Reddi, M. Gupta, G. Holloway, M. D. Smith, G.-Y. Wei, and D. Brooks, Predicting voltage droops using recurring program and microarchitectural event activity, IEEE micro Top picks,. [7] M. D. Powell and T. Vijaykumar, Pipeline damping: a microarchitectural technique to reduce inductive noise in supply voltage, in Proceedings of the International Symposium on Computer Architecture (ISCA), 03. [8] V. Reddi, M. Gupta, G. Holloway, G.-Y. Wei, M. Smith, and D. Brooks, Voltage emergency prediction: Using signatures to reduce operating margins, in Proceedings of the International Symposium on High- Performance Computer Architecture (HPCA), 09. [9] M. S. Gupta et al., An event-guided approach to handling inductive noise in processors, in Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 09. [] NVIDIA Corporation, CUDA C/C++ SDK CODE Samples, 11. [11] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, Rodinia: A benchmark suite for heterogeneous computing, in Proceedings of the International Symposium on Workload Characterization (IISWC), 09. [12] M. S. Gupta, K. K. Rangan, M. D. Smith, G.-Y. Wei, and D. Brooks, Towards a software approach to mitigate voltage emergencies, in Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 07. [13] NVIDIA Corporation, NVIDIA Visual Profiler, 13. SELSE 14 58