Evaluation of CPU Frequency Transition Latency

Similar documents
Evaluation of CPU Frequency Transition Latency

INTERFACING WITH INTERRUPTS AND SYNCHRONIZATION TECHNIQUES

Power Capping Via Forced Idleness

A virtual On Board Control Unit for system tests

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Efficient Construction of SIFT Multi-Scale Image Pyramids for Embedded Robot Vision

Validation of Frequency- and Time-domain Fidelity of an Ultra-low Latency Hardware-in-the-Loop (HIL) Emulator

Understanding Channel and Interface Heterogeneity in Multi-channel Multi-radio Wireless Mesh Networks

Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control

An Energy Conservation DVFS Algorithm for the Android Operating System

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

1 Introduction The n-queens problem is a classical combinatorial problem in the AI search area. We are particularly interested in the n-queens problem

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102

Tutorial on the Statistical Basis of ACE-PT Inc. s Proficiency Testing Schemes

Real Time Simulation of Power Electronic Systems on Multi-core Processors

Development of an Experimental Rig for Doubly-Fed Induction Generator based Wind Turbine

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

Analysis of Processing Parameters of GPS Signal Acquisition Scheme

Handling Search Inconsistencies in MTD(f)

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION

User-friendly Matlab tool for easy ADC testing

Multi-Site Efficiency and Throughput

Real-time Systems in Tokamak Devices. A case study: the JET Tokamak May 25, 2010

Final Report: DBmbench

GESTURE RECOGNITION SOLUTION FOR PRESENTATION CONTROL

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

Adaptive Touch Sampling for Energy-Efficient Mobile Platforms

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

Wireless Sensor Networks (aka, Active RFID)

System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators

A Bottom-Up Approach to on-chip Signal Integrity

Supplementary Figures

The Ghost in the Machine Observing the Effects of Kernel Operation on Parallel Application Performance

Leverage always-on voice trigger IP to reach ultra-low power consumption in voicecontrolled

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

Recent Advances in Simulation Techniques and Tools

Ted F Bowlds, PhD Candidate

IMPLEMENTATION OF UNSIGNED MULTIPLIER USING MODIFIED CSLA

A Memory-Efficient Method for Fast Computation of Short 15-Puzzle Solutions

Real-Time Face Detection and Tracking for High Resolution Smart Camera System

Laboratory 1: Uncertainty Analysis

2 Assoc Prof, Dept of ECE, George Institute of Engineering & Technology, Markapur, AP, India,

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

! # & # ( ( Published in IEEE Antennas and Wireless Propagation Letters, Volume 10, May 2011, pp ! # % % # & & # ( % # ) ) & ( ( % %

Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency

Complete and Incomplete Algorithms for the Queen Graph Coloring Problem

A Communication Model for Inter-vehicle Communication Simulation Systems Based on Properties of Urban Areas

Methods for Reducing the Activity Switching Factor

FACTORS AFFECTING DIMINISHING RETURNS FOR SEARCHING DEEPER 1

A Parallel Monte-Carlo Tree Search Algorithm

LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40


Towards Location and Trajectory Privacy Protection in Participatory Sensing

Introduction to Real-Time Systems

Design of Baugh Wooley Multiplier with Adaptive Hold Logic. M.Kavia, V.Meenakshi

SITUATED CREATIVITY INSPIRED IN PARAMETRIC DESIGN ENVIRONMENTS

Statistical Pulse Measurements using USB Power Sensors

Platform Comptence Center Report

K-RLE : A new Data Compression Algorithm for Wireless Sensor Network

Lecture 1: Introduction to Digital System Design & Co-Design

Lec 24: Parallel Processors. Announcements

Inter-Device Synchronous Control Technology for IoT Systems Using Wireless LAN Modules

Embedded Robust Control of Self-balancing Two-wheeled Robot

Exploring DSP Performance

An Integrated Modeling and Simulation Methodology for Intelligent Systems Design and Testing

TWO-WAY TIME TRANSFER WITH DUAL PSEUDO-RANDOM NOISE CODES

ABSTRACT 1. INTRODUCTION

A NOVEL WALLACE TREE MULTIPLIER FOR USING FAST ADDERS

Classification of Voltage Sag Using Multi-resolution Analysis and Support Vector Machine

Available online at ScienceDirect. Anugerah Firdauzi*, Kiki Wirianto, Muhammad Arijal, Trio Adiono

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors

Trends in Software and Control

Structural Analysis of Agent Oriented Methodologies

Sourjya Bhaumik, Shoban Chandrabose, Kashyap Jataprolu, Gautam Kumar, Paul Polakos, Vikram Srinivasan, Thomas Woo

Efficient Cool Down of Parallel Applications

Implementation of 32-Bit Unsigned Multiplier Using CLAA and CSLA

The Key to the Internet-of-Things: Conquering Complexity One Step at a Time

Constructing Line Graphs*

Disturbance Rejection Using Self-Tuning ARMARKOV Adaptive Control with Simultaneous Identification

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Low-Power Digital CMOS Design: A Survey

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL

Document downloaded from:

Parallelism Across the Curriculum

PES: A system for parallelized fitness evaluation of evolutionary methods

Investigation of Timescales for Channel, Rate, and Power Control in a Metropolitan Wireless Mesh Testbed1

Power-aware computing systems. Christian W. Probst*

Georgia Tech. Greetings from. Machine Learning and its Application to Integrated Systems

Design A Redundant Binary Multiplier Using Dual Logic Level Technique

Experimental Investigation on the Flame Wrinkle Fluctuation under External Acoustic Excitation

Gate Delay Estimation in STA under Dynamic Power Supply Noise

DOPPLER SHIFTED SPREAD SPECTRUM CARRIER RECOVERY USING REAL-TIME DSP TECHNIQUES

Comparing the State Estimates of a Kalman Filter to a Perfect IMM Against a Maneuvering Target

Image De-Noising Using a Fast Non-Local Averaging Algorithm

Experience with new architectures: moving from HELIOS to Marconi

Outline Simulators and such. What defines a simulator? What about emulation?

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH

Transcription:

Noname manuscript No. (will be inserted by the editor) Evaluation of CPU Frequency Transition Latency Abdelhafid Mazouz Alexandre Laurent Benoît Pradelle William Jalby Abstract Dynamic Voltage and Frequency Scaling (DVFS) is one of the most employed techniques to reduce energy consumption in computers. The main idea exploited by DVFS controllers is to reduce the CPU s frequency whenever it is less intensively used to save energy. For example, memory intense program phases are good targets for frequency reduction. However, depending on the CPU model, switching the frequency can be performed in varying delays. Such delays are often optimistically ignored in DVFS controllers, whereas their knowledge could enhance the quality of frequency setting decisions. The following document presents FTaLaT (Frequency Transition Latency measurement Tool), a tool able to measure the frequency transition latency directly on the computer. The measurement methodology is also presented, accompanied by evaluations on three recent Intel processors. Keywords DVFS Statistical performance evaluation Frequency transition latency 1 Introduction Energy is now considered as a determinant topic for computer science. Driven by autonomy constraints, embedded platforms pioneered, but more recently, major concerns have been raised in unexpected domains. It is the case for instance in high-performance computing where the supercomputers power consumption has reached levels preventing the current technologies to be used in the next generation of supercomputers. Among the different optimizations applied to reduce energy consumption, Dynamic Voltage and Frequency Scaling Abdelhafid Mazouz Alexandre Laurent Benoît Pradelle William Jalby Univ. Versailles St-Quentin en Yvelines E-mail: first.last@uvsq.fr (DVFS) has proven to be one of the most successful techniques [Ge et al, 2007; Hsu and Feng, 2005; Wu et al, 2005]. The core idea of DVFS is to dynamically adapt processors frequency and input voltage in order to reduce power consumption. Common DVFS controllers exploit the slack time of programs provoked by memory accesses or I/O and decrease the processor speed during such phases, effectively reducing the total energy consumption without harming the program execution time. In most of the existing DVFS controllers, the latency required to transition the frequency is considered as a negligible quantity. However, several DVFS controllers are based on frequent periodic polling of resource usage. They must correctly estimate the CPU frequency transition latency in order to determine an efficient polling period. Additionally, some other approaches consider theoretical modeling of hardware and can benefit from a precise estimation of the transition latency [Snowdon, 2010]. The frequency transition latency is then an important information for many DVFS controllers, among other potential usages. We propose FTaLaT, a new frequency transition latency estimator. Using precise measurements under different frequency settings, FTaLaT can quickly provide a precise estimation of all the possible frequency transition latencies. FTaLaT is freely distributed as open source software at: http://code.google.com/p/ftalat. 2 How does it works? The goal of FTaLaT is to measure the delay between the request for a new frequency and the actual frequency transition. To do so, it executes a specifically designed short program, called micro benchmark, or kernel. The benchmark is run several times to determine its execution time with the initial and target frequencies. Then, starting from the initial frequency, the transition towards the target frequency is trig-

2 Abdelhafid Mazouz et al. Listing 1 Micro benchmark assembly instructions s t a r t t i m e c o u n t e r... s t o p t i m e c o u n t e r gered and the benchmark is repeatedly run while measuring its execution time. Once the benchmark execution time is close to what is expected for the target frequency, FTaLaT detects a frequency transition and deduces the transition latency. The benchmark used for the measurement is a simple list of add assembly operations, as presented in Listing 1. With such structure, the benchmark exhibits a repeatable behavior and is sensitive to frequency transitions as it is only made of arithmetic operations. The benchmark is then a perfect target program to detect actual CPU transitions from execution time variations. The main issue with such methodology is to correctly detect a frequency transition from execution times. Indeed, background tasks may wakeup at anytime on the computer for instance, perturbing the execution of the micro benchmark [Mazouz et al, 2010; Todd Mytkowicz and Amer Diwan and Peter F. Sweeney and Mathias Hauswirth, 2009]. As the benchmark runs are extremely short, it may then be difficult to distinguish between measurement noise and actual frequency transitions. To solve that issue, FTaLaT relies on statistical tests. FTaLaT uses confidence intervals [Raj Jain, 1991] to determine if an execution time can be considered as close to what was observed previously at a given frequency. More precisely, confidence intervals define a range of values one can expect to be a good estimate of a population. They are computed from the mean of several measurements, their standard deviation, and a constant called the t distribution value that depends on the desired confidence level. The resulting confidence intervals are used for two main purposes. First, an execution time can be considered as close to others if it belongs to their confidence interval. Second, a test called the two samples t-test can determine if two sets of values are distinct or not. Such usages are perfectly suited to the needs of FTaLaT. Thanks to the statistical tools at its disposal, FTaLaT can determine if the execution times of the benchmark with the initial and target frequency are distinct or not. It can then ignore the cases where the benchmark execution times with both frequencies are too close to conclude. It also determines if an execution time is similar to those achieved with the target frequency in order to detect a frequency transition. Finally, FTaLaT uses the t-test to determine if a detected transition is a measurement noise or if it is confirmed over a long period. In our implementation, confidence intervals are built for a 95 % confidence level. A more detailed methodology is presented afterwards. The general FTaLaT algorithm is as follows. First of all, the micro benchmark is run 10,000 times while the initial frequency is set, and again 10,000 times with the target frequency. Then, both measurement sets are compared using the t-test: if they are not found distinct, the procedure ends as FTaLaT is not able to distinguish between execution times. In practice, this never happened during our experiments. Besides, the confidence interval of the mean of the target frequency is built. Then, the initial frequency is set, the target frequency is requested, and the micro benchmark is repeatedly executed while measuring its execution time. After every execution, if the resulting execution time lies in the confidence interval for the target frequency, a frequency transition is assumed. At this point, the transition is not certain as the execution time may have changed because of various external events. In order to ensure that the transition actually occurred, the micro benchmark is run again 100 times while measuring its execution time. Using the t-test, the set consisting of the 100 final execution times is compared to that of the target frequency. If they are determined similar, FTaLaT considers that a frequency transition actually happened and records the transition latency. Otherwise, the measurement is started over as it may have been disturbed by external noise. As a result, either FTaLaT aborts when the stability conditions are not met, or a frequency transition delay is determined. For more details about the statistical methods employed, or the detailed algorithm, the reader is referred to the scientific paper presenting the system [Mazouz et al, 2013]. 3 Experiments 3.1 Experimental setup Our experiments have been conducted on three distinct machines, presented in Table 1, running a recent Linux system. On each machine, FTaLaT measured the frequency transition latency between every available frequency pair. For statistical significance, each measurement was repeated 31 times while an effort was made to reduce the sources of performance instability on our measurements: to achieve high precision in our measurements, we use thread affinity for better performance stability and the time stamp counter, via the RDTSC instruction, for precise frequency-independent time measurements.

Evaluation of CPU Frequency Transition Latency 3 Table 1 Test machines description Processor Xeon X5650 Xeon E3-1240 Core i7-3770 CPU type Intel Core Westmere Intel Core SandyBridge Intel Core IvyBridge Micro-architecture Nehalem SandyBridge IvyBridge Cores 2x 6 1x4 1x 4 Hardware threads 2x 6 1x4 1x 8 Min CPU Frequency 1.59 GHz 1.6 GHz 1.6 GHz Max CPU Frequency 2.66 GHz 3.3 GHz 3.4 GHz 3.2 Experimental results 10 us latency Figures 1, 2, and 3 present the frequency transition latency in micro-seconds on the vertical axis for the three test machines. The horizontal axis in each figure shows the different target frequencies while the initial frequencies are represented using distinct plotting colors and symbols. For each pair of initial and target frequency, we report the delay required to achieve an effective frequency transition, as measured by FTaLaT. On the three machines, we observe that the transition delay is not constant but rather depends on the initial and target frequencies. The transition latency usually increases when the target frequency is higher than the initial frequency. On the other hand, when the frequency decreases, the transition latency stays in a very tight range of small values. One can also notice that the transition latency does not follow a similar trend on all machines: while transition latency seems to increase linearly when CPU frequency increases on the SandyBridge machine, at least three steps appear on the IvyBridge and the Westmere machines. Finally, the range of possible latencies tends to tighten for more recent processor models. The results obtained by FTaLaT are consistent with the short description found in manufacturer documentation for similar processors [Intel Corporation, 2011, 2012] where it is stated that the voltage increase induced by a frequency increase is performed in several steps while a frequency and voltage reduction is described as a one-step operation. Our tool is able to confirm that a similar behavior can be observed on our experimental platforms. In order to have a better view of the transition latency, we have represented in Figure 4 the measured benchmark execution times on the IvyBridge machine when switching the CPU frequency from 1.6 GHz to 3.4 GHz. While the vertical axis reports the execution time of the successive runs of the kernel, the horizontal axis represents the different iterations until the transition is detected and confirmed. Notice our system immediately detects the transition but runs additional measurements afterwards to confirm it. We can observe two distinct steps in execution times: 1) from iteration 1 to iteration 48 and 2) from iteration 50 to 149. The first step represents the executions at the 1.6 GHz frequency, while the second step occurs when the bench- Kernel latency 1 us 0 us 0 20 40 60 80 100 120 140 160 Iteration number Fig. 4 Observed execution times of the assembly kernel for the pair (1.6 GHz, 3.4 GHz) of CPU frequencies on the IvyBridge machine. mark runs at 3.4 GHz. Thus, the CPU does not change its operating frequency until iteration 49. The transition latency computed by FTaLaT is then the elapsed time between the request for a new frequency (at iteration 1) and the kernel s execution time change detection (iteration 50). Additionally, when executing the kernel at iteration 49, we can observe that the frequency transition provokes a short pause, leading the kernel s execution time to rise in a significant extent. Thus, aside frequency latency, there is also an overhead to transition frequency. The overhead directly impacts the execution time of the running programs as the processor can be considered as paused during the actual frequency transition. In the presented measurement, the overhead from transitioning frequencies can be deduced from the extra-time spent in the 49th iteration and represents about 9.5 µs. Ideally, such overhead should also be taken into account when performing DVFS, for instance by avoiding nonnecessary frequency transitions. 4 Conclusion FTaLaT is able to determine experimentally the frequency transition latency on modern x86 64 platforms for every couple of available frequencies. It uses a rigorous statistical approach to distinguish between measurement noise and actual information. The tool is distributed as open source software for recent Linux systems.

4 Abdelhafid Mazouz et al. 25 30 35 40 45 50 IvyBridge (4 cores) machine 1.6 1.7 1.9 2 2.1 2.2 2.4 2.5 2.6 2.8 2.9 3 3.1 3.3 3.4 1.6 GHz 1.7 GHz 1.9 GHz 2 GHz 2.1 GHz 2.2 GHz 2.4 GHz 2.5 GHz 2.6 GHz 2.8 GHz 2.9 GHz 3 GHz 3.1 GHz 3.3 GHz 3.4 GHz Fig. 1 Observed frequency transition latency on the IvyBridge machine. 20 30 40 50 60 70 SandyBridge (4 cores) machine 1.6 1.7 1.8 2 2.1 2.2 2.3 2.4 2.6 2.7 2.8 2.9 3.1 3.2 3.3 1.6 GHz 1.7 GHz 1.8 GHz 2 GHz 2.1 GHz 2.2 GHz 2.3 GHz 2.4 GHz 2.6 GHz 2.7 GHz 2.8 GHz 2.9 GHz 3.1 GHz 3.2 GHz 3.3 GHz Fig. 2 Observed frequency transition latency on the SandyBridge machine. From the experiments, we observed various interesting phenomena such as the variable cost of a frequency increase compared to the nearly fixed cost of a frequency decrease. FTaLaT is then of great help to measure frequency transition latency and better understand the processors. Thus, it can also be used, aside from its uses in DVFS control, to track the advances in frequency transition delays or highlight conception issues in processors. References Ge R, Feng X, chun Feng W, Cameron K (2007) CPU MISER: A performance-directed, run-time system for

Evaluation of CPU Frequency Transition Latency 5 10 20 30 40 50 60 1.596 GHz 1.729 GHz 1.862 GHz 1.995 GHz 2.128 GHz 2.261 GHz 2.394 GHz 2.527 GHz 2.66 GHz 1.596 1.729 1.862 1.995 2.128 2.261 2.394 2.527 2.66 Westmere (16 cores) machine Fig. 3 Observed frequency transition latency on the Westmere machine. power-aware clusters. In: ICPP 2007. International Conference on Parallel Processing, p 18, DOI 10.1109/ICPP. 2007.29 Hsu Ch, Feng Wc (2005) A power-aware run-time system for high-performance computing. In: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, IEEE Computer Society, Washington, DC, USA, SC 05, pp 1, DOI 10.1109/SC.2005.3, URL http://dx.doi.org/ 10.1109/SC.2005.3 Intel Corporation (2011) Intel xeon processor E3-1200 family datasheet Intel Corporation (2012) Intel Xeon processor E5-1600/E5-2600/E5-4600 product families Mazouz A, Touati SAA, Barthou D (2010) Study of variations of native program execution times on multi-core architectures. In: CISIS 10: Proc. of the International Conference on Complex, Intelligent and Software Intensive Systems, IEEE Computer Society, Washington, DC, USA, pp 919 924, DOI http://dx.doi.org/10.1109/cisis. 2010.96, MuCoCos workshop Mazouz A, Laurent A, Pradelle B, Jalby W (2013) Evaluation of cpu frequency transition latency. Computer Science - Research and Development pp 1 9, DOI 10. 1007/s00450-013-0240-x, URL http://dx.doi.org/ 10.1007/s00450-013-0240-x Raj Jain (1991) The Art of Computer Systems Performance Analysis : Techniques for Experimental Design, Measurement, Simulation, and Modelling. John Wiley and Sons Snowdon D (2010) Operating system directed power management. PhD thesis, University of New South Wales Todd Mytkowicz and Amer Diwan and Peter F Sweeney and Mathias Hauswirth (2009) Producing wrong data without doing anything obviously wrong! In: Architectural Support for Programming Languages and Operating Systems (ASPLOS) Wu Q, Martonosi M, Clark DW, Reddi VJ, Connors D, Wu Y, Lee J, Brooks D (2005) A dynamic compilation framework for controlling microprocessor energy and performance. In: MICRO, pp 271 282