Supply-Adaptive Performance Monitoring/Control Employing ILRO Frequency Tuning for Highly Efficient Multicore Processors

EE 241 Project Final Report 2013 1 Supply-Adaptive Performance Monitoring/Control Employing ILRO Frequency Tuning for Highly Efficient Multicore Processors Jaeduk Han, Student Member, IEEE, Angie Wang, Student Member, IEEE Abstract RAVEN, a state-of-the-art multicore processor being developed by the Berkeley Wireless Research Center, utilizes switched capacitor DC-DC converters to generate dynamically scalable core voltage supplies. It is possible to enhance a core s energy efficiency at a given throughput by allowing the local power management and dedicated clock generator to adapt to the capacitor voltage ripple. Some commercial multicore processors, such as IBM s Power7, support active management of timing guardband to achieve nearly 25% power reduction without performance degradation in the face of load-induced supply droops and other noise events similar in characteristic to DC-DC ripple. While previous designs utilized bulky and/or slow-locking DLLs and PLLs for core clock synthesis, our implementation uses a fast-locking, power/area efficient injection-locked ring oscillator. Fast supply adaptation to achieve an average 2GHz core frequency (2x global clock frequency), supported by the aforementioned blocks with feedback from critical path monitors, achieves power savings justifying our scheme for systems needing to tolerate >80mV supply droops. Index Terms Critical path monitoring (CPM), DC-DC converters, dynamic frequency and voltage scaling (DFVS), injection-locked ring oscillators (ILRO), multicore processors, timing/voltage guardband T I. INTRODUCTION O minimize the power consumption of high-performance microprocessors, architects have become less reliant on frequency scaling, favoring, instead, many-core parallelism to achieve higher data throughput. However, the move to multicore does not guarantee operation at optimal power levels without performance degradation. To increase yield, previous multicore designs employed static timing/voltage guardbands to prevent timing violations under worst-case conditions. This incurred a significant power/performance overhead. Because power consumption has a square dependence on voltage, efficiency can be increased by adapting V!! to fluctuating workloads/operating conditions, transforming excess performance margin into power reduction. To accomplish this, the supply level is adjusted to maintain an average clock frequency capable of meeting performance/power needs without incurring timing errors. Even if the global processor utilizes this adaptive voltage control technique, its overall efficiency is still limited by the minimum supply voltage of the slowest core. This is unfavorable, because modern processes suffer from high variability. Therefore, it is desirable to have per-core supply and clock monitoring/control systems, so that V!! s can be individually optimized for higher efficiency, as seen in Fig. 1. Frequency-adaptive techniques for combatting the effects of voltage droops have been successfully implemented in both IBM s POWER7 servers and Intel s Montecito CPUs. In these Fig. 1. VDD-frequency curves of multicore systems (a) Single power supply with guardband (b) Local power supply for each core Fig. 2. Adapting the operating frequency as the supply changes designs, feedback systems track timing margins back to V!! variation [1][2][5]. Additionally, adaptive voltage scaling, coupled with logic performance monitoring, has been shown to improve power efficiency in TI s mobile processors [9]. The advantage of using frequency scaling is depicted in Fig. 2. Our design supports even higher energy efficiency by applying adaptive voltage scaling techniques to the unregulated DC-DC output of each core and by using low-power injection-locked ring oscillators for local clock generation. II. SUPPLY AND TIMING MARGIN TRACKING A. Voltage-to-Frequency Converter Design Considerations The resonant peaking due to the interaction of supply parasitics with variable current demands causes V!! fluctuations, as seen in Fig. 3. These fluctuations significantly shorten/lengthen critical path delays as compared to nominal values. There are various ways to measure and compensate for such supply variation either in the voltage or time domain. Previously, Intel implemented an on-die droop detector relying on a voltage comparator to continuously sense power supply noise [4]. Allowing for the quick detection of high frequency voltage droops with fine resolution would place stringent requirements on the ADC, adding to design complexity. Additionally, without some feedback mechanism, frequency-adaptive supply tracking alone cannot guarantee that there will be no timing violations. Another feedback-less technique for active frequency adaptation is implemented in Intel s Nehalem architecture [7]. In this scheme, the noisy digital V!! is used to generate the supply for the PLL s VCO. This enables the core clock frequency to adapt to first-order supply transients, but with only coarse control, as the relationship

EE 241 Project Final Report 2013 2 Fig. 3 Vdd transients due to supply parasitics + variable current load [4] between supply voltage and optimal frequency is not well defined. Further, there is overhead from the need to synchronize across different core clock domains (i.e. via a FIFO buffer), an issue that must be addressed in our implementation as well. Lastly, a gear ratio technique is used in PowerPC 970+, which allows for frequency adaptation from a selection among four frequencies generated by a PLL driving a divide-by-2! counter [6]. The benefits of this technique are minimal due to low frequency resolution. This leads to step error and long waiting periods between transitions, which wastes power and suggests the need for a better solution. B. Critical Path Monitoring Schemes for Adaptive Power Management Rather than relying on voltage sampling or feedback-less techniques, real-time critical path monitoring appears to be an optimal mechanism for adaptive frequency and voltage tuning. In addition to being generally low area, it does not create performance overhead and provides timing margin measurements every clock cycle for speedy response to supply fluctuations. This helps to actively minimize guardband for power reduction. without timing error, avoiding the costly error detection/rollback circuitry used in common techniques like Razor [2]. A critical path monitor is employed in the voltage-to-frequency feedback loop of Intel s Montecito CPU. This CPM performs cycle-by-cycle phase comparisons to track the timing margin and thus the correlated supply variation associated with worst-case delay lines over a supply range of 0.8V to 1.2V [5]. The feedback is used to adjust the frequency synthesizer output in 1.5% steps. This scheme has a fast 1.5 cycle average response due to its high bandwidth, allowing for higher operating frequencies. However, complicated state machine logic and a large phase selection multiplexor (64:1) are needed to generate clock edges from the PLL output. IBM s Power7 processors use a similar CPM scheme for active management of timing guardband, improving power reduction by nearly 25% [2]. This technique supports significant undervolting (nearly 150mV below nominal) during low-activity periods, and it is also able to protect cores from timing failures, even during large supply transients and high activity periods. To verify timing margins, each cycle, pulses are launched down worst-case delay paths and captured in the 12-bit edge detector shown in Fig. 4. The worst-case penetration corresponds to the timing margin. The CPM output is used to control the oscillator Fig. 4. Edge detector for the CPM in [2] in a fractional-n DPLL, where the synthesized frequency is increased/decreased in the presence of positive/negative margin. This allows for frequency adaptation in under 10 cycles. A slower frequency response is generated through the normal DPLL feedback loop, capable of supporting 50MHz/ms frequency changes. Additionally, a performance controller continuously monitors and adjusts the supply voltage to achieve a designated long-term average frequency. Previous CPM implementations illustrate how timing feedback can be successfully used to minimize guardband for power savings while guaranteeing timing. However, the implementations have been logically quite complex. We explore a lighter/faster scheme achievable through direct clock pulse generation from critical path edge detection, as detailed below. III. PROPOSED SOLUTION We implemented a low-power power management scheme that allows for dynamic timing and voltage guardband tuning on a per-core basis. This solution has been designed to complement the RAVEN architecture, which attempts to convert the wasted voltage headroom associated with the periodic supply ripple at the output of an unregulated switched-capacitor DC-DC converter into usable energy. The faster that the clock synthesis scheme adopted/modified from [10] is able to detect and compensate for voltage transients, the more we can reduce margins to increase efficiency and performance. This design expands on the error prevention feedback techniques used in [2][5][10] to handle supply transients, substituting an ILRO in place of bulky clock generation units to achieve low area, low power, and fast response, with the added benefit of frequency multiplication. Our implementation incorporates the blocks that are shown in Fig. 5 and described below: Fig. 5 System block diagram

EE 241 Project Final Report 2013 3 1. ILRO with background frequency acquisition, which consumes much less power than conventional clock generation blocks (i.e. PLL, DLL, PI, etc.,) and supports frequency multiplication 2. Combined CPM edge detector and core clock synthesizer (synchronous to ILRO outputs) that track the timing margin 3. Verilog-AMS model of the block that measures the average frequency offset between the global clock and a core clock and controls the core s supply so that the average performance is at a desired level 4. DC-DC converter model that emulates the switched-capacitor behavior of an unregulated on-chip power supply output 5. Critical path replica that mimics the characteristic of the critical path (both dynamic and static) in real processors IV. INJECTION-LOCKED RING OSCILLATOR A. Benefits of Injection-Locked Oscillators over PLLs and DLLs Table 1 shows various clock generation schemes for individual cores. Most of the previous solutions rely on PLL or DLL-based clock generators, which require quiet supplies and consume significant power. Substantial energy and area savings can be obtained by using injection-locked oscillators as local clock receivers in frequency synthesis blocks. An ILRO acts like a first-order PLL, with unconditional stability and a fast settling response [3]. Its high bandwidth helps to reject VCO noise very effectively and provides high supply rejection. Also, unlike in DLL-based clock generation, an ILRO can multiply its output frequency, reducing the power overhead of the global clock distribution network. In other words, by routing a medium frequency global clock on highly capacitive core-to-core interconnects and then multiplying up to obtain core clocks, we Ref. Fig. 6. Phase distortion diagram of injection locking Fig. 7. Timing diagram of background tracking TABLE I VARIOUS SCHEMES FOR LOCAL CLOCK GENERATION Clocking Scheme Advantages [2] Prescaler + DPLL Input jitter filtering No biasing circuits* Frequency multiplication [5] DLL + PI + Fast response Fractional div. Precise resolution [6] Integer divider Fast response Simple circuitry [11] DLL + Very fast response Clock synthesizer This work ILRO + Clock synthesizer Very fast response No biasing circuits Frequency multiplication * Assume the DCO is a true-digital version ** Prescaler degrades DCO noise rejection performance Disadvantages Bulky PLL Slow response Poor noise rejection** Analog DLL Bulky interpolators Low resolution Analog DLL Metastabilities Metastabilities Fig. 8. Conceptual diagram of proposed ILRO can minimize power loss. For this design, we generated 2GHz ILRO clocks from a 1GHz reference. B. ILRO with Background Frequency Acquisition and Alternate Injection The potential drawback of using an ILRO versus a DLL/PLL is that the free-running frequency of the ring oscillator cannot be controlled without an external feedback mechanism, resulting in asymmetry between the output clock phases, as shown in Fig. 6. Several approaches have been proposed to suppress this mismatch. In [8], a cascaded ILRO topology was used to generate evenly spaced phases by distributing phase error throughout the second ring oscillator via a multi-stage symmetrical injection. To achieve finer resolution than that attainable with a cascaded ILRO, feedback control, such as the replica feedback control scheme described in [12], is necessary. We extend the idea of feedback control and propose a simple scheme, called background frequency acquisition with alternate injection, to control the ILRO without using a replica oscillator. The validity of this alternate injection scheme can be verified by Fig. 7. If the injected clock has a 50% duty cycle, as assumed in this configuration, the ring oscillator input is aligned to the injected clock s rising edge and the phase offset is extracted at its falling edge. The free running frequency of the ring oscillator can be adjusted using the accumulated phase offset information, as in the case of a PLL. A conceptual diagram of the ILRO, composed of a digitally tunable oscillator and control block, is depicted in Fig. 8. Sixteen MPCLK phases are generated using 8 differential delay cells running off of a clean analog supply. While any high-frequency jitters by the ring oscillator are suppressed with injection, offset between the injected clock frequency and the free-running frequency of the oscillator is captured by the digital tracking loop. One of the key advantages of this scheme is that a simple D-flip flop can be used to capture the free-running frequency offset through phase mismatch detection, because there is no frequency offset between the injected and locked oscillator clock. Conventional injection fails to maintain phase uniformity across temperature, which affects the free-running frequency of the oscillator. However, as seen in Fig. 9, the phase relationship of the ILRO with external feedback is maintained across temperature. This scheme improves the locking range of the ILRO and reliability across PVT variations.

EE 241 Project Final Report 2013 4 Fig. 9. Phase mismatch simulation result V. CRITICAL PATH MONITORING AND CLOCK GENERATION Our critical path monitoring/clock generation scheme is lighter than previous implementations, because we combine the edge detector and clock pulse generation, as shown in Fig. 10. A low clock pulse, synchronous to the ILRO output, is generated upon falling edge detection at the critical path output, enabling fast frequency adaptation to supply transients. For example, Fig. 11 shows the clock generator output in response to DC-DC ripple. Because we are not relying on edge selection for clock synthesis, and because we aren t utilizing the edge detector output as an additional control signal (i.e. it is not being used to control a PLL DCO), this scheme does not require complicated control logic or the use of a large multiplexer. Because the core clock pulse is synchronous to the ILRO output, the resolution of the frequency adaptation is determined by the number of ILRO phases. To achieve the maximum resolution, the edge detection off of the 16 ILRO outputs can be performed in parallel, which heavily loads the output of the delay line. Alternatively, the ILRO outputs can be combined to produce a higher frequency clock for serial edge detection (where fewer flip flops are needed), reducing the delay line load at the expense of higher switching power. The scheme implemented utilizes a combination of series/parallel edge detection. The 16 phases of the 2GHz ILRO clock are edge-combined for a multiplication factor of 8x (to 16GHz), and two edge detectors operate in parallel on opposite phases of the 16GHz clock. Edge detection is implemented via cascaded inverting D-flip flops clocked at the same phase. The outputs of edge detectors operating in parallel are ORed to pull the synthesized core clock down temporarily upon falling edge detection. The self-resetting dynamic logic then generates a delayed reset that raises the core clock. The pulse width of the core clock is set by the reset inverter chain delay. This timing scheme is depicted in Fig. 12. The maximum operating frequency of the gates used in this design limit the functionality of the critical path monitoring/clock generation scheme. Thus, to reduce delay, TSPC flip flops are used, since we are operating at relatively high frequency. Ideally, the falling edge input to the critical path replica would launch immediately after a falling edge is detected at the replica output. However, the minimum logic delay between edge detection and generation leads to excess timing guardband, resulting in additional power consumption over the ideal case. Edge combining is performed as illustrated in Fig. 13. To achieve an 8x multiplication factor, the following logical function was implemented:!!!! Fig. 10. Critical path monitor and clock generation block diagram Fig. 11. Clock generator output waveform Fig. 12. Core clock synthesizer timing waveforms Fig. 13. Edge combiner operation! MPCLK!! MPCLK!!!! + MPCLK!!!! MPCLK!!!!!!! Various combinations of the 16 ILRO outputs can be taken to generate different phases and 2! multiplication factors (K < 4).

EE 241 Project Final Report 2013 5 Again, due to the speed of the logic gates, edge sharpness is degraded at high frequencies, and 8x multiplication of a 2GHz clock is close to the upper limit. VI. PERFORMANCE MONITORING In order to maintain the desired data throughput, the average clock frequency should be measured and controlled to remain at 2GHz. Table 2 shows possible performance tracking schemes. We chose to implement a counter-based scheme because it has the simplest structure and its low bandwidth is acceptable for our purpose. An AMS model of the performance monitoring control system was implemented for functional verification, as illustrated in Fig. 14. The performance monitor measures the frequency offset between the core and reference clocks by counting the number of core clock rising edges and comparing this number to M (the size of the accumulation window) after M reference clock cycles. The measured value is then used to generate a control signal to scale the voltage at the output of the power management block either the minimum voltage threshold of the switched capacitor DC-DC supply or the voltage at the output of a low-dropout regulator. Fig. 15 shows the performance monitor s tracking capability. An AMS model of the switched capacitor DC-DC converter is used to generate a rippled VDD waveform whose minimum voltage is determined by the control signal (VTH). The core clock frequency tracks the DC-DC voltage ripple. As can be seen, after some time, the DC-DC output settles on a VTH that achieves an average 2GHz clock frequency. TABLE II POSSIBLE PERFORMANCE MONITORING AND TRACKING SCHEMES Scheme Utilizing phase information from clock generator Counter-based frequency lock logic PLL-like phase tracking logic Advantages / Disadvantages No phase detector / Frequency offset between reference and ILRO should be absolutely zero No phase detector / Slow tracking speed Faster than counter-based / TDC or BB PFD needed Fig. 16. Power consumption with various system configurations VII. IMPLEMENTATION RESULTS By adapting the supply voltage to achieve a desired performance with minimized guardband, we expected to see an improvement in the power efficiency of each core. To verify the benefits of our design, we compared the power per core with performance tracking to the nominal case (without DFVS) under multiple supply types. Several parameters (including total capacitance, amount of voltage ripple, and so on) were set to model real situations. The results are depicted in Fig. 16. As expected, the power consumption of our system does not increase even for significant voltage droops. This is a major advantage of our system over systems without DFVS. While achieving the desired operating frequency, our clock generation scheme, combined with performance tracking logic, effectively cancels the effect of the voltage droop and reduces the voltage guardband for improved power. The per-core power consumption is higher than calculated. This is the result of finite edge tracking resolution and the intrinsic additive delay of the logic within the clock generator. These two factors increase the clock periods, resulting in unwanted timing margin that is converted to additional voltage headroom. Nevertheless, the power consumptions shown justify the use of our performance tracking scheme for systems operating under rippled supplies or needing to tolerate supply droops higher than 80mV to achieve power savings. Additionally, as seen in Fig. 17, our control system is capable of adapting the core clock to large voltage droops across a broad Fig. 14. Performance monitoring block diagram Fig. 15. Performance tracking diagram Fig. 17. Clock generator output waveform

EE 241 Project Final Report 2013 6 range of frequencies. Owing to the fast response of the clock generation scheme, recovery from high frequency supply transients occurs in less than one clock cycle. VIII. CONCLUSIONS In order to increase the power efficiency and performance of multicore processors with rippled voltage supplies in particular, RAVEN we have looked into various guardband reduction schemes (from IBM and Intel) that rely on per-core dynamic frequency and voltage scaling in conjunction with CPM feedback for added stability and better error prevention. These techniques serve as a basis for our design. However, we used a multi-phase ILRO with background frequency acquisition in place of traditional clock generation circuitry (PLLs, DLLs, etc.) for its power and area benefits and simplicity of frequency multiplication. Additionally, we combined the critical path edge detection functionality with clock pulse generation to provide a low complexity solution for fast frequency adaptation to voltage transients. A combination of parallel and serial edge detection, utilizing 16 phases of the 2GHz ILRO output, is used to demonstrate design flexibility in obtaining a sufficient accuracy for frequency adaptation. Tradeoffs between increased critical path output load (associated with parallel edge detection) and higher clock frequency (from serial edge detection) can be further explored. In particular, an edge combination scheme was used to realize the frequency multiplication (8x to 16GHz) needed for serial edge detection. Additionally, an external performance monitoring loop was implemented to control the supply to achieve an average throughput/clock frequency of 2GHz. Our implementation demonstrated fast adaptation to voltage droops, with frequency recovery occurring in under one clock cycle. Additionally, we showed that by reducing voltage guardband, the DFVS performance tracking provides sufficient power savings for designs operating under rippled supplies or needing to tolerate >80mV supply droops. However, the power consumption is higher than calculated due to the finite tracking resolution and logic delays within the clock generator creating excess timing guardband. Further optimizations may be explored to minimize this excess guardband. [8] J. Pandey et al., A Sub-100 µw MICS/ISM Band Transmitter Based on Injection-Locking and Frequency Multiplication, in IEEE J. Solid-State Circuits, 2011. [9] G. Gammie et al., SmartReflex Power and Performance Management Technologies for 90 nm, 65 nm, and 45 nm Mobile Application Processors, in Proc. IEEE, 2010. [10] R. Jevtic et al., Resilient DVFS for Many Core Processor, BWRC Summer 2012 Retreat. [11] J. Kwak et al., Cassia: A Self Adjustable Clock System, BWRC Summer 2013 Retreat. [12] Wei Deng et al., "A 0.022mm 2 970µW dual-loop injection-locked PLL with 243dB FOM using synthesizable all-digital PVT calibration circuits," Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International, vol., no., pp.248,249, 17-21 Feb. 2013. REFERENCES [1] M. Floyd et al., Introducing the Adaptive Energy Management Features of the POWER7 Chip, in IEEE Micro, 2011. [2] C. R. Lefurgy et al., Active Management of Timing Guardband to Save Energy in POWER7, in Proceedings of International Symposium on Microarchitecture (MICRO), 2011. [3] L. Zhang et al., Injection-Locked Clocking: A Low-Power Clock Distribution Scheme for High-Performance Microprocessors, in IEEE Transactions on VLSI Systems, 2008. [4] A. Muhtaroglu et al., On-Die Droop Detector for Analog Sensing of Power Supply Noise, in IEEE J. Solid-State Circuits, 2004. [5] T. Fischer et al., A 90-nm Variable Frequency Clock System for a Power-Managed Itanium Architecture Processor, in IEEE J. Solid-State Circuits, 2006. [6] C. Lichtenau et al., PowerTune: Advanced Frequency and Power Scaling on 64b PowerPC Microprocessor, in IEEE ISSCC Dig. Tech. Papers, 2004. [7] N. Kurd et al., Next generation Intel Core Micro-Architecture (Nehalem) Clocking, in IEEE J. Solid State Circuits, 2009.