Resonant Clock Design for a Power-efficient, High-volume. x86-64 Microprocessor

Similar documents
Resonant-Clock Design for a Power-Efficient, High-Volume x86-64 Microprocessor

Resonant Clock Circuits for Energy Recovery Power Reductions

CURRENTLY, near/sub-threshold circuits have been

VOLTAGE scaling is one of the most effective methods for

A 3-10GHz Ultra-Wideband Pulser

An Enhanced Design Methodology for Resonant Clock. Trees

Fully Integrated Switched-Capacitor DC-DC Conversion

RECENT technology trends have lead to an increase in

Deep Trench Capacitors for Switched Capacitor Voltage Converters

CHAPTER 4 GALS ARCHITECTURE

Signal Integrity Design of TSV-Based 3D IC

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Study On Two-stage Architecture For Synchronous Buck Converter In High-power-density Power Supplies title

Lecture 11: Clocking

Minimizing Input Filter Requirements In Military Power Supply Designs

A10-Gb/slow-power adaptive continuous-time linear equalizer using asynchronous under-sampling histogram

Wafer-scale 3D integration of silicon-on-insulator RF amplifiers

OSC2 Selector Guide appears at end of data sheet. Maxim Integrated Products 1

Single-Ended to Differential Converter for Multiple-Stage Single-Ended Ring Oscillators

An Active Decoupling Capacitance Circuit for Inductive Noise Suppression in Power Supply Networks

An Optimal Design of Ring Oscillator and Differential LC using 45 nm CMOS Technology

Energy-Recovery CMOS Design

Lecture 9: Clocking for High Performance Processors

Integrated Power Management with Switched-Capacitor DC-DC Converters

An Efficient Design of CMOS based Differential LC and VCO for ISM and WI-FI Band of Applications

DESIGN OF MULTIPLYING DELAY LOCKED LOOP FOR DIFFERENT MULTIPLYING FACTORS

Low Jitter, Low Emission Timing Solutions For High Speed Digital Systems. A Design Methodology

INF3430 Clock and Synchronization

Microcircuit Electrical Issues

DESIGN FOR LOW-POWER USING MULTI-PHASE AND MULTI- FREQUENCY CLOCKING

A DPLL-based per Core Variable Frequency Clock Generator for an Eight-Core POWER7 Microprocessor

Fractional- N PLL with 90 Phase Shift Lock and Active Switched- Capacitor Loop Filter

A Variable-Frequency Parallel I/O Interface with Adaptive Power Supply Regulation

Interconnect/Via CONCORDIA VLSI DESIGN LAB

EE434 ASIC & Digital Systems. Partha Pande School of EECS Washington State University

REPORT DOCUMENTATION PAGE

Lecture 160 Examples of CDR Circuits in CMOS (09/04/03) Page 160-1

On Chip Active Decoupling Capacitors for Supply Noise Reduction for Power Gating and Dynamic Dual Vdd Circuits in Digital VLSI

Design and Simulation of Synchronous Buck Converter for Microprocessor Applications

In Search of Powerful Circuits: Developments in Very High Frequency Power Conversion

LSI and Circuit Technologies of the SX-9

DIGITAL INTEGRATED CIRCUITS A DESIGN PERSPECTIVE 2 N D E D I T I O N

RECYCLING CLOCK NETWORK ENERGY IN HIGH-PERFORMANCE DIGITAL DESIGNS USING ON-CHIP DC-DC CONVERTERS

High Speed Clock Distribution Design Techniques for CDC 509/516/2509/2510/2516

HIGH-SPEED LOW-POWER ON-CHIP GLOBAL SIGNALING DESIGN OVERVIEW. Xi Chen, John Wilson, John Poulton, Rizwan Bashirullah, Tom Gray

CPE/EE 427, CPE 527 VLSI Design I: Homeworks 3 & 4

LSI and Circuit Technologies for the SX-8 Supercomputer

TOP VIEW. Maxim Integrated Products 1

CLK_EN CLK_SEL. Q3 THIN QFN-EP** (4mm x 4mm) Maxim Integrated Products 1

Source Coding and Pre-emphasis for Double-Edged Pulse width Modulation Serial Communication

POWER minimization has become a primary concern in

A design of 16-bit adiabatic Microprocessor core

ECE1352. Term Paper Low Voltage Phase-Locked Loop Design Technique

A Novel Technique to Reduce the Switching Losses in a Synchronous Buck Converter

EE E6930 Advanced Digital Integrated Circuits. Spring, 2002 Lecture 7. Clocked and self-resetting logic I

Low-Power Digital CMOS Design: A Survey

Design of the Power Delivery System for Next Generation Gigahertz Packages

PIEZOELECTRIC TRANSFORMER FOR INTEGRATED MOSFET AND IGBT GATE DRIVER

Optimization of power in different circuits using MTCMOS Technique

Design Considerations for 12-V/1.5-V, 50-A Voltage Regulator Modules

Power Spring /7/05 L11 Power 1

PART MAX2605EUT-T MAX2606EUT-T MAX2607EUT-T MAX2608EUT-T MAX2609EUT-T TOP VIEW IND GND. Maxim Integrated Products 1

Fast Transient Power Converter Using Switched Current Conversion

A 10-Gb/s Multiphase Clock and Data Recovery Circuit with a Rotational Bang-Bang Phase Detector

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

A Novel Low Power Optimization for On-Chip Interconnection

High-Efficiency Forward Transformer Reset Scheme Utilizes Integrated DC-DC Switcher IC Function

DDR4 memory interface: Solving PCB design challenges

Synchronous Mirror Delays. ECG 721 Memory Circuit Design Kevin Buck

Lecture 10. Circuit Pitfalls

A Solution to Simplify 60A Multiphase Designs By John Lambert & Chris Bull, International Rectifier, USA

Analysis on the Effectiveness of Clock Trace Termination Methods and Trace Lengths on a Printed Circuit Board

Engr354: Digital Logic Circuits

Engineering the Power Delivery Network

A 0.2-to-1.45GHz Subsampling Fractional-N All-Digital MDLL with Zero-Offset Aperture PD-Based Spur Cancellation and In-Situ Timing Mismatch Detection

Signal Integrity and Clock System Design

Contents 1 Introduction 2 MOS Fabrication Technology

MICROWIND2 DSCH2 8. Converters /11/00

Miniaturized High-Frequency Integrated Power Conversion for Grid Interface

The First Step to Success Selecting the Optimal Topology Brian King

A Low Area, Switched-Resistor Loop Filter Technique for Fractional-N Synthesizers Applied to a MEMS-based Programmable Oscillator

Improvements of LLC Resonant Converter

Low-Jitter, 8kHz Reference Clock Synthesizer Outputs MHz

International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May ISSN

High-speed Serial Interface

EECS150 - Digital Design Lecture 19 CMOS Implementation Technologies. Recap and Outline

A LOW POWER SINGLE PHASE CLOCK DISTRIBUTION USING 4/5 PRESCALER TECHNIQUE

Ruixing Yang

A Multiobjective Optimization based Fast and Robust Design Methodology for Low Power and Low Phase Noise Current Starved VCO Gaurav Sharma 1

6.776 High Speed Communication Circuits and Systems Lecture 14 Voltage Controlled Oscillators

THE BASIC BUILDING BLOCKS OF 1.8 GHZ PLL

MTCMOS Post-Mask Performance Enhancement

Switched-Capacitor Converters: Big & Small. Michael Seeman Ph.D. 2009, UC Berkeley SCV-PELS April 21, 2010

Dynamic Threshold for Advanced CMOS Logic

Application Note, V 1.0, Feb AP C16xx. Timing, Reading the AC Characteristics. Microcontrollers. Never stop thinking.

Geared Oscillator Project Final Design Review. Nick Edwards Richard Wright

100-MHz Pentium II Clock Synthesizer/Driver with Spread Spectrum for Mobile or Desktop PCs

An energy efficient full adder cell for low voltage

EDA Challenges for Low Power Design. Anand Iyer, Cadence Design Systems

Chapter 4. Problems. 1 Chapter 4 Problem Set

Transcription:

Resonant Clock Design for a Power-efficient, High-volume x86-64 Microprocessor Visvesh Sathe 1, Srikanth Arekapudi 2, Alexander Ishii 3, Charles Ouyang 2, Marios Papaefthymiou 3,4, Samuel Naffziger 1 1 AMD Fort Collins, CO 2 AMD Sunnyvale, CA 3 Cyclos Semiconductor Inc. Berkeley, CA 4 The University of Michigan, Ann Arbor AMD s 4+ GHz x86-64 core code-named Piledriver employs resonant clocking [1,2,3,4] to reduce clock distribution power up to 24% while maintaining a low clock-skew target. To support testability and robust operation at the wide range of operating frequencies required of a commercial processor, the clock system operates in two modes: direct-drive (cclk) and resonant (rclk). Leveraging favorable factors such as the availability of two thick top-level metals, high operating frequency, clock-load density, and the existing clock-design methodology [5], the rclk mode was designed to enable both reduced average power dissipation and improved peak-power-constrained performance, with minimal area impact. Fabricated in a 32nm CMOS process, this work represents the first volume production-enabled implementation of resonant clock technology.

Rclk allows power reduction by recycling charge using LC-resonance, which enables further power reduction by reducing clock driver strength. Figure 1 shows a simplified schematic of the dual-mode clock system. The mode switch MSw is closed (open) in rclk (cclk) mode. The clock driver features a pulse-drive mode for additional efficiency improvement through duty cycle control of the pull-up and pull-down switches. TSw is a throttle switch employed to reduce voltage overshoot when the MSw is turned off during frequency changes. To operate in both modes, the clock driver needs to support frequency-dependent drivestrength and pulse modulation, both of which are efficiently implemented using a split-buffer topology. In rclk mode, drive strength is modulated through drven settings during P-state transitions. Pulse drive is used to enable a finer trade-off between conduction and switching losses in the driver. A local delay line delays only the asserting edge of the pull-up/down stage during pulse drive (plsen = 1), whereas respective de-asserting edges are triggered by the nondelayed clock. Thus, the driver output duty cycle is obtained by programming the local delay to modulate the input duty cycle. This method has three advantages:(1) Enabling PLL duty cycle control of the clock to tune performance (2) Guaranteeing robust clock slew and amplitude when operating off the V-f curve and (3) Reducing susceptibility in rclk skew due to process variation in the local delay chains. Figure 2 shows the Piledriver global clock construction in which a set of five horizontal-folded clock trees (HCK tree) drive a global clock grid [5]. Each HCK tree has up to 25 inductors

interleaved with clock drivers and can be programmed independently in test mode. The clock mode and frequency-dependent clock parameter settings (inductor connection, drive strength, pulse width) are adjusted during power-up and each P-state transition, during which time the clock mode parameters are initialized through a P-state indexed fuse based table. The power management unit accounts for the power reduction achieved in each P-State. These parameters are loaded by a sequencer in the transmit block, which distributes these parameters to the HCK trees through a source-synchronous bus inside the vertical clock tree module. Once received by the HCK trees, these parameters are broadcast to all clock drivers within each HCK tree. To avoid a circular dependence between the global clock and logic used to program the clock, all programming logic in the HCK trees is clocked by a broadly distributed intermediate stage of the clock tree. Existing clock gating mechanisms are leveraged to prevent the exposure of timing elements in the CPU to transitional clocks. Building inductors with a good quality factor Q, is critical to rclk efficiency, and is constrained by several factors in Piledriver. The inductor windings have to be designed to share metal resources on the top two metal layers with dense power distribution. Moreover, they must accommodate a substantial number of pre-clock distribution nets and global nets that are routed through, as well as under, the inductor. Figure 3 illustrates inductor design under these constraints. At the frequencies of interest, Q is dominated by winding resistance. The inductor was therefore designed using M11 and M10 thick-metal levels, with cut-aways allowing for

maximal use of both metal layers in the presence of routes and power-supply trunks. Inductor placement was directed to ensure that power-supply trunks pass through the middle of the inductor, minimizing the impact of inductive coupling. The power grid under the inductor was designed to be loop-less to alleviate the Q degradation resulting from eddy currents in the power grid loops while maintaining a robust grid. Figure 4 shows the structures required to support rclk (MSw, inductor, TankCap) that are tiled across the HCK-tree. MSw connects the inductor to the clock grid through the Driver-MSw shorting-bar. An LP formulation was used to determine inductor allocation from a palette of 5 with values in the 0.6 to 1.3 nh range. Skew was further controlled by interleaved driver/inductor placement. For each inductor, MSw size was tuned to trade-off reduced switch resistance with the increased switch parasitic that results from larger switches. For efficient rclk operation, a large low-esr TankCap is required within a limited allocated area. To that end, a capacitor structure of approximately six times the average clock load was implemented using both metal and gate structures. Figure 5 shows measured Cac (defined as Cac = P dynamic / V 2 f) savings and efficiency numbers, based on power dissipation in the clock drivers and grid, in cclk and rclk modes. A test pattern with high switching activity was used for the clock power measurement. Efficiency increases up to 3.3 GHz, and declines more gradually at higher frequencies. The inherent asymmetry of energy efficiency on either side of the resonant frequency is increased due to a voltage-

dependent Q (from the series-connected MSw) and a stringent clock slew criterion that requires a stronger drive at lower frequencies. The impact of rclk on skew was minimal: Fullchip simulation analysis showed a 1 ps increase in rclk skew compared to cclk. Figure 6 shows cclk and rclk waveforms from a full-chip clock simulation at 1.2V, 4.25GHz. Reducing clock driver strength in rclk enables greater Cac savings at the expense of reduced clock slew rates. These reduced slews result in increased cross-over current in the clock receivers. Measurements however, show a negligible change in efficiency for high-activity workloads compared to idle workloads which indicates that this effect is small. Reduced slew also causes a push-out in the 50% arrival time of the clock, potentially affecting both gaterenable paths and cross-clock domain communication. Static timing analysis with degraded slews was run on the core and resulting paths fixed. Figure 7 shows the microphotograph of the Piledriver core. Over the frequency range 3.0 GHz to 4.4 GHz, the power savings from rclk enable either a frequency increase of about 100 MHz for the same power, or a power reduction of 5-10% for the same frequency. Acknowledgments: The authors thank Tom Meneghini, Kyle Viau, Manivannan Bhoopathy, Joohee Kim, Jerry Kao, Fred Brauchler, Alan Arakawa, Syed Obaidulla, Kevin Hurd, Vasant Palisetti, and Denny Renfrow for their valuable contribution to this work.

References: [1] Drake, A.J et al. Resonant Clocking using Distributed Parallel Capacitance, JSSC, Sep 2004. [2] Sathe, V.S et al. Resonant-Clock Latch-Based Design, JSSC, Apr 2008. [3] Chan, S.C. et al. A Resonant Global Clock Distribution for the Cell Broadband Engine Processor, JSSC, Jan 2009. [4] Ishii, A. et al. A Resonant Clock 200MHz ARM926EJ-S Microcontroller, ESSCIRC 2009. [5] McIntyre, H. et al. Design of the Two-core x86-64 AMD Bulldozer Module in 32 nm SOI CMOS, JSSC, Jan 2012. Captions: Figure 1: Simplified model of AMD s Piledriver dual-mode global clock network Figure 2: Global-clock organization and distribution. A folded clock-tree (VCK tree), and 5 horizontal folded clock trees (HCK tree) achieve a low-skew core-wide distribution to clock drivers which drive the global clock grid. The HCK-tree is also used to achieve rclk programmability. Figure 3: Inductor design on the top two metal layers with cut-aways to accommodate power straps and global signal routes. A loop-less custom grid is implemented under the winding. Figure 4: Repeated section of the HCK tree showing relative placement of final-drivers, inductor, MSw, and TankCap. The Inductor-TankCap Shorting-Bus is used to provide a distributed low-resistance TankCap connection to the inductor. Figure 5: Measured Cac(pF) savings and clock efficiency vs. frequency. Peak efficiency is observed at 3.3GHz. Figure 6: Simulated cclk and rclk waveforms at Vdd=1.2V, frequency=4.25ghz under different drive-strength configurations. rclk_3/8 refers to an rclk mode where the clock drivers are operating at 3/8 drive-strength. Lower-drive strengths in rclk allow for more Cac savings at the expense of lower slew rates. Figure 7: Chip Microphotograph of the resonant-clocked 32nm AMD Piledriver core.

Figure 1: Simplified model of AMD s Piledriver dual-mode global clock network

Inductor Driver-modeSwitch shorting-bar Source- synchronous receiver Receive modeswitch 54 clock drivers per HCK tree PLL HCK tree4 Transmit Global Clock Grid HCK tree3 HCK tree2 Source-synchronous bus HCK tree1 HCK tree0 Figure 2: Global-clock organization and distribution. A folded clock-tree (VCK tree), and 5 horizontal folded clock trees (HCK tree) achieve a low-skew core-wide distribution to clock drivers which drive the global clock grid. The HCK tree is also used to achieve rclk programmability.

M10 Power and Signals M11 Inductor Winding 12um M11 Power Straps M10 Inductor Winding M10-M11 Winding Vias Custom Grid M9 and below Figure 3: Inductor design on the top two metal layers with cut-aways to accommodate power straps and global signal routes. A loop-less custom grid is implemented under the winding.

Figure 4: Repeated section of the HCK tree showing relative placement of final-drivers, inductor, MSw, and TankCap. The Inductor-TankCap Shorting-Bus is used to provide a distributed low-resistance TankCap connection to the inductor.

140 30.00% 120 25.00% 100 20.00% Cac Savings (pf) 80 60 40 20 Rclk Cac Savings @25C Rclk Cac Savings @60C Rclk Efficiency @25C Rclk_Efficiency @60C 15.00% 10.00% 5.00% Efficiency (%) 0 0.00% 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 Clock Frequency (GHz) Figure 5: Measured Cac(pF) savings and clock efficiency vs. frequency. Peak efficiency is observed at 3.3GHz.

Figure 6: Simulated cclk and rclk waveforms at Vdd=1.2V, frequency=4.25ghz under different drive-strength configurations. rclk_3/8 refers to an rclk mode where the clock drivers are operating at 3/8 drive-strength. Lowerdrive strengths in rclk allow for more Cac savings at the expense of lower slew rates.

HCK trees Inductors VCK tree Figure 7: Chip Microphotograph of the resonant-clocked 32nm AMD Piledriver core.