A Mathematical Solution to Power Optimal Pipeline Design by Utilizing Soft Edge Flip-Flops

Similar documents
Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

High Speed, Low Power And Area Efficient Carry-Select Adder

HIGH PERFORMANCE ADDER USING VARIABLE THRESHOLD MOSFET IN 45NM TECHNOLOGY

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

Exploiting Dynamic Workload Variation in Low Energy Preemptive Task Scheduling

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

Total Power Minimization in Glitch-Free CMOS Circuits Considering Process Variation

Total Power Minimization in Glitch-Free CMOS Circuits Considering Process Variation

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

A NSGA-II algorithm to solve a bi-objective optimization of the redundancy allocation problem for series-parallel systems

NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

Latency Insertion Method (LIM) for IR Drop Analysis in Power Grid

antenna antenna (4.139)

ECE315 / ECE515 Lecture 5 Date:

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

Sizing and Placement of Charge Recycling Transistors in MTCMOS Circuits

Figure.1. Basic model of an impedance source converter JCHPS Special Issue 12: August Page 13

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

Review: Our Approach 2. CSC310 Information Theory

Passive Filters. References: Barbow (pp ), Hayes & Horowitz (pp 32-60), Rizzoni (Chap. 6)

A High-Speed Multiplication Algorithm Using Modified Partial Product Reduction Tree

Harmonic Balance of Nonlinear RF Circuits

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

A Simple Yet Efficient Accuracy Configurable Adder Design

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Optimal Sizing and Allocation of Residential Photovoltaic Panels in a Distribution Network for Ancillary Services Application

MTBF PREDICTION REPORT

Low Switching Frequency Active Harmonic Elimination in Multilevel Converters with Unequal DC Voltages

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

Vectorless Analysis of Supply Noise Induced Delay Variation

Adaptive Modulation for Multiple Antenna Channels

NETWORK 2001 Transportation Planning Under Multiple Objectives

The Effect Of Phase-Shifting Transformer On Total Consumers Payments

problems palette of David Rock and Mary K. Porter 6. A local musician comes to your school to give a performance

MASTER TIMING AND TOF MODULE-

Uncertainty in measurements of power and energy on power networks

Optimal Placement of PMU and RTU by Hybrid Genetic Algorithm and Simulated Annealing for Multiarea Power System State Estimation

Network Reconfiguration in Distribution Systems Using a Modified TS Algorithm

Figure 1. DC-DC Boost Converter

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Dynamic Power Consumption in Virtex -II FPGA Family

FAST ELECTRON IRRADIATION EFFECTS ON MOS TRANSISTOR MICROSCOPIC PARAMETERS EXPERIMENTAL DATA AND THEORETICAL MODELS

Block-wise Extraction of Rent s Exponents for an Extensible Processor

THE GENERATION OF 400 MW RF PULSES AT X-BAND USING RESONANT DELAY LINES *

PERFORMANCE EVALUATION OF BOOTH AND WALLACE MULTIPLIER USING FIR FILTER. Chirala Engineering College, Chirala.

Performance Analysis of Multi User MIMO System with Block-Diagonalization Precoding Scheme

Understanding the Spike Algorithm

The Performance Improvement of BASK System for Giga-Bit MODEM Using the Fuzzy System

Priority based Dynamic Multiple Robot Path Planning

EE 508 Lecture 6. Degrees of Freedom The Approximation Problem

A High-Sensitivity Oversampling Digital Signal Detection Technique for CMOS Image Sensors Using Non-destructive Intermediate High-Speed Readout Mode

Power-Constrained Test Scheduling for Multi-Clock Domain SoCs

Sequential Designs Using Retiming and Multiple Supply Voltages

Define Y = # of mobiles from M total mobiles that have an adequate link. Measure of average portion of mobiles allocated a link of adequate quality.

Topology Control for C-RAN Architecture Based on Complex Network

Joint Adaptive Modulation and Power Allocation in Cognitive Radio Networks

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

High Gain Soft-switching Bidirectional DC-DC Converters for Eco-friendly Vehicles

POWER constraints are a well-known challenge in advanced

FACTS Devices Allocation Using a Novel Dedicated Improved PSO for Optimal Operation of Power System

High Speed ADC Sampling Transients

A Mathematical Model for Restoration Problem in Smart Grids Incorporating Load Shedding Concept

Prevention of Sequential Message Loss in CAN Systems

Digital Transmission

INSTANTANEOUS TORQUE CONTROL OF MICROSTEPPING BIPOLAR PWM DRIVE OF TWO-PHASE STEPPING MOTOR

A Novel Optimization of the Distance Source Routing (DSR) Protocol for the Mobile Ad Hoc Networks (MANET)

Microelectronic Circuits

Control of Chaos in Positive Output Luo Converter by means of Time Delay Feedback

Figure 1. DC-DC Boost Converter

Decision aid methodologies in transportation

Optimizing a System of Threshold-based Sensors with Application to Biosurveillance

Yield Optimisation of Power-On Reset Cells and Functional Verification

Adaptive System Control with PID Neural Networks

The Synthesis of Dependable Communication Networks for Automotive Systems

Hard Real-Time Scheduling for Low-Energy Using Stochastic Data and DVS Processors

Traffic balancing over licensed and unlicensed bands in heterogeneous networks

ECE 2133 Electronic Circuits. Dept. of Electrical and Computer Engineering International Islamic University Malaysia

A Simple Satellite Exclusion Algorithm for Advanced RAIM

Soft-Switched CCM Boost Converter with High Voltage Gain for High Power Applications

COMPARISON OF VARIOUS RIPPLE CARRY ADDERS: A REVIEW

A Novel Soft-Switching Two-Switch Flyback Converter with a Wide Operating Range and Regenerative Clamping

AC-DC CONVERTER FIRING ERROR DETECTION

THE ARCHITECTURE OF THE BROADBAND AMPLIFIERS WITHOUT CLASSICAL STAGES WITH A COMMON BASE AND A COMMON EMITTER

Keywords: Speed binning, delay measurement hardware, process variation.

Resource Scheduling in Dependable Integrated Modular Avionics

Approximating User Distributions in WCDMA Networks Using 2-D Gaussian

GP-based Design and Optimization of a Floating Voltage Source for Low-Power and Highly Tunable OTA Applications

Design of Shunt Active Filter for Harmonic Compensation in a 3 Phase 3 Wire Distribution Network

Test 2. ECON3161, Game Theory. Tuesday, November 6 th

ANNUAL OF NAVIGATION 11/2006

Redes de Comunicação em Ambientes Industriais Aula 8

Resource Allocation Optimization for Device-to- Device Communication Underlaying Cellular Networks

Active and Reactive Power Control of DFIG for Wind Energy Conversion Using Back to Back Converters (PWM Technique)

Mismatch-tolerant Capacitor Array Structure for Junction-splitting SAR Analog-to-digital Conversion

Piecewise Linear Approximation of Generators Cost Functions Using Max-Affine Functions

MODEL ORDER REDUCTION AND CONTROLLER DESIGN OF DISCRETE SYSTEM EMPLOYING REAL CODED GENETIC ALGORITHM J. S. Yadav, N. P. Patidar, J.

Keywords LTE, Uplink, Power Control, Fractional Power Control.

Research on Controller of Micro-hydro Power System Nan XIE 1,a, Dezhi QI 2,b,Weimin CHEN 2,c, Wei WANG 2,d

Transcription:

A Mathematcal Soluton to Power Optmal Ppelne Desgn by Utlzng Soft Edge Flp-Flops Mohammad Ghasemazar, Behnam Amelfard and Massoud Pedram Unversty of Southern Calforna Department of Electrcal Engneerng Los Angeles, CA 989 U.S.A. {ghasemaz,amelfar,pedram}@usc.edu ABSTRACT Ths paper presents a novel technque to mnmze the total power consumpton of a synchronous lnear ppelne crcut by explotng extra slacks avalable n some stages of the ppelne. The key dea s to utlze soft-edge flp-flops to enable tme borrowng between stages of a lnear ppelne n order to provde the tmng-crtcal stages wth more tme to complete ther computatons. Tme borrowng, n conjuncton wth keepng the clock frequency unchanged, gves rse to a postve tmng slack n each ppelne stage. The slack s subsequently utlzed to mnmze the crcut power consumpton by reducng the supply voltage level. We formulate and solve the problem of optmally selectng the transparency wndow of the soft-edge flp-flops and choosng the mnmum supply voltage level for the ppelne crcut as a quadratc program, thereby mnmzng the power consumpton of the lnear ppelne crcut under a clock frequency constrant. Expermental results prove the effcacy of the problem formulaton and soluton technque. Categores and Subject Descrptors B.8.2 [Performance and Relablty]: Performance Analyss and Desgn Ades General Terms Algorthms, Desgn. Keywords Low-power mcroprocessor desgn, Synchronous ppelnes, Soft edge flp-flop, Voltage scalng, Quadratc programmng.. INTRODUCTION Excessve power dsspaton and resultng temperature rse have become key lmtng factors to processor performance and a sgnfcant component of ts cost. In modern mcroprocessors, expensve packagng and heat removal solutons are requred to acheve acceptable substrate and nterconnect temperatures. Due to ther hgh utlzaton, ppelne crcuts of a hgh-performance mcroprocessor are major contrbutors to the overall power Ths research was sponsored n part by a grant from the Natonal Scence Foundaton under award number 59564. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ISLPED 8, August 3, 28, Bangalore, Inda. Copyrght 28 ACM 978--6558-9-5/8/8...$5.. consumpton of the processor, and consequently, one of the man sources of heat generaton on the chp []. Many technques have been proposed to reduce the power consumpton of a mcroprocessor s ppelne among whch ppelne gatng [], clock gatng [2, 3], and voltage scalng [4] have proven to be effectve. In ths paper we present a technque to address the problem of reducng the power consumpton n a synchronous lnear ppelne.e., one wth the followng propertes: () processng stages are lnearly connected, () t performs a fxed functon, and () stages are separated by flp-flops whch are clocked wth the same CLK sgnal. Our technque s based on the dea of utlzng soft-edge flp-flops (SEFF) for slack passng and voltage scalng n the ppelne stages. Soft-edge flp-flops have a small transparency wndow whch allows tme borrowng across ppelne stages. Softedge flp-flops have been tradtonally used for mnmzng the effect of clock skew on statc and dynamc crcuts [5, 6]. Recently, the authors of [7] proposed an approach to utlze soft-edge flpflops n sequental crcuts n order to mnmze the effect of process varaton on the yeld. They formulated the problem of statstcally aware SEFF assgnment whch maxmzes the gan n tmng yeld as an nteger lnear program (ILP) and proposed a heurstc algorthm to solve the problem. We descrbe a unfed methodology for optmally selectng the supply voltage level of a lnear ppelne and optmzng the transparency wndow of the SEFF so as to acheve the mnmum power consumpton subject to a total computaton tme (latency) constrant. We formulate ths problem as a quadratc program, whch s a convex programmng problem, and hence can be solved optmally n polynomal tme. The remander of ths paper s organzed as follows. In Secton 2 we provde some background on ppelne desgn and soft-edge flp-flops. Secton 3 descrbes our technques for reducng the power consumpton. Secton 4 s dedcated to smulaton results and Secton 5 concludes the paper. 2. BACKGROUND 2. Prelmnares A smple (synchronous) 2-stage lnear ppelne crcut s shown n Fgure. We call the set of flp-flops that separate consecutve stages of the lnear ppelne as a FF-set, for example, FF FF 2 are the FF-sets. Let s assume for now that the FF-sets used n ths desgn are all hard-edge FF s. To guarantee the correct operaton of the ppelne, the followng tmng constrants should be satsfed n all stages of the ppelne: d + t + t T N () s, cq, clk δ + t t N (2) cq, h,

where d and δ are the maxmum and mnmum delays of combnatonal logc n stage, T clk denotes the clock cycle tme, t s, and t h, are the setup and hold tmes for the flp-flops n the th FF-set whle tcq, denotes the clock-to-q propagaton delay st of the flp-flops n ppelnes stages. CLK FF-set. N denotes the number of D Q D Q D Q C C2 FF FF FF2 Fgure. A smple lnear ppelne Equaton () descrbes the constrant set on the maxmum delay of ppelne stages to prevent setup tme volatons. It mples that the sgnal delay from one stage to the next stage should be less than a clock cycle by at least a setup tme. The total delay s the sum of clock-to-q delay of the frst stage and the longest path delay of the combnatonal crcut. Equaton (2) descrbes the constrant set on the mnmum delay of the ppelne stages to prevent data race hazard. In order not to overwrte the prevous data, the new data of a stage must arrve at the next stage only after the hold tme of the next stage FF has elapsed. The earlest tme that new data can arrve at the next stage s the clock-to-q delay of the frst stage plus the shortest path delay of the combnatonal logc n between the two stages. We have gnored the clock skew n both equatons. To do so, we must add the clock skew, t skew, to the left sde of nequalty () and subtract t from the left sde of nequalty (2). 2.2 Soft-Edge Flp-Flop The key dea n desgnng a soft edge flp-flop [5] s to delay the clock of the master latch so as to create a wndow durng whch both master and slave latches are ON (cf. Fgure 2). Ths wndow s called the transparency wndow of the SEFF and allows slack passng between adjacent ppelne stages separated by SEFF s. The delayed clock s acheved by utlzng an nverter chan and approprately szng nverters n the chan to acheve desred delay. D D D D D Delay D D Fgure 2. Master slave soft edge flp-flop Referrng back to Fgure, for the sake of consstency wth the nput and output envronments and to avod mposng constrants on the sender or recever of data for the lnear ppelne crcut n queston, we requre that the frst and last FF-sets n the ppelne are composed of hard-edge FF s whereas the ntervenng FF-sets may be SEFF s. Therefore, n ths example, only FF can be made a soft-edge FF-set. In a SEFF, the transparency wndow sze s an mportant parameter n the tmng constrants snce t changes the characterstcs of the flp-flop. More precsely, the setup tme, hold tme, and clock-to-q delay of a soft-edge flp-flop are all functons of the transparency wndow wdth. By defnng these tmng parameters as functons of the wndow sze, we can rewrte the tmng constrants of a lnear ppelne whch utlzes SEFF s as, Q d T t ( w) t ( w ) N (3) clk s, cq, δ t w t w N (4) h, ( ) cq, ( ) Inequaltes (3) and (4) are the SEFF versons of nequaltes () and (2). Notce that the setup/hold tmes and the clock-to-q delay are now dependant on the transparency wndow sze of the SEFF s. Intutvely, t s expected that all three crtcal tmes of a SEFF,.e., the setup tme, hold tme and clock-to-q delay, are postponed by the sze of the transparency wndoww, because the data has more tme to arrve. As a result, the setup tme s decreased by w whle the hold tme and clock-to-q delay are ncreased byw. The reason for the lnear dependence of the setup and hold tmes on w s that the nput data may be read a tme w after the clock edge. In secton 3, we wll show that the optmal wndow sze of a SEFF s equal to the borrowed tme n the precedng ppelne stage. In other words, n the optmal lnear ppelne desgn, data arrves at the end of the transparency wndow of the SEFF, and as a result, the output of the SEFF s vald after a data to Q delay wth respect to the end of transparency wndow,.e., after w+ t wth respect to the clock edge. On the other hand, f there s no tme borrowng, the output Q becomes vald only a clock to Q tme, t cq, after the clock edge. Based on the above dscusson, the setup tme, hold tme, and clock-to-q delay of a SEFF may be modeled as lnear functons of wndow sze, as follows, ts,( w) = aw + a th,( w) = bw + b tcq,( w) = cw + c where a to c are technology and desgn specfc coeffcents. Power consumpton of a SEFF also changes wth w. Ths s due to the fact that ncreasng the wndow sze s performed by ncreasng the sze or the number of nverters n the delayed clock path. Both methods for alterng w result n an ncrease n the power consumpton of the SEFF. Power consumpton s a monotoncally ncreasng functon of wndow sze, as shown n Fgure 3 for the master-slave flp-flops. The dscontnutes (jumps) n the curve are due to a change n the number of nverters n the delay path. From ths fgure, one can conclude that the power dsspaton of the SEFF may be approxmated as a quadratc functon of the transparency wndow wdth,.e., 2 FF, 2 dq (5) P = dw + dw + d (6) where d to d 2 are technology and desgn specfc coeffcents. Power Dsspaton (uw) 35 3 25 2 5 5 4 8 2 6 Transparency wndow (ps) Fgure 3. Power consumpton of a SEFF as a functon of transparency wndow

3. POWER OPTIMAL PIPELINE DESIGN The key dea for usng SEFF s n a ppelne crcut s that some postve slack may be avalable n one or more stages of the ppelne. Utlzng SEFF allows passng ths slack to more tmng crtcal stages of the ppelne to provde them wth more freedom n power optmzaton through voltage scalng. As an example, consder the three stage ppelne crcut of Fgure 4 operatng at a supply voltage level of V DD. The per-stage maxmum logc delays are shown n the fgure. Let s assume the setup tme, hold tme, and the clock-to-q delay of all (hard-edge) FF s are 3ps each. Assumng fxed and unform tme allocaton across the three ppelne stages, from equaton (), t s easly seen that the mnmum clock perod s 56ps. If T clk =56ps, no slack wll be avalable to the frst stage of the ppelne, and consequently, the supply voltage of the ppelne crcut cannot be scaled down n order to reduce the power consumpton. However, f FF s replaced wth a SEFF wth a transparency wndow of 5ps, avalable slack at the second stage s passed to the frst stage, provdng the frst stage wth 5ps of borrowed tme. Now snce postve slacks are avalable n all stages of the ppelne, the crcut can be powered wth a smaller supply voltage n order to reduce the power consumpton (deally, V DD may be reduced by approxmately %, resultng n roughly 9% power savng). CLK C D Q D C2 C3 Q D Q D Q d=5ps d2=4ps d3=45ps FF FF FF2 FF3 Fgure 4. Example of slack passng 3. Soft-Edge Flp-Flop Modelng To optmally select the transparency wndow of the SEFF s and choose the mnmum supply voltage level, we need to accurately account not only for the effect of the transparency wndow on the setup/hold tmes and clock-to-q delay, but also for the power consumpton of the SEFF s. In Secton 2.2 t was shown that for a SEFF, the setup/hold tmes and clock-to-q delay can be modeled as lnear functons of transparency wndow sze (c.f. equaton set (5)). If the supply voltage of the flp-flop can also be adjusted to a new voltage level,v, then coeffcents of these lnear models wll become voltage-dependent parameters,.e., ts,( w, v) = a() vw+ a() v th,( w, v) = b() vw+ b() v tcq,( w, v) = c() vw+ c() v Fgure 5 through Fgure 7 show SPICE smulatons of the setup tme, hold tme, and clock-to-q delay as functons of the transparency wndow sze and supply voltage level for the SEFF of Fgure 2. From these fgures one can see that the equaton set (7) s qute accurate. Smlarly, an extenson of (6) can be used to model the effect of adjustng the supply voltage level,v, on the SEFF power consumpton as: 2 ( ) ( ) ( ) FF, 2 (7) P = d v w + d v w + d v (8) where d ( ) v through d ( ) 2 v are voltage-dependent parameters. Setup Tme (ps) - -2-3 -4-5 -6-7 4 6 8 2 4 Transparency wndow (ps) Vdd=.9V Vdd=.V Vdd=.V Vdd=.2V Fgure 5. Setup tme as a functon of the supply voltage level and the transparency wndow wdth Hold Tme (ps) 8 6 4 2 Vdd=.9V Vdd=.V Vdd=.V Vdd=.2V 4 6 8 2 4 Transparency wndow (ps) Fgure 6. Hold tme as a functon of the supply voltage level and the transparency wndow wdth -to-q delay (ps) 2 8 6 4 2 8 4 6 8 2 4 Transparency wndow (ps) Vdd=.9V Vdd=.V Vdd=.V Vdd=.2V Fgure 7. Clock-to-q delay as a functon of the supply voltage level and the transparency wndow wdth 3.2 Combnatonal Logc Block Modelng As a result of voltage scalng, for a fxed clock frequency, the total power consumpton of combnatonal logc changes as follows : 2 3 v v P () v = P + P Comb, dyn, leak, V V Ths super-lnear dependency of leakage power on the supply voltage s due to the combned effect of dran nduced barrer lowerng and the offstate leakage equaton: V DD I OFF. The cubc form of ths dependency has been emprcally observed from SPICE smulatons. (9)

where P dyn, and P leak, are the dynamc and leakage power consumpton of the combnatonal logc at the nomnal supply voltage V, and P Comb, s the total power consumpton of the combnatonal logc at the new supply voltage level v. On the other hand, t s known that when the supply voltage of a combnatonal logc s changed, ts new delay can be obtaned from the alpha-power law [8]; therefore, α V V t () = dv ( ) v V t d v α V V t () v = δ( V) v V t δ () () where α s a technology parameter whch s around 2 for long channel devces and.3 for short channel devces, and V t denotes the magntude of the threshold voltage of transstors. 3.3 Delay Elements From equaton (4) and Fgure 6, one can see that ncreasng the transparency wndow of the th soft-edge FF-set puts more rgd constrant on the hold tme condton for the th stage of the ppelne. Therefore, f needed, delay elements may be utlzed n the mnmum-delay path(s) to allevate the hold tme constrant volaton. Smlar to the delayed clock path, ths s acheved by utlzng some nverters and approprately szng them n a smlar fashon to [9], n order to meet the desred delay lower bound whle ncurrng mnmum power loss. The power overhead of a delay element s denoted as P (, ) ( ) DE zv = k v z, where z s the desred delay and k( v ) s a voltage dependent parameter. 3.4 Problem Formulaton The problem of power-optmal soft lnear ppelne (PSLP) desgn s defned as fndng optmal values of the supply voltage level for the whole desgn and the transparency wndows of the ndvdual soft-edge FF-sets n the desgn so as to mnmze the total power consumpton of an N-stage ppelne crcut subject to setup and hold tme constrants: N N N Mn. P = P () v + P ( w, v) + P ( z, v) total Comb, FF, DE, = = = st..() I d() v T t ( w, v) t ( w, v); N clk s, cq, ( II) δ() v + z t ( w, v) t ( w, v); N h, cq, ( III) w w w mn max; N ( IV) v { VV,..., V m } where Comb, (2) P, P FF,, and P DE, are respectvely the power dsspaton of the combnatonal logc, FF s, and delay elements n the th stage of the ppelne. The frst and second sets of constrants n (2) are respectvely the setup and hold tme constrants n the ppelne stages, the thrd set of constrants mposes an upper bound and a lower bound on the transparency wndow of the flp-flop ( wmn and w < T ), and max /2 clk fnally, the last constrant n (2) enforces the supply voltage of the ppelne to be from the set of avalable voltages { V, V,... V m }, where V s the nomnal supply voltage and V> V... > Vm. Note that problem formulaton () has 2N optmzaton varables correspondng to N transparency wndow szes, w, for the N soft-edge FF-sets n the lnear ppelne, N delay element values, z, for the N stages of the ppelne, and one supply voltage varable settng, v. To solve (2) effcently, we enumerate all possble values for v, and for each fxed v we solve a quadratc program (.e., we mnmze a quadratc cost functon subject to lnear nequalty constrants), whch can be solved optmally n polynomal tme. In the fxed supply voltage PSLP problem formulaton, P Comb, terms drop out of the cost functon, constrant (IV) dsappears, and all other tmng and power parameters become only dependent on w and z varables. We refer to ths verson of the problem as PSLP- FV, PSLP wth fxed voltage. Lemma : In the optmal soluton of PSLP-FV desgn problem, the transparency wndow of the th soft-edge FF-set s exactly equal to the tme borrowed by the combnatonal logc n the th stage of the lnear ppelne. Proof: Accordng to the dscusson n Secton 2.2 and Fgure 3, the power consumpton of a SEFF s a monotoncally ncreasng functon of the transparency wndow sze whle ts setup tme s a decreasng functon of the same. Now, from condton (I) n the PSLP-FV problem formulaton of equaton (2), a mnmum decrease n the setup tme of the th soft-edge FF-set t ( wv,) s, whch meets the long-path constrant n the th stage of the lnear ppelne, wll produce the mnmum ncrease n the power dsspaton of the th soft-edge FF-set P ( w, v ). Therefore, the, optmal soluton s acheved by utlzng the smallest possble transparency wndow szes whch prevent setup tme volaton. Lemma 2: In the optmal soluton of PSLP-FV desgn problem, the delay element nserted n the th stage of the lnear ppelne s exactly equal to the mnmum extra tme needed to meet the hold tme constrant at the th soft-edge FF-set. Proof: Accordng to the dscusson n Secton 3.3, the power consumpton of a delay element s a monotoncally ncreasng functon of the target delay value whle the hold tme of a SEFF s an ncreasng functon of the same. Now, from condton (II) n the PSLP-FV problem formulaton, a mnmum delay value z added to the th stage of the lnear ppelne whch meets the short-path constrant for that stage, wll produce the mnmum ncrease n the power dsspaton of the combnatonal logc n the th P ( z, v )., Therefore, the optmal soluton s acheved by utlzng the smallest possble delay elements whch prevent hold tme volatons. Theorem : The optmal soluton to PSLP desgn problem s obtaned by solvng the PSLP-FV desgn problem m tmes for * each dstnct voltage level and selectng the voltage level v and * * the correspondng w and z values that mnmze the total power dsspaton for v *. Proof: Ths easly follows from the observaton that soluton of the PSLP-FV problem producesw s andz s for each possble v and we enumerate over all v s to get the global optmum soluton n an exhaustve manner. Fnally we pont out that a greedy soluton to PSLP-FV whereby each ppelne stage s allocated a total combnatonal delay equal to the average combnatonal delay of all stages and the dfference between actual delay of the stage and the allocated delay s FF DE

corrected for by settng the transparency wndow sze of the correspondng soft-edge FF s, cannot meet the long-path constrants n all stages of the ppelne snce the macro model equatons for the setup/hold tme and clock-to-q delays of the soft edge FF s have dfferent slopes wth respect to w s. 4. EXPERIMENTAL RESULTS To solve the mathematcal problem developed n ths paper, MOSEK optmzaton toolbox [] has been used. To extract the parameters used n the optmzaton problem, we performed transstor-level smulatons on soft edge flp-flops by usng HSPICE []. The technology used n ths smulaton s a 65nm predctve technology model [2], the nomnal supply voltage of ths technology s.2v, and the de temperature s o C. We syntheszed a number of lnear ppelne crcuts whch capture the characterstcs of a typcal ppelne n a modern processor as a set of benchmark crcuts. SIS [3] optmzaton package was used to synthesze the set of benchmarks. The mnmum and maxmum delays of each ppelne stage were computed at the maxmum allowed supply voltage (.2V) and at the low and hgh temperature corners. The mnmum clock cycle tme for the ppelne (maxmum frequency) and power dsspaton of the lnear ppelne were subsequently computed. Ths data defned the baselne for our comparson. Next, PSLP was run on each crcut under the condton that we mantan the clock frequency, whle explotng tme borrowng across dfferent stages to enable voltage scalng, and thus, power savng. The specfcatons of these benchmarks are shown n Table. The frst column n ths table gves the name of the benchmark, the second column reports the max and mn delays of each stage of the ppelne at the nomnal voltage, whereas the last column provdes the clock frequency. Testbench Table. Specfcaton of the benchmark Stage delays at nomnal voltage (ps) Clock freq. TB (32,4), (332,5), (38,5), (32,7) 2.GHz TB2 (32,4), (332,5), (38,5), (28,45), (32,7) 2.GHz TB3 (325, 5) (3,55) (29,6) 2.GHz TB4 (275,4), (235,4), (245,6), (275,5), (275,7) 2.5GHz TB5 (3,), (245,4), (245,5), (245,6) 2.5GHz Expermental results on these benchmarks are provded n Table 2. The frst entry n the table s the name of the benchmark and the second entry shows the percentage power reducton acheved by PSLP (compared to conventonal way of usng hard-edge FF s n the ppelne). From ths table, one can see that PSLP, whch combnes tme borrowng and voltage scalng to reduce the power consumpton, produces crcuts wth much lower power consumpton at the same clock frequency. The supply voltage level and soft-edge FF-set transparency wndow szes are reported n the last two columns of the table. Notce that for the frst entry of the table, the wndow szes are such that the frst and second stages borrow larger tmes from ther next stages, whle the thrd stage cannot borrow much tme; the reason s that snce the last stage of the ppelne has a large max delay and ends up nto a hard edge FFset, t can lend very lttle tme to ts prevous stage. In another set of experments, we studed how usng SEFF s can mprove the performance of a ppelne. In these experments, the supply voltage of each ppelne was set at the nomnal value and PSLP has been nvoked for dfferent values of T clk. A bnary search has been used to fnd the mnmum T clk for whch PSLP has a soluton. Table 3 shows that utlzng SEFF n the FF-set of ppelnes mproves the performance by an average of 2.8%. The area overhead of our technque s very small because t only replaces standard flp-flops wth SEFF s when helpful. The crcut structure of the SEFF's s dfferent from that of conventonal FF s only n that SEFF s use an addtonal delay element (e.g., chan of nverters). The area overhead of ths delay element s small compared to the area of the orgnal FF. In addton, compared to the sze of the combnatonal crcut plus the orgnal FF-sets, the area overhead of the added delay elements nsde SEFF s s mnscule. Consequently, n the fnal physcal layout of the crcut, PSLP does not ntroduce any sgnfcant addtonal area. The runtme of our algorthm for all benchmarks s less than one second on a 2.4GHz Pentum-4 PC wth 2GB of memory. Table 2. Power reducton n PSLP compared to regular FF ppelne. TB Power Optmum Optmum wndow reducton (%) Vdd (V) sze (ps) TB 32.. 4, 49, 22 TB2 33.8. 4, 49, 46, 2 TB3 48..95 7, 24 TB4 6.3. 35, 35, 3 TB5 25.4.5 37,36 Table 3. PSLP s performance mprovement results TB Performance mprovement (%) TB 4% TB2 5% TB3 2% TB4 5% TB5 % 4. A Case Study In order to demonstrate the effcacy of the proposed technque and provde nsght as to how t operates, n ths secton, we provde detals of applyng our technque for performance/power optmzaton of a 34-bt ppelned adder. We used the PSLP desgn technque to determne the best way of ppelnng ths adder nto four stages n order to acheve the maxmum performance and also mnmum power dsspaton at that performance level. Assumng rpple carry adder (RCA) structure for the crcut, splttng the 34- bt adder can be done by ncludng dfferent number of cascaded - bt full adders n each stage of the ppelne. For example, a possble confguraton s to buld three stages of eght -bt full adders and one stage of ten -bt full adders, resultng n the 8 8 8 ppelne confguraton. If hard-edge FF s are used n the ppelne, the mnmum clock perod of the 8 8 8 ppelned adder s 475ps under a supply voltage of.2v (the delay of a sngle full adder s 38.5ps and the setup tme and clock-to-q delay are 35ps and 5ps, respectvely). Ths delay can be reduced to 45ps by utlzng soft edge flp-flops. The PSLP desgn technque can choose the mnmum power and the fastest desgn among all possble confguratons. Table 4 compares four ppelne structures for the 34-bt adder operatng n the same supply voltage. In ths table, all desgns have three stages of eght -bt full adders, and a stage of ten -bt full adders.

Placng the -bt stage n the ppelne s crtcal n performance and power consumpton of the crcut. In the 8 8 8 confguraton a hgher clock frequency can be acheved by means of tme borrowng between stages, resultng n lower power consumpton. The 8 8 8 needs a hgher clock perod, because tme borrowng s not possble for the last stage, and therefore t needs more tme. Another ppelne confguraton s to have two 9- bt rpple carry adders and two 8-bt rpple carry adders. In ths case, the performance s only a lttle worse than the -8-8-8 confguraton. The PSLP desgn technque fnds the optmal wndow assgnment to each nter-stage flp-flop to optmally satsfy the tmng constrants for the gven clock perod. Table 4. Comparng performance of ppelne confguratons Confguraton Vdd (V) Mn clock perod (ps) Power consumpton (mw) 8 8 8.2 45 6.42 8 8 8.2 472 6.5 8 8 8.2 472 6.5 8 8 8.2 486 6.55 9 9 8 8.2 455 6.42 9 8 9 8.2 433 6.5 Assumng a clock frequency of 2GHz, we wll have a 5ps clock cycle whch creates postve slack n the stages. Ths slack allows us to scale down the supply voltage. Reducng the voltage level decreases the power consumpton by a notceable amount due to the quadratc dependency of power on voltage. Moreover, by usng the flexblty that the SEFF s add to the ppelne, voltage can be further reduced to save even more power. The PSLP technque searches for the mnmum power consumpton by changng the operatng voltage and fndng optmum wndow sze assgnment for that voltage. Table 5 lsts the optmum operatng voltage and mnmum power consumpton of four dfferent confguratons. For nstance, n the case of -8-8-8 adder, PSLP suggests a wndow of 47ps for the frst stage and 42ps for the next two soft edge stages to meet the 2GHz constrants under a supply voltage of.5volts. Table 5. Mnmum power consumpton of ppelne confguratons Confguraton Optmum Vdd (V) Power consumpton (mw) Clock frequency 8 8 8.5 4.9 2GHz 8 8 8.5 5. 2GHz 9 9 8 8.5 4.9 2GHz 9 8 8 9. 4.9 2GHz 5. CONCLUSION We presented a new technque to mnmze the total power consumpton of a lnear ppelne crcut by utlzng soft-edge flpflops and choosng the optmal supply voltage level for the ppelne. We formulated the problem as a mathematcal program and solved t effcently. Our expermental results demonstrated that ths technque s qute effectve n reducng the power consumpton of a ppelne crcut under a performance constrant. A number of extensons to the work presented n ths paper are possble. One s to allow dfferent transparency wndows for FF s n the same FF-set. The only dfference s that n ths case the setup and hold tme constrants should be satsfed for every I/O condut of the crcut (see [4] for an exact defnton). The maxmum number of I/O conduts n any stage of lnear ppelne s the product of the cardnalty of ts nput FF-set and ts output FF-set. It s seen that the sze of PSLP desgn problem for a ths case stll remans manageable. Another extenson s to consder the nterdependency between setup and hold tmes. It s known that the ndependent characterzaton of setup, hold tme, and clock-to-q delay of FF s results n pessmstc tmng analyss [5]. In our problem defnton, consderng the nterdependency between the setup and hold tme provdes more freedom n the optmzaton problem and t s expected to mprove the qualty of the results. Yet another extenson s to solve the PSLP desgn problem for the nonlnear ppelnes,.e. ppelnes that perform varable functons and have mult-stage feed-forward paths or mult-stage feedback paths [6]. The problem setup n ths case wll be smlar to that of Secton 3 but the constrants are more complex. Fnally one may combne our technque wth clock skew control and retmng methods [7] to acheve hgher power savngs. REFERENCES [] S. Manne, A. Klauser, and D. Grunwald, "Ppelne gatng: speculaton control for energy reducton," Proc. of Internatonal Symposum on Computer Archtecture, 998, pp. 32-4. [2] H. M. Jacobson, "Improved clock-gatng through transparent ppelnng," Proc. of Internatonal Symposum on Low Power Electroncs and Desgn, 24, pp. 26-3. [3] H. Jacobson, P. Bose, H. Zhgang, et al., "Stretchng the lmts of clock-gatng effcency n server-class processors," Proc. of Hgh- Performance Computer Archtecture, 25, pp. 238-242. [4] D. Ernst, N. Km, S. Das, et al., "Razor: a low-power ppelne based on crcut-level tmng speculaton," Proc. of Internatonal Symposum on Mcroarchtecture, 23, pp. 7-8. [5] H. Partov, R. Burd, U. Salm, et al., "Flow-through latch and edgetrggered flp-flop hybrd elements," Proc. of Internatonal Sold- State Crcuts Conference, 996, pp.38-39. [6] D. Harrs and M. A. Horowtz, "Skew-tolerant domno crcuts," IEEE Journal of Sold-State Crcuts, vol. 32, no., Nov. 997, pp. 72-7. [7] V. Josh, D. Blaauw, and D. Sylvester, "Soft-edge flp-flops for mproved tmng yeld: desgn and optmzaton," Proc. of Internatonal Conference on Computer-Aded Desgn, 27, pp. 667-673. [8] T. Sakura and A. R. Newton, "Alpha-power law MOSFET model and ts applcatons to CMOS nverter delay and other formulas," IEEE Journal of Sold-State Crcuts, vol. 25, no. 2, Apr. 99, pp. 584-594. [9] B. Amelfard, F. Fallah, and M. Pedram, "Low-power fanout optmzaton usng MTCMOS and mult-vt technques," Proc. of Internatonal Symposum on Low Power Electroncs and Desgn, 26, pp. 334-337. [] MOSEK Optmzaton Software, http://www.mosek.com [onlne] [] HSPICE: The gold standard for accurate crcut smulaton, http://www.synopsys.com/products/mxedsgnal/hspce/hspce.html [onlne] [2] Predctve Technology Model, http://www.eas.asu.edu/~ptm/ [3] E. M. Sentovch, K. J. Sngh, L. Lavagno, et al., "SIS: A System for Sequental Crcut Synthess," Unversty of Calforna, Berkeley, Report M92/4, May 992. [4] C.-S. Hwang and M. Pedram, "PMP: Performance-drven multlevel parttonng by aggregatng the preferred sgnal drectons of I/O conduts," Proc. of Asa and South Pacfc Desgn Automaton Conference, 25, pp. 428-43. [5] E. Salman, A. Dasdan, F. Taraporevala, et al., "Explotng setup hold-tme nterdependence n statc tmng analyss," IEEE Transactons on Computer Aded Desgn of Integrated Crcuts and Systems, vol. 26, no. 6, Jun. 27, pp. 4-25. [6] K. Hwang, Advanced Computer Archtecture. New York, NY: McGraw Hll, 993. [7] J.Montero, S. Devadas, and A. Ghosh. "Retmng sequental crcuts for low power" In Dgest of Techncal Papers of the 993 IEEE Internatonal Conference on CAD, pages 398-42, November 993.