A Mathematcal Soluton to Power Optmal Ppelne Desgn by Utlzng Soft Edge Flp-Flops Mohammad Ghasemazar, Behnam Amelfard and Massoud Pedram Unversty of Southern Calforna Department of Electrcal Engneerng Los Angeles, CA 989 U.S.A. {ghasemaz,amelfar,pedram}@usc.edu ABSTRACT Ths paper presents a novel technque to mnmze the total power consumpton of a synchronous lnear ppelne crcut by explotng extra slacks avalable n some stages of the ppelne. The key dea s to utlze soft-edge flp-flops to enable tme borrowng between stages of a lnear ppelne n order to provde the tmng-crtcal stages wth more tme to complete ther computatons. Tme borrowng, n conjuncton wth keepng the clock frequency unchanged, gves rse to a postve tmng slack n each ppelne stage. The slack s subsequently utlzed to mnmze the crcut power consumpton by reducng the supply voltage level. We formulate and solve the problem of optmally selectng the transparency wndow of the soft-edge flp-flops and choosng the mnmum supply voltage level for the ppelne crcut as a quadratc program, thereby mnmzng the power consumpton of the lnear ppelne crcut under a clock frequency constrant. Expermental results prove the effcacy of the problem formulaton and soluton technque. Categores and Subject Descrptors B.8.2 [Performance and Relablty]: Performance Analyss and Desgn Ades General Terms Algorthms, Desgn. Keywords Low-power mcroprocessor desgn, Synchronous ppelnes, Soft edge flp-flop, Voltage scalng, Quadratc programmng.. INTRODUCTION Excessve power dsspaton and resultng temperature rse have become key lmtng factors to processor performance and a sgnfcant component of ts cost. In modern mcroprocessors, expensve packagng and heat removal solutons are requred to acheve acceptable substrate and nterconnect temperatures. Due to ther hgh utlzaton, ppelne crcuts of a hgh-performance mcroprocessor are major contrbutors to the overall power Ths research was sponsored n part by a grant from the Natonal Scence Foundaton under award number 59564. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ISLPED 8, August 3, 28, Bangalore, Inda. Copyrght 28 ACM 978--6558-9-5/8/8...$5.. consumpton of the processor, and consequently, one of the man sources of heat generaton on the chp []. Many technques have been proposed to reduce the power consumpton of a mcroprocessor s ppelne among whch ppelne gatng [], clock gatng [2, 3], and voltage scalng [4] have proven to be effectve. In ths paper we present a technque to address the problem of reducng the power consumpton n a synchronous lnear ppelne.e., one wth the followng propertes: () processng stages are lnearly connected, () t performs a fxed functon, and () stages are separated by flp-flops whch are clocked wth the same CLK sgnal. Our technque s based on the dea of utlzng soft-edge flp-flops (SEFF) for slack passng and voltage scalng n the ppelne stages. Soft-edge flp-flops have a small transparency wndow whch allows tme borrowng across ppelne stages. Softedge flp-flops have been tradtonally used for mnmzng the effect of clock skew on statc and dynamc crcuts [5, 6]. Recently, the authors of [7] proposed an approach to utlze soft-edge flpflops n sequental crcuts n order to mnmze the effect of process varaton on the yeld. They formulated the problem of statstcally aware SEFF assgnment whch maxmzes the gan n tmng yeld as an nteger lnear program (ILP) and proposed a heurstc algorthm to solve the problem. We descrbe a unfed methodology for optmally selectng the supply voltage level of a lnear ppelne and optmzng the transparency wndow of the SEFF so as to acheve the mnmum power consumpton subject to a total computaton tme (latency) constrant. We formulate ths problem as a quadratc program, whch s a convex programmng problem, and hence can be solved optmally n polynomal tme. The remander of ths paper s organzed as follows. In Secton 2 we provde some background on ppelne desgn and soft-edge flp-flops. Secton 3 descrbes our technques for reducng the power consumpton. Secton 4 s dedcated to smulaton results and Secton 5 concludes the paper. 2. BACKGROUND 2. Prelmnares A smple (synchronous) 2-stage lnear ppelne crcut s shown n Fgure. We call the set of flp-flops that separate consecutve stages of the lnear ppelne as a FF-set, for example, FF FF 2 are the FF-sets. Let s assume for now that the FF-sets used n ths desgn are all hard-edge FF s. To guarantee the correct operaton of the ppelne, the followng tmng constrants should be satsfed n all stages of the ppelne: d + t + t T N () s, cq, clk δ + t t N (2) cq, h,
where d and δ are the maxmum and mnmum delays of combnatonal logc n stage, T clk denotes the clock cycle tme, t s, and t h, are the setup and hold tmes for the flp-flops n the th FF-set whle tcq, denotes the clock-to-q propagaton delay st of the flp-flops n ppelnes stages. CLK FF-set. N denotes the number of D Q D Q D Q C C2 FF FF FF2 Fgure. A smple lnear ppelne Equaton () descrbes the constrant set on the maxmum delay of ppelne stages to prevent setup tme volatons. It mples that the sgnal delay from one stage to the next stage should be less than a clock cycle by at least a setup tme. The total delay s the sum of clock-to-q delay of the frst stage and the longest path delay of the combnatonal crcut. Equaton (2) descrbes the constrant set on the mnmum delay of the ppelne stages to prevent data race hazard. In order not to overwrte the prevous data, the new data of a stage must arrve at the next stage only after the hold tme of the next stage FF has elapsed. The earlest tme that new data can arrve at the next stage s the clock-to-q delay of the frst stage plus the shortest path delay of the combnatonal logc n between the two stages. We have gnored the clock skew n both equatons. To do so, we must add the clock skew, t skew, to the left sde of nequalty () and subtract t from the left sde of nequalty (2). 2.2 Soft-Edge Flp-Flop The key dea n desgnng a soft edge flp-flop [5] s to delay the clock of the master latch so as to create a wndow durng whch both master and slave latches are ON (cf. Fgure 2). Ths wndow s called the transparency wndow of the SEFF and allows slack passng between adjacent ppelne stages separated by SEFF s. The delayed clock s acheved by utlzng an nverter chan and approprately szng nverters n the chan to acheve desred delay. D D D D D Delay D D Fgure 2. Master slave soft edge flp-flop Referrng back to Fgure, for the sake of consstency wth the nput and output envronments and to avod mposng constrants on the sender or recever of data for the lnear ppelne crcut n queston, we requre that the frst and last FF-sets n the ppelne are composed of hard-edge FF s whereas the ntervenng FF-sets may be SEFF s. Therefore, n ths example, only FF can be made a soft-edge FF-set. In a SEFF, the transparency wndow sze s an mportant parameter n the tmng constrants snce t changes the characterstcs of the flp-flop. More precsely, the setup tme, hold tme, and clock-to-q delay of a soft-edge flp-flop are all functons of the transparency wndow wdth. By defnng these tmng parameters as functons of the wndow sze, we can rewrte the tmng constrants of a lnear ppelne whch utlzes SEFF s as, Q d T t ( w) t ( w ) N (3) clk s, cq, δ t w t w N (4) h, ( ) cq, ( ) Inequaltes (3) and (4) are the SEFF versons of nequaltes () and (2). Notce that the setup/hold tmes and the clock-to-q delay are now dependant on the transparency wndow sze of the SEFF s. Intutvely, t s expected that all three crtcal tmes of a SEFF,.e., the setup tme, hold tme and clock-to-q delay, are postponed by the sze of the transparency wndoww, because the data has more tme to arrve. As a result, the setup tme s decreased by w whle the hold tme and clock-to-q delay are ncreased byw. The reason for the lnear dependence of the setup and hold tmes on w s that the nput data may be read a tme w after the clock edge. In secton 3, we wll show that the optmal wndow sze of a SEFF s equal to the borrowed tme n the precedng ppelne stage. In other words, n the optmal lnear ppelne desgn, data arrves at the end of the transparency wndow of the SEFF, and as a result, the output of the SEFF s vald after a data to Q delay wth respect to the end of transparency wndow,.e., after w+ t wth respect to the clock edge. On the other hand, f there s no tme borrowng, the output Q becomes vald only a clock to Q tme, t cq, after the clock edge. Based on the above dscusson, the setup tme, hold tme, and clock-to-q delay of a SEFF may be modeled as lnear functons of wndow sze, as follows, ts,( w) = aw + a th,( w) = bw + b tcq,( w) = cw + c where a to c are technology and desgn specfc coeffcents. Power consumpton of a SEFF also changes wth w. Ths s due to the fact that ncreasng the wndow sze s performed by ncreasng the sze or the number of nverters n the delayed clock path. Both methods for alterng w result n an ncrease n the power consumpton of the SEFF. Power consumpton s a monotoncally ncreasng functon of wndow sze, as shown n Fgure 3 for the master-slave flp-flops. The dscontnutes (jumps) n the curve are due to a change n the number of nverters n the delay path. From ths fgure, one can conclude that the power dsspaton of the SEFF may be approxmated as a quadratc functon of the transparency wndow wdth,.e., 2 FF, 2 dq (5) P = dw + dw + d (6) where d to d 2 are technology and desgn specfc coeffcents. Power Dsspaton (uw) 35 3 25 2 5 5 4 8 2 6 Transparency wndow (ps) Fgure 3. Power consumpton of a SEFF as a functon of transparency wndow
3. POWER OPTIMAL PIPELINE DESIGN The key dea for usng SEFF s n a ppelne crcut s that some postve slack may be avalable n one or more stages of the ppelne. Utlzng SEFF allows passng ths slack to more tmng crtcal stages of the ppelne to provde them wth more freedom n power optmzaton through voltage scalng. As an example, consder the three stage ppelne crcut of Fgure 4 operatng at a supply voltage level of V DD. The per-stage maxmum logc delays are shown n the fgure. Let s assume the setup tme, hold tme, and the clock-to-q delay of all (hard-edge) FF s are 3ps each. Assumng fxed and unform tme allocaton across the three ppelne stages, from equaton (), t s easly seen that the mnmum clock perod s 56ps. If T clk =56ps, no slack wll be avalable to the frst stage of the ppelne, and consequently, the supply voltage of the ppelne crcut cannot be scaled down n order to reduce the power consumpton. However, f FF s replaced wth a SEFF wth a transparency wndow of 5ps, avalable slack at the second stage s passed to the frst stage, provdng the frst stage wth 5ps of borrowed tme. Now snce postve slacks are avalable n all stages of the ppelne, the crcut can be powered wth a smaller supply voltage n order to reduce the power consumpton (deally, V DD may be reduced by approxmately %, resultng n roughly 9% power savng). CLK C D Q D C2 C3 Q D Q D Q d=5ps d2=4ps d3=45ps FF FF FF2 FF3 Fgure 4. Example of slack passng 3. Soft-Edge Flp-Flop Modelng To optmally select the transparency wndow of the SEFF s and choose the mnmum supply voltage level, we need to accurately account not only for the effect of the transparency wndow on the setup/hold tmes and clock-to-q delay, but also for the power consumpton of the SEFF s. In Secton 2.2 t was shown that for a SEFF, the setup/hold tmes and clock-to-q delay can be modeled as lnear functons of transparency wndow sze (c.f. equaton set (5)). If the supply voltage of the flp-flop can also be adjusted to a new voltage level,v, then coeffcents of these lnear models wll become voltage-dependent parameters,.e., ts,( w, v) = a() vw+ a() v th,( w, v) = b() vw+ b() v tcq,( w, v) = c() vw+ c() v Fgure 5 through Fgure 7 show SPICE smulatons of the setup tme, hold tme, and clock-to-q delay as functons of the transparency wndow sze and supply voltage level for the SEFF of Fgure 2. From these fgures one can see that the equaton set (7) s qute accurate. Smlarly, an extenson of (6) can be used to model the effect of adjustng the supply voltage level,v, on the SEFF power consumpton as: 2 ( ) ( ) ( ) FF, 2 (7) P = d v w + d v w + d v (8) where d ( ) v through d ( ) 2 v are voltage-dependent parameters. Setup Tme (ps) - -2-3 -4-5 -6-7 4 6 8 2 4 Transparency wndow (ps) Vdd=.9V Vdd=.V Vdd=.V Vdd=.2V Fgure 5. Setup tme as a functon of the supply voltage level and the transparency wndow wdth Hold Tme (ps) 8 6 4 2 Vdd=.9V Vdd=.V Vdd=.V Vdd=.2V 4 6 8 2 4 Transparency wndow (ps) Fgure 6. Hold tme as a functon of the supply voltage level and the transparency wndow wdth -to-q delay (ps) 2 8 6 4 2 8 4 6 8 2 4 Transparency wndow (ps) Vdd=.9V Vdd=.V Vdd=.V Vdd=.2V Fgure 7. Clock-to-q delay as a functon of the supply voltage level and the transparency wndow wdth 3.2 Combnatonal Logc Block Modelng As a result of voltage scalng, for a fxed clock frequency, the total power consumpton of combnatonal logc changes as follows : 2 3 v v P () v = P + P Comb, dyn, leak, V V Ths super-lnear dependency of leakage power on the supply voltage s due to the combned effect of dran nduced barrer lowerng and the offstate leakage equaton: V DD I OFF. The cubc form of ths dependency has been emprcally observed from SPICE smulatons. (9)
where P dyn, and P leak, are the dynamc and leakage power consumpton of the combnatonal logc at the nomnal supply voltage V, and P Comb, s the total power consumpton of the combnatonal logc at the new supply voltage level v. On the other hand, t s known that when the supply voltage of a combnatonal logc s changed, ts new delay can be obtaned from the alpha-power law [8]; therefore, α V V t () = dv ( ) v V t d v α V V t () v = δ( V) v V t δ () () where α s a technology parameter whch s around 2 for long channel devces and.3 for short channel devces, and V t denotes the magntude of the threshold voltage of transstors. 3.3 Delay Elements From equaton (4) and Fgure 6, one can see that ncreasng the transparency wndow of the th soft-edge FF-set puts more rgd constrant on the hold tme condton for the th stage of the ppelne. Therefore, f needed, delay elements may be utlzed n the mnmum-delay path(s) to allevate the hold tme constrant volaton. Smlar to the delayed clock path, ths s acheved by utlzng some nverters and approprately szng them n a smlar fashon to [9], n order to meet the desred delay lower bound whle ncurrng mnmum power loss. The power overhead of a delay element s denoted as P (, ) ( ) DE zv = k v z, where z s the desred delay and k( v ) s a voltage dependent parameter. 3.4 Problem Formulaton The problem of power-optmal soft lnear ppelne (PSLP) desgn s defned as fndng optmal values of the supply voltage level for the whole desgn and the transparency wndows of the ndvdual soft-edge FF-sets n the desgn so as to mnmze the total power consumpton of an N-stage ppelne crcut subject to setup and hold tme constrants: N N N Mn. P = P () v + P ( w, v) + P ( z, v) total Comb, FF, DE, = = = st..() I d() v T t ( w, v) t ( w, v); N clk s, cq, ( II) δ() v + z t ( w, v) t ( w, v); N h, cq, ( III) w w w mn max; N ( IV) v { VV,..., V m } where Comb, (2) P, P FF,, and P DE, are respectvely the power dsspaton of the combnatonal logc, FF s, and delay elements n the th stage of the ppelne. The frst and second sets of constrants n (2) are respectvely the setup and hold tme constrants n the ppelne stages, the thrd set of constrants mposes an upper bound and a lower bound on the transparency wndow of the flp-flop ( wmn and w < T ), and max /2 clk fnally, the last constrant n (2) enforces the supply voltage of the ppelne to be from the set of avalable voltages { V, V,... V m }, where V s the nomnal supply voltage and V> V... > Vm. Note that problem formulaton () has 2N optmzaton varables correspondng to N transparency wndow szes, w, for the N soft-edge FF-sets n the lnear ppelne, N delay element values, z, for the N stages of the ppelne, and one supply voltage varable settng, v. To solve (2) effcently, we enumerate all possble values for v, and for each fxed v we solve a quadratc program (.e., we mnmze a quadratc cost functon subject to lnear nequalty constrants), whch can be solved optmally n polynomal tme. In the fxed supply voltage PSLP problem formulaton, P Comb, terms drop out of the cost functon, constrant (IV) dsappears, and all other tmng and power parameters become only dependent on w and z varables. We refer to ths verson of the problem as PSLP- FV, PSLP wth fxed voltage. Lemma : In the optmal soluton of PSLP-FV desgn problem, the transparency wndow of the th soft-edge FF-set s exactly equal to the tme borrowed by the combnatonal logc n the th stage of the lnear ppelne. Proof: Accordng to the dscusson n Secton 2.2 and Fgure 3, the power consumpton of a SEFF s a monotoncally ncreasng functon of the transparency wndow sze whle ts setup tme s a decreasng functon of the same. Now, from condton (I) n the PSLP-FV problem formulaton of equaton (2), a mnmum decrease n the setup tme of the th soft-edge FF-set t ( wv,) s, whch meets the long-path constrant n the th stage of the lnear ppelne, wll produce the mnmum ncrease n the power dsspaton of the th soft-edge FF-set P ( w, v ). Therefore, the, optmal soluton s acheved by utlzng the smallest possble transparency wndow szes whch prevent setup tme volaton. Lemma 2: In the optmal soluton of PSLP-FV desgn problem, the delay element nserted n the th stage of the lnear ppelne s exactly equal to the mnmum extra tme needed to meet the hold tme constrant at the th soft-edge FF-set. Proof: Accordng to the dscusson n Secton 3.3, the power consumpton of a delay element s a monotoncally ncreasng functon of the target delay value whle the hold tme of a SEFF s an ncreasng functon of the same. Now, from condton (II) n the PSLP-FV problem formulaton, a mnmum delay value z added to the th stage of the lnear ppelne whch meets the short-path constrant for that stage, wll produce the mnmum ncrease n the power dsspaton of the combnatonal logc n the th P ( z, v )., Therefore, the optmal soluton s acheved by utlzng the smallest possble delay elements whch prevent hold tme volatons. Theorem : The optmal soluton to PSLP desgn problem s obtaned by solvng the PSLP-FV desgn problem m tmes for * each dstnct voltage level and selectng the voltage level v and * * the correspondng w and z values that mnmze the total power dsspaton for v *. Proof: Ths easly follows from the observaton that soluton of the PSLP-FV problem producesw s andz s for each possble v and we enumerate over all v s to get the global optmum soluton n an exhaustve manner. Fnally we pont out that a greedy soluton to PSLP-FV whereby each ppelne stage s allocated a total combnatonal delay equal to the average combnatonal delay of all stages and the dfference between actual delay of the stage and the allocated delay s FF DE
corrected for by settng the transparency wndow sze of the correspondng soft-edge FF s, cannot meet the long-path constrants n all stages of the ppelne snce the macro model equatons for the setup/hold tme and clock-to-q delays of the soft edge FF s have dfferent slopes wth respect to w s. 4. EXPERIMENTAL RESULTS To solve the mathematcal problem developed n ths paper, MOSEK optmzaton toolbox [] has been used. To extract the parameters used n the optmzaton problem, we performed transstor-level smulatons on soft edge flp-flops by usng HSPICE []. The technology used n ths smulaton s a 65nm predctve technology model [2], the nomnal supply voltage of ths technology s.2v, and the de temperature s o C. We syntheszed a number of lnear ppelne crcuts whch capture the characterstcs of a typcal ppelne n a modern processor as a set of benchmark crcuts. SIS [3] optmzaton package was used to synthesze the set of benchmarks. The mnmum and maxmum delays of each ppelne stage were computed at the maxmum allowed supply voltage (.2V) and at the low and hgh temperature corners. The mnmum clock cycle tme for the ppelne (maxmum frequency) and power dsspaton of the lnear ppelne were subsequently computed. Ths data defned the baselne for our comparson. Next, PSLP was run on each crcut under the condton that we mantan the clock frequency, whle explotng tme borrowng across dfferent stages to enable voltage scalng, and thus, power savng. The specfcatons of these benchmarks are shown n Table. The frst column n ths table gves the name of the benchmark, the second column reports the max and mn delays of each stage of the ppelne at the nomnal voltage, whereas the last column provdes the clock frequency. Testbench Table. Specfcaton of the benchmark Stage delays at nomnal voltage (ps) Clock freq. TB (32,4), (332,5), (38,5), (32,7) 2.GHz TB2 (32,4), (332,5), (38,5), (28,45), (32,7) 2.GHz TB3 (325, 5) (3,55) (29,6) 2.GHz TB4 (275,4), (235,4), (245,6), (275,5), (275,7) 2.5GHz TB5 (3,), (245,4), (245,5), (245,6) 2.5GHz Expermental results on these benchmarks are provded n Table 2. The frst entry n the table s the name of the benchmark and the second entry shows the percentage power reducton acheved by PSLP (compared to conventonal way of usng hard-edge FF s n the ppelne). From ths table, one can see that PSLP, whch combnes tme borrowng and voltage scalng to reduce the power consumpton, produces crcuts wth much lower power consumpton at the same clock frequency. The supply voltage level and soft-edge FF-set transparency wndow szes are reported n the last two columns of the table. Notce that for the frst entry of the table, the wndow szes are such that the frst and second stages borrow larger tmes from ther next stages, whle the thrd stage cannot borrow much tme; the reason s that snce the last stage of the ppelne has a large max delay and ends up nto a hard edge FFset, t can lend very lttle tme to ts prevous stage. In another set of experments, we studed how usng SEFF s can mprove the performance of a ppelne. In these experments, the supply voltage of each ppelne was set at the nomnal value and PSLP has been nvoked for dfferent values of T clk. A bnary search has been used to fnd the mnmum T clk for whch PSLP has a soluton. Table 3 shows that utlzng SEFF n the FF-set of ppelnes mproves the performance by an average of 2.8%. The area overhead of our technque s very small because t only replaces standard flp-flops wth SEFF s when helpful. The crcut structure of the SEFF's s dfferent from that of conventonal FF s only n that SEFF s use an addtonal delay element (e.g., chan of nverters). The area overhead of ths delay element s small compared to the area of the orgnal FF. In addton, compared to the sze of the combnatonal crcut plus the orgnal FF-sets, the area overhead of the added delay elements nsde SEFF s s mnscule. Consequently, n the fnal physcal layout of the crcut, PSLP does not ntroduce any sgnfcant addtonal area. The runtme of our algorthm for all benchmarks s less than one second on a 2.4GHz Pentum-4 PC wth 2GB of memory. Table 2. Power reducton n PSLP compared to regular FF ppelne. TB Power Optmum Optmum wndow reducton (%) Vdd (V) sze (ps) TB 32.. 4, 49, 22 TB2 33.8. 4, 49, 46, 2 TB3 48..95 7, 24 TB4 6.3. 35, 35, 3 TB5 25.4.5 37,36 Table 3. PSLP s performance mprovement results TB Performance mprovement (%) TB 4% TB2 5% TB3 2% TB4 5% TB5 % 4. A Case Study In order to demonstrate the effcacy of the proposed technque and provde nsght as to how t operates, n ths secton, we provde detals of applyng our technque for performance/power optmzaton of a 34-bt ppelned adder. We used the PSLP desgn technque to determne the best way of ppelnng ths adder nto four stages n order to acheve the maxmum performance and also mnmum power dsspaton at that performance level. Assumng rpple carry adder (RCA) structure for the crcut, splttng the 34- bt adder can be done by ncludng dfferent number of cascaded - bt full adders n each stage of the ppelne. For example, a possble confguraton s to buld three stages of eght -bt full adders and one stage of ten -bt full adders, resultng n the 8 8 8 ppelne confguraton. If hard-edge FF s are used n the ppelne, the mnmum clock perod of the 8 8 8 ppelned adder s 475ps under a supply voltage of.2v (the delay of a sngle full adder s 38.5ps and the setup tme and clock-to-q delay are 35ps and 5ps, respectvely). Ths delay can be reduced to 45ps by utlzng soft edge flp-flops. The PSLP desgn technque can choose the mnmum power and the fastest desgn among all possble confguratons. Table 4 compares four ppelne structures for the 34-bt adder operatng n the same supply voltage. In ths table, all desgns have three stages of eght -bt full adders, and a stage of ten -bt full adders.
Placng the -bt stage n the ppelne s crtcal n performance and power consumpton of the crcut. In the 8 8 8 confguraton a hgher clock frequency can be acheved by means of tme borrowng between stages, resultng n lower power consumpton. The 8 8 8 needs a hgher clock perod, because tme borrowng s not possble for the last stage, and therefore t needs more tme. Another ppelne confguraton s to have two 9- bt rpple carry adders and two 8-bt rpple carry adders. In ths case, the performance s only a lttle worse than the -8-8-8 confguraton. The PSLP desgn technque fnds the optmal wndow assgnment to each nter-stage flp-flop to optmally satsfy the tmng constrants for the gven clock perod. Table 4. Comparng performance of ppelne confguratons Confguraton Vdd (V) Mn clock perod (ps) Power consumpton (mw) 8 8 8.2 45 6.42 8 8 8.2 472 6.5 8 8 8.2 472 6.5 8 8 8.2 486 6.55 9 9 8 8.2 455 6.42 9 8 9 8.2 433 6.5 Assumng a clock frequency of 2GHz, we wll have a 5ps clock cycle whch creates postve slack n the stages. Ths slack allows us to scale down the supply voltage. Reducng the voltage level decreases the power consumpton by a notceable amount due to the quadratc dependency of power on voltage. Moreover, by usng the flexblty that the SEFF s add to the ppelne, voltage can be further reduced to save even more power. The PSLP technque searches for the mnmum power consumpton by changng the operatng voltage and fndng optmum wndow sze assgnment for that voltage. Table 5 lsts the optmum operatng voltage and mnmum power consumpton of four dfferent confguratons. For nstance, n the case of -8-8-8 adder, PSLP suggests a wndow of 47ps for the frst stage and 42ps for the next two soft edge stages to meet the 2GHz constrants under a supply voltage of.5volts. Table 5. Mnmum power consumpton of ppelne confguratons Confguraton Optmum Vdd (V) Power consumpton (mw) Clock frequency 8 8 8.5 4.9 2GHz 8 8 8.5 5. 2GHz 9 9 8 8.5 4.9 2GHz 9 8 8 9. 4.9 2GHz 5. CONCLUSION We presented a new technque to mnmze the total power consumpton of a lnear ppelne crcut by utlzng soft-edge flpflops and choosng the optmal supply voltage level for the ppelne. We formulated the problem as a mathematcal program and solved t effcently. Our expermental results demonstrated that ths technque s qute effectve n reducng the power consumpton of a ppelne crcut under a performance constrant. A number of extensons to the work presented n ths paper are possble. One s to allow dfferent transparency wndows for FF s n the same FF-set. The only dfference s that n ths case the setup and hold tme constrants should be satsfed for every I/O condut of the crcut (see [4] for an exact defnton). The maxmum number of I/O conduts n any stage of lnear ppelne s the product of the cardnalty of ts nput FF-set and ts output FF-set. It s seen that the sze of PSLP desgn problem for a ths case stll remans manageable. Another extenson s to consder the nterdependency between setup and hold tmes. It s known that the ndependent characterzaton of setup, hold tme, and clock-to-q delay of FF s results n pessmstc tmng analyss [5]. In our problem defnton, consderng the nterdependency between the setup and hold tme provdes more freedom n the optmzaton problem and t s expected to mprove the qualty of the results. Yet another extenson s to solve the PSLP desgn problem for the nonlnear ppelnes,.e. ppelnes that perform varable functons and have mult-stage feed-forward paths or mult-stage feedback paths [6]. The problem setup n ths case wll be smlar to that of Secton 3 but the constrants are more complex. Fnally one may combne our technque wth clock skew control and retmng methods [7] to acheve hgher power savngs. REFERENCES [] S. Manne, A. Klauser, and D. Grunwald, "Ppelne gatng: speculaton control for energy reducton," Proc. of Internatonal Symposum on Computer Archtecture, 998, pp. 32-4. [2] H. M. Jacobson, "Improved clock-gatng through transparent ppelnng," Proc. of Internatonal Symposum on Low Power Electroncs and Desgn, 24, pp. 26-3. [3] H. Jacobson, P. Bose, H. Zhgang, et al., "Stretchng the lmts of clock-gatng effcency n server-class processors," Proc. of Hgh- Performance Computer Archtecture, 25, pp. 238-242. [4] D. Ernst, N. Km, S. Das, et al., "Razor: a low-power ppelne based on crcut-level tmng speculaton," Proc. of Internatonal Symposum on Mcroarchtecture, 23, pp. 7-8. [5] H. Partov, R. Burd, U. Salm, et al., "Flow-through latch and edgetrggered flp-flop hybrd elements," Proc. of Internatonal Sold- State Crcuts Conference, 996, pp.38-39. [6] D. Harrs and M. A. Horowtz, "Skew-tolerant domno crcuts," IEEE Journal of Sold-State Crcuts, vol. 32, no., Nov. 997, pp. 72-7. [7] V. Josh, D. Blaauw, and D. Sylvester, "Soft-edge flp-flops for mproved tmng yeld: desgn and optmzaton," Proc. of Internatonal Conference on Computer-Aded Desgn, 27, pp. 667-673. [8] T. Sakura and A. R. Newton, "Alpha-power law MOSFET model and ts applcatons to CMOS nverter delay and other formulas," IEEE Journal of Sold-State Crcuts, vol. 25, no. 2, Apr. 99, pp. 584-594. [9] B. Amelfard, F. Fallah, and M. Pedram, "Low-power fanout optmzaton usng MTCMOS and mult-vt technques," Proc. of Internatonal Symposum on Low Power Electroncs and Desgn, 26, pp. 334-337. [] MOSEK Optmzaton Software, http://www.mosek.com [onlne] [] HSPICE: The gold standard for accurate crcut smulaton, http://www.synopsys.com/products/mxedsgnal/hspce/hspce.html [onlne] [2] Predctve Technology Model, http://www.eas.asu.edu/~ptm/ [3] E. M. Sentovch, K. J. Sngh, L. Lavagno, et al., "SIS: A System for Sequental Crcut Synthess," Unversty of Calforna, Berkeley, Report M92/4, May 992. [4] C.-S. Hwang and M. Pedram, "PMP: Performance-drven multlevel parttonng by aggregatng the preferred sgnal drectons of I/O conduts," Proc. of Asa and South Pacfc Desgn Automaton Conference, 25, pp. 428-43. [5] E. Salman, A. Dasdan, F. Taraporevala, et al., "Explotng setup hold-tme nterdependence n statc tmng analyss," IEEE Transactons on Computer Aded Desgn of Integrated Crcuts and Systems, vol. 26, no. 6, Jun. 27, pp. 4-25. [6] K. Hwang, Advanced Computer Archtecture. New York, NY: McGraw Hll, 993. [7] J.Montero, S. Devadas, and A. Ghosh. "Retmng sequental crcuts for low power" In Dgest of Techncal Papers of the 993 IEEE Internatonal Conference on CAD, pages 398-42, November 993.