A Novel On-chp Measurement Hardware for Effcent Speed-Bnnng A. Raychowdhury, S. Ghosh, and K. Roy Department of ECE, Purdue Unversty, IN {araycho, ghosh3, kaushk}@ecn.purdue.edu Abstract Wth the aggressve scalng of the CMOS technology parametrc varaton of the transstor threshold voltage causes sgnfcant spread n the crcut delay as well as leakage spectrum. Consequently, speed bnnng of the hgh performance VLSI chps s essental and t costs sgnfcant amount of test applcaton tme. Further, the knowledge of the actual delay n the crtcal path of the crcut enables effcent use of typcal low power methodologes e.g., voltage scalng, adaptve body basng etc. In ths paper, we have proposed a novel on-chp, low overhead and process tolerant delay measurement crcut whch can estmate the crtcal path delay n a sngle clock perod. Ths has the advantage of effcent on-chp speed bnnng. Keywords: Speed bnnng, delay measurement hardware, process varaton. I. Introducton Systematc as well as random varatons n the process parameters have posed serous challenges to future hgh performance mcroprocessor desgn. Varatons n length, oxde thckness and random dopant effects n nano-scaled transstors result n sgnfcant fluctuatons n the transstor threshold voltage (V T ). Ths can be from one de to another (nter-de) as well as wthn de (ntra-de). Consequently, the spread n delay s consderable and a 3% frequency dstrbuton s typcally estmated [1]. Ths varaton n frequency has ntroduced the concept of frequency bnnng. On one hand, some of the chps are faster (low-v T ) than the nomnal and they tend to support hgher clock frequences at a system level. These chps add sgnfcantly to the proft margn. On the other hand, some of the hgh V T chps are er than the nomnal but they can be used at lower clock speeds. Thus t s essental to effcently perform speed bnnng, not only to earn extra proft for the hgher performance chps but also to salvage the er but non-faulty chps n a possble gono-go stuaton. Speed bnnng has, thus, emerged as an ndspensable part of delay fault testng. Tradtonally, speed bnnng s perfor by ncreasng the clock frequency of the crtcal path secton (or, ts replca) of the crcut tll t fals. Ths process s expensve n terms of test applcaton tme and desgn complexty of the test hardware. In ths paper, we propose a novel crcut to effcently speed bn a hgh performance processor n a sngle clock cycle by drectly measurng the delay of ts crtcal path. We have desgned a low-overhead, process tolerant delay measurement hardware (DMH) that can detect the bn that a partcular chp belongs to. Conventonally, speed bnnng as well as adaptve technques (e.g. body basng, dynamc voltage scalng) are perfor on crtcal path replcas [2]. In our methodology, we have used a smlar technque. Consequently, the nserton of DMH does not load the crtcal path of the crcut. The replca crcut tracks nterde varatons effcently and smulaton results show that even under hgh ntra-de varatons the technque can correctly bn the crcut wth more than 96% confdence. The output of the DMH s a dgtal word that represents the bn that the crcut belongs to. The novelty les n the fact that the DMH can detect the speed bn n a sngle clock cycle thereby savng valuable test applcaton tme. The organzaton of ths paper s as follows. Secton II descrbes the operaton of DMH for effcent speed bnnng. The ndvdual blocks of DMH are descrbed n Secton III. In Secton IV, we have demonstrated the need of speed bnnng due to process varaton. The expermental setup for performng speed bnnng wth the proposed DMH s explaned n Secton V. The effect of parametrc varaton on speed bnnng s studed n Secton VI. Fnally, conclusons are drawn n Secton VII. II. Methodology Before gong nto the detaled descrpton of the delay measurement hardware, t wll be worthwhle to menton the prncple of operaton of the DMH. Let us assume that the crtcal path s a combnatonal logc block wth a state nput comng from flp-flop FF1 and the output gong to FF2 as shown n Fg. 1. Frst, we replcate the crtcal path of the crcut and nstead of a flp-flop we place the DMH at ts end. Secondly, the replcated crtcal path s senstzed usng test patterns appled by the Bult-n Self- Test (BIST). The BIST s clocked by the system clock and the test pattern s launched at T1. Let us assume that node X (output of the crtcal path) makes a fallng transton from a logc one to a logc zero and T D s the tme nterval between the clock edge T1 and the tme when the voltage at node X makes a fallng transton (Fg. 2). Proceedngs of the 11th IEEE Internatonal On-Lne Testng Symposum (IOLTS 5) 1
FLASH ADC FF 1 BIST Logc CRITICAL PATH CRITICAL PATH REPLICA TAH Sawtooth Generator X Comparator C3 fast V CE Comparator C2 Table 1: Flash ADC outputs and correspondng speed bns fast Medum Slow est bn1 1 bn2 1 1 bn3 1 1 1 We propose the system llustrated n Fg. 1 for estmatng the delay, T D. A sawtooth generator s so desgned that the sawtooth waveform s extracted from the reference clock tself and t has pulse duraton equal to the tme perod (T) of the reference clock. The output of the sawtooth generator goes nto a track-and-hold crcut (TAH) and the samplng swtch of the TAH s controlled by the observaton node (X). As long as the node X s hgh, the TAH swtch s on and the output of the TAH tracks the sawtooth waveform. When X makes a fallng transton, t turns the TAH swtch off and the output capactor of the TAH holds ts value (say, V TAH ). The greater the delay T D s, the lower s V TAH. The value of V TAH can be used to estmate the speed bn of the bn3 bn2 V CE Comparator C1 V Fg. 1: Speed bnnng archtecture usng the proposed DMH. Node X Sawtooth TAH CE T D T Fg. 2: Tmng dagram of the DMH. FF 2 V CE bn1 Sawtooth fast V ref V ref V ref V OL Fgure 3: Speed bns and correspondng T max. crcut. The TAH drves a flash analog-to-dgtal comparator (ADC). The flash ADC conssts of three comparators C1, C2 and C3. The output of the comparators.e., bn1, bn2 and bn3, ndcates the speed bn of the crcut. As evdent from Fg. 3, we have dvded the speed nto four bns, namely, fast, um, and est. The chps belongng to the est bn are dscarded. Table 1 shows the outputs of the flash ADC correspondng to each speed bn. The reference voltages, ( V, V ref ref andv fast ref ) are the three nputs to the flash ADC. Consderng a lnearly fallng sawtooth waveform, the reference voltages can be estmated as: T T V VDD 1 VOL (1) T T where T s the clock perod and represents the bn,.e.;, or fast. Here, the sawtooth waveform s assu to be between V DD and V OL and T represents the maxmum allowable delay of the th bn. Thus T also represents the lower delay threshold of the th bn. For example, T represents the maxmum delay that a crcut may have to be placed n the um frequency dn. Thus, t represents the boundary between the um and the bns. Mathematcally: T FAST MEDIUM max delay of bn (2) The concept of T max determnaton for a partcular bn s llustrated n Fg. 3. The boundares between dfferent bns are shown wth bold lnes. The ADC evaluates at the next clock cycle when the Comparator-Enable (CE) sgnal goes hgh (Fg. 2). For the crtcal path replca to belong to the th bn, we requre: V V (3) TAH When (3) s true, the comparator output goes HIGH (logcal 1 ). Here, we have dscussed the case when X makes a fallng transton. Snce we perform the speed bnnng n the test cycle we can decde a-pror what nput vector can excte a hgh to low transton at node X and apply t correspondngly. Also, note that the proposed SLOW SLOWEST T fast T T Proceedngs of the 11th IEEE Internatonal On-Lne Testng Symposum (IOLTS 5) 2
V DD V DD T FF _2T M 2 C out M1 Vbas OUT 2TZ (a) V TAH_OUT V SZ CE (a) IN OUT V DD RESET I 1 I 2 I ref1 I ref2 S C HOLD (b) Fg. 4: Schematc dagrams (a) Sawtooth Generator; (b) Track-and-Hold (TAH) crcut. speed bnnng methodology can be easly extended to N number of bns. A D1 D2 B R 1 R Vref1 Rref1 V ref2 R ref2 III. Desgn of the ndvdual DMH blocks Sawtooth generator: The sawtooth generator s based on the prncple of constant current dscharge. The schematc dagram of the sawtooth generator s shown n Fg. 4a. A T flp-flop s used to generate a clock (_2T) wth a perod equal to twce the perod of the reference system clock. Consder that the node OUT n Fg. 2a s precharged to V DD when _2T s low. When _2T goes hgh (_2TZ goes low) the constant supply voltage (V DD ) provdes a constant current through the NMOS M1. Ths current dscharges the capactor C lnearly as long as M1 s n saturaton. Durng ths phase the PMOS M2 remans off and the output node shows a lnearly fallng waveform. At the end of the clock perod, the _2TZ sgnal goes hgh. Ths creates a low resstve path across the capactor through M2 and thus helps to charge OUT back to V DD. The gate voltage V bas of M1 provdes the requred current n the saturaton regon. The dscharge s lnear (gnorng Early effect) as long as M1 s n the saturaton regon. Hence we requre V ds >= V bas -V t. To ensure ths, the output node s allowed to dscharge tll V OL =V bas -V t (chosen to be 18mV, n ths case) n a sngle clock perod. Track-and-Hold network: The track-and-hold network for the crcut s a complementary pass transstor swtch (b) Fg. 5: The schematc dagram of the (a) comparator; (b) reference voltage generator. wth a capactve load (Fg. 4b). The voltage at the observaton pont X s the nput sgnal, S to the TAH. As long as S s hgh t wll charge the capactor, C HOLD. The value of the capactor n our desgn s 1fF. To dscharge the capactor before the next delay measurement, an NMOS swtch, trggered by a RESET sgnal s used n parallel wth C HOLD. The RESET pulse s generated after the comparson between V and V TAH has been made. It s also worth-mentonng, that the samplng swtch has been made a complementary one to avod charge njecton and clock feed-through [4]. Flash ADC: The flash ADC s the fastest ADC whch parallely compares the nput to a set of reference voltages. In the proposed desgn, we have used a 3-bt ADC. The ADC comprses of three comparators. The comparator used here s a latch-based sense amplfer, as been llustrated n Fg. 5a. The value of the sgnal at node TAH_OUT goes nto the comparator nput. After the trackand-hold phase, CE goes hgh and the output of the comparator s noted n the next clock cycle. The three Proceedngs of the 11th IEEE Internatonal On-Lne Testng Symposum (IOLTS 5) 3
comparator outputs form the ADC output word whch represents the speed bn of the crcut as gven n Table 1. The reference voltages of flash ADC are calculated by usng (1) and assumng T fast = N 1 Dm; T = N 2 Dm; (4) T = N 3 Dm; T = N 4 Dm; Where D m s the mean delay of the crtcal path of the crcut and N s determne the tmng boundares between bns. In our smulatons, we have chosen reasonable values of N s, namely, N 1 =.75, N 2 = 1.1, N 3 = 1.3 and N 4 = 1.6. In other words, N 1 =.75 means that all chps whose max crtcal path delays are less than 75% of the nomnal desgn are called fast chps. Smlarly all chps whose max delay s between 75% (N 1 ) and 11% (N 2 ) of the nomnal delay are called the um frequency chps. Fnally, f the chp has a delay of more than 1.3 tmes of the nomnal delay (N 3 D m ) then we dscard the chp as a faulty one. Note that n the test phase the test clock has a tme perod whch s 6% more than that of the nomnal clock perod. Ths ensures that the chps that are non-faulty but er than the nomnal desgn are properly bnned and can be used. Generaton of reference voltages In the proposed DMH, the stable voltage references (V ref s) for the DMH have been desgned based on the desgn of band-gap reference voltage. Ths has been llustrated n Fg. 5b. The prncple of operaton of sub 1V bandgap reference crcut s descrbed n [5]. Ths has been sutably modfed and used n the proposed DMH. The opamp equalzes the voltage between nodes A and B. Hence all the PMOS transstors at the top have the same V gs and hence they mrror the same current. I 1 = I 2 = I 3 = I ref1 = I ref2 =. (5) Let V d be the voltage across the dode D 1. It has a negatve temperature coeffcent (NTC). dv s the voltage dfference between dode D 1 and dode D 2 (of area N tmes that of D 1 ) and hence dv = V T ln(n), where V T s the thermal voltage havng a postve temperature coeffcent (PTC). The total current I 2 s: VB Vd Vd dv I2 R R1 R R (6) 1 Due to the current mrror, the same current s pumped nto the reference voltage generator arm. The output reference voltage of the th arm s thus: Vd VT ln( N) Vref () Rref () (7) R R1 N s chosen such that the net temperature coeffcent s zero. Note that, any voltage can be generated by changng the value of R ref. Several dfferent arms have been shown n the crcut below. Further, the output voltage s not dependent on transstor parameters. Even under de-to-de parameter varatons, the reference generator wll delver a stable reference voltage. It can be mentoned that bandgap reference usually forms an ntegral part of all mxed-sgnal and dgtal crcuts. We can use the bandgap reference already present n the crcut and we can add the reference generator arms to t to obtan a wde range of temperature nsenstve and stable voltage references. Ths wll reduce the area overhead nvolved wth the generaton of stable reference voltages. If the bandgap reference s not already present, we can use one bandgap reference (as n Fg. 5b) and share t for all reference voltages. Impact of process varaton It has been mentoned that we generate process varaton tolerant reference voltages usng a modfed bandgap reference. The other mportant DMH block that can be affected by process varaton s the sawtooth generator. Process varaton changes the dscharge rate of the capactor C and hence, mpacts V OL and the choce of V. To compensate for ths, we propose an ntal calbraton cycle where the capactor C s trm dependng on the process corner, thereby ensurng a lnear dscharge from V dd to V OL across all des. The TAH crcut s a transmsson gate wth large-szed transstors. Hence, process varaton cannot consderably mpact functonalty or performance of ths block. Fnally, the comparator n the ADC s dfferental n nature and de-tode varaton cannot mpact ts functonalty. Further, latchbased comparators, tolerant to wthn-de varaton, have been reported n [6] and extensvely studed n the desgn of our proposed DMH. IV. varatons: Necessty of Speed bnnng It has already been mentoned that the process varaton manfests tself as chp speed varaton n nanometer desgns [1]. We have studed a number of benchmark crcuts and they show sgnfcant spread n crtcal path delays. Fg. 6 shows the crtcal path delay dstrbuton of some ISCAS 89 benchmarks under process varaton. All the benchmarks have been modeled n HSPICE usng the BPTM [7] 7nm technology node. In our study we have assu an nter-de V T varaton wth =25% and ntrade varaton wth =15%. From Fg. 6 we can note that on an average the standard devaton () of the crtcal path delay s approxmately 27% of the nomnal delay. Ths consoldates the argument that speed bnnng n nanoscaled desgns s absolutely necessary. V. Expermental setup To explan the expermental setup let us consder one of the benchmark crcuts, namely, s838. Frst, we extracted ts crtcal usng Synopsys prmetme tool. Next the test pattern, to senstze the crtcal path, s obtaned usng the Proceedngs of the 11th IEEE Internatonal On-Lne Testng Symposum (IOLTS 5) 4
4 15 1 35 9 3 8 1 7 Number of chps 25 2 15 Number of chps 5 Number of chps 6 5 4 3 1 2 5 1 2 3 4 5 6 7 8 9 1 x 1 1.2.4.6.8 1 1.2 1.4 1.6 1.8 x 1 9 2 4 6 8 1 12 14 16 x 1 1 (a) s838 (b) s1196 (c) s5378 12 1 1 1 9 8 9 8 Numbre of chps 8 6 4 Number of chps 7 6 5 4 3 Number of chps 7 6 5 4 3 2 2 1 2 1.5 1 1.5 2 2.5 3 3.5 x 1 9.5 1 1.5 2 2.5 3 3.5 x 1 1 2 3 4 5 6 7 8 9 1 x 1 1 (d) s1327 (e) s1585 (f) s35932 Fg. 6: spread of ISCAS 89 benchmark crcuts w.r.t. number of chps due to parametrc varatons Synopsys Tetramax [8] ATPG tool. The entre DMH and benchmark has then been smulated n HSPICE at the 7nm technology node. We have appled the test pattern obtaned from Tetramax to both the orgnal crcut and the replcated crtcal path. The smulaton result from the spce s depcted n Fg. 7. The bn nformaton output of the DMH s observed at the end of test cycle and verfed for correctness wth the exact delay of the orgnal crtcal path. It can be notced that there s a fallng transton at the output node X of replcated crtcal path. The delay of the crtcal path falls nto um speed category whch s verfed by the DMH outputs.e. bn1, bn2 and bn3. suffers from ntra-de varaton. VI. Effect of varaton If there are no ntra-de varatons then both the orgnal as well as the crtcal path replcas would have dentcal delays and the speed bnnng would be perfect. However, ntra-de varatons tend to produce delay skew between the actual crtcal path and the crtcal path replca. Therefore, there can be chances of bn mspredcton f the replcated crtcal path and orgnal crtcal path severely Table 2: Bn predcton usng DMH for 1 runs Crcut Bn (correct) Bn Correct (mspredcted) predcton (%) s838 962 38 96.2 s1196 97 93 9.7 s5378 97 3 97. s1327 985 15 98.5 s1585 991 9 99.1 s35932 986 14 98.6 Fg. 7: Spce smulaton of s838 usng DMH and bn determnaton. DMH outputs (bn1, bn2, bn3) = (, 1, 1) ndcates that the crcut falls under um speed category. Proceedngs of the 11th IEEE Internatonal On-Lne Testng Symposum (IOLTS 5) 5
Table 3: Bn predcton of s838 for 1 process condtons Smulaton # 1 2 3 4 5 6 7 8 9 1 Actual bn 2 1 3 3 1 1 3 2 3 3 Predcted bn 2 1 3 3 1 1 3 1 3 3 bn=1 s est, bn=2 s, bn=3 s um and bn=4 s fast. In stuatons where the crcut speed s at the boundary of two neghborng bns, the chances of mspredcton ncreases due to varaton n crcut delay owng to process fluctuatons. To study the effect of process fluctuatons on bn predcton, we smulated each of the benchmark crcuts for 1 dfferent process condtons. Smulaton results are shown n Table 2. It can be observed that the chances of correct bn predcton under severe nter- and ntra-chp varatons are approxmately 96% on an average. Note that the DMH offers two gate capactance loads at the end of the crtcal path replca nstead of two dffuson cap loads of the flp flop at the end of the actual crtcal path. Further, we provde some extra threshold whle estmatng the reference voltages V ref for bn boundary determnaton. Hence, our bn predcton s pessmstc due to extra loadng of DMH. Therefore, the faulty chps (under est category) cannot pass through to consumers. Further, the mspredcton occurs when the chp under consderaton s at the boundary of two bns. Ths s llustrated n Table 3, where we have smulated benchmark s838 for 1 dfferent process condtons and compared the correct and predcted bn The ms-predcted chp was found to be at the boundary of est and bn. 7. Berkeley Predctve Technology Models: http://wwwdevce.eecs.berkeley.edu/~ptm/ 8. Synopsys Inc., Tetramax ATPG, www.synopsys.com/products. VII. Conclusons In ths paper, we have proposed a novel on-chp, low overhead and process tolerant delay measurement crcut whch can estmate the crtcal path delay n a sngle clock perod. Ths has the advantage of effcent on-chp speed bnnng. Smulaton results have shown an average of 96% correct bn predcton even under severe nter- and ntrachp varatons. References: 1. S. Borkar et al, Parameter varatons and mpact on crcuts and mcroarchtecture, DAC, pp. 338-342, 23. 2. J. W. Tshanz et al, Adaptve body bas for reducng de-to de and wthn-de parameter varatons on mcroprocessor frequency and leakage, IEEE JSSC, pp. 1396-142, 22. 3. N. Dragone et al., An adaptve on-chp voltage regulaton technque for low-power applcatons, ISLPED, pp. 2-24, 2. 4. B. Razav, Desgn of Analog CMOS Integrated Crcuts, McGraw Hll, USA, 2. 5. Banba et al., A CMOS bandgap reference crcut wth sub-1-v operaton, IEEE JSSC, pp.67-674, 1999. 6. Sarpeshkar et al., Msmatch senstvty of a smultaneously latched CMOS sense amplfer, IEEE JSSC, pp. 1413-1422, 1991. Proceedngs of the 11th IEEE Internatonal On-Lne Testng Symposum (IOLTS 5) 6