The Design and Verification of A High-Performance Low-Control-Overhead Asynchronous Differential Equation Solver

Size: px

Start display at page:

Download "The Design and Verification of A High-Performance Low-Control-Overhead Asynchronous Differential Equation Solver"

Vincent Sanders
6 years ago
Views:

1 he Design nd Verifiction of A High-Performnce Low-Control-Overhed Asynchronous Differentil Eqution Solver Kenneth Y. Yun, Memer, IEEE, Peter A. Beerel, Memer, IEEE, Vid Vkilotojr, Student Memer, IEEE, Ayoo E. Dooply, Student Memer, IEEE, nd Julio Arceo Astrct his pper descries the design nd verifiction of high-performnce synchronous differentil eqution solver enchmrk circuit. he design hs low control overhed which llows its verge-cse speed (tested t 22 C nd 3.3V) to e 48% fster thn ny comprle synchronous design (designed to operte t 100 C nd 3V for the slow process corner). he techniques to reduce completion sensing overhed nd hide control overhed t the circuit, rchitecturl, nd protocol levels re discussed. In ddition, symolic model checking techniques re descried tht were used to gin higher confidence in the correctness of the timed distriuted control. I. INRODUCION It is well known tht synchronous circuits cn chieve verge-cse dely while synchronous circuits chieve fixed dely determined y their clock frequency tht must e preset ccording to the worst-cse conditions. he difference etween the verge nd worst-cse conditions stems from oth dt-dependency nd environmentl fctors. In prticulr, clock frequencies re set, ssuming the chip is running t 100 C, even though the chip usully opertes t much lower temperture. his difference cn led to significnt verge-performnce dvntge for synchronous circuits [1]. Unfortuntely, most synchronous circuits hve filed to chieve this gol, primrily ecuse the control overhed ssocited with synchronous design is typiclly very high (with notle exception of zerooverhed self-timed ring techniques, exemplified y Willims s divider [2]). his pper descries numer of techniques to hide nd minimize synchronous control overhed tht require underlying timing ssumptions t the circuit, rchitecturl, nd protocol levels. hese techniques re descried in the context of the design of n synchronous differentil eqution solver enchmrk circuit. Extensive lyout simultions demonstrte tht the techniques reduce control overhed to less thn 16%. Consequently, when temperture, power supply voltge, nd process vritions re considered, the design is 48% fster thn ny comprle synchronous design (designed to operte t 100 C nd 3V for the slow process corner). Our design style requires some fullcustom design of dtpth uilding locks; it is thus potentilly more difficult to ensure correctness thn most trditionl synchronous design methodologies [3], [4], [5]. A rief outline of the design process used is descried in Section IV-E. We consider two sources of control overhed. he first component of control overhed is completion sensing, i.e., dynmiclly determining when functionl unit hs finished computing its result. Our design consists of two dynmic 32-it ALUs nd two dynmic 32-it rry multipliers. or oth designs, we descrie how timing ssumptions nd optimizing for the verge- K. Y. Yun nd A. E. Dooply re with ECE Dept., University of Cliforni, Sn Diego. P. A. Beerel nd V. Vkilotojr re with EE-Systems Dept., University of Southern Cliforni. J. Arceo is with Qulcomm Inc. cse cn led to reltively smll completion sensing overhed. he second component is sequencing dely, i.e., the dely from the termintion of one ction (e.g., multipliction) to the strt of susequent ction (e.g., n ddition). his sequencing is implemented y control circuits nd usully requires t lest two gte delys. he rchitecture of our DIEQ design fcilittes hiding this dely y performing useful computtion (e.g., multiplexing) in etween the two ctions. o ensure correctness, the worst-cse dely of this computtion is mde smller thn the control dely. In ddition, to reduce control delys, timing ssumptions re used in the communiction protocol, which is implemented with set of communicting synchronous SMs [6]. he ppliction of vrious timing ssumptions mde vlidting the correctness of the design difficult. While most timing ssumptions could e verified using SPICE, vlidting the distriuted communiction protocol ws more difficult ecuse it incorportes timing ssumptions tht depend on complex interction of dt-dependent delys. o gin more confidence in the design, we used forml verifiction techniques sed on symolic model checking [7], [8] s deugging id. We present our experiences of concurrent verifiction nd design, including the description of the ugs tht our verifiction efforts helped void. he rest of this pper is orgnized s follows. Section II gives n overview of the differentil eqution solver chip rchitecture. Section III descries the dder nd multiplier designs. Section IV descries the controller design nd vlidtion. Section V provides the simultion nd test results nd Section VI drws some conclusions. II. CHIP ARCHIECURE OVERVIEW A high-level description of the differentil eqution solver lgorithm is presented elow. his is widely used high-level synthesis enchmrk exmple from [9]. diffeq { red ( x, y, u, dx, ); while ( x < ) { xl := x + dx; ul := u - 3 * dx * (u * x + y); yl := y + u * dx; x := xl; u := ul; y := yl; } write ( y ); } he lgorithm implements the forwrd Euler method to numericlly otin the vlues of y tht stisfy the differentil eqution y +3xy +3y =0s x rnges from x(0) to with step size of dx nd with initil vlues of y nd y equl to y(0) nd u, respectively. Using opertor reduction [9] to convert multipliction y 3 to shift-nd-dd opertion nd code motion [9]

2 IEEE RANSACIONS ON VLSI SYSEMS 2 x dx < x< C dx x dx u x1 2 dx u + X < C X1 * M2 x< y 3 dx y Shdow Register + B * M1 + A * M1 ig. 1. Dtflow grph. + Y u M1 to pull the shift-nd-dd opertion out of the loop, the dt-flow grph depicted in ig. 1 is otined. here re three threds of opertions needed per itertion: computing y nd y (denoted s u in ig. 1) nd incrementing x. Computing y requires sequence of four opertions (multiply dd multiply dd) whose input opernds re the results of immeditely preceding opertions. However, computing y nd updting x only require two opertions ech. hus the computtion of y is the criticl pth of the dtflow grph. he shding of opertions in ig. 1 depicts the inding of opertions to two ALUs nd two multipliers. We llocted one dynmic ALU nd one dynmic multiplier to compute y (resource shring). After the multiplier finishes computing u x1 (x1 is copy of x), the result is stored in register M1. While the ALU is computing y + M1, the multiplier is rged for the next opertion. After y + M1 is stored in register A, the multiplier strts evluting 3dx A nd the ALU is rged. After the multipliction is complete, the result is stored in M1. he ALU then egins evluting u M1 nd the multiplier is rged. hus ech functionl unit performs the computtion in n lternting fshion. We selected this prticulr inding of resources to opertions in order to remove the rge delys of the functionl units from the criticl pth. We llocted second dynmic ALU nd second dynmic multiplier to compute y nd x. Here the rge dely of the ALU could not e hidden ecuse three consecutive ALU opertions re required. Consequently, the execution time of these opertions is only slightly smller thn the dely required to compute y, lthough the evlution dely of the ALU is sustntilly less thn tht of the multiplier. We used stndrd MUX-sed rchitecture s depicted in ig. 2. In mny instnces, we reduced the numer of controller outputs y crefully scheduling the control signl trnsitions so tht one signl cn serve multiple functions. or exmple, LA is used s MUX select signl nd the opcode signl for ALU1 s well s the lod signl for register A. or the registers, we used Svensson-style positive-edge-triggered flip-flops which were mde pseudo-sttic y dding wek feedck trnsistors. hese flip-flops hve less overhed thn the often-used specil synchronous ltches tht require full hndshke communictions for reding nd writing. hus, structurlly the only difference etween our dtpth rchitecture nd typicl synchronous ones is the presence of the done signls generted from the functionl units. u U We used n event-driven protocol for control. or exmple, the externl environment initites the first step of this protocol y rising strt (to commnd oth the ALU1 nd ALU2 controllers to egin their opertions). he ALU1 controller responds y lowering ALU1 s rge signl, which triggers ALU1 to compute 2dx + dx. Similrly, the ALU2 controller responds y triggering ALU2 to compute the loop test (X < ). As soon s ech functionl unit completes its computtion, it rises its done signl. Ech ALU controller, upon detecting the done signl from its functionl unit, responds y ltching the result nd sending communiction signl to oth multiplier controllers. In ddition, ech ALU controller prepres for its next opertion y setting up the MUX nd function select signls nd rging its functionl unit. Susequent multipliction opertions egin fter the communiction signls re sserted y oth ALU controllers. he detils of control design re descried in Section IV. III. DAAPAH BUILDING BLOCKS his section descries the dder nd multiplier designs used in the differentil eqution solver: in prticulr, it focuses on the techniques used to chieve superior verge-cse dely over synchronous designs nd to minimize the completion sensing overhed. A. Adder he 32-it dder used in our design is self-timed, i.e., the completion of computtion is not pre-timed ut detected nd reported y the circuit itself. Unlike conventionl self-timed (ripple crry) dders, however, our design incorportes crry ypss logic to speed up the worst-cse computtion. hus our dder design uses oth worst-cse dely reduction mechnism (crry ypss) nd n verge-cse dely reduction feture (completion sensing). A.1 Motivtion A strong justifiction for self-timed ripple crry dders is found in one of Von Neumnn s erly ppers collected in [10]. It is well-known fct tht the dely of ripple crry ddition corresponds to the length of the longest crry chin (crry propgtion). Von Neumnn proved tht for rndom input sttistics the verge length of the longest crry chin is log 2 n for n-it ddition. hus, for rndom inputs, the verge dely of 32-it ddition is equl to the worst-cse dely of 5-it ddition. In most pplictions [11], [12], however, input sttistics re not s fvorle for self-timed ripple crry dders: i.e., lrge frction of crry chins tends to e long. Arithmetic opertions on dt (s opposed to ddress clcultions), in prticulr, tend to generte long crry chins. In order to understnd the effects of relistic input sttistics, we conducted severl experiments using n ARM simultor 1 nd stndrd enchmrk trces. In fct, ccording to ARM enchmrks, the verge longest crry chin length for 32-it dt rithmetic opertions is 17, s shown in ig. 3, which is fr from the theoreticl verge longest crry chin length of 5. However, the worst-cse longest crry chin 1 We used simultor for commonly used RISC processor, ARM-6, from Advnced RISC Mchines, in order to otin the ALU input opernd sttistics for dt rithmetic opertions.

3 IEEE RANSACIONS ON VLSI SYSEMS 3 2 dx Y U dx M1 strt done Y X M2 dx C EndP seldx selym2 U Port ALU1 opcode A1Done A1Prech A2Done A2Prech C X Port ALU2 opcode Y Port strt strt U A B LB LA LX X X1 B U A MUL1 LU M1A ALU1 CRL A1M M1A MUL1 M1Prech CRL A2M M1Done ALU2 CRL M2A2 LY MUL2 CRL M2Prech M2Done X1 U MUL2 Y dx M1 M2 ig. 2. Dtpth nd control. Benchmrk Dhrystone Espresso Compiler I Compiler II With crry ypss every 4 its No ypss (ripple crry) Averge longest crry chin length # of opertions ig. 3. Averge longest crry chin lengths for dt rithmetic opertions from ARM enchmrks. length cn e reduced to considerly less thn 17 y employing simple crry ypss or lookhed feture. or exmple, crry ypss dders with ypssing done t 4-it oundries cn chieve the worst-cse longest crry chin length of 12, ssuming tht the time it tkes for one stge of 4-it crry ypss is the sme s tht for one it crry propgte. Under the sme ssumption, the verge longest crry chin reduces to 7 in the enchmrks. he motivtion for our self-timed crry ypss dder design is lrgely sed on this experiment. In order to detect nd report the completion of computtion whose dely vries depending on dt opernds, completion sensing circuit is required. However, to mke the verge dely of self-timed crry ypss dder circuit s short s possile, the overhed for completion sensing must e minimized. In other words, the difference in time etween when the completion is reported nd when the lst chnging sum it reches its finl vlue must e minimized. In the enchmrk simultion shown in ig. 3, for the crry ypss cse to e fster thn the synchronous worst-cse, the verge completion sensing overhed must e smller thn the dely equivlent to the crry chin length of 5 (worst-cse longest crry chin length minus verge-cse longest crry chin length). he design detils of our crry ypss dder re descried elow. We present sic rged Mnchester crry chin dder, dul-ril implementtion of the Mnchester crry chin dder, the crry ypss circuit, nd the completion sensing circuit in prticulr, we focus on the technique to minimize verge completion sensing overhed. A.2 Design Our 32-it dder, s shown in ig. 4, is divided into eight 4- it sections nd completion sensing circuit. Ech section consists of crry-propgte, crry-kill, nd crry-generte it slices (PKG), sum it slices, 4-it Mnchester crry chin (MCC), nd crry ypss circuit. he hert of our crry ypss dder design is the Mnchester crry chin. In the ith it slice of single-ril sttic MCC, crrypropgte (p i ), crry-kill (k i ), nd crry-generte (g i )reusedto propgte crry-in (c i ) to crry-out (c i+1 ), reset crry-out, nd set crry-out, respectively; p i, k i,ndg i signls re mutully exclusive nd defined s: p i = A i B i (1) k i = A i B i (2) g i = A i B i (3)

4 IEEE RANSACIONS ON VLSI SYSEMS 4 evl c 0 c 1 c 2 c 3 p0 p 1 p 2 p 3 c in PKG Sum MCC PKG Sum MCC PKG Sum MCC PKG Sum MCC c out c in g0 g1 g2 g3 c out Bypss Bypss Bypss Completion sensing circuit ig. 4. Self-timed crry ypss dder lock digrm. Bypss done where A i nd B i represent the ith input its. he sum (S i )then cn e computed s elow: S i = p i c i (4) In order to detect whether the crry-out of ech it slice hs een computed, our design uses dul-ril crry chin defined s follows: c i+1 = g i + p i c i (5) c i+1 = k i + p i c i (6) where c i nd c i re the true nd flse rils of the ith crry it. o detect tht ll of the crry its hve een computed, the completion sensing circuit implements the following function: done = 31 i=0 (c i + c i ) (7) he circuit implementtion of the dul-ril Mnchester crry chin is shown in ig. 5. Before the computtion egins, ll the internl nodes (c i s nd c i s) re rged to 1. hus oth true nd flse rils of crry signls (c i nd c i ) re low. After the evlution egins, exctly one control signl (p i, k i,org i ) per it slice is sserted. If g i is sserted, node c i+1 is dischrged enling c i+1 to rise; since k i nd p i remin low, the voltge level t node c i+1 remins high, keeping c i+1 low. On the other hnd, if k i is sserted, node c i+1 is dischrged enling c i+1 to rise; since g i nd p i remin low, c i+1 remins low. If p i is sserted, oth g i nd k i remin low; thus the NMOS trnsistors controlled y g i nd k i remin turned off, nd the logic vlues t c i nd c i re propgted to c i+1 nd c i+1 respectively. herefore, in ll three cses, for ech it slice exctly one crry ril ecomes high when the computtion is complete. he PKG nd sum it slices re implemented in domino gtes. he sum logic is designed s domino gte for two resons: (1) domino gtes re fster thn sttic XOR gtes nd (2) domino outputs re glitch-free (hence generte less noise). However, unclocked domino sum gtes require monotoniclly rising inputs; thus oth crry-in nd crry-propgte signls must e in dul-ril form: S i = p i c i + p i c i (8) he input-output ehvior of the jth 4-it Mnchester crry chin cn e descried s: c 4j+4 = G 4j + P 4j c 4j (9) c 4j+4 = K 4j + P 4j c 4j (10) c 4j c in c 0 c in k0 k 1 k 2 k 3 p 4j p4j + 1 p4j + 2 p4j + 3 c 1 c 2 c 3 ypss (true ril) MCC ypss (flse ril) MCC c out c4j c c c in out 4j + 4 ig. 5. () Dul-ril Mnchester crry chin; () Crry ypss circuit. P 4j c out c 4j + 4 where G 4j, K 4j,ndP 4j re group-generte, group-kill, nd group-propgte signls for the 4-it section: G 4j = g 4j+3 + p 4j+3 g 4j+2 + p 4j+3 p 4j+2 g 4j+1 + p 4j+3 p 4j+2 p 4j+1 g 4j (11) K 4j = k 4j+3 + p 4j+3 k 4j+2 + p 4j+3 p 4j+2 k 4j+1 + p 4j+3 p 4j+2 p 4j+1 k 4j (12) P 4j = p 4j+3 p 4j+2 p 4j+1 p 4j (13) In order to speed up the crry propgtion for the worst-cse ddition, in which the crry is propgted from the lest significnt it to the most significnt, crry ypss circuits re inserted etween successive four-it sections. If every it slice in the 4-it section propgtes the crry, i.e., group-propgte, then the crry propgtion cn e sped up y ypssing the MCC s shown in ig. 5. Nodes c 4j+4 nd c 4j+4 re oth rged to 1 efore the evlution egins. If ll four propgte signls (p 4j, p 4j+1, p 4j+2, p 4j+3 ) ecome high fter the evlution egins, the group propgte signl P 4j is sserted. his enles c 4j nd c 4j to propgte to c 4j+4 nd c 4j+4 respectively. If the group propgte signl is not sserted, the crry signl is either generted or

5 IEEE RANSACIONS ON VLSI SYSEMS 5 killed in the corresponding 4-it section of the Mnchester crry chin. his enles c out or c out to e sserted, which in turn enles c 4j+4 or c 4j+4. his feture llows the crry to propgte to the next MCC t the erliest possile opportunity. he completion sensing circuit in our design is domino logic implementtion of the OR-AND circuit s shown in ig. 6. Since the order of sum it genertion cnnot e known priori, the circuit must detect tht the logic level of every sum it hs reched the finl vlue. In prctice, it is sufficient to detect tht every crry it hs reched the finl vlue, s long s the dely from crry to the corresponding sum it is lwys smller thn the minimum dely through the completion logic, which in fct is the cse in our circuit. hus the dely from crry to sum genertion is effectively hidden in the completion sensing computtion. In order to further minimize the completion sensing overhed for the worst-cse computtion, lte rriving signls (for the worst-cse computtion) re plced closer to the output. According to the SPICE simultion (t 3.3V nd 22 C) with relistic output loding, lthough the completion sensing overhed is s much s 50% of the sum dely when there is no crry propgtion, it is only 5% when the computtion tkes the worst-cse dely: i.e., when the crry propgtes through the entire crry chin. As result, the ctul dely from the eginning of evlution to the completion sensing is only 12% longer (5% with relistic output loding) thn to the sum genertion in the worst cse nd usully is much shorter thn the worst-cse sum dely. Our results indicte tht the verge computtion time including completion sensing dely is 2.8ns for ARM dt rithmetic enchmrk trces. his outperforms most other pulished selftimed dder designs [13], [11]. It is somewht slower (< 15%) thn recently pulished self-timed (Brent nd Kung style) crry lookhed dders, which require much greter re (> 4 ), using specultive completion technique [14]. More importntly, the completion sensing overhed of our design is only 20% of the verge-cse dely nd 12% of the worst-cse dely. c 3 c 2 c 1 c 0 c 3 c 2 c 1 c 0 c 5 c 4 c 5 c 7 c 4 c 6 31 c c 24 i i 23 Π c + c 16 i i 15 Π c + c 8 i i Π ( + ) ( ) ( ) Π 7 0 c 7 c 6 ig. 6. Completion sensing domino logic. ( + ) c c i i done A.3 Results We implemented our design in 0.5µm HP CMOS14B triple-metl process. le I shows the results of test nd SPICE simultion with 3.3V power supply t 22 C nd with ech sum gte driving relistic lod (two fnouts, ech of which is 19λ/9λ inverter connected through fully turned-on 9λ/9λ trnsmission gte), using the process prsitics from the run (N68MAR) in which our chips were fricted. We list simultion results s well s test results ecuse not ll relevnt dt re ville from tests. he evlution delys from the test mesurements re out 20% smller thn those from the simultion runs. We conjecture tht this discrepncy is prtly due to the conservtive wire cpcitnce extrction from the lyout. he test results for different operting conditions re reported in Section V. Adding 1 nd 1, one of the cses tht yield the longest crry chin, results in the worst-cse evl done dely of 3.96ns. t evl sum is omitted from le I ecuse the sum its, once rged, remin 0 for this computtion. Adding 2 nd 0 is nother computtion with the longest crry chin. In this cse, the flse ril of the crry is produced in the lest significnt it nd propgted to the most significnt it. he test result for t evl done is 3.67ns. In this cse, we were le to simulte nd oserve evl sum nd evl done delys nd derive the completion sensing overhed. he overhed of 0.21ns mounts to 5% of the simulted evl sum dely. he est-cse dely is 1.56ns, which results from 0+0computtion. Agin we were unle to deduce the completion sensing overhed for this ddition, ecuse the sum its do not chnge once rged. Insted, we use nother cse ( 1 1) to deduce the completion sensing overhed for the computtion with the crry chin length of 0. he test result for t evl done is 1.67ns. he overhed of 0.70ns is 50% of the simulted evl sum dely of 1.37ns. We oserve tht the computtion with the longest crry chin hs the minimum completion sensing overhed nd the one with the crry chin length of zero hs the mximum overhed, which is exctly the desired chrcteristic to chieve superior vergecse dely nd yet mintin very competitive worst-cse dely. We confirmed tht the completion sensing overhed for rndom cses fll etween 0.2ns nd 0.7ns using oth SPICE nd softwre simultions. est Simultion Cses t evl done t evl done t evl sum Overhed ns 4.76ns N/A ns 4.68ns 4.47ns 0.21ns ns 4.42ns 4.16ns 0.26ns ns 2.07ns 1.37ns 0.70ns 0+.56ns 2.03ns N/A ABLE I ADDER ES / SIMULAION RESULS (3.3V, 22 C, MOSIS N68MAR). Using the extrcted dely components from the SPICE simultions, we computed the verge-cse dely for the ARM dt rithmetic opertion enchmrks. he simulted verge-cse dely ws 2.8ns (with no loding on outputs). We lso compred our design to comprle synchronous design: the sme design s our self-timed design ut excluding the completion sensing circuit. Conventionl synchronous designs typiclly use single-ril crry chin nd sttic XOR gtes for sum genertion. However, our synchronous design contins dul-ril

6 IEEE RANSACIONS ON VLSI SYSEMS 6 crry chin nd domino sum logic. he cost of dding n dditionl crry chin is one dditionl trnsistor lod on p i s, ut it is more thn offset y fster rged sum gtes. We simulted our synchronous design in the sme operting environment. he worst-cse dely (when 1 nd 1 re dded) ws 3.8ns. hus we concluded tht the verge-cse dely (including completion sensing overhed) of our self-timed design is 36% fster thn comprle synchronous designs, with only 12% overhed (with no loding on outputs) for the worst-cse computtion. his speedup is sed solely on dt dependency, not considering vritions in environmentl fctors such s temperture nd power supply voltge. he effects of temperture nd voltge vritions re discussed in Section V. B. Multiplier Our 32-it multiplier is n rry multiplier which consists of 16 rows of crry sve dders (CSA) nd 32-it crry propgte dder (CPA) operting in prllel with the CSA rry s depicted in ig. 7. Becuse the upper 32 its of the 64 it product re not needed in this design, only the right hlf of the CSA rry is used. We show elow tht requiring only lower 32 its of the product yields interesting optimiztion opportunities. Upper 32 its not needed; thus not implemented C6 P05 C5 HA P5 P04 C4 P4 P03 C3 P23 P22 P21 P20 A P41 A CPA HA A P40 A HA A P3 CPA P02 C2 HA A P2 P01 C1 HA P1 Shded re enlrged elow CPA P00 C0 2 HA P c 2 CPA 0c0 s 1 s 0 ig. 7. A frgment of domino dul-ril rry multiplier. P 2i,j represents the jth it of prtil product P 2i ; C 2i =1(C 2i+1 =1) if prtil product P 2i is two s complement of A (2A) (see le II). B.1 Architecture Our decision to select non-pipelined rry multiplier is sed on the nture of this ppliction. he differentil eqution solver is to solve one eqution t time; thus the most importnt criterion is ltency, not throughput. or 32-it y 32-it multipliction, the totl numer of prtil products is normlly 32. In our design, we use the rdix-4 Booth encoding to reduce the numer of prtil products to 16. Given multiplicnd, A, nd multiplier, B, the product,a B, is: 30 A B = A ( i 2 i ) i= = A ( 2i+1 2 2i+1 + 2i 2 2i +2 2i 1 2 2i 1 ) = 15 i=0 i=0 A ˆ 2i 2 2i, (14) where 1 =0nd ˆ 2i = 2 2i+1 + 2i + 2i 1. he prtil product genertion is summrized in le II. ˆ2i Prtil Product P 2i C 2i+1 C 2i 2 wo s comp of 2A wo s comp of A A A 0 0 ABLE II PARIAL PRODUC GENERAION BASED ON RADIX-4 BOOH ENCODING. Ech row of the CSA rry computes the running sum of the prtil products in crry-sve form, i.e., crry its re not propgted in ripple crry fshion ut re forwrded to the next row with twice the weight. hus ech row requires set of full dders (A) to dd prtil product to the running sum in crrysve form. Becuse the rdix-4 Booth encoding requires tht ech successive prtil product e shifted two it positions to the left s derived in Eqn. (14), ech row of the CSA rry genertes spillover its, which form two 32-it numers tht need e dded, s depicted in ig. 7. he key oservtion here is tht self-timed CPA cn e employed to perform this ddition in prllel with the CSA computtion, s soon s the spillover its re generted from the CSA. Our design uses 2-it CPA per ech row of the CSA rry to dd the spillover its from the CSA rry, s illustrted in ig. 7. As long s the 2-it CPA is s fst s the A in the CSA rry (though it is not necessry to e fster for the multiplier to function correctly), our multiplier cn compute the lower 32 its of the 64-it product in the time required for dding 16 prtil products in crry-sve form. hus, for pplictions requiring only the lest significnt 32 its, our technique, which is generlly pplicle to oth synchronous nd synchronous designs, leds to significnt reduction in ltency. B.2 Circuit implementtion Our design uses self-timed dul-ril domino logic to speed up the ltency of the crry sve ddition. he self-timed domino llows the signls to propgte (or dominos to fll ) t its nturl speed unencumered y the evlution clock djusted for the worst-cse. Since unclocked domino logic requires tht ll inputs e monotoniclly rising once the evlution egins, our design uses the dul-ril signling. Dul-ril domino gtes for sum nd crry logic re shown in ig. 8. here re three dul-ril inputs to ech domino gte: sum nd crry from the preceding row nd prtil product it. he position of prtil product its in the series stck plys significnt role in voiding chrge shring prolems. In our design, the prtil product input (pp) is connected to nd its complement (pp) to. When the evlution egins, internl nodes 1 nd 2 re rged to V dd,

7 IEEE RANSACIONS ON VLSI SYSEMS 7 s ( pp ) c i evl 1 2 Sum s s c c i i s ( pp ) c i c o c i ig. 8. Dul-ril domino full dder. Crry c o c o c c i i co ecuse the prtil product its re set up efore the evlution egins. his mkes the effective stck size to e two during evlution, which mkes the chrge shring prolem (cused y one of the trnsistors controlled y or turning on while the trnsistors controlled y c i nd c i re off) negligile. In order to mke the 2-it CPA fster thn the A used in the CSA rry, our design uses 2-it dul-ril domino crry lookhed dder (CLA) s descried elow: c 1 = c 0 0 (15) c 1 = c (16) c 2 = c 0 0 ( )+ 1 1 (17) c 2 = (c )( )+ 1 1 (18) p 1 = (19) p 1 = (20) s 0 = 0 c 0 + 0c 0 (21) s 1 = p 1 c 1 + p 1 c 1 (22) Becuse the longest series stck in the c 2 logic of our 2-it CPA is 3 wheres the stck size of the sum gte in the A is 4, the c 2 it of 2-it CPA is computed efore the outputs of the As in the corresponding row of the CSA rry. In other words, the crry propgtion never impedes the evlution of CPA stge; the CPA stge egins the evlution s soon s the sum nd crry signls from the CSA rrive. Normlly, the 32-it CPA requires completion sensing circuit implemented in 32-it OR-AND s shown in Eqn. (7). Since the genertion of crry signls is stggered, domino chin is the most suitle circuit topology to use. However, in this design, we cn further simplify the circuit y exploiting the fct tht c 0...c 30 re lwys computed efore c 31. hus our implementtion of done logic is domino OR gte: c 31 + c 31. Although the sum genertion tkes n dditionl domino gte dely fter the crry its re computed, done signl is gurnteed not to rise until ll of the sum its of the CPAs, i.e., the finl product its, hve een generted, ecuse the done signl is uffered significntly to drive the controller or the registers. c i est Simultion emperture 22 C 22 C 100 C 0 C Power supply 3.3V 3.3V 3.0V 3.6V Process N68MAR N68MAR Slow st Evlute dely 6.7ns 7.7ns 12.8ns 5.9ns Prechrge dely 3.8ns 4.4ns 7.2ns 3.5ns ABLE III MULIPLIER SIMULAION AND ES RESULS. B.3 Results We implemented our design in 0.5µm HP CMOS14B triple-metl process nd simulted the cknnotted design for three test corners s shown in le III. he verge dely from the eginning of evlution to the ssertion of done is 7.7ns, when simulted t 22 C nd 3.3V using the process prsitics from the run (N68MAR), in which our chips were fricted. he CSA dely (dely from evlution to CSA done) is 7.1ns; thus the computtion time eyond the CSA dely is only 0.6ns. he completion sensing overhed is ctully zero, iftheuffering dely for the done signl is excluded. inlly, the rge uffer dely is 1.5ns. We mesured the multiplier evlution nd rge delys of the fricted chips. he test results re lso shown in le III. he ctul mesured delys re somewht less thn the simultion results the mesured evlution nd rge delys re 6.7ns nd 3.8ns wheres simulted evlution nd rge delys re 7.7ns nd 4.4ns respectively. Our est guess for this discrepncy is tht the wire cpcitnce extrction from the lyout is too conservtive. o the est of our knowledge, our design outperforms, in terms of ltency, lmost ll pulished 16-it or 32-it rry multiplier designs (fricted in similr technology) which generte 32-it products. he closest competitor in the sme ppliction domin is the 500MHz 16-it pipelined rry multiplier used in DSP core [15], which hs 8ns ltency. his chip ws fricted in 0.35µm process nd uses 2.5V power supply. IV. CONROLLER DESIGN AND VALIDAION A. Overview he sic difference etween our synchronous design nd trditionl synchronous designs is in the scheduling of control. In synchronous designs, the ssertion of the control signls is tied to clock edges, which cn often led to non-trivil ded time etween the completion of computtion nd the strt of the next clock cycle. However, the event-driven sequencing sed on completion sensing in our design reduces the ded time etween the completion of computtion nd the strt of the next computtion significntly. A.1 Distriuted control We implemented the control using distriuted urst-mode controllers, clss of synchronous SMs descried in Section IV-B, to reduce the ltency of control. As in synchronous systems, opertions must e ssigned to stte trnsitions priori; however, stte trnsitions re event-driven nd occur s soon s specified set of input trnsitions rrive. o otin the flexile opertion sequencing, we used four control units, one for ech

8 IEEE RANSACIONS ON VLSI SYSEMS 8 dtpth unit, s depicted in ig. 2. A high-level synthesis tool, such s ACK [16], would hve een eneficil in prtitioning the design nd insuring the correctness of prtitioning, lthough we did not mke use of it. his distriuted control prdigm ws lso motivted y lyout considertions. he controller for ech dtpth unit is locted next to the unit. his minimizes the length of the dtpth control signls, therey reducing the control wire delys in the criticl pth; however, it mkes few inter-controller communiction wires run hlf the length of the chip. However, the longer delys on these communiction wires do not ffect the performnce dversely ecuse, y nd lrge, their delys re hidden in the dtpth computtion, s descried in Section IV-C. A.2 Inter-controller protocol optimiztion Communiction wires re typiclly designed in pirs connecting designted sender to designted receiver. he sender trditionlly initites the communiction y toggling one wire nd the receiver cknowledges the communiction y toggling the other wire. he process would either repet using the opposite phse s in two-phse hndshking or the signls would reset in preprtion for the next communiction s in four-phse hndshking. ypiclly, designs consist of either ll two-phse or ll four-phse hndshking. In our design, however, we use mixture of oth. he communiction signls M1A nd A1M follow two-phse protocol while the rest of the signls follow four-phse protocol. Our design is lso different in tht some cknowledgment signls indicte the existence of new dt. In prticulr, M2A2 + cts to cknowledge A2M + s well s indicte to ALU2 tht register M2 hs een updted with u dx. o reduce the numer of communiction wires, however, we used rodcst communiction signls which connect sender to multiple receivers. or exmple, A1M connects ALU1 controller to oth multiplier controllers. his is efficient ecuse oth multipliers must wit for U to e updted y ALU1 efore strting the susequent loop itertion. Similrly, A2M is used to indicte to oth multipliers tht the loop test (X <) psses nd loop itertion should strt. urthermore, in our design, not ll rodcst request signls re explicitly cknowledged y their receivers. or exmple, rodcst request signl, A1M, is sent to oth MUL1 nd MUL2 ut cknowledged explicitly only y MUL1 controller. ypicl speed-independent synchronous protocols do not llow this ecuse it cnnot e gurnteed tht MUL2 controller would hve chnce to properly rect to A1M + efore A1M occurs (see ig. 9). In our design, however, we nlyzed the reltive timings of ll signls nd verified tht A1M + lwys occurs when MUL2 is witing for it nd tht MUL2 lwys hs chnce to properly rect efore A1M occurs. hus, n explicit cknowledgment would e redundnt. his nlysis llowed us to keep the controllers simple, which improved their ltencies up to 15%. B. Controller specifiction nd implementtion his susection descries the specifiction nd implementtion of one of the controllers nd the constrints on its environment. M2Done A1M A2M / M2A2 0 M2Done A1M+ A2M+ / M2 A2M 1 M2Done+ A1M* / A1M M2Prech+ M2A2+ M2A2 2 reset A2M A1M M2Prech ig. 9. MUL2 controller specifiction nd implementtion. M2A2 reset B.1 Extended urst-mode specifiction he controllers were specified in extended urst-mode (XBM), n synchronous SM specifiction formt. We only descrie the detils of the XBM specifiction relevnt to the discussions in this pper. he complete detils re descried in [6]. ig. 9 descries the controller for MUL2, one of the four extended urst-mode controllers used in our design. Signls ending with + or re terminting signls; the ones ending with re directed don t cres. If stte trnsition is leled with directed don t cre, then the following stte trnsition must e leled with or + or. A terminting signl + denotes 0 1 trnsition of if is initilly 0, nd no trnsition t ll if is initilly 1. A sequence of stte trnsitions leled with nd terminted with + represents single 0 1 trnsition of t ny point in the sequence. A terminting signl not immeditely preceded y directed don t cre represents compulsory trnsition. An input urst is non-empty set of input trnsitions (terminting or directed don t cre), t lest one of which must e compulsory. An output urst consists of possily empty set of output trnsitions. In given stte, when ll the terminting trnsitions in the input urst hve ppered, the mchine genertes the corresponding output urst nd moves to new stte. Specified trnsitions in the input urst my pper in ritrry temporl order; outputs my e generted in ny order. B.2 Controller implementtion style he controller specifictions were implemented in n efficient ut roust form of multiple-input-chnge circuits using the synthesis tool 3D-gC [17]. Ech output is typiclly implemented with single generlized C-element. As n exmple, the implementtion of the MUL2 controller is shown in ig. 9. his design style uses n efficient strtegy for the insertion of stte vriles. In prticulr, stte vriles chnge concurrently with n output urst. Consequently, stte vriles re never in the criticl pth of the circuit nd the ltency of the circuit is very low. he circuits, however, hve fundmentl-mode constrint. ht is, fter the output urst, the circuit requires time to stilize, efore ny compulsory trnsition of next input urst ppers. A detiled timing nlysis could e performed to identify the specific required settling time for ech stte trnsition [18]. hen, SPICE simultions of the controller nd its environment could e performed to ensure tht the environment is sufficiently slow in ll cses. his involves lrge numer of time-consuming nd tedious SPICE simultions. ortuntely, the environment is typiclly so slow (reltive to the controller settling time) tht the timing ssumptions re stisfied with lrge

9 IEEE RANSACIONS ON VLSI SYSEMS 9 A frgment of ALU1 spec 3 A1Done+ / A1Prech+ LA+ A1M 4 A frgment of MUL1 spec 2 M1Done A2M* A1M / M1Prech 3 A1Prech LA M1Prech U MUL1 ALU1 A ig. 10. iming ssumptions. A1Done M1Done sfety mrgins. herefore, we cn pproximte the settling time with constnt tht is lrger thn the settling time of ech of the individul stte trnsitions without introducing mny flse negtives. We then cn use our forml verifiction techniques to utomticlly vlidte these timing ssumptions, s explined elow. C. Controller-dtpth interfce design nd vlidtion We reduced control circuit overhed y hiding controller dely within dtpth computtion. his is chieved y simultneously ltching the result of one opertion nd sserting communiction signl to initite the next opertion. hus much of the controller dely, if not ll, cn e overlpped y ltch nd multiplexor delys, s illustrted in ig. 10. In this exmple, when ALU1 completes its evlution, the controller sserts LA+ to ltch the result in register A nd A1M to signl MUL1 to strt the next opertion using the vlue of A. Using this ssumption, however, mens tht we must e creful to ensure tht the inputs to MUL1 hve stilized efore evlution strts. Otherwise, the domino logic my mlfunction due to ccidentl dischrge. Simple sttic timing nlysis nd SPICE simultions were used to ensure tht sufficient timing mrgins exist. he exmple in ig. 10 lso illustrtes second timing ssumption we needed. ALU1 controller sserts A1Prech+ nd LA+ concurrently fter ALU1 hs finished evluting. he ssumption here is tht the output of ALU1 is loded into register A efore rging of ALU1 chnges the dt (i.e., dt hold time is not violted). his ssumption is sfe ecuse A1Prech signl is hevily loded. his type of timing ssumptions ws lso verified using sttic timing nlysis nd SPICE simultion. D. Inter-controller protocol design nd vlidtion Our forml verifiction efforts were directed to insure tht the controller designs re correct: i.e., tht the interction of controllers nd dtpths generted no illegl trnsitions which my cuse the circuit to fil. Specificlly, we ttempted to identify the existence of ny trnsition tht violtes the specifiction of controller or violtes the fundmentl mode constrint (i.e., trnsitions tht occur efore the controller is gurnteed to hve stilized). It is importnt to note tht we used forml verifiction only s design id to check whether certin properties hold. We did not use it to gurntee correctness with respect to some model. Our verifiction efforts vlidted tht the finl controller designs re correct. More importntly, it helped us quickly identify mny ugs in preliminry versions of the design. We discuss elow some interesting ugs. A more detiled description of our verifiction technique cn e found in [8]. D.1 Verifiction models: discrete vs. continuous We dopted discrete time model nd used symolic model checking to verify our system. Discrete time models were selected over continuous time models for efficiency. Continuous time models, leit more expressive, require the mngement of region grphs, which limits the size of the system tht cn e nlyzed [19], [20]. Discrete time models do hve the disdvntge tht discretiztion of timer regions leds to lrge stte spces, referred to s the stte explosion prolem. However,finite stte spces (even very lrge ones) resulting from discretiztion cn e represented efficiently using BDDs. In prticulr, symolic model checking techniques using BDDs hve een successfully pplied to mny prcticl designs, some of which hve huge stte spces [7]. his motivted the ppliction of symolic model checking to our discrete time models. In discrete time model, one stte trnsition of the system represents the pssge of time step mount of time. Smller step sizes led to more ccurte representtions of the design, reducing the chnce of flse negtive verifiction results. his higher resolution, however, leds to lrger stte spce tht must e nlyzed. o keep verifiction run-times mngele, we strct the implementtion detils of ech component y modeling its specifiction rther thn its implementtion. Moreover, we model the ehvior of only those signls tht re involved in inter-controller communiction. he discretiztion of delys, if not done crefully, cn introduce flse positive verifiction results [21], [22]. Other reserchers hve developed guidelines which ensure the sence of flse positives in prticulr clsses of systems [22], [23]. We re currently investigting whether this property is stisfied in our systems. Our primry gol, in ny cse, ws to find ugs tht re difficult to detect y simultion nd not necessrily to gurntee 100% correctness. D.2 Verifiction specifictions An XBM specifiction represents, s ny synchronous specifiction does, contrct etween the environment nd the circuit. As detiled in Section IV-B.1, it dicttes wht the circuit should do nd wht the environment is llowed to do. While the circuit s conformnce to the specifiction is gurnteed y the synthesis methodology, the environment s complince to the contrct needs to e verified. or every stte of n XBM controller, these constrints together define set of input trnsitions tht re illegl. hecomplete set of illegl trnsitions, which re derived utomticlly from urst-mode specifictions, is descried in detil in [8]. D.3 Design ugs found using symolic verifiction Since ALU2 is responsile for implementing the loop test, we designed its controller to e the mster controller. If the loop test psses, the controller sserts communiction signl tht triggers the multiplier controllers to initite their respective opertions. he multiplier controllers, when the multipliers re

10 IEEE RANSACIONS ON VLSI SYSEMS 10 done, ssert communiction signls tht trigger the next opertions, including ALU1 loop opertions. Consequently, no direct communiction etween ALU1 nd ALU2 controllers is needed; it is done through the MUL1 controller insted. Since ALU2 is lso responsile for clculting y, its controller indictes to the environment when the entire computtion is done. his sence of direct communiction etween ALU1 nd ALU2, however, creted timing prolem in the initil design tht our verifiction process uncovered. Initilly, the extr C-element shown in ig. 2 ws not present nd the ALU2 controller communicted directly with the externl environment (using EndP s the done signl). In the initil design, ALU2 controller my ssert done+ fter the loop test fils nd the finl y clcultion completes ut while ALU1 is still computing u. If the environment is fst, strt signl my e lowered efore ALU1 expects it. Moreover, if ALU2 controller negtes done quickly, then strt+ my occur efore ALU1 re-enters the reset stte. his scenrio violtes the constrints of ALU1 controller. o fix this error, the C-element ws introduced into the design with A1M nd EndP s its inputs nd done s its output. he C-element ensures tht done+ does not fire until the clcultion of u is complete, signled y A1M +. D.4 Verifiction run-times Initilly, simplified model of the chip ws used for verifiction. Controllers were modeled s hving fixed dely of one unit of time, with output signl trnsitions ll occurring simultneously with the stte trnsition. Bounds for ALU nd multiplier rge nd evlute delys were estimted. Even with this simple model, the verifiction identified severl design prolems, including the one descried ove. After lying out the design, more precise dely informtion ws extrcted from SPICE simultions; susequently, more ccurte verifiction model ws creted. or exmple, uffers tht drive control signls over long wires, which were not in the erlier model, were dded to the ck-nnotted model. or ech component/signl trnsition, we conservtively modeled its dely y non-deterministiclly setting it to vlue tht lies within the tightest rnge of discrete time points tht encompss the dely ounds otined y SPICE. he more detiled design ws verified using numer of different time steps (0.5ns, 0.75ns, 1ns, nd 1.25ns). or the resolution of 0.5ns, the finest one used, no specifictions were violted. Even when corse resolution models were used, ll ut three of the specifictions were verified correct. Some of these flse negtives were removed y loclly incresing the ccurcy of our models y, for exmple, forcing known order of output events. he run-times for the erlier models were quite smll (less thn 100 seconds), which fcilitted short design cycle. he run-times of the ck-nnotted designs were somewht longer ut still cceptle (from 30 minutes to 6.5 hours). E. Summry of the control design process nd future work Although our design process is tool-ssisted, the tools do not constitute fully integrted design methodology. In fct, our design process required significnt mnul effort. or exmple, the dtpth nd distriuted controllers were designed mnully without the ssistnce of high-level synthesis tool. Hence, insuring correctness ws more difficult. his suggests tht chieving the sme results for lrger designs my e more chllenging without the ppliction of more powerful tools. We did, however, synthesize the controllers nd derived the verifiction models utomticlly from the XBM urst-mode controller specifictions. In the future, it my e possile to derive correctly prtitioned structurl VHDL models of the system using high-level synthesis tool, oth for synthesis nd for verifiction. or the inter-controller communiction design, we strted out with conservtive speed-independent protocol nd looked for optimiztion opportunities sed on wht we knew out the dtpth/environment. After modifying the protocol sed on timing ssumptions, we verified the new design using SMV. he optimiztion nd verifiction were repeted until we were stisfied with the estimted performnce of the design. or this design, our verifiction efforts were limited to the inter-controller protocol design. hus simple verifiction technique using monolithic trnsition reltion ws sufficient. o verify lrger designs, we cn employ more sophisticted verifiction pproches using prtitioned trnsition reltions [7]. Recently, Chkrorty et l. [24] confirmed, using timing nlysis tool sed on time seprtion of events, tht ll the timing ssumptions mde t the controller-dtpth interfce do indeed hold true. V. RESULS We completed the design, verifiction, simultion, nd testing of the differentil eqution solver. he chip (die photo shown in ig. 11) ws fricted in 0.5µm HP CMOS14B triplemetl process through MOSIS. he chip re (including pds) is 13058λ 13058λ (3917.4µm µm); the core re is 10550λ 10145λ (3165µm µm =9.63mm 2 ). We conducted comprehensive set of tests, with some runs exceeding 1,500,000 itertions. le IV lists the verge simultion nd test results of n exmple run. esting Simultion 22 C 22 C 50 C 100 C V 3.3V 3.3V 3.0V Itertions N68MAR N68MAR ypicl Slow Averge 35.6ns 41.2ns 46.2ns 66.6ns ABLE IV EXPERIMENAL RESULS O AN EXAMPLE RUN USING HE INIIAL VALUES O x =1, dx = 1, =11, u =3, AND y =3. he verge dely per itertion of the lgorithm is 35.6ns, when tested t 22 C with 3.3V power supply. As discussed erlier, the simulted dely of 41.2ns is somewht longer thn the test result. he simulted verge dely per itertion under the worst-cse condition is 66.6ns; the worst-cse over ll the dt-dependent delys we simulted is 70.7ns. he criticl pth dely clculted y ccumulting only the dtpth component delys is 61ns. his dely would correspond to the dely of comprle synchronous implementtion, which hs no control overhed, no clock skew, no register setup time, nd no cycle-

IEEE RANSACIONS ON VLSI SYSEMS 11 62.9 65 60.0 60 55 55.7 Itertion time (ns) 50 45 40 35 36.5 39.5 41.3 30 25 20 2 2.5 3 3.5 Voltge (V) 4 4.5 5 40 23.7 20 25.4 40 20 0 emperture (C) 26.6 60 ig. 12.

11 IEEE RANSACIONS ON VLSI SYSEMS Itertion time (ns) Voltge (V) emperture (C) ig. 12. Itertion time test results with vrying operting temperture nd power supply voltge. ig. 11. DIEQ die photo. to-cycle jitter, ut uses the sme clss of dders, multipliers, multiplexors, nd registers. Since our design s simulted verge itertion time under typicl operting condition is 41.2ns, it is t lest 48% fster, when operting under typicl operting condition, thn comprle synchronous designs (designed for the worst cse, i.e., 100 C, 3V, nd slow process corner). In order to vlidte the roustness of our design, we conducted set of tests y vrying the operting temperture nd the power supply voltge. ig. 12 illustrtes the verge itertion time for 24 different operting conditions. We conducted our tests t three different operting temperture: 30 C, 22 C, nd 55 C. We could not perform the worst-cse temperture testing t 100 C, ecuse we were unle to rise the operting temperture eyond 55 C with our simple test setup. We vried the power supply voltge from 2.2V to 5.0V in 0.2V increments for ech operting temperture. As expected, we oserved the increse in itertion time s the operting temperture ws rised. Similrly, the increse in the itertion time ws oserved s the power supply voltge ws reduced. Our chips operted correctly down to 1.85V t 55 C, 1.89V t 22 C, nd 2.07V t 30 C. Although the plot does not show, we oserved tht our chips operte correctly up to power supply voltge of 7.0V. Since HP CMOS14B process is optimized for operting t 3.3V, our design cn tolerte 40% drop or more thn 100% increse in the power supply voltge. his result proves tht our design is extremely roust nd, more importntly, designs with sfe timing ssumptions cn led to higher performnce without scrificing roustness. We use control overhed per itertion to estimte the effectiveness of our overhed reduction techniques: OH iter = t iter ˆt iter ˆt iter, where OH iter is the control overhed per itertion, t iter the ctul itertion time, nd ˆt iter the itertion time of n idel cir- cuit, i.e., the sum of ll dtpth element delys incurred in the criticl pth of n itertion. or our chip, the verge control overhed is estimted to e 16%. VI. CONCLUSION We designed high-performnce (48% fster thn comprle synchronous designs), low-control-overhed (16% control overhed) synchronous differentil eqution solver. In the course of the design, we developed novel techniques for reducing completion sensing overhed for self-timed dtpth uilding locks. When sfe, we used timing ssumptions to hide control overhed. Some of the trickier prolems t the protocol level were deugged using symolic model checking sed forml verifiction tool. he design process consisted of n itertive design nd verifiction loop. At the erlier design stges, simpler models of XBM controllers were used, which enled detection of mny ugs quickly. In the finl stges of the design, more ccurte models which incorporte cknnotted delys were verified in resonle mount of time, providing higher confidence in the design. he fricted chips perform s simulted, leit somewht fster. he test results show tht our design is very roust with respect to vritions in operting temperture nd power supply voltge. We conclude tht exploiting design mrgins using sfe timing ssumptions cn led to roust, high-performnce circuits. Moreover, we rgue tht our techniques chieve high performnce y reducing control overhed while mintining roustness nd re generlly pplicle. In synchronous designs, clock periods re set to fit the criticl pth dely plus some sfety mrgin; however, considerle mount of extr sfety mrgin exists in non-criticl pths. Our chip, on the other hnd, demonstrtes tht event-driven control cn led to higher performnce y reducing excess ded time, i.e., y reducing the sfety mrgin of significntly more pths to the level used in the criticl pths. In ddition, techniques to hide control overhed with the dely through MUXes re generlly pplicle to MUX-sed rchitectures nd cn e extended to us-sed rchitectures.

CHAPTER 2 LITERATURE STUDY

CHAPTER 2 LITERATURE STUDY CHAPTER LITERATURE STUDY. Introduction Multipliction involves two bsic opertions: the genertion of the prtil products nd their ccumultion. Therefore, there re two possible wys to speed up the multipliction: