Area-Time Efficient Digit-Serial-Serial Two s Complement Multiplier

Are-Time Efficient Digit-Seril-Seril Two s Complement Multiplier Essm Elsyed nd Htem M. El-Boghddi Computer Engineering Deprtment, Ciro University, Egypt Astrct - Multipliction is n importnt primitive opertion used in mny pplictions. Although prllel multipliers produce results fst, they occupy considerle chip re. For pplictions with lengthy opernds such s cryptogrphy, the required re grows further. A Digit-Seril-Seril multiplier receives oth inputs serilly one digit per cycle. This reduces re t the expense of the numer of cycles required to complete the multipliction. Digit multiplier designs re flexile with respect to the digit width offering designers the opportunity to select the most suitle compromise etween re nd cycle count for the ppliction in concern. In this pper, new Digit-Seril-Seril multiplier is proposed tht is efficient in terms of re nd re-time product. The proposed multiplier supports one opernd to e of dynmicwidth while the other opernd is fixed-width. In contrst, other multipliers support only fixed, equl-width opernds. With smll modifiction, the multiplier is shown to e le to operte on 2 s complement opernds. The proposed multiplier lso supports it-level pipelining. Tht is, independent of the opernd width nd the digit width, the criticl pth of the multiplier pipeline stge cn e reduced down to the dely of D-FF, n AND gte nd two full dders (FAs) independent of the digit width. Simultion results show tht the proposed multiplier reduces the required re over similr multipliers [] y up to 2% nd reduces the re-time product y up to 32%. Keywords: Multiplier, Digit-Seril-Seril, Are, Time, 2 s complement, dynmic opernd Introduction Multipliction is core opertion in hrdwre designs. Although mny multiplier designs hve een proposed in literture, these designs still hve room for improvement [7]. To speed up the multipliction opertion, completely prllel implementtions hve een proposed [3] [9]. Since these designs process the two opernds in prllel they possess short processing time. In this cse, prllel systems require considerly lrge silicon re nd re reduction ecomes essentil. Reduction of multiplier re hs een the field of study of mny ppers []. Mny pproches were followed to reduce the re. One of the common pproches used to reduce re is to split multipliction over multiple cycles nd re-use smller circuit tht exploits the similrities in the multipliction opertion. This requires splitting one or the two opernds into numer of digits ech of width d its nd processing one digit t time. Bsiclly, multipliction cn e crried out one digit t time producing one digit of the result per cycle. This pproch hs mny vritions. One vrition is Digit-Seril-Prllel multipliction [7] where one opernd is input in prllel while the other opernd is input one digit per cycle. This hs the dvntge of reducing the multiplier re ut suffers from the fct tht one of the opernds is still hndled in prllel keeping the re-required high. Another vrition is the Bit-Seril-Seril multipliction where oth opernds re input one it per cycle [8]. This hs the dvntge of miniml re ut suffers from high cycle count. The third is Digit-Seril-Seril where oth opernds re input one digit t time filling the gp etween the two former pproches. In this pper, we follow this pproch nd propose n re nd re-time product efficient multiplier. Mny design pproches re used to chieve Digit multipliers. Two systemtic pproches re folding [] nd unfolding [7]. Folding strts with full prllel multiplier nd trunctes the sic execution unit (in the multiplier cse, the dder) to the digit width rther thn the opernd width. On the other hnd, unfolding strts with Bit-Seril-Seril multiplier nd replictes it numer of times equl to the digit width. Other systemtic pproches re sed on rdix 2 n rithmetic []. An d-hoc pproch lso cn e found in [7] producing Digit-Seril-Prllel multiplier. Our design dopted the folding pproch strting from fully prllel multiplier nd truncting it down to Digit-Seril-Seril multiplier. Compred to the lrge volume of reserch on Digit-Seril- Prllel multipliers, much smller volume is present for Digit- Seril-Seril multipliers. Perhps one of the first trils to chieve Digit-Seril-Seril multipliers is the work proposed y Aggoun et l. []. They proposed design tht is sed on rdix-2 n rithmetic nd supports it-level-pipelining. This is chieved y using 4-to-2 compressors [] insted of dders. Their proposed multiplier is considered the first Digit-Seril- Seril multiplier in literture nd ccordingly is compred to Digit-Seril-Prllel multipliers. Though their multiplier ws shown to outperform the compred-to multipliers in re nd re-time product mesures, however they impose some constrints like no 2 s complement support, no odd digit count support nd no dynmic-width opernd support.

Almildi [2] mde n extension to the work of Aggoun [] where he used nother design methodology (Tle Methodology) tht is more systemtic to chieve the sme results nd ccordingly their multiplier inherits the sme constrints. Also their multiplier does not support it-levelpipelining nd introduces extr strtup cycles. Tht is, the result digits re not produced immeditely strting from the first cycle s in Aggoun [] or the multiplier proposed in this pper. Up to our knowledge, this pper is the third tril to ddress Digit-Seril-Seril multipliers. In this pper, new Digit-Seril- Seril multiplier is proposed tht is efficient in terms of re nd re-time product. The re-time product is lnced mesure tht considers oth the speed nd the re fctors of design rther thn considering ech lone. We cll the proposed multiplier the Folded Digit-Seril-Seril Multiplier (FDSSM). et the two opernds to e multiplied e A nd B of width m its nd n its respectively. In the FDSSM opertion, A nd B re divided into digits of width d nd fed serilly; one digit of ech of the opernds per cycle. The FDSSM is shown to require (n + m)/d cycles to complete the multipliction opertion. The FDSSM supports one opernd of dynmic-width while the other opernd (the smller of the two opernds) is of fixed-width. In contrst, other functionlly-similr multipliers support only fixed, equl-width opernds. With simple modifiction, the FDSSM cn lso perform 2 s complement opernds multipliction tht is not supported y the other functionlly-similr multipliers. The FDSSM lso supports it-level pipelining. Tht is, when the multiplier is pipelined; the pipeline stge dely is independent of the opernd width nd the digit width. In the FDSSM cse; it cn e pipelined with pipeline stge dely down to the dely of one D-FF, one AND gte nd two FAs. The min reson the FDSSM improves the re is the elimintion of the input uffers nd the lst-digits uffers; simultion results show tht the FDSSM reduces the required re over other functionlly-similr multipliers y up to 2%. It lso reduces the re-time product y up to 32%. The next section descries the mthemticl ckground of the Digit-Seril-Seril multiplier. In section 3, we descrie the FDSSM circuit nd its opertion. In section 4, we present the simultion results nd comprison to other multiplier designs. Section 5 covers the FDSSM s properties such s dynmic opernd support nd how the FDSSM cn e modified to support 2 s complement nd pplying it-level-pipelining. Finlly, Section 6 presents the concluding remrks. 2 Mthemticl Bckground In this section, we present the mthemticl ckground tht is used s the se of the FDSSM. In the following we consider the two multiplied opernds to e of the sme width. ter in section 5 we show how one of the opernds cn e of dynmic width. Consider the multipliction of two numers A nd B ech of width n its where A= n- n-2. nd B= n- n-2.. The pper nd pencil technique to clculte A B hs two min stges; the prtil product its genertion where ech it of A is multiplied y ech other it of B. The second stge is to ccumulte the prtil product its of the sme significnce to form the finl result. The prtil product its mtrix (PPBM) to e ccumulted cn e given s follows: 2n 2 n n n 2 M Column i of the mtrix is of significnce 2 i where i 2n 2. Accumulting the prtil product its of ech umn gives the multipliction result it of the current umn s significnce nd multiple crry its for the next significnt umn. The opernd A cn e ppended with zero-its t oth the lest nd most significnt ends to mke the PPBM rectngulr mtrix. This extension cn e done s follows: () i = i : n < i 2n or i < (2) The resulting PPBM with effective prtil product its underlined (non-underlined terms re equl to ) is: 2 2n 2 n 2n 2 2n 3 M n n 2 n+ 2 n+ After generting the PPBM, the second stge is to ccumulte these prtil products. This cn e formulted s: 2n i= j= j j i A B = i 2 (4) The outer summtion represents the umns of PPBM nd the inner one represents the rows. Ech cycle, the FDSSM ccepts one digit of A nd one digit of B. Grouping ech d consecutive its of A into digit, eqution (4) cn e written s: (2n / d ) d i= j= l= + l j id + l A B = id 2 (5) j (3)

Grouping ech d consecutive its of B into digit the eqution ecomes: (2n / d ) ( n / d ) d d i= j= l= k= id + l ( i j) + ( l k) jd + k 2 A B = d (6) Ech of the outer summtion itertions represents d consecutive umns (umn set) of the PPBM of eqution (3). Within the FDSSM; ech of these itertions is processed in one cycle strting y the lest significnt umn set requiring totl of 2n/d cycles to complete the multipliction. The numer of prtil product its generted per cycle is clculted s the product of the numer of itertions of the three remining inner summtions of eqution 6, ( n / d ) d d = nd its. Also the prtil product digits count per cycle is clculted s the prtil product its count per cycle nd divided y digit width d to e n digits. The prtil product digits re then ccumulted in the sme cycle using tree of 4-to-2 compressors of depth (log n) - levels nd one norml dder. 3 The Proposed Multiplier As mentioned in the previous section; the FDSSM runs over two stges per cycle: prtil product digits genertion nd prtil product digits ccumultion. In this section, we propose n implementtion for the FDSSM. 3. Prtil Product Digits Genertion Generlly speking, the min ide is to divide multipliction into identicl sets of opertions tht cn e processed y smller processing unit over multiple cycles. This is in contrst to the multipliction s single set of opertions processed y full-length processing unit in one cycle. This is shown in Figure where the PPBM is divided into umn-sets ech is of width d. The processing unit in this cse (the FDSSM) is responsile for generting nd ccumulting the prtil product digits of one umn-set per cycle (strting y the lest-significnt umn-set up to the most significnt one.) The prtil product digits genertion per cycle is further divided within the FDSSM over n/d identicl sic locks (BB, BB BB (n/d)- ) where ech sic lock is responsile for generting d digits. This is shown in Figure 2 where ech sic lock is responsile for row-set of width d of the PPBM. The intersection of the umn-set nd the row-set defines the prtil product digits generted t the relevnt cycle y the relevnt sic lock respectively. The sic locks re shown in the upper prt of Figure 2 nd the lower prt shows the detiled design of one sic lock. Ech sic lock hs three inputs nd two outputs. Figure : Prtil Product Bits per Cycle per Bsic Block The first input of ech lock is current cycle s digit of the multiplier opernd B (indicted y on Figure 2). The control signl of the sic lock enles ltching this digit within the lock such tht ech lock ltches only the digit of B respective to the row-set it is responsile for. The second input is one digit of the multiplicnd opernd A received from previous lock (indicted y 2 on Figure 2), we cll it the current digit of A (with respect to the sic lock.) The third input is the previous digit of A (with respect to the sic lock) received from next lock (indicted y 3 on Figure 2) In summry, digits of A re shifted long the sic locks such tht the current digit of A of sic lock ecomes the current digit of A of the next sic lock nd the previous digit of A of the former sic lock t the next cycle. The need for the previous digit of A cn e seen in eqution (3) where there is one it shift-left per row. This mkes ech umn-set spnning two digits of A: current digit of A nd previous digit of A. The feedck of the previous digit of A is needed to complement the shifted versions of the current digit of A. A register of width (d ) is used to ltch nd provide the previous digit of A for the lst sic lock. The outputs of the lock re the generted prtil product digits (indicted y 4 on Figure 2) fed to the ccumultion stge nd the current digit of A (indicted y 5 on Figure 2) fed to oth the next nd the previous locks. This wy the unit genertes d prtil product digits per cycle. 3.2 Prtil Product Digits Accumultion In this section we show how the generted prtil product digits re ccumulted. As mentioned in section 3., the output of ech sic lock is d prtil product digits. Here we use tree of 4-to-2 compressors to ccumulte the generted prtil product digits s shown in Figure 3.

Figure 4: Are Comprison etween FDSSM nd Agg. [] Figure 2: Prtil Product Digits Genertion Unit dder tht sums two prtil product digits into one digit. The tree consists of (n/2)- compressors nd norml dder. Ech compressor consists of 2d full dders nd the norml dder consists of d full dders giving totl of 2d ([n/2]-) +d = d (n- ) full dders. The tree ltency cn e given y: 2D FA ([log n] - ) +D ADD where D FA is the full dder dely nd D ADD is the norml dder ltency. 4 Experimentl Results Figure 3: Prtil Product Digits Accumultion Ech of the 4-to-2 compressors tkes 4 prtil product digits nd 2 crry-in its s input nd compresses them into 2 prtil product digits nd 2 crry-out its. A 4-to2 compressor possesses dely of 2D FA, where D FA is one full dder dely. The 2 crry-out its re ltched nd routed ck s crry-in for the sme compressors next cycle. The tree root is norml In this section we compre the FDSSM with the multiplier proposed y Aggoun []. As mentioned efore, their work is the only in literture tht proposed Digit-Seril-Seril multiplier. The work of Almildi [2] is slightly modified version for the work of Aggoun [] tht uses different design methodology. Aggoun [] proposed two implementtions: the first with sic-lock per digit of the opernd nd the second uses only hlf the count of sic-locks nd uffers the most significnt opernds hlves nd reroutes them for processing fter the lest significnt hlves re processed. The second implementtion is the one we compre to since it occupies less chip re nd possesses less re-time product. We implemented the FDSSM nd the compred-to Aggoun [] in VHD. The implementtion ws done using Cdence Encounter Digitl Implementtion RT Compiler (EDI9._ISR4_s273) on CentOS 6 - x368 using NCSU- FreePDK45-.4 cell lirry nd trgeting clock period of ps. The comprison ws done with respect to re, time, nd re-time product mesures. All multiplier instnces used equlwidth opernds ( constrint of Aggoun []) Simultion results re shown in Figures 4, 5, 6, 7, 8, nd 9. In the experiments, we chnge opernds width n from 8 to 64 its while chnging the digit width d from 4 to 32 its. In the figures, 8/4 mens tht the opernd width n = 8 its while the digit width is d = 4.

Figure 5: Are Improvement of FDSSM Over Aggoun [] Figure 7: Time Improvement of FDSSM Over Aggoun [] Figure 6: Time Comprison etween FDSSM nd Agg. [] Figure 4 shows how the re chnges with different opernd width nd digit width. The re is mesured in nnometer. Figure 5 shows the re improvement percent for the FDSSM over Aggoun []). For exmple, for 32/8, the re is reduced in FDSSM y out 6%. The excess re in Aggoun [] results from opernd uffering done where n uffers re used to uffer the most-significnt hlves of the opernds together with n input MUX of width d. Also the design of Aggoun [] uffers the previous digit of oth opernds per unit to complement the shifted prtil product digits generted. Both elements re not present in the FDSSM cusing the re reduction. Figure 8: Achievle Running Frequency Figure 6 nd Figure 7 show the clock cycle time comprison etween FDSSM nd Aggoun []. Theoreticlly FDSSM outperforms the work in Aggoun [] in time for n/d rtios less thn or equl to 4; tht is, opernds re split into 4 digits or less. The time required for FDSSM is given y T FDSSM = ((2log n)-2+d) T FA where T FA is the full dder time wheres the time for the design Aggoun [] is given y T Aggoun [] = ((2log d)+2+d)t FA. Thus, n/d must e 4 for T FDSSM to e less thn or equl to T Aggoun []. This reflects on the time improvement s seen for cses like 32/4, 64/4, nd 64/8 in Figure 7. The figure shows tht FDSSM possess time improvement up to 4% for n/d rtios less thn or equl to 4. In other cses where this rtio is more thn 4, the work of Aggoun [] outperforms the FDSSM.

Figure 9: Are-Time Product Improvement Figure 8 shows the circuit chievle frequency. The FDDSM frequency rnges from out 55 MHz to 96 MHz wheres Aggoun [] frequency rnges from out 48 MHz to 82 MHz. The frequency grph is the reciprocl of time grph; Aggoun [] outperforms t 32/4, 64/4, 64/8 wheres FDSSM outperforms t the rest of vlues. Figure 9 shows the re-time product comprison etween FDSSM nd Aggoun []. The re improvement compenstes for the time lg in the 32/4 nd 64/8 cses, nd diminishes the time lg in the 64/4 cse. It should e noted tht Aggoun [] imposes constrint tht the digit count must e even (due to using hlf the digitcount of sic locks) however FDSSM does not impose this constrint; tht is it supports even nd odd digit count giving designers more flexiility in re-speed compromises. 5 Properties of the FDSSM In this section we introduce two properties tht the FDSSM possesses. First we introduce the property of dynmic width opernd then we show how the FDSSM cn e modified to support the two s complement multipliction. 5. Dynmic Width Opernd The proposed multiplier ltches one opernd s digit per cycle nd shifts the other opernd s digits long the sic locks. This llows the second opernd to e of ny digit count (i.e., dynmic.) This enles long opernd multipliction y fixed-width opernd which is common in cryptogrphy pplictions. The numer of cycles required to complete the multipliction is given y (n + m) / d where n nd m re the opernd sizes, m > n. The numer of sic locks required in this cse is equl to the digit count of n. Tht is the re nd time specifiction of n FDSSM with one dynmic opernd Figure : Modifying the Multiplier to Support 2 s Complement mtches those of n FDSSM with equl size opernds with the opernd size equl to the smller opernd. On the other hnd in Aggoun [], oth opernds re ltched one digit per cycle which constrints the multiplier for fixed, equl-width opernds. Accordingly the numer of sic locks required is equl to hlf the digit count of m; nd n is extended to mtch m. To support dynmic width opernd, nother input signl is dded to the FDSSM flgging the rrivl of the lst digit of m. This signl tells the FDSSM tht strting next cycle, zeros digits must e injected insted of m digits (no more m digits). The lst result digit is produced fter n/d cycles of receiving the lst digit of m. 5.2 Two s Complement Support In this section we show how the FDSSM cn e modified to support 2 s complement opernds. The 2 s complement support pplies the methodology proposed y Ienne nd Viredz [5]. As mentioned efore, one of the opernds, the ltched opernd is considered the multiplier nd the second opernd, the shifted opernd is considered the multiplicnd. To support 2 s complement opernds, we del with the multiplicnd nd the multiplier s follows. For the shifted multiplicnd opernd, sign extension is performed. In cse of the dynmic-width multiplicnd, n extr input is required to indicte the lst digit so tht sign extension strts next cycle. As for the ltched multiplier opernd, the sic lock of most significnt digit is modified replcing the AND gtes connected to the most significnt it of the ltched digit to XOR gtes (highlighted in Figure ). This hs the effect of inverting the its of multiplicnd digits if tht it (i.e., the sign it) is. The first effective crry-in of the dder corresponding to the

Figure 2: Structure of 4-To-2 Compressor [] Figure : Bit-evel Pipelining Support modified prtil product digit is lso reset to rther thn if tht it is, such tht the multiplicnd negtion is complete (i.e., if the sign it of the multiplier opernd is ; the multiplicnd opernd is sutrcted rther thn dded.) Figure shows the modifiction. This not pplicle to Aggoun [] since the multiplier sign it is processed t ech of the sic locks compred to FDSSM where it is processed only t the most significnt digit s sic lock. 5.3 Bit-evel Pipelining Support A it-level pipelined multiplier is multiplier with pipelined circuit such tht the pipeline stge dely is independent of the digit width nd the opernds width(s). Multipliers with feed-ckwrd lines cnnot e it-level pipelined since the feedck will e out of sync due to pipelining. Also multipliers tht use norml digit dders cnnot e it-level pipelined since the pipeline stge dely will e dependent on the dder/digit width. In the FDSSM, it-level pipelining cn e chieved y ltching the compressors outputs from one level to the next. Figure shows the modifictions required nd Figure 2 shows the 4-to-2 compressor circuit []. The finl dder is replced y it-level pipelined dder s the one descried in [7] nd shown in Figure 3. This modifiction mkes the pipeline stge dely equl to D-FF, n AND gte nd two FA(s) independent of the digit width. Figure 3: 4 Bit- Bit-evel-Pipelined Adder 6 Concluding Remrks In this pper new Digit-Seril-Seril multiplier is proposed tht is efficient in oth re nd re-time product mesures. The presented multiplier supports dynmic-width opernd while the other-shorter opernd is fixed-width. The multiplier ws shown to e le to perform 2 s complement multipliction nd supports it-level pipelining. Simultion results showed tht the proposed multiplier reduces the required re nd re-time product significntly. An extension of this work cn e in the design of Digit- Seril-Seril divider of similr orgniztion or sed on the proposed multiplier. Other direction includes hving full Digit-Seril-Seril AU s uilding lock in processors.

References [] Aggoun, A.; Frwn, A.F.; Irhim, M.K.; Ashur, A., "Rdix-2 n seril-seril multipliers," IEE Proceedings - Circuits, Devices nd Systems, vol.5, no.6, pp. 53-59, 5 Dec. 24. [2] Almildi, A.; "A Novel Methodology for Designing Rdix-2 n Seril-Seril Multipliers," Journl of Computer Science 6 (4): 46-469, 2. [3] Dimitrov, V.S.; Jrvinen, K.U.; Adikri, J.; "Are- Efficient Multipliers Bsed on Multiple-Rdix Representtions," IEEE Trnsctions on Computers, vol.6, no.2, pp.89-2, Fe. 2. [4] Gnnsekrn, R.; "On Bit-Seril Input nd Bit-Seril Output Multiplier," IEEE Trnsctions on Computers, vol.c-32, no.9, pp.878-88, Sept. 983. [5] Ienne, P.; Viredz, M.A.; "Bit-seril multipliers nd squrers," IEEE Trnsctions on Computers, vol.43, no.2, pp.445-45, Dec 994. [6] merti, F.; Andrikos, N.; Antelo, E.; Montuschi, P.; "Reducing the Computtion Time in (Short Bit-Width) Two's Complement Multipliers," IEEE Trnsctions on Computers, vol.6, no.2, pp.48-56, Fe. 2. [7] Niouche, C.; Niouche, M., "On designing digit multipliers," 9th Interntionl Conference on Electronics, Circuits nd Systems 22, vol.3, no., pp. 95-954 vol.3, 22. [8] Niouche, O.; Bouridne, A.; Niouche, M.; "New rchitectures for seril-seril multipliction," The 2 IEEE Interntionl Symposium on Circuits nd Systems 2, ISCAS 2, vol.2, no., pp.75-78 vol. 2, 6-9 My 2. [9] Stelling, P.F.; Mrtel, C.U.; Oklodzij, V.G.; Rvi, R.; "Optiml circuits for prllel multipliers," IEEE Trnsctions on Computers, vol.47, no.3, pp.273-285, Mr 998. [] Wu, C.W. nd P.R. Cppello, 989. "Block multipliers unify it-level cellulr multiplictions," Int. J. Comp. Aid. VSI Des., : 3-25.