OPTIMIZATION OF RNS FIR FILTERS FOR 6-INPUTS LUT BASED FPGAS

OPTIMIZATION OF RNS FIR FILTERS FOR 6-INPUTS LUT BASED FPGAS G.C. Cardarilli, M. Re, A. Salsao Uiversity of Rome Tor Vergata Departmet of Electroic Egieerig Via del Politecico 1 / 00133 / Rome / ITAL {marco.re, g.cardarilli}@ieee.org salsao@ig.uiroma2.it S. Potarelli (ASI) Italia Space Agecy Viale Liegi 26 00198 Rome, ITAL potarelli@ig.uiroma2.it ABSTRACT I this paper optimized Residue Number System (RNS) arithmetic blocks to better exploit some of the architectural characteristics of the last geeratio FPGAs are preseted. The implemetatio of modulo m adders, costat ad geeral multipliers, iput ad output coverters is preseted. These architectures are based o moduli sets chose i order to optimally use the six iputs Look-Up Tables (LUTs) available i the Complex Logic Blocks (CLBs) of the ew geeratio FPGAs. Experimets based o the implemetatio of Fiite Impulse Respose (FIR) filters characterized by differet umber of taps ad wordlegths shows that the use the RNS together with suitable moduli sets optimally fits the six iputs LUTs of the last geeratio FPGAs architectures. 1. INTRODUCTION The silico itegrated circuits tred is characterized by a steady reductio i the feature size combied with a steady rise i desity ad speed as show i [1]. I the last twety years FPGAs evolved rapidly i terms of complexity ad architecture startig from the first FPGA, the ilix C2064 chip with its 1,000 gates of complexity [2] to the ewest geeratios. The major evolutio was related to the structure of the itercoect, the topology of the basic cell (LE i.e. the Logic Elemet), ad the itroductio of full custom processig elemets such as multipliers, hardware processor cores, MAC uits, ad very high speed serial I/O blocks. Oe of the last iovatio i the FPGAs architecture, has bee the itroductio of 6-iputs LUTs as the mai block for the implemetatio of combiatorial fuctios [4], [3]. Moreover, chages i the FPGAs architecture require chages i the sythesis algorithms i order to guaratee a optimum mappig o the available resources. I this paper, the use of a RNS represetatio based o suitable moduli sets is used to optimally implemet the basic arithmetic operators by usig six-iputs LUTs. I particular it is show that the use of moduli that are represeted by five bits offers the best results i terms of used resources ad delay. For this reaso the moduli set that has bee used used for the sythesis experimets is composed by five bits moduli ad the bigger modulo has bee choses as a power of two. I this way dyamic rages of up to 34 bits are obtaied. The paper is orgaized as follows: i Sectio II a backgroud o the RNS arithmetic is give. I Sectio III architectures ad performace of 6-iputs LUT based implemetatios of modulo m arithmetic operators such as adders, costat multipliers ad geeral multipliers are discussed. Sectio IV illustrates the implemetatio of the iput ad output coverters, while i Sectio V a set of experimets based o the implemetatio of FIR filters are show discussig the obtaied area ad speed results. Coclusio are draw i Sectio VI. 2. BACKGROUND ON RESIDUE NUMBER SSTEM A Residue Number System (RNS) is defied by a set of relatively prime itegers: {m 1, m 2,..., m P } The dyamic rage of the system is give by the product of the moduli m i M = P i=1 Ay iteger [0, M 1] has a uique RNS represetatio give by m i RNS ( m1, m2,..., mp ) (1) where mi = mod m i A comprehesive descriptio of the RNS theory ad its applicatio to computer systems ca be foud i [6], [7], ad [8]. I the RNS represetatio, operatios, such as additio ad multiplicatio, are executed i parallel o the differet moduli

Z = op RNS Z m1 = m1 op m1 m1... Z mp = mp op mp mp (2) where eq. (2) is valid if the fial results prior the coversio i the two s complemet represetatio (TCS) belogs to the rage [0, M 1]. The coversio of Z i TCS is accomplished by the Chiese Remaider Theorem (CRT) Clearly, coversios the from the biary represetatio to RNS, ad vice-versa, costitute a overhead for systems based o the RNS represetatio. However, efficiet methods to perform those coversios have bee preseted i [9], [10], ad [11]. The iput coversio is obtaied by the reductio modulo m i of the iput samples x(), providig the residue digits x mi. The mod. m i RNS filters compute the residues y mi defied i eq. (4), while the output coversio based o CRT computes back y(). 3. MODULO OPERATIONS BASED ON SI INPUTS LUTS I FPGAs, the LEs are based o LUTs ad, i particular, the last geeratio FPGAs are characterized by LEs cotaiig six iputs LUTs (useful to implemet six iput oe output combiatorial fuctios) that ca be also cofigured as double 5 iputs LUTs (useful to implemet five iputs double output combiatorial fuctios) ([4], [5]). Cosequetly, i the paper, the moduli set is chose such that the moduli rage belogs to the iterval [17, 64], moreover they must be coprime ad usually it is coveiet to use a power of two (such as 2 with = 5 or = 6) as the bigger modulus. I the rest of the paper the followig arithmetic blocks are aalyzed 1. Modulo m adders 2. Costat multipliers (costat coefficiets FIR filters) 3. Geeral multipliers (variable coefficiets FIR filters) 3.1. Modulo m adders If a modulo m is chose such that 2 1 < m 2, the rage of the results geerated by operatios mod. m are i the rage [0, 2 1] ad therefore bits are used to represet the results. A fast architecture ca be used to implemet the additio modulus m as show i Fig. 1 It is composed by a two operads adder, a three operads adder ad a multiplexer. The two operads -bits adder computes S1 = +, the three iputs -bits adder computes S2 = + m ad the 2 1 multiplexer selects S1 or S2 depedig o the the carry out of S2. -m -bits -bits S1 Cout S2 MU Fig. 1. The parallel modulo m adder. I this paper a differet architecture to compute + m is preseted obtaiig a delay comparable to that of the parallel architecture by usig less resources. The architecture is show i Fig. 2 where ad are added obtaiig S. This value is used to address a ROM (based o six iputs LUTs) cotaiig S m. -bits +1 S 2 +1 ROM < > m Fig. 2. The ROM based modulo m adder For a 5 bits modulo, the size of the ROM is 2 6 5, correspodig to 5 six iputs LUTs, while for a 6 bits modulo the size of the ROM is 2 7 6, correspodig to 12 six iputs LUTs. The growth of the ROM size is expoetial, but for m up to 64 this structure is slightly coveiet with respect to the parallel implemetatio as show i Table I. This table shows the sythesis results i terms of umber of LUTs ad delay for differet values of five ad six bits moduli i compariso with the parallel implemetatio. m Parallel mod. Adder ROM based mod. Adder delay(s) #LUT delay(s) #LUT 19 1,59 15 1,53 10 31 1,62 15 1,55 10 35 1,77 18 1,77 18 63 1,71 16 1,62 15 Table 1. Area ad delay of parallel ad ROM based modulo adders implemeted o a ilix Virtex V FPGA I the case of five bits moduli this implemetatio gives 33% of resource savigs maitaiig the same delay, while for six bits moduli the results i term of used resources ad delay are similar ad there are o advatages. 3.2. Modulo m Multipliers: variable coefficiets, costat coefficiets I this sectio, costat coefficiets ad geeral multipliers are aalyzed.

1. Modulo m costat multipliers. They are used to implemet RNS FIR filters with costat coefficiets. If is the umber of bits to represet m, K m requires output bits. If = 6, it ca be implemeted by usig a 2 6 6 ROM that, i the case a ilix Virtex V FPGA is implemeted by usig 6 six iputs LUTs with a critical path of about 0.8 s. 2. Geeral multipliers. I this case, beig m a prime umber (6 bit) the isomorphism techique [6] ca be used to perform the multiplicatio. This techique is based o the algebraic properties of the structure composed by the modulo m additio ad multiplicatio ad the umbers i the iterval [0, m 1]. I fact the rig is a fiite field ad therefore (a) each elemet differet from zero has a multiplicative iverse (b) it exists a elemet of the field, called α, such as x [1, m 1] i α i = x ad α m = α The modulo m multiplicatio of two umbers becomes x y m = α i α j m = α i+j m. Because αm = α the additio of i+j is performed modulo m 1. The architecture of the isomorphic multiplier is show i Fig. 3. Log () Log () MODULO m-1 α k 0 Fig. 3. The modulo m multiplier based o the isomorphism techique The blocks amed Log (based o LUTs) performs the associatio betwee the value ad ad the correspodig idexes i ad j, while the block α k performs the iverse associatio betwee the result of i + j m 1 ad the value α k. Some additioal logic allows resolvig the case i which oe or both the operads are zero. The modulo m 1 adder ca be implemeted by either the parallel ad the ROM based modulo adder. If the ROM based modulo adder is used, two ROMS, the first performig the operatio m 1, the secod performig the iverse isomorphism, are used as show i Fig. 4. The two ROMs ca be combied i a sigle ROM performig both the operatios. This implemetatio requires MU OR Log () Log () +1 2 +1 ROM < > m 1 α k 0 Fig. 4. The isomorphic based modulo m multiplier with ROM based Modulo additio the use of about 30 LUTs, with a maximum delay of 3.04s. Istead, the parallel implemetatio requires 36 LUTs with a maximum delay of 3.78s. Therefore, by embeddig the two ROMs i a sigle ROM the architecture is about 20% faster ad shows a 15% of resource savigs. 4. FIR FILTER IMPLEMENTATION A N taps FIR filter is described by y() = N 1 k MU h k x( k) (3) Its fixed poit implemetatio, i trasposed or direct form, is obtaied by usig multipliers adders ad registers. I particular, i parallel implemetatios, the reductio of the used resources is usually accomplished by trucatig the multipliers outputs. The umber of trucated bits is the result of a fixed poit optimizatio phase that is based o a trade of betwee resource savigs ad sigal to oise ratio worseig. The implemetatio of RNS FIR filters is a direct cosequece of eq. (2) ad eq. (3) becomes y() m1 =... y() mp = N 1 k N 1 k hk m1 x( k) m1 m 1 hk mp x( k) mp m P m 1 OR m P(4) The filter is implemeted i RNS by decomposig it ito P FIR filters workig i parallel, as sketched i Fig. 5 (P=3). 4.1. Modulo m i filters The architecture of the mod m i filters (based o eq. (4)) is depicted i Fig. 6. where, the shaded area, is filter basic buildig block (the mod. m i tap).

< >m1 m1 filter x mi () x() < >m2 < >m3 m2 filter m3 filter RNS to Biary Fig. 5. RNS implemetatio of a FIR filter y() h j s j + Delay s j+1 Fig. 7. Optimized slice of a modulo m i FIR filter with costat coefficiets x mi () + h N Delay...... + h j Delay... h 1 h 0... + Delay + y mi () are memorized. For a 5 bits modulo the resource usage is 10 LUTs ad the delay is about 1,5 s, while for a 6 bits modulo the resource usage is aroud 16 LUTs ad the delay is about 1,7 s. I Fig. 8 the architecture of a tap i case of variable coefficiets filter is show. s i s out x m () Log Fig. 6. Architecture of a modulo m i FIR filter + Log h j+1 The filter tap computes the followig equatio E s out (j) = x mi h j + s i (5) where s i = s out (j 1) ad m i 2. Also i this case, the filter tap has bee optimized by usig a method similar to that used for the geeral multiplier preseted i the previous sectio. Moreover, i the followig, the aalysis is restricted to moduli beig prime umbers i order to make it possible the use of the isomorphism techique. For costat coefficiets filters, the filter tap (Fig.6) requires a ROM ad a modular adder that ca be either a parallel or a ROM based adder. Equatio 5 ca be rewritte as s out (j) = h j (x mi + h 1 j s i ) = h j s out (6) where s out = x mi + h 1 j iverse of h j mod. m i. s i ad h 1 j is the multiplicative For cosecutive slices the filter coefficiets h 1 j+1 ad h j ca be combied as s out (j + 1) = h j+1 (x mi + h 1 j+1 s out(j) ) = ( h j+1 x mi + h 1 j+1 h j (x mi + h 1 j s out (j 1) )) = h j+1 (x mi + h j s out (j) ) (7) where h j = h 1 j+1 h j. I this way for the itermediate slices the tap ca be implemeted as depicted i Fig. 7. The operatio hj (x mi + s j ), where h j is a costat factor is implemeted by usig a ROM based modular adder. I the ROM precomputed values of m i hj (x mi + s j ) m i s j + < > m Delay s j+1 Fig. 8. Optimized architecture of a slice for a modulo m i variable coefficiets FIR filter The Log operators are implemeted by 2 ROMs, the E operator is a 2 +1 ROM performig modulo reductio ad expoetiatio, the mi operator is ROM based, the adders are -bits adders, while the critical path is composed by two adders ad three ROMs. The first optimizatio cosists i sharig the Log operator that is the same for all the slices composig the modulo m i filter. The secod optimizatio is obtaied by balacig the paths of the slices movig the ROM implemetig the mi operator after the delay elemet. I this way the critical path is reduced to two ROMs ad two adders. 5. FIR FILTERS EPERIMENTS I this sectio a set of experimets for the characterizatio of FIR filters are described. Two cases have bee selected: 8 bits ad 12 bits both for the coefficiets ad iput samples while, the umber of filter taps vary from 16 to 256. The aalysis has bee restricted to costat coefficiet filters, but it ca be easily exteded to variable coefficiet filters. The set of moduli is composed by a power of two modulo (2, up to 9) ad the remaiig moduli are prime umbers that ca be represeted by 5 bits. I table II the set of sythesized filters are show. For dyamic rages up to 23 bits 4 moduli have bee used while for the biggest dyamic rage (32 bits) 7 moduli

FIR Iput/Coeff (bits) N. taps M(Bits) Moduli set FIR1 8 16 20 64,31,29,23 FIR2 8 32 21 128,31,29,23 FIR3 8 64 22 256,31,29,23 FIR4 8 128 23 512,31,29,23 FIR5 8 256 24 64,31,29,23,19 FIR6 12 16 28 64,31,29,23,19,17 FIR7 12 32 29 128,31,29,23,19,17 FIR8 12 64 30 256,31,29,23,19,17 FIR9 12 128 31 512,31,29,23,19,17 FIR10 12 256 32 64,31,29,23,19,17,13 Table 2. Descriptio of the set of FIR filters sythesis experimets I this table, the resources for the implemetatio of the iput ad output coverters have bee evaluated showig that it become less tha 10% for N > 64 (see FIR3 ad FIR8). Fially, the results of the sythesis of the RNS filters have bee compared to a TCS implemetatio (o trucatio). As idicated i sectio II usually trucatio is used i order to limit the resources i TCS filters but it has bee show i the literature [15] that trucatio do ot offset the advatages of a RNS implemetatio. Moreover, the RNS represetatio is ofte used to desig filters with error detectio ad correctio capabilities [13], [14]). If trucatio is used, error detectio techiques caot be used. The results are preseted i table IV. The resource savigs obtaied by usig RNS are always greater tha 30% whe the dyamic rage of the iput data is 12 bits, while i case of 8 bits the advatage depeds o the umber of taps. For the FIR1 there are o savigs but a small icremet i the resources usage due to the overhead of the coversio blocks but savigs up to 20% are obtaied for FIR5 ad FIR 3 experimets. The experimetal results shows that the preseted techiques offer iterestig advacaoical RNS savig Exp. Name (#LUTs) (#LUTs) (%) FIR1 788 804-2 FIR2 1800 1460 18 FIR3 3632 2900 20 FIR4 6966 6356 8 FIR5 15203 12296 19 FIR6 1899 1252 34 FIR7 3338 2228 33 FIR8 6555 4308 34 FIR9 14043 9044 35 FIR10 29234 17545 40 Table 4. Compariso of RNS ad TCS filters are required. I Table III the results i terms of resources ad speed performaces for the set of sythesized filters are listed. The maximum frequecy for the 8 bits filters (from FIR1 to FIR5) is bouded by the maximum operatig frequecy of the filter tap (about 435 MHz), while for the 12 bits filters (from FIR6 to FIR10) the maximum workig frequecy of the filter is limited by the iput coverter speed (300 MHz). FIR Max. freq. Taps I coverter Out coverter Total resources (MHz) (#LUTs) (#LUTs) (#LUTs) (#LUTs) FIR1 400 592 30 182 804 FIR2 400 1248 30 182 1460 FIR3 400 2688 30 182 2900 FIR4 400 6144 30 182 6356 FIR5 400 12032 40 224 12296 FIR6 303 912 70 270 1252 FIR7 303 1888 70 270 2228 FIR8 303 3968 70 270 4308 FIR9 303 8704 70 270 9044 FIR10 303 17152 84 309 17545 Table 3. Resource usage ad speed for the experimets tages for FIR filters characterized by high dyamic rage ad high umber of taps especially whe full custom multipliers are ot available i the target FPGA architecture or whe they must to be used for differet purposes. 6. CONCLUSION The optimizatio of Residue Number System (RNS) arithmetic to better exploit some of the architectural characteristic of the last geeratio FPGAs has bee preseted. Usig a approach based o ROM modular adders differet optimizatio techiques for the basic modular operatios ad for the basic blocks of RNS filters has bee discussed. The choice of 5-bit moduli allows to implemet high speed, low resource occupatio RNS filters, as show i the set of experimets discussed i the paper. 7. REFERENCES [1] 2007 Iteratioal Techology Roadmap for Semicoductors, http://public.itrs.et/. [2] http://www.xilix.com/compay/history.htm#begi [3] http://www.altera.com/products/devices/stratix3/st3- idex.jsp [4] Virtex-5 Family Overview L, LT, ad ST Platforms [5] Logic Array Blocks ad Adaptive Logic Modules i Stratix III Devices chapter i volume 1 of the StratixIII Device Hadbook. [6] I. Viogradov, A Itroductio to the Theory of Numbers. New ork: Pergamo Press, 1955. [7] N. Szabo ad R. Taaka, Residue Arithmetic ad its Applicatios i Computer Techology. New ork: McGraw-Hill, 1967. [8] M. Sodestrad, W. Jekis, G. A. Jullie, ad F. J. Taylor, Residue Number System Arithmetic: Moder Applicatios i Digital Sigal Processig. New ork: IEEE Press, 1986.

[9] T. V. Vu, Efficiet implemetatio of the chiese remaider theorem for sig detectio ad residue decodig, IEEE Tras. Circuits Systems-I, vol. 45, pp. 667-669, Jue 1985. [10] S.Piestrak, A high-speed realizatio of a residue to biary umber system coverter, IEEE Tras. Circuits Systems-II Aalog ad Digital Sigal Processig, vol. 42, pp. 661-663, Oct. 1995. [11] G. Cardarilli, M. Re, ad R. Lojacoo, A residue to biary coversio algorithm for siged umbers, Europea Coferece o Circuit Theory ad Desig (EC- CTD97), vol. 3, pp. 1456-1459, 1997. [12] S. Badyopadhyay, G.A. Jullie, A. Segupta, A Systolic Array for Fault [13] Mark H. Etzel ad W. K. Jekis Redudat Residue Number Systems for Error Detectio ad Correctio i Digital Filters, IEEE Trasactios o Acoustics, Speech ad Sigal Processig, vol. ASS-28, No 5, pp. 538-544, October 1980. [14] S. Potarelli, G.C. Cardarilli, M. Re, A. Salsao Totally Fault Tolerat RNS based FIR Filters, to be published i IEEE Iteratioal O-Lie Testig Symposium 2008. [15] A. Naarelli, M. Re, G. C. Cardarilli, Tradeoffs betwee Residue Number System ad Traditioal FIR Filters, IEEE Iteratioal Symposium o Circuits ad Systems, ISCAS 2001, Vol. II, pp. 305-308, Sydey (Australia), May 6-9, 2001.