Design and Ipleentation of Bloc Based Transpose For FIR Filter O. Venata rishna 1, Dr. C. Venata Narasihulu 2, Dr.. Satya Prasad 3 1 (ECE, CVR College of Engineering, Hyderabad, India) 2 (ECE, Geethanjali College of Engineering and Technology, Hyderabad, India) 3 (ECE, L Deeed to be University, Vaddeshwara, Guntur, India) ABSTRACT The transpose for configuration of Finite ipulse response filter (FIR) does not support for bloc based processing se for FIR filter architecture is optiized and ipleented. The basic Data Flow Graph (DFG) of transpose for FIR filter is converted into bloc based DFG and retiing is inserted in the DFG for low power consuption, reduced area and inial delay. The generalized atheatical forulation is done for the retied bloc based transpose for FIR filter and it is ipleented with the bloc size of 4 for the filter length of 16 using Verilog Hardware Description Language (HDL). Later, it is synthesized using CADENCE-RTL copiler in TSMC 45n CMOS library and power, area and delay reports are generated. The obtained results are copared with the few existing structures. eywords: Digital filters, Data Flow Graph, Retiing, low power, FIR, and HDL. 1. INTRODUCTION The Digital Signal Processing (DSP) systes are being ipleented on Field Prograable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC), due to the reconfiguration and flexibility of FPGAs. The FPGA platfor is ore suitable for the optiizing the DSP systes in ters of area, power and delay. Digital filters are ostly used in DSP applications [1], such as bioedical applications, counication systes and obile applications. For these applications, the digital filter ust consue less power, reduced area and high speed. The FIR filters can be ipleented in different architectures, such as, direct for structure, transpose for structure and hybrid structures. Several FIR architectures are ipleented in different styles to eet the specifications. For exaple, a FIR filter ipleented by Mahesh et al [2] using prograable shift ethod (PSM) and Constant shift ethod (CSM) [8][11]. Par [3] also ipleented a FIR filter based on distributed arithetic structure in direct for and transpose for structures. But there is no any bloc based concept in transpose for structure. Mohanty et al [4] proposed bloc based structures and filter bans, which are not suitable for higher order filter lengths and applicable for 2-Diensional (2D) filters. Mohanty also proposed [5], the reconfigurable bloc based transpose for filter and fixed length transpose for FIR filter for DSP applications [6]. The ost preferred architectures of FIR filters in signal processing are transpose for structures. The transpose for FIR filter consists of inherent pipelining process. The pipelining in the digital filters design leads to reduction of critical path or delay, reduction of power consuption and increases the cloc speed. 72
In this paper, the bloc based transpose for FIR filter is realized and atheatically forulated for reconfiguration applications. The DFG of transpose for FIR filter is converted and odified to reduce the power consuption, area and delay. Section-II, the coputational analysis using DFG and data flow table (DFT) of transpose for FIR filter and atheatical forulations are presented. Section-III describes the realization of hard ware structure and the ipleentation approach of the proposed FIR filter. In section-iv, all the practical ipleented synthesized circuits and siulation diagras are presented and corresponding results copared with the existing structures in ters of Very Large Scale Integration (VLSI) design etrics, such as area power and delay etc. 2. ANALYSIS OF BLOC BASED TRANSPOSE FORM FIR FILTER The digital filters are designed to odify the frequency response properties of the given discrete input signal x (n) to eet the soe design requireents. The digital filter characterized by its transfer function or frequency response or unit saple response h (n). Equation (1) represents the general differential equation of digital filter. N 1 1 N 1 y ( n) a y( n ) b x( n ) (1) If a for1 N 1, the above equation becoes N 1 y ( n) b x( n ) (2) Equation (2) is N tap finite ipulse response filter with unit saple response h () 1 N 1 and h () = otherwise. The above equation can be expressed in the z- transfor is given by, Y ( z) N 1 b z ( z) = b for For the coputational analysis, the DFGs are drawn in the transpose for for the filter length N=8 as shown in figures.the figure 1 represents the DFG for the input x (n) and output y (n) and figure 2 describes the DFG for the input x ( n 1) and output y ( n 1) respectively. Figure1: DFG of transpose for FIR filter for the length of 8 for output y(n) to input x(n). 73
Figure 2: DFG of transpose for FIR filter for the length of 8 for output y(n-1) to input x(n-1). In the DFG1 and DFG2, the ultiplied values of coefficients with input values and corresponding accuulation paths are shown in the data flow tables (DFT1) and DFT2 of figure 3. The accuulation path of the product values are indicated by arrows in DFT1 and DFT2. Fro the observation of the DFT1 and DFT2, we conclude that, the five values in the each colun of the data flow graphs are sae. Figure 3: DFT for output y(n) to input x(n) with respect to DFG1 and DFT2 output y(n-1) to input x(n-1)with respect to DFG2. This is high redundancy in the noral transpose fro FIR filter. This redundancy can be reduced in the above FIR structures for the two consecutive inputs by introducing the bloc based inputs concept. Here, the non-overlapped sequence input blocs are used. Now two odified data flow tables DFT3 and DFT4 are presented in the figure 4 corresponding to non - overlapped input blocs to avoid the redundancy in noral transpose for FIR filters. 74
Figure 4: The odified Data flow tables DFT3 and DFT4 for transpose for FIR filter with bloc size of 2 and N=6. The DFT3 is the data flow table for the output of y (n) and DFT4 for the output of y( n 1).There is no redundancy in DFT3 and DFT4, which can be observed fro the entries of the tables. The gray cells represent the output of y (n) and other values for output y ( n 1). Now the DFG1 is copletely transfored into a new DFG corresponding to the DFT3 and DFT4 is referred as DFG3. The DFG3 is the equivalent flow diagra for the coputations of DFT3 and DFT4 with non-overlapped blocs of 2 for the filter length of 8 which is shown in figure 5. Figure 5: Modified DFG of bloc based transpose for FIR filter for the length of 8. This DFG3 is further optiized using the concept of retiing. The retiing is ethod which reduces the power, area and delay in VLSI circuits, by changing the positions of the delay eleents lie flip flops. This change can not alter the characteristics of the circuit. Retiing is ostly used in the synchronous designs for any applications. Due to the retiing, circuit switching activity is reduced, hence the power consuption decreases. Actually, the dynaic power dissipation is reduced in static CMOS circuits [7]. In this paper, the DFG3 is retied to obtain the advantages of retiing and naed as DFG4, which is bloc based retied transpose for FIR filter as shown in figure 6. 75
Figure 6: Retied DFG of bloc based Transpose for FIR filter for the length of 8. In the coparison of DFG3 and DFG4 for bloc based transpose for FIR filters, note that both structures consists of equal nuber of adders and ultipliers. Only the delay eleents or D flip flops are less in the retied FIR filter structure. 3. MATHEMATICAL FORMULATION OF PROPOSED STRUCTURE This retied bloc based transpose for FIR filter is atheatically forulated to design and ipleent the proposed architecture for low power and less delay FIR filter with optiized area. For the ipleentation of proposed FIR filter, the input bloc size is assued as L=4 for every cycle. It process the input saples and produce L nuber of outputs [8]. The filter output for th bloc is generally represented by y using the following relation. Y =. b (3) Where, b is coefficient weight vector taen as b = [b(), b(1), b(2),, b(n-1)] T The input atrix is taen as = [ Where i 1 2 N 1 ] (4) i is the (i+1) th colun of are defined as = [ x( 4 i) ( 4 i 1) Substitute (4) in (3) x x ( 4 i 2) ( 4 i 3) x ] T (5) N y = 1. b(i) (6) i i Suppose N is coposite nuber and decoposed as N=ML, the index 3 ; Substituting i l 4 in (5), we have i l 4 for l 3 and i 4 (7) l l Substitute (7) in (4) 76
= [ 1 2 3 1 Substituting (8) in (3) 1 1 2 1 3 1 2 1 2 DOI: https://dx.doi.org/1.2688/rs.ca.i8v1.8 2 2 3 2 1 2 3 3 3 3 3 ] (8) y = l 3 3. ( i 4) l b (9) The input atrix of (8) has the following features, the data bloc is the current bloc, while { 1, 2, 3 } are blocs delayed by 1, 2, 3 cycles. The overlapped blocs { 1, 1, 1 1 2 } are 1, 2, 3 cloc cycles delayed version of overlapped bloc 1. 3 The input atrix is decoposed into 3 sall atrices R, such that l R contains 4 blocs { 1 2 3 1 1 2 3,,, } and R contains {, 1, 1, 1 }. The coefficient vector b is 1 decoposed into sall vectors C = b ( 4), b(4 1), b(4 2), b(4 3) where 3. Here R is syetric and satisfy the identity R = R (1) Fro (1) R is cloc cycle delay with respect to R, the equation (9) can be expressed in C as R and 3 y = r r = R. C (11) The above relation can be expressed in z- transfor as recurrence relation 1 1 1 Y z) R ( z)[ z ( z ( z C C ) C ) C ] (12) ( 3 2 1 Where ( z R ) and Y (z) are the z- doain values of R and y respectively. The above z- transfor recurrence relation is equivalent to the DFG4 which was shown in the above figure. The next section deals with the ipleentation of proposed architecture and internal blocs of the transposed for FIR filter structure for the length N=16 and bloc size L= 4 and related design issues. 4. IMPLEMENTATION OF PROPOSED ARCHITECTURE In this section, the bloc based transpose for FIR filter with the constraint of retiing for the low power, less delay and less area is ipleented. The figure 7 shows the architecture of proposed FIR filter, which consists of a Register Cell (RC), one Coefficient Unit (CU), four Product Units (PUs) and one Pipeline Adder Circuit (PAC). 77
Figure 7: Architecture of FIR filter for the length N=16 and bloc size L=4. The bloc of 4 input saples are applied to the RC for the th cycle, it internally consists of delay flip flops to rearrange the saples corresponding to the algorith and it produce 4 rows of input saples R in parallel as shown in figure 8. Figure 8: The internal circuitry of Register cell (RC) These 4 rows of R are applied to M (where M=4) nuber of PUs in the structure. The M weight coefficient vectors fro CU also transitted to PU. The 4 coefficient vectors C are transitted to PU4, C 1 to PU3, C2 to PU2, and C 3 for PU1 respectively as shown in above figure. Then a atrix ultiplication is taen place between the input saples R fro RC and coefficient weighted vectors C (where is o to 3). The each PU internally consists of four inner product cells (IPC) as shown in the figure 9. The 4 rows of input saples fro RC are going to row wise to the each inner product cell. Then IPC ultiples these values with 4 coefficient vectors and generates the r as the output. Siilarly, each PU produce r, such as, r, r 1, r 2, and r 3. Here 4 PUs wored on parallel processing and produce 4 blocs of result r (where is o to 3). 78
Parallel processing eans that, the ultiple outputs are coputed in parallel for a cloc period. The parallel processing and pipelining in the architecture of PAC is used to reduce the power consuption, for the reduction of critical path or delay, and which also can be iproves the cloc speed. Parallel processing and pipelining techniques are dual each other [9] [1]. A coputation can be pipelined and it also can be parallel processed. Figure 9: Product Unit bloc with four internal IPCs The internal circuit of each IPC bloc is shown in the figure 1. The actual ultiplication of inputs with filter coefficients is carried out by this cell. It ultiplies the each input saple with the corresponding coefficient and produce the output as denoted by r(4). Siilarly, the four IPCs produce the outputs are r(4-1), r(4-2) and r(4-3) respectively. The cobination of these four outputs is generally denoted by r, is one of the output eleents of PU. According to this procedure, the each PU produce 4 outputs referred as r, r 1, r 2, and r 3. These four PU outputs are given to the next bloc which is PAC. Figure 1: Internal circuit of Inner product cell (IPC) in the PU. These four outputs are passing through PAC bloc, which consists of delay eleents and carry save adders, as shown in the figure 11. In this bloc, the total partial product outputs are added by a pipeline addition and produce L nuber of outputs y., where L value is 4. Finally, the output of the Transpose for FIR filter architecture produces 4 blocs of output for four 79
input bloc saples. The pipelining is used in this PAC bloc to obtain the optiization in the filter. Figure 11: The internal circuit of the PAC. 5. RESULTS AND PERFORMANCE COMPARISON The entire FIR filter design is ipleented in ilinx Synthesis Tool (ST) using Verilog HDL for the target device of FPGA vertex-5 and siulated using ISE siulator. The top level synthesized odule of FIR filter and siulation outputs for the ISE Siulator are presented in the figure 12 and figure 13 respectively. Figure 12: Top level odule of bloc based transpose for FIR filter for the length N=16 and bloc size L=4 using ST. 8
Figure 13: Siulation outputs of proposed FIR filter using ISE Siulator. The Table. I represents the coplete design suary of the proposed FIR filter using ST tool for the FPGA Vertex 5. Here, the design blocs are apped to the technology blocs in the FPGA. The device utilization percentages and nuber of available blocs and nuber of used blocs are shown in the table I. Table I Device utilization suary for the proposed FIR filter using ST Logic Utilization Used Available Utilization Nuber of Slice Registers 129 64 % Nuber of Slice LUTs 192 64 % Nuber of fully used LUT-FF pairs 128 193 66% Nuber of bonded IOBs 98 64 15% Nuber of BUFG/BUFGCTRLs 1 32 3% Nuber of DSP48Es 8 256 31% The blocs of FIR filter are coded using Verilog HDL, next synthesized using Encounter RTL Copiler in TSMC 45n and TSMC 18n CMOS technology fro CADENCE. The RTL copiler gives nanoeter perforance goals, reduces the chip area, lowers power and iproves tiing closure. The reports are generated for power consuption, area and delay using this synthesis. The top odule of FIR filter architecture fro RTL coplier tool is shown in the figure 14. 81
Figure 14: The coplete sturcture of proposed FIR filter using TSMC 45-n CMOS technology by RTL Copiler. The coparision of the two different TSMC technologies using RTL copiler is described in table II. The FIR filter synthesized results for TSMC 45-n CMOS library and TSMC 18-n CMOS technolgy are tabulated fro th reports generated by RTL Copiler tool. The power consuption of the FIR filter in 18n technology is very uch greater than the 45n technolgy, that eans ore power optiization is taen place in the 45n technolgy due to the constraints given with repect to power reduction in the filter. Here, the nuber of delay eleents also reduced using retiing, hence the delay is iproved in the 45n technology. The cloc speed corresponding to FIR filter delay in the 45n is high i.e 222 MHz. The area also reduced in the advanced CMOS technology 45n using appropriate constarints for the area. The nuber of Flip Flops are reduced and optiized adders and ultipliers are used in the design of FIR filter. Fro the RTL copiler synthesis tool, the leaage power is very less coparively dynaic power in both 45n and 18n technologies. The leaage power in 45n is 157nW and leaage power in 18n is 62 nw.it is negigible coparitively dynaic power. The toatl power is 326 µw for proposed FIR filter using 45n technology, which is optiized power. Table II Synthesis results of 45n and 18n CMOS technologies using RTL copiler. Technology Power consuption(µw) Area (µ 2 ) Delay (ns) Cloc frequency (M Hz) TSMC 45-n CMOS 326 2445 4.487 222.866 TSMC 18-n CMOS 16967 91925 7.391 135.299 82
The table III presents the coparision between the exixting FIR filter structures for survey and proposed structure of FIR filter. The nuber of ultipliers requred for the proposed design is 64 sae as the direct for architecture [4] and transpose for structure [5], but the less nuber of delay elents or FFs coparitively all the existing structures. Due to the less nuber of delay eleents the over delay for the propsed FIR filter is reduced to 4.487ns. The cloc speed for the proposed transpose for FIR filter is achived for 45n technology is 222MHz. The adder blocs also 197, which is lesser than filter existing architectures [2] and [3]. The area occupied by the proposed structure is very uch saller than the exixting FIR filetr structures. The ore area optiization is achived by this FIR filter in 45n technology. Table III Coparison between different FIR filters paraeters. Multipliers Adders FF Power consuption Area (µ 2 ) Direct for 64 6 12 55.3 W 112723 structure Par et al 71 296 24.9 W 56325 Mahesh et al 114 36 28.7 W 67949 Mohanty et al 64 6 312 59.3 W 123489 Proposed 64 6 197 326 µw 2445 6. CONCLUSION The optiized bloc based transpose for FIR filter is realized using retiing with less nuber of delay eleents for the low power consuption, low area and high speed. The basic transpose for FIR filter DFG is converted into odified DFG to avoid the redundancy for bloc based inputs. The retiing is introduced by changing the location of D flip flops for the optiization of the FIR filter. The constraints are applied in the synthesis tool to reduce the delay, area and power consuption of FIR filter. The entire bloc based transpose for FIR filter structure is ipleented in Verilog HDL code and siulated using ISE siulator. The design synthesized using ST tool and again synthesized using RTL coplier for two different technologies, such as, TSMC 45n and TSMC 18n CMOS technology. Fro the coparison between these two technologies, the 45n technology gives better results in ters of area, delay and power consuption. The area and utilization suary is given by ST tool and power report and delay reports are obtained by RTL synthesis tool. REFERENCES [1] A. Uasanar and N. Vasudevan, Design And Analysis of Various Slice Reduction Algorith for Low Power and Area Efficient FIR Filter,ICCTET13,IEEE Conf. july 213. [2] R Mahesh and A.P Vinod, New Reconfigurable Architectures for Ipleenting FIR Filters with low Coplexity IEEE Tansactions, Coputer Aided Design Integr. Circuits Syst., Vol. 29, no 2, pp. 275-288, Feb. 21 83
[3] S. Y. Par and P.. Meher, Efficient FPGA and ASIC realizations of a DA-based reconfigurable FIR digital filter, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 7, pp. 511 515, Jul. 214. [4] B.. Mohanty and P..Meher A gigh perforance energy efficient architecture for FIR adaptive filter based on new distributed arthietic forulation of bloc LMS algorith IEEE Trans. Signal Process., vol. 61,no.4, pp. 921-932, feb. 213 [5] B.. Mohanty and P.. Meher, A high- perforance FIR Filter Architecture for Fixed and Reconfigurable Applications, IEEE Trans. on VLSI systes, vol. 24, issue 2, pp.444 452, 216. [6] A. P. Vinod and E. M. Lai, Low power and high-speed ipleentation of FIR filters for software defined radio receivers, IEEE Trans. Wireless Coun., vol. 7, no. 5, pp. 1669 1675, Jul. 26. [7] eshab. Parhi VLSI Digital Signal Processing Systes- Design and Ipleentation john wiley & sons, in 1999. [8] B.. Mohanty and P.. Meher, A high-perforance energy-efficient architecture for FIR adaptive filter based on new distributed arithetic forulation of bloc LMS algorith, IEEE Trans. Signal Process., vol. 61, no. 4, pp. 921 932, Feb. 213. [9] R. Mahesh and A. P. Vinod, A new coon sub-expression eliination algorith for realizing low-coplexity higher order digital filters, IEEE Trans. Coput.-Aided Design Integr. Circuits Syst., vol. 27, no. 2, pp. 217 219, Feb. 28. [1] J. Par, W. Jeong, H. Mahoodi-Meiand, Y. Wang, H. Choo, and. Roy, Coputation sharing prograable FIR filter for low-power and high-perforance applications, IEEE J. Solid State Circuits, vol. 39, no. 2, pp. 348 357, Feb. 24. [11].-H. Chen and T.-D. Chiueh, A low-power digit-based reconfigurable FIR filter, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 8, pp. 617 621, Aug. 26. 84