A NOVEL APPROACH FOR AREA -POWER- ENERGY REDUCTION IN LMS ADAPTIVE FILTER

Volume 118 No. 20 2018, 343-350 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A NOVEL APPROACH FOR AREA -POWER- ENERGY REDUCTION IN LMS ADAPTIVE FILTER 1 Vasumathi.A, 2 Joyce Selva Hephzibah.T 1,2 Assistant Professor, Electronics and Communication Engineering Karpagam College of Engineering, Coimbatore 1 vasu.kce@gmail.com, 2 joyceselva87@gmail.com Abstract: In this paper, an efficient architecture for the implementation of a delayed least mean square adaptive filter has been presented. For achieving lower adaptation-delay and area-delay-power efficient implementation, a partial product generator is used. Pipelining concept has been introduced across the timeconsuming combinational blocks of the structure. CSE and CSD algorithm is used in multiplier block which reduces area and power. It is found that the proposed design offers less area-delay product (ADP) and less energy-delay product (EDP) than the best of the existing systolic structures. It produces less area, power and delay compared to the existing systems. It gives nearly 63% area saving compared with existing systems. General Terms: Least mean square algorithm, Common sub expression algorithm and Canonical signed digit code used in adaptive filter. Keywords: Adaptive filters, least mean square (LMS) algorithm, CSE algorithm, Error computation block, Multiplier block, Weight update block performance degradation is the drawback of this Delayed LMS algorithm. The hardware cost required for this is very less. A correction term is used to avoid this degradation. But adding a correction term will increase power consumption[4]. The DLMS algorithm produces delay. In order to avoid the delays produced by DLMS algorithm a modified DLMS algorithm is proposed[8]. 2. Proposed Architecture 2.1 Modified delayed LMS algorithm The Conventional delayed LMS filter consists of filter block and weight update block. The overall delay is m cycles in conventional LMS filter. The weights of LMS adaptive filter during the nth iteration are shown in the equation. Wn+1= wn + en xn. µ Where wn is the weight coefficient and xn, is the input samples for nth iteration and µ is the step size. 1. Introduction THE LEAST MEAN SQUARE (LMS) adaptive filter is the most accepted and most commonly used adaptive filter used in estimation, equalization and demodulation. It is a broadly used adaptive algorithm for its robustness and less hardware complexity. It is simple and the convergence rate is high. LMS adaptation scheme imposes a significant limit on its implementation[1]. There are many critical path computations. It is recursive in nature. Least mean squares (LMS) algorithms are class of adaptive filter used to mock a desired filter by finding the filter coefficients that relate to producing the least mean squares of the error signal (difference between the desired and the actual signal)[11]. The drawback of LMS is it is difficult to choose the scaling rate. In practical applications, many modified LMS algorithm had be proposed. The systolic array structures are used based on LMS algorithm. It is difficult to implement systolic array using LMS algorithm[9]. Delayed LMS algorithm is suited in hardware implementation. The Figure 1. Structure of conventional filter The input signal is the sum of a desired signal d(n) and noise signal v(n). ie x(n)=d(n)+v(n). The variable filter has a finite impulse response structure. The variable filter has a finite impulse response (FIR structure). For such structures the impulse response is equal to the filter coefficients. The coefficients for a filter order P are defined as W(n)=wn(0)+wn(1)+.wn(P) T. The error signal is the difference between the desired and estimated signal. 343

e(n)= d(n)-d(n). Wn+1= wn+ wn. Where wn is the correction factor. The existing modified delayed LMS algorithm uses pipelining concept in the LMS algorithm. The architecture for filter block is shown in Figure 2. It uses the pipelining structure. The filter block consists of two main blocks. The weight updating block and error computation block. In delayed LMS filters two types of delay are produced one part is the delay introduced by the pipeline stages in FIR filtering, and the other part is due to the delay involved in pipelining the weight updating block. The weight-update equation of DLMS adaptive filter is given by w n+1 = w n + µ e n m x n m. The weight-update equation for modified DLMS filter is given by W n+1 = w n + µ. e (n n1). x (n n1 ) where e n n1 = d n n1 y n n1 y n = w n T n2.x n The delays can be reduced by using the modified structure in the next sections. produces three outputs b0 = u 0 u 1, b1 = u 0 u 1,and b2 = u 0 u 1, such that b0 = 1 for (u 1 u 0 ) = 1, b1 = 1 for(u 1 u 0 ) = 2, and b2 = 1 for (u 1 u 0 ) = 3. The decoder output b0, b1 and b2 along with w, 2w, and 3w are fed in to an AOC, where w, 2w, and 3w are in 2 s complement representation and sign-extended to have (W + 2) bits each. To ensure the sign of the input samples while computing the partial product corresponding to the most significant digit (MSD), i.e.,(u (L 1) u (L 2 ) of the input sample, the AOC (L/2 1) is fed with w, 2w, and w as input since (u (L 1) u (L 2 ) can have four possible values 0, 1, 2, and 1. 3.2 AND /OR Cell AOC consists of 3 AND cells and 2 OR cells. The two inputs to AND cells is input D which is n bit and a single bit b. The OR cell is fed with input b and D. The output of an AOC is w, 2w and 3w corresponding to the decimal values 1, 2, and 3 of 2-bit input (u1u0), respectively. The decoder and AOC performs a multiplication of input operand w with two-bit digit (u1u0), the PPG of performs L=2 parallel multiplications of input words with a 2-bit digit to obtain L=2 partial products of the product word wu. Structure and function of AND cells. 3.3 Adder Tree Figure 2. Structure of modified delayed LMS adaptive filter 3. Error Computation Block The proposed structure for error-computation of an N- tap DLMS adaptive filter is shown in Fig. 3. It consists of N number of partial product generators (PPG) corresponding to N multipliers and a cluster of L/2 binary adder trees followed by a single multiplier block. Each sub block is described in detail. 3.1 Partial Product Generator The structure of each PPG is shown in Fig. 5. It consists of L/2 number of 2-to-3 decoders and the same number of AND / OR cells(aoc).1 Each of the 2-to-3 decoders takes a 2-b digit (u 1 u 0 )as input and The shift-add operation is performed on the partial products of each PPG to get the product value, and then all the N product values have been added to compute the desired inner product. The shifts-add operation to obtain the product value increases the word-length; and increases the adder-size of N -1 addition of the product values. In order to avoid such increase in word-size of the adders, all the N partial products of the same place value has been added from all the N PPGs by one adder-tree. Figure 4 Shows the adder tree.all the L/2 partial products generated by each of the N PPGs are thus added by (L/2) binary adder-trees. The output of the L/2 adder-trees are then added by a shift-add-tree according to their place values. Each of the binary adder-tree requires log 2 N stages of adders to add N partial product, and the shift-add-tree requires log 2 L -1 stages of adders to add L/2 output of L=2 binary addertrees. The addition scheme for the error-computation block for a 4-tap filter and input word-size L = 8 is shown in Fig.3.6. For N = 4 and L = 8, the addernetwork requires four binary adder-trees of two stages each and a two-stage shift-add tree. In this figure all possible locations of pipeline latches are shown by dashed lines, to reduce the critical-path to one additiontime. If pipeline latches are introduced after every addition, it would require L(N -1)=2+L=2-1 latches in log 2 N +log 2 L-1 stages, which leads to high adaptation- 344

delay, and introduces large overhead of area and power-consumption for large values of N and L. On the other hand, some of those pipeline latches are redundant in the sense that they are not required to maintain a critical-path of one addition-time. The final adder in the shift-add tree contributes the maximum delay to the critical path. Based on that observation, we have identified the pipeline latches which do not contribute significantly to critical-path and could exclude those without any noticeable increase of critical-path. The position of pipeline latches for filterlengths N = 8, 16 and 32 and for input size L = 8. The pipelining is performed by a feed-forward cut-set retiming of error-computation block. This has been shown in Table 3.1. 3.4 Multiplier block using CSE and CSD The multiplier block consists of two algorithms which reduces area and the CSE algorithm and CSD algorithm. 3.4.1 Common sub expression CSE has been utilized as a very powerful tool in FIR filter design to reduce the number of arithmetic units (adders and shifters). The following example is used to explain the CSE concept. Consider two functions F1 and F2, where F1 = 13X and F2 = 29X. Both F1 and F2 can be represented in the following manner: F1 = X + 4X + 8X = X + X << 2 + X << 3 and F2 = X + 4X + 8X + 16X = X + X << 2 + X << 3 + X << 4, where _ means bitwise left shift. Both expressions F1 and F2 have some common terms D = X + X << 2 + X << 3. Therefore, F1 and F2 can be rewritten as F1 = D and F2 = D + X << 4. Reusing D in both expressions reduces the computation overhead and the number of adders required to implement both expressions. The corresponding hardware implementation is shown in Fig.5a and 5b. Two important conclusions can be drawn from the earlier example: 1) Significant power savings can be achieved by reducing number of adders using CSE (only three adders in the CSE-based implementation compared with five adders in the unshared case), and 2) CSE might increase the total number of adders in the critical path. For the multiplication purpose we use both the CSE algorithm. The common sub expression has been taken. Figure 3. Error Computation Block 345

To elucidate this point further, let us consider each of the expressions F1 and F2. Without CSE, F1 has two adders in its critical path. Even after applying CSE, F1 is still available after two adder delays. Therefore, there is no delay penalty for F1 in the CSE-based implementation. However, the critical path of F2 is increased from two to three adders due to CSE, resulting in a delay overhead. Therefore, there is a tradeoff between the power consumption and the frequency requirements in the case of a CSE-based implementation. Figure 4. Structure of Adder Tree Figure 5. (b) Multiplication with CSE Figure 5. (a) Multiplication without CSE Figure 6. Comparison of signed power of two (left) and CSD (right) 346

3.4.2 Canonical signed digit On the part of FIR design, we need to emphasize that the FIR filter is constructed from delay element, multiplier and adder. To minimize the chip size, we adopt the signed power-of-two (SPT) method to implement the FIR filter without multipliers. Unfortunately, the SPT method cannot guarantee the least numbers of adders and shifts because of its various expressions. For example, the value of 23 can be expressed as three types, 010111, 011001, and 101001, where the 1 is the value of -1. In order to overcome this fault, a canonic signed-digit (CSD) representation can be employed to implement the basic shift-and-add algorithm. The algorithm of CSD is expressed as {11 101; 11 01; 11 01; 11 101}. Then a value of 27 can be expressed with the following steps: 11011 11101 101101 100101. Fig. 3 shows a comparison of shift-and-add numbers between SPT and SCD. The SCD obviously operates with lower adders and shifters. 4. Weight Update Block The weight-update block is shown in Fig 6. It performs N multiply operations of the form (µ e) xi + wi to update N filter weights. The step-size µ is taken as a negative power of two to realize the multiplication with recently available error only by a shift operation. Each of the MAC units therefore performs the multiplication of shifted value of error with the delayed input samples xi followed by the additions with the corresponding old weight values wi. The N multiplications for the MAC operations are performed by N PPGs and isbfollowed by N shift-add-trees. Each of the PPGs generates L/2partial products corresponding to product of recent shifted error value e with L=2 number of 2-bit digits of the input word xi, where the sub expression 3µ e is shared within the multiplier. Since the scaled error (µ e) is multiplied with all the N delayed input values in the weight-update block, this sub expression can be shared across all the multipliers as well. This leads to substantial reduction of the adder complexity. The final outputs of MAC units constitute the desired updated weights to be used as inputs to the error-computation block as well as the weight-update block for the next iteration. Figure 7. Weight update block 5. Simulation Result The simulation output for the proposed method has been given in figure 7. And the comparison of the existing method and proposed method are given in table. 347

Figure 8. Simulated output for CSE algorithm The area power and delay are compared with the existing system. The power required for the proposed system is less compared to the existing system and the area and delay also gets reduced as shown in the table 1. S.n o 1 2 Table 1. Performance comparison of CSE algorithm with existing system Design Modified delayed LMS filter algorithm Proposed(C SE algorithm) Area (sq.µm ) Power (mw) Frequenc y (MHz) 5880 309 39.638 2154 196 49.525 6. Conclusion We proposed an area delay-power efficient low adaptation-delay architecture for LMS adaptive filter. A new PPG is used for for efficient implementation of addition and multiplication. The pipeline concept is used here so that the delay in the adder stage can be avoided by reducing the redundant adder stages. The inner product computation are made by using common sub expression sharing and canonical signed digit code. It is used for further reduction of computation delay and reduction of area. The proposed structure involved significantly less adaptation delay and provided significant saving of ADP and EDP compared to the existing structures. The proposed design gives 63% of area reduction when compared to the existing system. 7. Acknowledgments My heartfelt thankfulness goes to our honorable chairman Thiru. K.PARAMASIVAM B.Sc., for having provided us with the entire necessary infrastructure and other facilities and my special thanks to our Correspondent Thiru. P.SATHIYAMOORTHY B.E., M.B.A.,M.S., for providing the facilities to complete the project successfully. I extend my gratitude to Dr.N.KUPPUSWAMY M.E., Ph.D., F.I.E.,Principal, Maharaja Engineering College, Avinashi for his high degree of encouragement and moral support during the course of this project work. I am extremely happy for expressing my heartfelt gratitude to Mr.V.SAMINATHAN M.E.,Head, Department of Electronics and Communication Engineering for extending all possible help during the course ofmy project work and also his valuable guidance in making this project work a grand success. I thank my Coordinator Mr. GOPALAKRISHNASAMYM.E., Asst.Prof, and also thank my guide Mr. L.VIGNEASH M.E.,MBA., Department of Electronics and Communication who have helped me during the course of my project work. My heartfelt thanks go to all the faculties of Electronics and Communication Department for their valuable support and encouragement for the completion of this stage of project. I am grateful to my family for lighting the candle for pursuing my studies. I wish to express my sincere thanks to all those who helped me in making 348

this project successful. Our thanks to the experts who have contributed towards development of the template. References [1] Pramod Kumar Meher, Senior Member, IEEE, and Sang Yoon Park, Member, IEEE, Area-Delay- Power Efficient Fixed-Point LMS Adaptive Filter with Low adaptation-delay, IEEE transaction on very large scale integration systems. [2] D.P. Agrawal, M.D. Meyer, May 1990 A modular pipelined implementation of a delayed LMS transversal adaptive filter, in Proc.IEEE Int. synp. Circuits syst., pp. 1943-1946. [3] S.Ramanathan, V.Viswanathan, January 1995 VLSID '95: Proceedings of the 8th International Conference on VLSI Design Publisher: IEEE Computer Society. [4] M.Maheshwari and P.K.Meher May 2011, A high-speed FIR adaptive filter architecture using a modified delayed LMS algorithm, in proc. IEEE Int. Symp. Circuits Syst., pp. 121-124. [5] S. Ramanathan and V. Visvanathanraman, Jan.1996, A systolic architecture for LMS adaptive filtering with minimal adaptation delay, in Proc. Int.Conf. Very Large Scale Integer (VLSI) Design, pp,286-289. [6] R. Rocher.D.Menard, O.Sentieys, and P.Scalart, May 2004, Accuracy evaluation of fixed-point LMS algorithm, in Proc.IEEE Int.Conf.Acoust., Speech Signal process., pp.237-240. receivers, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 1, pp. 86 99, Jan. 2005. [11] P. K. Meher and M. Maheshwari, A highspeed FIR adaptive filter architecture using a modified delayed LMS algorithm, in Proc. IEEE Int. Symp. Circuits Syst., May 2011, pp. 121 124. [12] Rajesh, M., and J. M. Gnanasekar. " Constructing Well-Organized Wireless Sensor Networks with Low-Level Identification.& quot; World Engineering & Applied Sciences Journal 7.1 (2016). [13] S.V.Manikanthan and T.Padmapriya Recent Trends In M2m Communications In 4g Networks And Evolution Towards 5g, International Journal of Pure and Applied Mathematics, ISSN NO:1314-3395, Vol- 115, Issue -8, Sep 2017. [14] T. Padmapriya and V. Saminadan, Inter-cell Load Balancing technique for multi-class traffic in MIMO-LTE-A Networks, International Journal of Electrical, Electronics and Data Communication (IJEEDC), ISSN: 2320-2084, vol.3, no.8, pp. 22-26, Aug 2015. [15] Harikishore Kakarla, Madhavi Latha M and Habibulla Khan, Transition Optimization in Fault Free Memory Application Using Bus-Align Mode, European Journal of Scientific Research, Vol.112, No.2, pp.237-245, ISSN: 1450-216x135/1450-202x, October 2013. [7] Jung Hwan Choi, Student Member, IEEE, Nilanjan Banerjee, and Kaushik Roy, Fellow, IEEEVariation-Aware Low-Power Synthesis Methodologyfor Fixed-Point FIR Filters. [8] Y. Yi, R. Woods, L.-K. Ting, and C. F. N. Cowan, High speed FPGA-based implementations of delayed-lms filters, J. Very Large Scale Integr. (VLSI) Signal Process., vol. 39, nos. 1 2, pp. 113 131, Jan. 2005. [9] L. D. Van and W. S. Feng, An efficient systolic architecture for the DLMS adaptive filter and its applications, IEEE Trans. Circuits Syst. II, Analog Digital Signal Process., vol. 48, no. 4, pp. 359 366, Apr. 2001. [10] L.-K. Ting, R. Woods, and C. F. N. Cowan, Virtex FPGA implementation of a pipelined adaptive LMS predictor for electronic support measures 349

350