Reconfigurable High Performance Baugh-Wooley Multiplier for DSP Applications

Reconfigurable High Performance Baugh-Wooley Multiplier for DSP Applications Joshin Mathews Joseph & V.Sarada Department of Electronics and Communication Engineering, SRM University, Kattankulathur, Chennai, India E-mail : joshin166@gmail.com, saradasaran@gmail.com Abstract This paper presents a power efficient reconfigurable Baugh Wooley multiplier that provides six configuration modes. The various modes are 1.one n n fixed width multiplier,2. Two n/2 n/2 fixed width multiplier 3.one n/2 n/2 fixed width multiplier 4.one n/2 n/2 full precision multiplier 5.two n/4 n/4 full precision multiplier 6. One n/4 n/4 full precision multiplier. The design of a normal multiplier will consumes more power in DSP processor. The proposed multiplier architecture support sub word parallelism and additional features which enhance their performances in dsp application that takes only slightly less area and delay than conventional multipliers for general purpose processing. In order to reduce the power, we are applying gated clock technique and zero input technique. A fixed width multiplier is used for implementing six modes. The power of the proposed reconfigurable structure is reduced by 6-7% when compared to existing pipelined reconfigurable Baugh Wooley multiplier. This architecture supports higher clock frequency when compared with 2 s complement Baugh Wooley multiplier which supports 100MHz.The area of the proposed architecture is reduced 20% compared to the existing 2 s complement Baugh Wooley multiplier. Keywords BaughWooley algorithm, Reconfigurable, Clockgating, Twoscomplementmultiplication, Hardware description language (HDL). I. INTRODUCTION Multiplication is a very important operation in DSP applications. The power efficient multiplier is essential due to the increased demand in expanding computing and communication operations which offers a better power reduction. Most of the multiplication algorithms are based on the Baugh Wooley or Booth [1][2][3]. This algorithm is widely used in digital filters, fast fourier transforms, discrete cosine transform, convolution, wavelet transform and other important dsp related multimedia applications etc. In digital signal processing applications requires flexible working ability, less power and higher performance system so that modifications are needed to meet all these requirements. In most of the cases fixed width multipliers are used for the multiplication purposes. With this fixed type, area and power reduction are achieved to a large extent. The hardware implementation of a multiplication operation consists of three stages; especially the generation of partial products, reduction of partial products and final carry propagation addition. In fixed width multiplier the least significant bits are truncated and concentrate only on the higher order bits for the multiplication process[6]. The ignoring of the least significant partial part will lead to two main errors in the multiplication process i.e. reduction and rounding errors. In a full width multiplier n n multiplier it gives a2n output as sum of partial products. If the final product is truncated to n bits, the product matrix contributes little to the final result. As more columns which contribute the partial products are eliminated out, the area and power consumption of the arithmetic unit and delay also reduced to a larger extent. Different configuration parameters are required for making different functioning process in DSP. For attaining different configuration pattern, different multiplier structure is needed but the hardware complexity is higher. Reconfiguringan existing structure will leads to greater flexibility without compromising on performance. Former reconfigurable structures have four modes for various DSP functions[5].in this paper it has improved up to six different modes by reconfiguring the low power fixed width multiplier structure with power reduction techniques, also an error compensation technique in the design to reduce the error[7]. The six configuration modes include 1. One n n fixed width multiplier, 103

2. Two n/2 n/2 fixed width multiplier 3. One n/2 n/2 fixed width multiplier 4. One n/2 n/2 full precision multiplier 5. Two n/4 n/4 full precision multiplier 6. One n/4 n/4 full precision multiplier Within this work it has introduced a pipelined, power efficient reconfigurable Baugh-Wooleymultiplier that contributes six configuration modes which will functions in various bit length process. The paper is organized as follows: section 2 gives an description about two s complement parallel array multiplication algorithm, section 3 gives an insight about design of reconfigurable fixed width baughwooley multiplier. In section 4 discuss about the power reduction techniques which have been in the proposed architecture and simulation results are presented in section5.last,brief statements conclude the presentation of the paper. II. TWO S COMPLEMENT PARALLEL ARRAY MULTIPLICATION ALGORITHM In higher performance circuits the multiplication process consumes most area in the arithmetic computation. Two s complement is the most popular method in representing signed integer in computer science.its use wide today because it does not require the addition and subtraction circuitry to examine the signs of the operands to determine whether to add or subtract. BaughWooley multiplier is usedfor both unsigned and signed multiplications. Baugh wooley multiplier operates on signed operands with 2 s complement representation to make sure that the signs of all the partial products are positive.the unsigned multiplicationmatrix is being modified for operation of two s complement operandsusing the technique done by Baugh and Wooley[1]. The inputs of the multipliers represent n bits in two s complement fraction as, X= + (1) The first two terms of above equation are positive and last two terms are negative. In order to calculate the product, instead of subtract the last two terms it is possible to add the opposite values [1] [4]. Since its representation in 2 s complement the opposite is easily calculated considering the entire bit complemented and adding 1 in the least significant column: X.Y= (4) Fig.1: Partial product array diagram for an n n Baugh- Wooley multiplier. Final equation will be, X.Y=- Y= + (2) A full precision product X.Y is given by X.Y= - (3) 104

The above equation represents the BaughWooley algorithm for two s complement multiplication process[1]. III. DESIGN OF RECONFIGURABLE FIXED WIDTH BAUGH WOOLEY MULTIPLIER This section describes the implementation of six different configuration modes under limited hardware resource. Most of the applications it has require only single precision product, wherethe double word length result is rounded to single precision. It is only necessary to estimate the carries generated which is ripple into the most significant part of the product[8]. In the present work reduced the accuracy degradation in fixed width multipliers by truncating with rounding technique which has accuracy almost equal to the rounding technique with a little circuit complexity.the three modules denoted by mul1, mul2, mul3 are used to achieve the six modes of operation. For attain various configuration modes various configuration parameters has been set out. The elaborated structure of MUL1, MUL2, MUL3 are given in the previous paper [5]. The prototype of the reconfigurable architecture is given below. The three modules denoted by mul1, mul2, mul3 are used to achieve the six modes of operation. Forattain various configuration modes various configuration parameters has been set out. The elaborated structure of MUL1, MUL2, MUL3 are given in the previous paper [5]. 3.1 CM1: n n fixed width multiplier In CM1, multiplier receives two n bit input data and produces an n bit product. All the three multiple blocks are used for the calculation purpose. Each partial product isgenerated independely and summed up to get the final result. In this mode, compensation vector is used to add carry to the final stage. For avoiding of addition of compensation vector twice a control unit has been used in multiplier block 1.The partial array diagram and the configuration parameters has been given below. Fig.3. (a) Partial products for fixed width multiplication, (b) Partial Products for CM1, (c) Configuration parameters 3.2 CM2: n/2 n/2 fixed width multipliers The input is given as two n/2 numbers and output is taken as two n/2 numbers. It is manifest that the mul1 and mul2 blocks are suitable for two n/2 n/2 multiplication. In this mode the configuration parameters are set has 1 for CP 0,CP 1 and CP 2. Fig. 2 : Proposed pipelined reconfigurable multiplier 105

Fig 4. (a) Partial products for CM2 Fig 4(b) Input and output relations for CM2 3.3 CM3:one n/2 n/2 fixed width multiplier In this mode, two multipliers are used to obtain the final result. Two multiplicand operations are not necessary for smaller bit length applications so that only one multiplier is required to obtain the result. The power consumption is reduced by using only one multiplier block mul1. Fig.5. (a) Proposed partial product array diagram for CM3, (b) configuration parameter settings. 3.4 Mode 4: one n/2 n/2 full precision multiplier In this case multiplier block 3 is alone is used for the operation.two n/2 numbers are multiplied and n bit product is given as the output. The partial product diagram and mode setting are given in figure 6. Fig.6. (a) Partial products for CM4, (b) Configuration Parameters for CM4 3.5 CM5: two n/4 n/4 full precision multiplier This configuration mode is widely used in low resolution operation which performs two n/4 n/4 full precision multiplications. With minimum numbers of modules and partial product configuration we make use of mul3 is used to fulfill mode5 operation. The operation of the parameters setting is explained in figure7. 3.6 CM6: n/4 n/4 full precision multiplier This mode is an extension to mode5 which uses lesser resources to arrive at multiplication process. This mode is added advantage for low power application where a small part of architecture is being used up. In this only the higher order bits of mul3 has been using up for the calculation part. The higher bits from both the inputs has been invoking for calculations. Using the above mentioned operating modes and the reconfigurable architecture, a new architecture is proposed to arrive at the functionality. The figure gives an over view of an architecture. The entire architecture is divided it into 3 sections.stage1 decodes the operation condition for different modes of operation. These bits select which multiplier functionality to be performed in a particular time. The mode select bits are determined according to the reconfigurable region or modules designed.operation code (op) is used to determine the type of multiplicationperformed; either n x n fixed width or n/2 x n/2 fixed width or n/2 x n/2 full precision or n/4 x n/4 full precision. In second stage each MUL module performs independent multiplication operation according to the multiplicand inputs and the decoded control signals from the stage 1. The product from each MUL is then sent to stage 3 for final addition. MUX in the final stage is used to select the output of the multipliers based on the input control signals. 106

The hardware over head is the main disadvantage of this scheme. This duplicated registers can increase the area of the multiplier. Fig.7. (a) Partial products for CM5, (b) Configuration parameters for CM5 IV. DESIGN OF RECONFIGURABLE POWER EFFICIENT ARCHITECTURE Power Consumption in baughwooley multipliers is minimum compared to other conventional multiplier units. So it is cleared that both signed and unsigned binary multiplication through baughwooley multiplication is suited for the reconfigurable multiplier implementation. The reconfigurable structure invokes all the hardware resources for its operation. The introduction of clock gating and zero input technique into the proposed structure makes it more power efficient. The control signal n isintroduced to achieve m3 and m6 modes of operation. It has no significance when we used in CM1 and CM4 modes. The power efficient reconfigurable fixed width multiplier is shown in figure 8. 4.1 Clock gating Clock gating is applied to the register in the second and third stage of the multiplier. The main aim of this is to avoid unnecessary transition in the multiplication process. With our requirement only registers are disabled based on the mode of operation 1. If multiplier is operated in m1 mode then mul1, mul2, mul3 are conditionally disabled based on the zero inputs to the multiplier. 2. For mode2, mul3 is being disabled. 3. For mode3, mul2 and mul3 are disabled. 4. For mode4, mul1 and mul2 are disabled. 5. For mode5, mul1 and mul2 are disabled. 6. For mode6, mul1 and mul2 are disabled and mul3 is partially disabled by disabling the gated register. Fig.8 Proposed power efficient pipelined reconfigurable fixed width multiplier. 4.2 Zero input technique: The functional blocks mul1, mul2 and mul3 can be functionally disabled based on the zero inputsthey receive. The condition for zero value is follows 1. If x [7:4] is zero, input register of mul1 and mul3 can be disabled 2. If x[3:0] is zero, input register of mul2 can be disabled 3. If y[7:4] is zero, input register of mul2 and mul3 can be disabled 4. If y[3:0] is zero,input register of mul1 can be disabled. In most cases if the inputsoperands are zero the product of the multiplication process may not be zero, because some of the partial products in the multiplication process has complemented out. The actual outputs of the mul3and mul2 should be (11110000) 2 and (001111) 2 [5].The output of mul1 may not be same in all the cases the output depends on the partial product vector.in such case the actual product of MUL1 in the disabledcondition is {0100, x3y3 & Km2, (x3y3 & Km2) }. The control unit (CU) is used to treat Km2 = 1 when MUL2 is disabled. Latch L is used to keep the present value when MUL1 is disabled. For the operations other than M1 mode, input registers of ADD1 can be disabled. Based on the above stated conditions the input signal is decoded and g_m1, g_m2 and g_m3 are generated which control the gated registers of MUL1, MUL2 and MUL3 respectively. The gated 107

register at stage 3 is controlled by t[3] which is taken as value 1 only in the operation mode CM1. V. SIMULATION RESULTS Fig.9: Simulated Power of reconfigurable 2 s complement multiplier Fig. 10 : Simulated Power of reconfigurable 2 s complement power efficient multiplier From the simulated results the power efficient reconfigurable multiplier is more efficient than normal pipelined reconfigurable multiplier. By calculating the LUT s area used in the structure, power efficient reconfigurable 2 s complement multiplier consumes less area than the normal pipelined reconfigurable multiplier.hence the area and the power consumption isreduced and the performance and the throughput is increased. clock gating and zero input technique. The power efficient architecture will reduce 6-7% of the power with respect to the proposed reconfigurable multiplier with six modes. The frequency of operation is doubled compared to other reconfigurable architectures. The same methodology can be used for n=16,32, and 64. The average power of the multiplier is reduced with the addition of two more modes. VII. REFERENCES [1] C.R. Baugh and B.A. Wooley, A Two s Complement Parallel Array Multiplication Algorithm, IEEE Trans. Computers, vol. 22, no. 12, pp. 1045-1047, Dec. 1973. [2] A.D. Booth, Signed Binary Multiplication TechniquesQuarterly J. Mechanics and Applied Math., vol. 4, pp. 236-240, 1951. [3] O.L. MacSorley, High-Speed Arithmetic in Binary Computer, Proc. Conf. Institute of Radio Engineers (IRE 61), vol. 49, pp. 67-91, 1961. [4] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design. John-Wiley, 1979. [5] Tu, J.-H., Van, L.-D.: Power-Efficient Pipelined Reconfigurable Fixed-Width Baugh-Wooley Multipliers. IEEE Trans. Computers 58(10) (October 2009) [6] Jou, J.M., Kuang, S.R., Chen, R.D.: Design of Low-Error Fixed-Width Multiplier for DSP applications. IEEE Trans. Circuits and Systems 46(6), 836 842 (1999) [7] Krithivasan, S., Schulte, M.J.: Multiplier Architectures for Media Processing. In: Proc. EEE Asilomar Conf. Signals, Systems, and Computers, vol. 2, pp. 2193 2197 (November 2003) [8] Tsao, Y.-L., Chen, W.-H., Tan, M.-H., Lin, M.- C., Jou, S.-J.: Low-Power Embedded DSP Core for Communication Systems. EURASIP J. Applied Signal Processing, 1355 1370 (January 2003). VI. CONCLUSION A pipelined reconfigurable power efficient two s complement multiplier using Baugh Wooley algorithm is implemented.the structure has been modeled in Verilog HDL. Better power efficiency is achieved by 108