Implementation of Parallel MAC Unit in 8*8 Pre- Encoded NR4SD Multipliers

Implementation of Parallel MAC Unit in 8*8 Pre- Encoded NR4SD Multipliers Justin K Joy 1, Deepa N R 2, Nimmy M Philip 3 1 PG Scholar, Department of ECE, FISAT, MG University, Angamaly, Kerala, justinkjoy333@gmail.com 2 Assistant Professor, Department of ECE, FISAT, MG University, Angamaly, Kerala, nrdeeps@gmail.com 3 Assistant Professor, Department of ECE, FISAT, MG University, Angamaly, Kerala, nimmyphilip@fisat.ac.in Abstract This paper describes a merged multiplyaccumulate (MAC) hardware along with t he pre-encoded non redundant radix-4 signed digit encoding (NR4SD) multiplier. New designs of pre-encoded multipliers are explored by off-line encoding the standard coefficients using NR4SD encoder and storing them in system memory.the NR4SD encoder, CSA tree, the NR4SD multiplier, and the accumulator sections ensure the fastest possible implementation. A fast pipelined 8 * 8 multiplier implementation is proposed in the paper with parallel multiply-accumulate unit. Extensive experimental analysis verifies the gains of the proposed pre-encoded NR4SD multipliers with MAC unit in terms of area complexity and power consumption compared to the conventional MB multiplier. Index Terms NR4SD multiplier, MAC unit, MB multiplier, off-line encoder I. INTRODUCTION Multipliers play an important role in today s digital signal processing and various other applications. With advances in innovation, numerous specialists have attempted and are attempting to plan multipliers which offer both of the accompanying configuration targets rapid, low power utilization, consistency of design and consequently less territory or even blend of them in one multiplier in this manner making them appropriate for different fast, low power and smaller VLSI execution. The normal augmentation technique utilized is "include and move" calculation. In parallel multipliers number of incomplete items to be included is the fundamental parameter that decides the execution of the multiplier. To decrease the quantity of fractional items to be included, Modified Booth calculation is a standout amongst the most prevalent calculations. To accomplish speed changes Wallace Tree calculation can be utilized to decrease the quantity of consecutive including stages. Further by joining both Modified Booth calculation and Wallace Tree method we can see point of preference of both calculations in one multiplier. In multipliers included in butterfly units of FFT processors use standard coefficients stored in ROMs. In audio and video, CoDecs, fixed coefficients stored in memory, are used as multiplication inputs. Since the values of constant coefficients are known in advance, we encode the coefficients off-line based on the MB encoding and store the MB encoded coefficients (i.e., 3 bits per digit) into a ROM. Using this technique, the encoding circuit of the MB multiplier is omitted. We refer to this design as preencoded MB multiplier. Modified Booth (MB) encoding tackles the above limitations of CSD multiplier and reduces to half the number of partial products resulting to reduced area, critical delay and power consumption. However, a dedicated encoding circuit is required and the partial products generation is more complex. Then, Non-Redundant radix-4 Signed- Digit (NR4SD) encoding scheme encoding scheme uses one of the following sets of {+2,+1,0,-1} or {+1,0,-1,-2} digit values:. In order to cover the dynamic range of the 2 s complement form, all digits of the proposed representation are encoded according to NR4SD except the most significant one that is MB encoded. Using the proposed encoding formula, standard coefficients are pre-encoded and stored them into a ROM in a condensed form (i.e., 2 bits per digit). Compared to the pre-encoded MB multiplier in which the encoded coefficients need 3 bits per digit, the proposed NR4SD scheme reduces the memory size. Also, compared to the MB form, which uses five digit values {+2, +1, 0,-1,-2}, the proposed NR4SD encoding uses four digit values. Thus, the NR4SD-based pre-encoded multipliers include a less complex partial products generation circuit. The efficiency of the aforementioned pre-encoded multipliers taking into account the size of the coefficients ROM is explored. Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in combinational logic followed by an adder and an accumulator register www.ijrcct.org 269

that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers. The first processors to be equipped with MAC units were digital signal processors, but the technique is now also common in general-purpose processors. It is very easy to find such operation in signal processing algorithms and matrix arithmetic. The multiplier speed is usually the bottleneck that determines the depth of pipelining in the ALU or how fast a processor can run. The main contributions that are addressed in this paper: 1) Different multiplier architectures are compared and the proposed NR4SD multipliers are showing the best results. 2) Advantage of pre-encoded multipliers. 3) The merged MAC unit with different multipliers. This paper is organized as follows. Section II briefly discusses the general architecture of the different multiplication algorithm employed. The factors like encoding schemes, partial product generation that optimize each step are discussed and compared to earlier architectures. We explore the efficiency of the aforementioned pre-encoded multipliers taking into account the size of the coefficients ROM. Section III discusses the conventional multiply-accumulate unit and the proposed method, irrespective of the multiplier architectures. Section IV compares the different multiplier architecture in the basis of area and delay. Extensive experimental analysis verifies the gains of the proposed pre-encoded NR4SD multipliers in terms of area complexity and power consumption compared to the conventional MB multiplier. It also compares proposed pipelined MAC structure with NR4SD multiplier and conventional MAC unit and also estimates the delays of the different sections of the proposed pipelined multiplier. II. MULTIPLIER ARCHITECTURES The different multiplier architectures explained below are conventional modified booth multiplier, pre-encoded modified booth multiplier, and multipliers. A. Modified Booth Multiplier Booth algorithm provides a procedure for multiplying binary integers in signed-2 s complement representation. Booth algorithm which scans strings of three bits is given below: 1) Extend the sign bit 1 position if necessary to ensure that n is even. 2) Append a 0 to the right of the LSB of the multiplier. 3) According to the value of each vector, each Partial Product will be 0, +Y, -Y, +2Y or -2Y. The negative values of B are made by taking the 2 s complement and in this paper Carry-Save (CSA) fast adders are used. The multiplication of M is done by shifting M by one bit to the left. Thus, in any case, in designing n-bit parallel multiplier, at most n/2 partial products are produced. Figure [1] presents the architecture of the system which comprises the conventional MB multiplier and the ROM with coefficients in 2 s complement form. Let us consider the multiplication X * Y. The coefficient Y = < b n-1.. b 0 >consists of n=2k bits and is driven to the MB encoding blocks from a ROM where it is stored in 2 s complement form. It is encoded according to the MB algorithm and multiplied by X = < a n-1.a 0 >2 s, which is in 2 s complement representation. The ROM data bus width equals the width of coefficient Y (n bits) and that it outputs one coefficient on each clock cycle. Figure 1: Block diagram of conventional modified booth algorithm Modified Booth (MB) is a redundant radix-4 encoding technique. Considering the multiplication of the 2 s complement numbers X, Y, each one consisting of n=2k bits, Y can be represented in MB form as: where b -1 = 0. Each MB digit is represented by the bits s, one and two. The bit s shows if the digit is negative (s=1) or positive (s=0). One shows if the absolute value of a digit equals 1 (one=1) or not (one= 0). Two shows if the absolute value of a digit equals 2 (two=1) or not (two=0). TABLE I: TRUTH TABLE FOR MB ENCODING SCHEME (1) (2) www.ijrcct.org 270

Figure 2: Block diagram of NR4SD multipliers The generation of the i th bit p j,i of the partial product PP j is illustrated at gate level. For the computation of the least and most significant bits of PP j, a -1 =0 and a n = a n-1. After shaping the partial products, they are added, properly weighted, through a Carry Save Adder (CSA) tree along with the correction term. The CS output of the tree is leaded to a fast Carry Look Ahead (CLA) adder to form the final result P = X * Y. In the pre-encoded MB multiplier scheme, the coefficient B is encoded off-line according to the conventional MB form (Table 1). T he resulting encoding signals of B are stored in a ROM. The shaded part of Fig. 1, which contains the ROM with coefficients in 2 s complement form and the MB encoding circuit, is now totally replaced by the ROM with coefficients in MB form. The MB encoding blocks of Fig. 1 are omitted. The new ROM is used to store the encoding signals of B and feed them into the partial product generators (PPj Generators - PPG) on each clock cycle. Since the n-bit coefficient B needs three bits per digit when encoded in MB form, the ROM width requirement is 3n/2 bits per coefficient. Thus, the width and the overall size of the ROM are increased by 50% compared to the ROM of the conventional scheme. B. NR4SD Multipliers There are two pre-encoded NR4SD multipliers (a) and (b) multipliers. Both of them have same block diagram, it differs only in encoding section which will be explained later in the section. The system architecture for the pre-encoded NR4SD multipliers is presented in Fig.2. Two bits are now stored in ROM:, for the NR4SD - or for the NR4SD + form. In this way, the memory requirement is reduced to n+1 bit per coefficient while the corresponding memory required for the pre-encoded MB scheme is 3n/2 bits per coefficient. Thus, the amount of stored bits is equal to that of the conventional MB design, except for the most significant digit that needs an extra bit as it is MB encoded. In MB form, the number of partial products is reduced to half. When encoding the 2 s complement number B, digits take one of four values: {+1, 0,-1,-2} or take one of four values: {+2, +1, 0,-1} at the NR4SD - or NR4SD + algorithm, respectively. Only four different values are used and not five as in MB algorithm, which leads to 0 j 2. As we need to cover the dynamic range of the 2 s complement form, the most significant digit is MB encoded. The NR4SD - encoding technique is illustrated below: (a) (b) Figure 3: Block Diagram of the Encoding Scheme at the (a) Digit and (b) Word Level Consider the initial values j = 0 and c 0 =0. Calculate the carry and the sum of a Half Adder (HA) with inputs and. (5) (6) (7) (8) (9) If (j = k-1), encode the most significant digit based on the MB algorithm and considering the three consecutive bits to be, and. If (j = k), stop. www.ijrcct.org 271

TABLE II: TRUTH TABLE OF NR4SD - ENCODING SCHEME Equations below show that how the NR4SD- encoding signals, and of Table 2 are generated. (10) (11) (12) The NR4SD + encoding technique is illustrated below: (a) (17) (18) (19) Compared to the pre-encoded MB multiplier, where the MB encoding blocks are omitted, the pre-encoded NR4SD multipliers need extra hardware to generate the signals for the NR4SD - and NR4SD + form, respectively. Each partial product of the pre-encoded NR4SD - and NR4SD + multipliers is implemented, except for the PP k-1 that corresponds to the most significant digit. As this digit is in MB form, the PPG for booth algorithm is used. The partial products, properly weighted are fed into a CSA tree. The carry-save output of the CSA tree is finally summed using a fast CLA adder. III. MAC ARCHITECTURES The general structure of a parallel MAC is multiplying two numbers X and Y and adding the result to Z. The partial products in the figure can be generated using any multiplication algorithm explained above. (b) Fig 4: Block Diagram of the at the (a) Digit and (b) Word Level Encoding Scheme Consider the initial values j = 0 and c 0 =0. Calculate the carry and the sum of a Half Adder (HA) with inputs and. (13) (14) (15) (16) (17) If (j = k-1), encode the most significant digit based on the MB algorithm and considering the three consecutive bits to be, and. If (j = k), stop. B. TABLE III: TRUTH TABLE FOR NR4SD + ENCODING SCHEME Figure 5: General MAC architecture We are exploring two MAC architectures- (a) Conventional MAC architecture and (b) Proposed MAC architecture. A. A. Conventional MAC architecture Step 1: The multiplier is fed into the ROM as two s complement form and the multiplier is encoded using www.ijrcct.org 272

different encoding algorithms like MBA, NR4SD - and NR4SD +. Step 2: Partial product is generated using different circuits in figure 7. This can be achieved using several techniques such as the modified Booth algorithm (MBA), or the NR4SD s. For a n bit multiplier, the number of summands is at most n/2+1 for MBA, n/2 for NR4SD s. Step 3: Partial-product addition is accomplished using carry-save techniques and Wallace trees for parallel multipliers. Figure 7: Generation of the i th Bit p j,i of PPj for a) Conventional, b) Pre-Encoded MB Multipliers, c) NR4SD-, d) NR4SD+ Pre-Encoded Multipliers, and e) NR4SD-, f) NR4SD+ Pre-Encoded Multipliers after reconstruction. B. Proposed MAC architecture Steps are almost same in the proposed architecture that of conventional except for step 3 and step 6 which is eliminated. We include accumulator along with partial product summation using csa tree which eliminates 2n bit CLA adder used to add accumulator and product to get final sum. It sometimes increases one extra level of CSA tree but it can be compromised by the advantage of elimination of one 2n bit adder. Figure 6: The major parts of a general parallel multiplier. Step 4: When the number of partial products is reduced to sum and carry words, a final adder is required to generate the multiplication result. The number of bits of the final adder is the sum of the number of bits of the multiplier and multiplicand. 2n-bit CLAs can be used. Step 5: The final adder produces a double-precision result of 2n bits that must be added to the accumulator content, which is also 2n -bits wide. This is delay intensive, since both the multiplier and accumulator will each have a delay that is almost times the delay of a onebit full adder.. Figure 8: The major parts of a proposed MAC using parallel multiplier. IV. IMPLEMENTATION RESULTS We implemented in Verilog the multiplier designs of Table IV. The PPGs for the NR4SD - and NR4SD + multipliers (Fig. 7c, 7d, respectively) contain a large number of inverters since all the A bits are complemented in case of a negative digit. In order to avoid these inverters and, thus, reduce the area/power/delay of NR4SD -, NR4SD + pre-encoded multipliers, the PPGs for the NR4SD -, NR4SD +. multipliers were designed based on primitive NAND and NOR gates, and replaced by Fig. 7e, 7f, respectively. TABLE IV: MULTIPLIER DESIGNS ROM WIDTH FOR N*N BIT INPINPUT ENCODINGNG Conventional M B Pre_encoded MB ROM WIDTH n-bit (3n/2)-bit www.ijrcct.org 273

Pre_encoded NR4SD_MINUS Pre_encoded NR4SD_PLUS (n+1)-bit (n+1)-bit We used Xilinx Design ISE 14.7 standard cell library to synthesize the evaluated designs, considering the highest optimization degree and keeping the hierarchy of the designs. The memory compiler of the same library provided the physical ROMs for the coefficients. Since the ROMs required for the pre-encoded multipliers are larger than the one for the conventional MB scheme, access time is increased. However, the pre-encoded designs may achieve lower clock periods than the conventional MB one because the encoding circuits] that are included in the critical path, are omitted or less complex. This reduces the memory requirement to n+ 1 bit per coefficient while the corresponding memory required for the pre-encoded MB scheme is 3n/2 bits per coefficient. Thus, the amount of stored bits is equal to that of the conventional MB design, except for the most significant digit that needs an extra bit as it is MB encoded. Compared to the pre-encoded MB multiplier, the MB encoding blocks are omitted in preencode MB multiplier. TABLE V: COMPARISON OF AREA OF PARTIAL PRODUCT GENERATION IN DIFFERENT ARCHITECTURES PARTIAL PRODUCT GENERATION AREA (gate count) Conventional M B 483 Pre_encoded MB 441 Pre_encoded NR4SD_MINUS 294 Pre_encoded NR4SD_PLUS 267 The above table shows that the partial product generated using proposed NR4SD is made less complex and the gate count is reduced as per the experimental results. TABLE VI: COMPARISON OF AREA AND OF CSA TREE IN DIFFERENT ARCHITECTURES CSA_TREE GATE COUNT MB MULTIPLIER 726 9.33 NR4SD MULTIPLIER 546 8.888 The pre-encoded MB scheme delivers losses in area complexity. This was expected considering that the size of the ROM required by the pre-encoded MB design is by 50% larger than the ROM of the conventional MB scheme. However, the proposed pre-encoded NR4SD designs (ROM and multiplier) deliver improvements in area complexity compared to the conventional MB scheme. TABLE VII: COMPARISON OF AREA AND IN DIFFERENT ARCHITECTURES MULTIPLI ERS Conventional M B Pre_encoded MB Pre_encoded NR4SD_MI NUS Pre_encoded NR4SD_PL US AREA (gate count) LOGI C DELA Y ROUTE TOTA L 960 17.1 11.977 29.077 918 11.906 9.895 22.701 780 10.56 10.56 20.114 762 9.856 9.458 19.314 We note that the gains that concern the multipliers of the pre-encoded NR4SD designs over the one of the conventional MB scheme are much higher. This is mainly due to the less complex PPG circuit of the NR4SD designs compared to the one of the conventional MB design considering that the partial products generation largely contributes to the area complexity and power dissipation of a multiplier. Table V verify the area and power gains of the PPG of the NR4SD designs over the one of the conventional MB scheme. TABLE VIII: COMPARISON OF AREA AND OF CONVENTIONAL AND PROPOSED MAC ARCHITECTURES IN NR4SD MULTIPLIERS www.ijrcct.org 274

NR4SD MULTIPLI ERS Conventiona l MAC unit Proposed MAC unit AREA (gate count) LOGI C DELA Y ROUTE TOT AL 1050 17.162 12.443 29.60 885 15.443 11.614 27.36 5 8 power efficient compared to the conventional and preencoded MB designs. Extensive experimental analysis verifies the gains of the proposed pre-encoded NR4SD multipliers in terms of area complexity and power consumption compared to the conventional MB multiplier. The proposed pre-encoded MAC NR4SD multiplier designs are more area and delay efficient compared to the conventional MAC designs. VI. REFERENCES [1]. Naresh R. Swnbhag, Student Member, IEEE, and Pushkal Juneta, Parallel Implementation of a 4 X 4-bit Multiplier Using Modified Booth s Algorithm, IEEE Journal of Solid-state Circuits,1988 Table VIII shows the comparison of area (gate count) and delay of conventional and proposed MAC architectures in NR4SD multipliers. It has been implemented in 8 * 8 multiplier with 16 bit accumulator. It shows that proposed structure decreases the area by reduction of gates used. It also has advantage on timing constraints, i.e., delay. The inferences of the implementation results are shown below: [1] The ROMs required for the pre-encoded multipliers are larger than the one for the conventional MB scheme, access time is increased. However, the pre-encoded designs may achieve lower clock periods than the conventional MB one because the encoding circuits that are included in the critical path are omitted or less complex. [2] In proposed system, the no. of partial products is less than MB multiplier and partial product generation is less complex. [3] Levels of csa tree will be less and gates count will be less for proposed system. [4] Extensive experimental analysis verifies the gains of the proposed pre-encoded NR4SD multipliers in terms of area complexity and delay reduction compared to the conventional MB multiplier. [5] The proposed pre-encoded MAC NR4SD multiplier designs are more area and delay efficient compared to the conventional MAC designs. V. CONCLUSIONS In this paper, new designs of pre-encoded multipliers are explored by off-line encoding the standard coefficients and storing them in system memory. It proposes encoding these coefficients in the Non-Redundant radix-4 Signed- Digit (NR4SD) form. It simulated all designs using Modelsim and 20 different sets of ROM words. For the conventional MB multiplier, the 2 s complement inputs were randomly generated with equal possibility of a bit to be 0 or 1. Using a high level programming language, we generated the pre encoded values of B which we then stored in the ROMs of pre-encoded designs. The proposed pre-encoded NR4SD multiplier designs are more area and [2]. S.Ravi Chandra Kishore, K.V. Ramana Rao, Implementation of carry-save adders in FPGA, International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 8958, Volume-1, Issue-6, August 2012 27 [3] K. Yong-Eun, C. Kyung-Ju, J.-G. Chung, and X. Huang, Csd based programmable multiplier design for predetermined coefficient groups, IEICE Trans. Fundam. Electron. Commun. Comput. Sci., vol. 93, no. 1, pp. 324 326, 2010. [4] O. Macsorley, High-speed arithmetic in binary computers, Proc. IRE, vol. 49, no. 1, pp. 67 91, Jan. 1961. [5] W.-C. Yeh and C.-W. Jen, High-speed booth encoded parallel multiplier design, IEEE Trans. Comput., vol. 49, no. 7, pp. 692 701, Jul. 2000. [6] Z. Huang, High-level optimization techniques for low-power multiplier design, Ph.D. dissertation, Department of Computer Science, University of California, Los Angeles, CA, 2003. [7] Z. Huang and M. Ercegovac, High-performance lowpower left-to-right array multiplier design, IEEE Trans. Comput., vol. 54, no. 3, pp. 272 283, Mar. 2005. [8] Y.-E. Kim, K.-J. Cho, and J.-G. Chung, Low power small area modified booth multiplier design for predetermined coefficients, IEICE Trans. Fundam. Electron. Commun. Comput. Sci.,vol. E90-A, no. 3, pp. 694 697, Mar. 2007. [9]. K. Tsoumanis, N. Axelos, N. Moshopoulos, G. Zervakis and K. Pekmestzi, Pre-encoded Multipliers Based On Non-redundant Radix-4 Signed-digit Encoding, IEEE Transactions on Computers, 2015 [10] N. Weste and D. Harris, CMOS VLSI Design: A www.ijrcct.org 275

Circuits and Systems Perspective, 4th ed. USA: Addison-Wesley Publishing Company, 2010. About Authors: JUSTIN K JOY received the Bachelor Degree in ECE from Kerala University, College of Engineering, Trivandrum, currently pursuing Master s Degree in VLSI and Embedded Systems from MG University in FISAT, Kerala DEEPA N R received the Bachelor s degree in Electronics and communication engineering in 2005, and MTech in VLSI &Embedded system from M.G.University, Kerala, India, in 2012.She has been working as Assistant professor at FISAT, Angamaly, Mookkannoor, Ernakulam, and Kerala, India. NIMMY M PHILIP received the Bachelor s degree in Electronics and communication engineering in 1999, from Calicut University, Kerala, India and MTech in VLSI &Embedded system from M.G.University, Kerala, India, in 2012. She has been working as Assistant professor at FISAT, Angamaly, Mookkannoor, Ernakulam, and Kerala, India. www.ijrcct.org 276