Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm

Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm Vijay Dhar Maurya 1, Imran Ullah Khan 2 1 M.Tech Scholar, 2 Associate Professor (J), Department of ECE, Integral University, Lucknow, U.P. India 1 mauryavijay27@gmail.com, 2 iukhan@iul.ac.in Abstract: This paper is all about to implementation of Multiplier-and-accumulator (MAC) for high-speed arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator that has the largest delay in MAC was merged into CSA, the overall performance was raised. The proposed CSA tree uses 1 scomplement-based radix-2 modified Booth s algorithm (MBA) and has the modified array for the sign extension in order to escalation the bit density of the operands. The CSA propagates the carries to the least significant bits of the partial products and generates the least significant bits in advance to decrease the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits instead of the output of the final adder, which made it possible to optimize the pipeline scheme to improve the performance. Based on the theoretical and experimental estimation, we analyzed the results such as the amount of hardware resources, delay, and pipelining scheme. Keywords: Booth Multiplier, Carry Save Adder (CSA) Tree, Computer Arithmetic, Digital Signal Processing (DSP), Multiplier and- Accumulator (MAC). I. INTRODUCTION With the recent rapid advances in multimedia and communication systems, real-time signal processing s like audio signal processing, video/image processing, or large-capacity data processing are increasingly being demanded. The multiplier and multiplier-andaccumulator (MAC) [2] are the essential elements of the digital signal processing such as filtering, convolution, and inner products. Most digital signal processing methods use nonlinear functions such as discrete cosine transform (DCT) [3] or discrete wavelet transform (DWT) [4]. Because they are basically accomplished by repetitive application of multiplication and addition, the speed of the multiplication and addition arithmetic s determines the execution speed and performance of the entire calculation. Because the multiplier requires the longest delay among the basic operational blocks in digital system, the critical path is determined by the multiplier, in general. For high-speed multiplication, the modified radix-4 Booth s algorithm (MBA) [5] is commonly used. However, this cannot completely solve the problem due to the long critical path for multiplication [6], [7]. In general, a multiplier uses Booth s algorithm [8] and array of full adders (FAs), or Wallace tree [9] instead of the array of FAs., i.e., this multiplier mainly consists of the three parts: Booth encoder, a tree to compress the partial products such as Wallace tree, and final adder [10], [11]. Because Wallace tree is to add the partial products from encoder as parallel as possible, its operation time is proportional to, where N is the number of inputs. It uses the fact that counting the number of 1 s among the inputs reduces the number of outputs into. In real implementation, many (3:2) or (7:3) counters are used to reduce the number of outputs in each pipeline step. The most effective way to increase the speed of a multiplier is to reduce the number of the partial products because multiplication precedes a series of additions for the partial products. To reduce the number of calculation steps for the partial products, MBA algorithm has been applied mostly where Wallace tree has taken the role of increasing the speed to add the partial products. To increase the speed of the MBA algorithm, many parallel multiplication architectures have been researched [11] [12]. In this paper, a new architecture for a high-speed MAC is proposed. In this MAC, the computations of multiplication and accumulation are combined and a hybrid-type CSA structure is proposed to reduce the critical path and improve the output rate. It uses MBA algorithm based on 1 s complement number system. A modified array structure for the sign bits is used to increase the density of the operands. A carry lookahead adder (CLA) is inserted in the CSA tree to reduce the number of bits in the final adder. In addition, in order to increase the output rate by optimizing the pipeline efficiency, intermediate calculation results are accumulated in the form of sum and carry instead of the final adder outputs. Fig. 1. Basic Arithmetic Steps of Multiplication and Accumulation 35

This paper is organized as follows. In Section II, Present Schemes Used general MAC will be given, and the architecture for Booth s Recoding Algorithm will be described in Section III. In Section V, the implementation and result will be analyzed and the characteristic of the proposed MAC will be shown. Finally, the conclusion will be given in Section VI. 36 II. PRESENT SCHEMES USED There are different methods present in this domain such as A. Binary Multiplication B. Array Multiplier C. Multiplier and Accumulator Unit There are lot of disadvantages in the previous methods, such as low performance when number of bits are increased, there is a chance of mismatch of connection to perform different multiplications, additions with carry. A. Binary Multiplication: In the binary number system the digits, called bits, are limited to the set [1, 2]. The result of multiplying any binary number by a single binary bit is either 0, or the original number. This makes forming the intermediate partial-products simple and efficient. Summing these partial-products is the time consuming task for binary multipliers. One logical approach is to form the partialproducts one at a time and sum them as they are generated. Often implemented by software on processors that do not have a hardware multiplier, this technique works fine, but is slow because at least one machine cycle is required to sum each additional partial product. For applications where this approach does not provide enough performance, multipliers can be implemented directly in hardware. The two main categories of binary multiplication include signed and unsigned numbers. Digit multiplication is a series of bit shifts and series of bit additions, where the two numbers, the multiplicand and the multiplier are combined into the result. Fig. 2. Multiplication Process Considering the bit representation of the multiplicand x = xn-1..x1 x0 and the multiplier y = yn-1..y1y0 in order to form the product up to n shifted copies of the multiplicand are to be added for unsigned multiplication. The entire process consists of three steps, partial product generation, partial product reduction and final addition. B. Array Multiplier: A 4 x 4 array multiplier and the functions of M0, M1, M2, and M4 (M s are either half adders or full adders) are shown in figure 3. X3X2X1X0 is the 4 bit multiplicand and Y3Y2Y1Y0 is the 4 bit multiplier. Full adder is the important component in each cell. Each cell consist of AND gate, which determines whether a multiplicand bit, Xj is added to the incoming partial product bit based on the value of the multiplier bit Yi. PPi is unchanged and passed vertically downward if Yi=0,else each row adds the multiplicand (appropriately shifted) to the incoming partial product, PPi to generate the outgoing partial product PP (i+1). The path from the upper right corner of the array to the high order product bit output at the bottom left corner of the array is the worst case signal propagation delay. Fig. 3. Array Multiplier C. Multiplier and Accumulator Unit: The inputs for the MAC are to be fetched from memory location and fed to the multiplier block of the MAC, which will perform multiplication and give the result to adder which will accumulate the result and then will store the result into a memory location. This entire process is to be achieved in a single clock cycle (Waste & Harris, 3rd Ed). The architecture of the MAC unit which had been designed in this work consists of one 16 bit register, one 16-bit Modified Booth Multiplier, 32- bit accumulator. To multiply the values of A and B, Modified Booth multiplier is used instead of conventional multiplier because Modified Booth multiplier can increase the MAC unit design speed and reduce multiplication complexity. The product of Ai X Bi is always fed back into the 32-bit accumulator and then added again with the next product Ai x Bi. This MAC unit is capable of multiplying and adding with previous product consecutively up to as many as times.

37 Fig. 4. Simple MAC Architecture III. BOOTH S RECODING ALGORITHM Parallel Multiplication using basic Booth s Recoding algorithm technique based on the fact that partial product can be generated for group of consecutive 0 s and 1 s which is called as Booth s recoding. These Booth s Recoding algorithm is used to generate efficient partial product. These Partial Products always have large number of bits than the input number of bits. This width of partial product is usually depends upon the radix scheme used for recoding. These generated partial products are added by compressors as explained in section 3.2. So, these scheme uses less partial products which comprises low power and area. There are two types of algorithm Radix-2 and Radix- 4 to generate efficient partial products for multiplication. First we will explain basic technique of Booth s Recoding algorithm and then Modified Booth s Recoding technique for Radix-2 algorithm. A. Basic Technique of Booth s Recoding Algorithm for Radix-2: Booth has proposed Radix algorithm for high speed multiplication which reduces partial products for multiplication. The Booth s algorithm for multiplication is based on this observation. To do a multiplication A*B, where A= an, an-1..a0 is a multiplier B= bn, bn-1..b0 is a multiplicand then, we check every two consecutive bits in A at a time. Suppose A is Multiplier having value -5 and B is Multiplicand having value +2 then, B=> 0010 (+2) A=> 1011 (-5). After looking into above table for multiplicand, first we see two LSB values and then adjacent values in A. We get partial product as For 10 we have to perform -1.B, i.e., 2 s complement of B, 1110. For 11 we have to put all 0 s i.e., 0000. For 01 we have to perform 1.B, i.e., value of B, 0010 For 10 again -1.B, i.e. 1110. Here, some bits are encapsulated called as correction bits to match the width of partial products. B. Basic Technique of Modified Booth s Recoding Algorithm Radix-2: Modified Booth algorithm has been proposed for high speed multiplication.this type of multiplier operates much faster than an array multiplier for longer operands because its computation time is proportional to the logarithm of the word length of operands. Booth multiplication is a technique that allows faster multiplication by grouping the multiplier bits. The grouping of multiplier bits and Radix-2 Booth encoding reduce the number of partial products to half. So we take every second column, and multiply by ±1, ±2, or 0, instead of shifting and adding for every column of the multiplier term and multiplying by 1 or 0.The advantage of this method is halving of the number of partial products. For Booth encoding the multiplier bits are formed in blocks of three, such that each block overlaps the previous block by one bit. Start from the LSB for grouping, and the first block only uses two bits of the multiplier. Figure 7 shows the grouping of bits from the multiplier term. Fig 5. Grouping of Bits from the Multiplier Term To obtain the correct partial product each block is decoded. Table 1 shows the encoding of the multiplier Y, using the modified Booth algorithm, generates the following five signed digits, -2, - 1, 0, +1, +2. Each encoded digit in the multiplier performs a certain operation on the multiplicand X. Table 1. Operations on the Multiplicand IV. PROPOSED MAC If an operation to multiply two N-bit numbers and accumulate into a 2N-bit number is considered, the critical path is determined by the 2N-bit accumulation operation. if a pipeline scheme is applied for each step

in the standard design of Fig,the delay of last accumulator must be reduced in order to improve the performance of the MAC. Fig. 6. Internal Block Diagram of 16*16 Basic Multiplier A. Booth Encoding: The first step is radix-2 booth encoding in which a row of partial products is generated from the multiplicand (X) and multiplier (Y). Partial product can be achieved using the various techniques such as booth algorithm, modified booth algorithm. This result is added to the preceding multiplication result (Z). The MAC process can be written as P X Y Z (2.1) Where the multiplicand X and multiplier Y are of n bits each and multiplication result P has 2n bits. V. IMPLEMENTATION AND RESULT Radix-2 modified booth MAC performs both multiplication and accumulation. Multiplication result is obtained by multiplying multiplicand and multiplier. This multiplication result is accumulated with previous result. The black box view of radix-2 modified Booth MAC module is shown in figure 8. Fig 8. Black Box View of Radix-2 Modified Booth MAC Table 2. Project Summary Fig. 7. Booth Encoder B. Partial Product Summation: The second step is the partial product summation to add all the partial products and convert them in the form of sum (S) and carry (C). This is done using a carry save adder and carry look ahead adder (CLA) for serial parallel multipliers. For parallel multiplier the addition is done using carry-save techniques, or summand skip. C. Final Addition: The last step is the final addition in which multiplication is produced by summing sum (S) and carries (C). Final adder is required to generate the multiplication result. Fig. 5, Shows the basic hardware architecture of the MAC architecture. It performs the multiplication operation to produce the final result by multiplying the multiplier (X) and the multiplicand (Y). Table 3. Device Utilization Summary 38

Fig 10. Simulation Result for 16-Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm VI. CONCLUSION A 16x16 multiplier-accumulator (MAC) is presented in this work. A RADIX -2 Modified Booth multiplier circuit is used for MAC architecture. Compared to other circuits, the Booth multiplier has the highest operational speed and less hardware count. The basic building blocks for the MAC unit are identified and each of the blocks is analyzed for its performance. Power and delay is calculated for the blocks. 1- Bit MAC unit is designed with enable to reduce the total power consumption based on block enable technique. Using this block, the N-bit MAC unit is constructed and the total power consumption is calculated for the MAC unit. The power reduction techniques adopted in this work. The MAC unit designed in this work can be used in filter realizations for High speed DSP applications. VII. REFERENCES [1] Young-Ho Seo, Member, IEEE, and Dong-Wook Kim, Member, IEEE, A New VLSI Architecture of Parallel Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 18, No. 2, February 2010. [2] J. J. F. Cavanagh, Digital Computer Arithmetic. New York: McGraw- Hill, 1984. [3] Information Technology-Coding of Moving Picture and Associated Autio, MPEG-2 Draft International Standard, ISO/IEC 13818-1, 2, 3, 1994. [4] JPEG 2000 Part I Fina1119l Draft, ISO/IEC JTC1/SC29 WG1. [5] O. L. MacSorley, High speed arithmetic in binary computers, Proc. IRE, vol. 49, pp. 67 91, Jan. 1961. [6] S. Waser and M. J. Flynn, Introduction to Arithmetic for Digital Systems Designers. New York: Holt, Rinehart and Winston, 1982. [7] A. R. Omondi, Computer Arithmetic Systems. Englewood Cliffs, NJ: Prentice-Hall, 1994. [8] A. D. Booth, A signed binary multiplication technique, Quart. J. Math., vol. IV, pp. 236 240, 1952. [9] C. S. Wallace, A suggestion for a fast multiplier, IEEE Trans. Electron Comput., vol. EC-13, no. 1, pp. 14 17, Feb. 1964. [10] A. R. Cooper, Parallel architecture modified Booth multiplier, Proc. Inst. Electr. Eng. G, vol. 135, pp. 125 128, 1988. [11] N. R. Shanbag and P. Juneja, Parallel implementation of a 4 4-bit multiplier using modified Booth s algorithm, IEEE J. Solid-State Circuits, vol. 23, no. 4, pp. 1010 1013, Aug. 1988. [12] G. Goto, T. Sato, M. Nakajima, and T. Sukemura, A 54 54 regular structured tree multiplier, IEEE J. Solid- State Circuits, vol. 27, no. 9, pp. 1229 1236, Sep. 1992. 39