International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 7, July 2012)

Similar documents
Implementation of Parallel Multiplier-Accumulator using Radix- 2 Modified Booth Algorithm and SPST

Mahendra Engineering College, Namakkal, Tamilnadu, India.

A Fixed-Width Modified Baugh-Wooley Multiplier Using Verilog

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

S.Nagaraj 1, R.Mallikarjuna Reddy 2

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

PERFORMANCE COMPARISON OF HIGHER RADIX BOOTH MULTIPLIER USING 45nm TECHNOLOGY

Modified Booth Multiplier Based Low-Cost FIR Filter Design Shelja Jose, Shereena Mytheen

CHAPTER 4 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED MULTIPLIER TOPOLOGIES

An Optimized Design for Parallel MAC based on Radix-4 MBA

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

High Speed Vedic Multiplier Designs Using Novel Carry Select Adder

Performance Analysis of Multipliers in VLSI Design

Design and Implementation of Truncated Multipliers for Precision Improvement and Its Application to a Filter Structure

Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm

FPGA Implementation of Wallace Tree Multiplier using CSLA / CLA

DESIGN OF AREA EFFICIENT TRUNCATED MULTIPLIER FOR DIGITAL SIGNAL PROCESSING APPLICATIONS

International Journal of Scientific & Engineering Research Volume 3, Issue 12, December ISSN

A Compact Design of 8X8 Bit Vedic Multiplier Using Reversible Logic Based Compressor

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

Design of an optimized multiplier based on approximation logic

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Low-Power Multipliers with Data Wordlength Reduction

ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER

An Optimized Implementation of CSLA and CLLA for 32-bit Unsigned Multiplier Using Verilog

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Low Power Approach for Fir Filter Using Modified Booth Multiprecision Multiplier

ISSN Vol.03,Issue.02, February-2014, Pages:

CHAPTER 1 INTRODUCTION

Modified Booth Encoding Multiplier for both Signed and Unsigned Radix Based Multi-Modulus Multiplier

Modified Partial Product Generator for Redundant Binary Multiplier with High Modularity and Carry-Free Addition

Design and Simulation of Convolution Using Booth Encoded Wallace Tree Multiplier

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Design and Simulation of 16x16 Hybrid Multiplier based on Modified Booth algorithm and Wallace tree Structure

AREA EFFICIENT DISTRIBUTED ARITHMETIC DISCRETE COSINE TRANSFORM USING MODIFIED WALLACE TREE MULTIPLIER

DESIGN OF HIGH PERFORMANCE MODIFIED RADIX8 BOOTH MULTIPLIER

Design A Redundant Binary Multiplier Using Dual Logic Level Technique

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) STUDY ON COMPARISON OF VARIOUS MULTIPLIERS

Digital Integrated CircuitDesign

Design of Area and Power Efficient FIR Filter Using Truncated Multiplier Technique

HIGH SPEED FIXED-WIDTH MODIFIED BOOTH MULTIPLIERS

DESIGN OF A HIGH SPEED MULTIPLIER BY USING ANCIENT VEDIC MATHEMATICS APPROACH FOR DIGITAL ARITHMETIC

A High Speed Wallace Tree Multiplier Using Modified Booth Algorithm for Fast Arithmetic Circuits

Review of Booth Algorithm for Design of Multiplier

International Journal of Advanced Research in Computer Science and Software Engineering

ADVANCES in NATURAL and APPLIED SCIENCES

A Survey on A High Performance Approximate Adder And Two High Performance Approximate Multipliers

HIGH PERFORMANCE BAUGH WOOLEY MULTIPLIER USING CARRY SKIP ADDER STRUCTURE

A Survey on Power Reduction Techniques in FIR Filter

Performance Analysis of a 64-bit signed Multiplier with a Carry Select Adder Using VHDL

Design and Implementation of High Radix Booth Multiplier using Koggestone Adder and Carry Select Adder

Implementation of 32-Bit Unsigned Multiplier Using CLAA and CSLA

Design and Analysis of RNS Based FIR Filter Using Verilog Language

A MODIFIED ARCHITECTURE OF MULTIPLIER AND ACCUMULATOR USING SPURIOUS POWER SUPPRESSION TECHNIQUE

A New Architecture for Signed Radix-2 m Pure Array Multipliers

Reconfigurable High Performance Baugh-Wooley Multiplier for DSP Applications

Study on Digital Multiplier Architecture Using Square Law and Divide-Conquer Method

Tirupur, Tamilnadu, India 1 2

Implementation of Parallel MAC Unit in 8*8 Pre- Encoded NR4SD Multipliers

A Parallel Multiplier - Accumulator Based On Radix 4 Modified Booth Algorithms by Using Spurious Power Suppression Technique

Design of Baugh Wooley Multiplier with Adaptive Hold Logic. M.Kavia, V.Meenakshi

Design and Implementation of 64-bit MAC Unit for DSP Applications using verilog HDL

Design of Roba Mutiplier Using Booth Signed Multiplier and Brent Kung Adder

Design of Efficient 64 Bit Mac Unit Using Vedic Multiplier

FOR HIGH SPEED LOW POWER APPLICATIONS USING RADIX-4 MODIFIED BOOTH ENCODER

Low Power R4SDC Pipelined FFT Processor Architecture

A NOVEL IMPLEMENTATION OF HIGH SPEED MULTIPLIER USING BRENT KUNG CARRY SELECT ADDER K. Golda Hepzibha 1 and Subha 2

DESIGNING OF MODIFIED BOOTH ENCODER WITH POWER SUPPRESSION TECHNIQUE

International Journal Of Scientific Research And Education Volume 3 Issue 6 Pages June-2015 ISSN (e): Website:

Comparative Analysis of 16 X 16 Bit Vedic and Booth Multipliers

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

An Design of Radix-4 Modified Booth Encoded Multiplier and Optimised Carry Select Adder Design for Efficient Area and Delay

Comparative Analysis of Multiplier in Quaternary logic

IMPLEMENTATION OF AREA EFFICIENT MULTIPLIER AND ADDER ARCHITECTURE IN DIGITAL FIR FILTER

Design and Implementation of Complex Multiplier Using Compressors

Wave Pipelined Circuit with Self Tuning for Clock Skew and Clock Period Using BIST Approach

IMPLEMENTATION OF UNSIGNED MULTIPLIER USING MODIFIED CSLA

High Speed Speculative Multiplier Using 3 Step Speculative Carry Save Reduction Tree

Abstract. 1. Introduction. Department of Electronics and Communication Engineering Coimbatore Institute of Engineering and Technology

DESIGN & IMPLEMENTATION OF FIXED WIDTH MODIFIED BOOTH MULTIPLIER

Techniques for Implementing Multipliers in Stratix, Stratix GX & Cyclone Devices

High Speed, Low power and Area Efficient Processor Design Using Square Root Carry Select Adder

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

A Review on Different Multiplier Techniques

REALIZATION OF FPGA BASED Q-FORMAT ARITHMETIC LOGIC UNIT FOR POWER ELECTRONIC CONVERTER APPLICATIONS

A Faster Carry save Adder in Radix-8 Booth Encoded Multiplier

Verilog Implementation of 64-bit Redundant Binary Product generator using MBE

PUBLICATIONS OF PROBLEMS & APPLICATION IN ENGINEERING RESEARCH - PAPER CSEA2012 ISSN: ; e-issn:

Design of Digital FIR Filter using Modified MAC Unit

FPGA Implementation of Area-Delay and Power Efficient Carry Select Adder

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

International Journal of Modern Trends in Engineering and Research

National Conference on Emerging Trends in Information, Digital & Embedded Systems(NC e-tides-2016)

AN EFFICIENT MAC DESIGN IN DIGITAL FILTERS

High Speed Non Linear Carry Select Adder Used In Wallace Tree Multiplier and In Radix-4 Booth Recorded Multiplier

By Dayadi Lakshmaiah, Dr. M. V. Subramanyam & Dr. K. Satya Prasad Jawaharlal Nehru Technological University, India

VHDL Implementation of Advanced Booth Dadda Multiplier

Comparison of Conventional Multiplier with Bypass Zero Multiplier

Transcription:

Parallel Squarer Design Using Pre-Calculated Sum of Partial Products Manasa S.N 1, S.L.Pinjare 2, Chandra Mohan Umapthy 3 1 Manasa S.N, Student of Dept of E&C &NMIT College 2 S.L Pinjare,HOD of E&C &NMIT College 3 Chandra Mohan Umapthy,Assistant Professor &NMIT College Abstract Power is becoming a precious resource in modern VLSI design, even more than area. With large number of Applications requiring support of functional units like squares, cubes and other higher order units, it becomes imperative that such functions be implemented in hardware. This paper proposes a novel architecture for modular, scalable &reusable hybrid squaring circuit. Comparison is made between different implementationof squaring circuit. The implementation results show a significant improvement in performance in terms of area, power & timing Keywords Squarer,SquaringCircuit,,Low Power etc. I. INTRODUCTION The advances in VLSI technology, more and more functionality complexity has been integrated into digital designs to better support target applications. With many applications requiring support for floating point arithmetic, complex arithmetic modules like multipliers and powering units are now being extensively used in design. With technology scaling, the goal has been to operate designs at the fastest possible frequency to achieve high performance. The problem with these complex arithmetic blocks like multipliers and squaring units is that they require longer cycle times for computation. In order to achieve the frequency requirements, these designs invariably end up being pipelined, which results in increases in area and thus incurs a power penalty for operating at higher clock speeds. In many applications a higher power penalty cannot be tolerated and designers have to budget the power associated with individual resources. designs require large area and consume a considerable amount of power per computation. For powering operations where a general-purpose multiplier is not necessary, this results in power being wasted. We propose to use dedicated powering units which perform a specific function in place of a multiplier which has been designed for general-purpose computation. The advantage with using dedicated Squaring units is that they consume less power compared to general-purpose multipliers. Squaring is a special case of multiplication. By using dedicated resources one can save a considerable amount of power which allows designers to remain inside their power budgets. Recently, lot of research has been conducted in order to develop different methodologies to implement squarer s, giving more importance to improve delay & reducing area constraints. Due to which a new scheme was developed to compromise the above-mentioned trade-offs, which is called Hybrid Squarer s. Greater emphasis is given on Hybrid Squarer s, which Comprises of Memory Elements & Computing Logic. The remainder of this paper is organized as follows. Section II presents a brief description of existing algorithms used in the designs of squaring units for unsigned/signed followed by the designs multiplication of two binary numbers for unsigned/signed. We present a way to use Quarter squaring units to perform multiplication of two binary numbers in section III. Section IV details the implementation and experimental results followed by a conclusion in section V. II. BINARY MULTIPLICATION AND SQUARING A. Binary Squarer s Squares are a special case of multiplication where both inputs are identical. Since the two inputs are identical, many optimizations can be made in the implementation of a dedicated squaring unit[3]. Such a squaring unit requires less area compared to multipliers as nearly half of the partial products can be combined using the equivalence A i A j + A j A i = 2 A i A j which can be represented by adding A i A j to the next column to the left. This reduces the depth, which can be defined as the number of partial products to be added together in a column. With a reduction in depth, the design can operate faster as the number of terms on the critical path reduces. Fig.1 shows a 4-bit unsigned squaring unit. We can observe from Fig.1 that two A 1 A 0 terms in column 2 are reduced to having only one A 1 A 0 term in column 3. Similarly other partial products can be reduced. Also the property that A 0 A 0 = A 0 allows reducing terms in the final partial products. 35

The square of a 4-bit number can be computed by adding the rows at the bottom part of Fig. 1. From Fig. 1 we observe that the depth has also reduced; an initial depth of four for a multiplier configuration was reduced to three for squaring[4]. Fig 3 shows the proposed architecture[8]. Fig1: Partial Product Reduction in Unsigned Squaring Operation Fig3: Block diagram of the Proposed Architecture Proposed Algorithm: The algorithm consists of following steps: The given input is partitioned into two parts, each part is treated as a separate unit processed individually by further units. Find the square of each part. Find twice the product of individual part. Add the above results suitably to get the final result. If X is a five-digit number, who s square has to be computed. X = abcde. Find square of abc = (abc) 2. Find square of de = (de) 2. Find twice the product of abc & de = 2(abc)(de) Find the sum of the above results to get the square of X. Ex: 1. Let X = 12345. a = 123, b = 45. Find square of abc = (123) 2 = 15129. Find square of de = (45) 2 = 2025. Find twice the product of abc & de = 2 * (123) * (45) = 11070. 36 Find the sum of the above results to get the square of X = 152399025. The above theory can be extended for any given number X. Hence, by mathematical inspection; the proposed algorithm is proven for any arbitrary number X B. Binary Multiplication Using Mux The techniques for performing binary multiplication involve three basic steps: namely, Generation of Partial Products, Reduction of Partial Products and Addition of the final two rows of partial products. An M N bit multiplication can be viewed as forming N partial product arrays, each of M bits and adding them together according to their weights. Multiplication is performed either by using a Shift Add algorithm or by using Parallel multiplication techniques. The Shift Add method requires M-cycles to perform M N-bit Multiplication In this method we are using 2-Mux to generate partial product, the select line of Mux are controlled by counter. The output of Mux is given to a, the result of is stored in Register & controlled By clock.when clock Enables the Register we perform the Shift-Add method requires M-cycles to perform M N-bit Multiplication All the recoding bit arrays are then added together according to their weights to obtain the final product. The architecture of designed using Mux shown below: C. : Fig4: designed using mux. A basic LUT-based multiplier is simply a lookup table with the addresses arranged so that part of the address is the multiplicand and the other part is the multiplier. The data width should be set to the sum of the address width to accommodate the product. Implementing a Basic/ : In the case where a four-bit value is multiplied by a fourbit value, you will need a memory block that is eight bits wide and 256 words deep. The first four bits of the address can be configured as the multiplicand and the second four bits can be configured as the multiplier.

The memory will store the appropriate product values. To multiply the upper four bits by the lower four bits, feed both values into the address and clock the memory. The appropriate product value will appear on the RAM output. A diagram of this LUT-based multiplier implementation is shown in Figure 1 on page 2. Since the memory block is synchronous, this configuration will result in a synchronous multiplier, whose clock frequency is only limited by the data access time of the memory. While this approach is more efficient than implementing multipliers in gates, it can consume a large amount of memory. The amount of memory required increases with the square of the bit width. Theexample above demonstrates a 4 x 4 bit multiplier with 256 eight-bit words of storage required. For an 8 x 8 bit multiplier, 65,536 16-bit words must be stored using this technique. Characteristics of Basic : A. Iterative shift add routine B. N clock cycles to complete C. Very compact design D. Serial input can be MSB or LSB first depending on direction of shift in accumulator E. Parallel output Partial Product s One way to mitigate the amount of memory required is to use partial product multiplication. This technique combines the lookup table approach with elements of longhand multiplication. For example, to multiply 24 x 43 = 1,032 using longhand, simplify the problem into the sum of four multiply functions and three add functions (Figure 2). (4 3 + ((2 3) 10)) + (((4 4) + ((2 4) 10)) 10) = 1,032 Using a basic lookup table technique, an eight-bit by eight-bit multiply would require 128 kb of storage. As shown in Figure 3 on page 3, using partial product multipliers, the same procedure can be accomplished using 1 kb of storage. In order to accomplish this in logic, using A as the multiplicand and B as the multiplier, take the lower four bits of A and multiply it by the lower four bits of B using the lookup table technique. Then take the upper four bits of A and multiply it by the lower four bits of B and shift the partial product result to the left by four. Then add the two results together for the first part of the product. For the second part of the product, multiply the lower four bits of A by the upper four bits of B. Then do the same with the upper four bits of both A and B and shift this partial product value to the left by four. Add the two values of the previous calculation and shift the whole result to the left by four. Then add the first part of the product to the second part of the product for the final result. While this technique is not as fast as implementing the entire multiply as a single memory element, it does greatly reduce the amount of memory required at the expense of using more core tiles. III. QUARTER SQUARE TECHNIQUE The squaring units requiring less area and power as compared to multipliers, it is interesting to assess the use of squaring units to perform multiplication. There are various Methods to obtain a multiplication of two numbers using squares instead of using multipliers. One of the most widely used methods in algebra is the quarter square method [5]. In mathematical terms, the quarter square algorithm can be expressed as. A x B=1 4{(A+B) 2 - (A-B) 2 } In this method, to obtain the product of two numbers, we obtain their sum and difference. The obtained sum and difference are squared, and the difference of these two squares when divided by 4 provides the result. As in binary arithmetic, divide by 4 operation can be easily accomplished by shifting right two digits. The quarter square technique is illustrated in Fig.5. 4 Fig 5.Partial Product technique Implementing a Partial Product In logic this same technique can be used to reduce the amount of memory required to perform a multiply function. 37 Fig 6. Quarter Square Technique

From Fig.6 we observe that if we have two 8-bit unsigned numbers, the sum can result in a carry, similarly with two 8-bit signed numbers, the difference can generate an overflow. In order to produce a correct result we need a (8+1) bit adder for computation of sum and difference, and hence one would need at least (n+1) bit squaring units to correctly perform an n-bit squaring operation. IV. EXPERIMENTS AND RESULTS An 8/16/32/64-bit multiplier performing signed/ unsigned operations based on multiplier using Mux Algorithm has been described in Verilog. As multipliers support signed operations, we use squaring units designed for signed operations for all the results and comparisons. We implemented the Quarter square algorithm using the Squaring unit designs to perform signed/unsigned multiplication. We designed 8/16/32/64/bit squaring units to support 8-, 16-, 32-,64 bit multiplication, respectively. The performance of all squaring circuits are evaluated on the same device Spartan xc3s400 & Vertex XC2vp30 with a speed grade of 4 & 7.The results suggest that the proposed architecture is faster than. TABLE I Area Requirement Using Spartan3 Bits Squarer usingmux Quarter 4 4 13 7 4 8 6 19 11 10 16 8 35 17 40 TABLE II Delay (ns) Requirement using Spartan3 Bits squarer usingmux Quarte Multiplie r 4 13.48 19.6 18.93 13.68 8 16.44 21.12 19.91 18.13 16 16.66 21.89 21.25 22.32 Delay=Input delay + Output delay From tables I & II, we can conclude that the proposed scheme is more efficient in terms of area, timing & power. The above results can be further improved by using the Look Up Table (LUT) approach to calculate the intermediate squaring values. TABLE III Area Requirement using spartan3 Bits With lut Without lut 2 2 3 4 4 6 6 7 10 8 20 26 Fig 7. Area Requirements of various designs. Fig 8. Area Requirements of various designs. 38

TABLE IV Delay (ns) Requirement using Spartan3 Bits Squarer using Lut Squarer using Without Lut 2 2.69 3.43 4 2.73 6.52 6 4.71 7.73 8 7.35 10.9 Number of Bits TABLE VI Delay (ns) Requirement using Vertex2p Squarer Using Mux Quarter 4 7 10.39 7.6 10.2 8 8.95 11.8 10.09 11.09 16 9.29 12.24 12.47 12.89 32 11.32 13.10 15.5 17.92 64 16.52 20.92 19.61 24.13 Simulation Result: Delay = Input delay+ output delay. Fig 9. Delay Requirements of various designs. Fig 11. 8-bit Squaring Circuit TABLE V Area Requirement using Vertex2p Bits Squarer Using Mux Quarter 4 3 14 7 3 8 5 19 9 10 16 8 35 11 40 32 13 67 85 157 64 87 167 421 534 Fig 12. 16-bit Squaring Circuit. Fig 10. Area Requirements of various designs. Fig 13. 64-bit Squaring Circuit 39

We compare four designs based on their area or the maximum number of partial products in a column. Table I shows the area requirements for the types multiplier and the squaring units. As seen from Table II, the maximum delay requirement for the & the other unit is more than that of the Squarer. From the table we can prove that area reduces means power automatically reduces. Fig. 7&10 plots the area requirements for various designsunder the same constraints. From the results we can observe that the squaring units require only about 55% of the multiplier area. Designing multipliers with quarter Squarer techniques results in an area penalty of about 20-60% over multiplier. From the area requirements of quarter square multiplier and squaring units, we find that the area overhead of adders in the multiplier design is about 20-30% of the area of the squaring unit. The Delay required for each design is shown in Table II we observe that the squaring units consume about 50% of the Delay consumed by the multiplier to perform squaring. However, when a multiplier is built using the quarter square technique, it consumes more power than the as the design requires the use of two squaring units and three adders for every multiplication. The adder overhead significantly affects the overall power. The Table III & IV shows area & delay requirement using without Lut for & with Lut for Squarer architecture. With Lut consume less Area & Delay compared to Without Lut. The Table V & VI shows the area & delay requirement using Vertex. Fig 11,12,13 shows the simulation result of 8,16,64 bit Squarer. Result remains the same for all other types of multiplier. V. CONCLUSION The paper presents a case for the use of dedicated squaring units in applications where squares are required in large numbers, which otherwise would be implemented using general purpose multipliers. A method of using squaring units to perform multiplications is presented, and the tradeoffs as compared to conventional multipliers are presented. We provide results for area and power requirements in unsigned/signed squaring units and quarter square multiplier for 8/16/32-bits. The low area and power required per computation provide significant advantages when dedicated squaring units are used in a design instead of a general purpose multiplier. The Salient Feature are Modular & Scalable architecture, Easy & simple to implement, Low Power consumption, Less Area & Better Timing can be achieved. REFERENCES [1 ] Risojevic, V.; Avramovic, A.; Babic, Z.; Bulic, P, A simple pipelined squaring circuit for DSP,IEEE 29th International Conference Computer Design (ICCD),2011, Page(s): 162 167. [2 ] Kuan Jen Lin; Yu Chan Chiu; Tzu-Hao Lin A decimal squarer with efficient partial product generation, 18th IEEE/IFIP VLSI System on Chip Conference, 2010, Page(s): 213 218 [3 ] Garofalo. V. Coppola. M. De Caro. Napoli. E. Petra, N.. Strollo, A.G.M. A novel truncated squarer with linear compensation function, IEEE International Symposium on Circuits and Systems (ISCAS), Proceedings, 2010, Page(s): 4157 4160 [4 ] Oberman, Stuart F. and Flynn, Michael J. "Division Algorithms and Implementations." IEEE Transcation on Computers (1997): pp833-854, 2010 [5 ] Datla, S.R, Thornton, M.A, Mutual, D.W., "A Low Power High Performance Radix-4 Approximate Squaring Circuit," Application specific Systems, Architectures and Processors,. 20th IEEE International Conference on, vol., no., pp.91-97, 7-9 July2009 [6 ] Taek-Jun Kwon, Jeff Sondeen, Jeffrey Draper, Floating-Point Division and Square Root using a Taylor-Series Expansion Algorithm, 50th IEEE International Midwest Symposium on Circuits and Systems, August 2007, pp. 305 308 [7 ] Cho, K.-J.; Chung, J.-G. A parallel squarer design using precalculated sum of partial product Electronics letter,2007,vol 43 pp1414-1416 [8 ] Hong, Sun-Ah.Kim, Yong-Eun, Chung. Jin-Gyun. Lee, Sung-Chul, Efficient Squarer Design Using Group Partial Products IEEE Workshop on Signal Processing Systems,2007, Page(s): 146 150 [9 ] Cho, K.-J, Chung.J.-G. Low error fixed-width two's complement squarer design using Booth-folding technique Computers & Techniques, IET, 2007, Page(s): 414 422. [10 ] Shuli Gao, Chabini. N, Al-Khalili. D, Langlois. P Efficient Realization of Large Integer s and Squarers IEEE North- East Workshop on Circuits and Systems. 2006, Page(s): 237-240 [11 ] Chandra Mohan Umapathy High speed squarer 20th IEEE/IFIP VLSI System on Chip Conference, 2010. 40