VHDL Code Generator for Optimized Carry-Save Reduction Strategy in Low Power Computer Arithmetic

VHDL Code Generator for Optimized Carry-Save Reduction Strategy in Low Power Computer Arithmetic DAVID NEUHÄUSER Friedrich Schiller University Department of Computer Science D-07737 Jena GERMANY dn@c3e.de EBERHARD ZEHENDNER Friedrich Schiller University Department of Computer Science D-07737 Jena GERMANY nez@uni-jena.de Abstract: Carry-save arithmetic is frequently used in multiplier design. When reducing an array of partial products by carry-save addition, one cannot be certain, which carry-save strategy yields the best results in terms of area, latency, and low power consumption. In our contribution, we expose differences between the strategies of Wallace and Dadda, as well as our carry-save method called short result strategy SRR, when applied to various arithmetic operations. We provide a software tool for time efficient analysis and rapid prototyping of carry-save arithmetic using strategies of this kind. We show results gained by employing our tool in terms of the expected area, latency, and power consumption of the resulting circuit, and outline the relevance for low power design. Key Words: carry-save, tree multiplier, multiply-accumulate, merged arithmetic, Dadda, Wallace tree, low power, VHDL Introduction Carry-save arithmetic is often proposed to be used in multiplier [8], multiply-accumulate [4], digital filter [2] or cryptography circuits [7]. Carry-save arithmetic yields the advantage of fully parallel addition. The problem of calculating a multiplication, a multiply-accumulate operation, or a digital filter computation can be seen as the task to reduce n partial products to only two. These final two partial products can be used as input for further carry-save operations or summed up using a carry-propagate adder (CPA) to gain a number in the usual binary representation. The task of reducing several partial products can be accomplished by using different elements, in particular (3,2)-counters [2, 4, 3], (4,2)- counters [9] or (5,2)-counters [0]. In addition, for every set of used elements, several strategies can be found. We focus on strategies for (3,2)-counters, using full adders and half adders as elements. Depending on the used strategy, digital circuits composed of (3,2)- counters differ by the area required, the shape of the resulting partial products, and the power consumption of the whole circuit. In this paper, we investigate different strategies and point out the possible advantage of each strategy. For flexibility we implemented the strategies with the programming language C++, creating a tool, which applies the required strategies to the given task. The tool offers statistics about the area cost and expected latency and creates VHDL code. The gained statistics about the expected area can be used as an approximation of the final circuit s power consumption. Synthesis of this source code gives even more accurate information about the expected area, latency, and power consumption. The following section will provide a short review of carry-save arithmetic. In Section 3, we investigate different strategies and point out their application to regular structures like multipliers. Irregular structures as needed by multiply-accumulate arithmetic will be discussed in Section 4. The VHDL code generator tool is introduced in Section 5. In Section 6, we present generator based statistics as well as synthesis results. We conclude with Section 7 and show possible improvements and applications of our tool. 2 Carry-Save Arithmetic Any arithmetic operation that requires adding up more than two partial products can be implemented by using carry-save arithmetic, for example plain multipli- ISBN: 978--6804-056-5 229

Table : Reducing 6 6 bit multiplier partial products. HA FA s bits c bits area CPA Wallace 9 7 2 7 43 8 Dadda 5 5 0 35 0 SRR 5 8 7 4 7 cation (Equ. ), multiply-accumulate (Equ. 2), or digital filter operations, as for instance a FIR Filter (Equ. 3) [3]. C = A B () n C = A(i) B(i) (2) C(k) = i= n A(k i + ) B(i) (3) i= To realize the required arithmetic operation, we reduce the set of partial products by taking two or three bits of the same weight and summing them up with a half adder resp. a full adder, depending on the applied strategy. All additions performed during one step operate on mutually disjoint sets of partial product bits and therefore are independent of each other, thus they can be conducted in parallel. As the result of such a step, we get a new arrangement consisting of sum bits and carry bits of the full adders and half adders, as well as some unreduced bits. This structure constitutes the task for the next step. After a certain number of steps we gain a result with no more than two unreduced bits per bit position, which can be seen as two remaining partial products. To reduce p partial products, we need h steps, which is the height of the resulting tree [9]. The maximum number p max of partial products that can be reduced to two final partial products by h steps is [4] p max (h) = p max (h ) 3, for h > 0,(4) 2 p max (0) = 2, where is defined as b = a, for a IR, b IN and b a < b+. (5) 3 Reduction Strategies There exist different strategies to decide which bits should be reduced by a half adder and which by a full adder in each step; some partial product bits may also stay unreduced. Figure : Wallace strategy for multipliers. Figure 2: Dadda strategy for multipliers. 3. Wallace tree: In a Wallace tree [9] we reduce the partial products by using a tree structure, consisting of carry-save adders. The strategy tends to compress partial product bits as much as possible during a single full adder delay. An example 6 bit 6 bit unsigned multiplier is shown in figure. We get the task of reducing 6 partial products. The least significant bit position is always the rightmost position. The applied steps shown in figures -. 3.2 Dadda strategy: The Dadda [5] strategy looks for the bit position with most bits and calculates the smallest reachable level L min at this position within one step. All necessary steps are shown in Figures 2-. Without changing the weight of any bit, the array structure can be rearranged as a pyramid. The least significant bit position is always the rightmost position. Beginning with the least significant bit position, half adders and full adders are placed to reach L min, but no effort is done to reduce any bits in advance. Note that the Dadda resp. Wallace strategies yield exactly the same number of steps. 3.3 SRR: We propose a new strategy, called short result strategy SRR strategy. The goal of ISBN: 978--6804-056-5 230

Figure 4: MAC-unit partial products using Dadda. Figure 3: SRR strategy for multipliers. Table 2: Dadda and SRR in MAC-units. HA FA s b. c b. area lat. CPA Dadda 4 5 8 8 33 6 8 SRR 4 5 8 5 33 3 5 this strategy is to produce two final partial products of small width, but with hardware effort comparable to a Dadda. As in the other strategies, the maximum number of bits in a column is used to calculate the smallest reachable level within one step. The algorithm reduces a lower significant section following the Wallace strategy. Higher significant bits are only reduced if unavoidable to achieve the minimum number of steps, as in the Dadda strategy. The steps are shown in Figure 3. The mentioned strategies differ in the amount of half adders and/or full adders used, and in the shape of the resulting partial products; this can be seen in Table for the example task of a 6 bit 6 bit multiplication. HA and FA give the numbers of half adders resp. full adders used; s bits and c bits are the numbers of bits of the two resulting partial products, and area is the half adder equivalent of the needed area with one full adder equaling two half adders. CPA is the number of common bits in both resulting partial products to be added up by the final CPA to gain a binary number. We won t consider the Wallace strategy any further in this paper, since it appears to be clearly weaker than the Dadda resp. the SRR strategy. 4 Irregular Structures For reasons of efficiency, Equations 2 resp. 3 can be implemented as instances of merged arithmetic [6], see also [4]. However, this approach leads to far more complex structures to be reduced. As an example, we assume A and B as 4 bit wide and C resp. C(k) as carry-save 8 bit wide, all in two s complement. We gain the task of reducing 6 partial products. Figure 4 shows these partial products when using a Dadda strategy. A Baugh-Wooley multi- Figure 5: MAC-unit steps using Dadda. plication array is used for two s complement multiplication (proposed in [2], reviewed in [7]). Bits in the leftmost column are most significant and signed; they are removed from the original partial products for separated overflow detection and sign correction. This leaves the partial products in Figure 4 to be reduced. Figures 5- show the steps. When applying SRR, we gain a different structure of partial products, as shown in Figure 6. Again, bits in the leftmost column are most significant and signed; they are removed from the original partial products for separated overflow detection and sign correction. This leaves the partial products in Figure 6 to be reduced. Figures 7- show the corresponding steps. As before, strategies differ in the shape of the resulting partial products; this can be seen in Table 2 for the example task of reducing a multiply-accumulate partial product array. Notice however, that the numbers of half adders resp. full adders and thus the total area agree in this example, in contrast to the results from Section 3. Again, HA and FA give the numbers of half adders resp. full adders used; s bits and c bits are the numbers of bits of the two resulting Figure 6: MAC-unit partial products using SRR. ISBN: 978--6804-056-5 23

design flow control strategy task VHDL generator tree statistics tree VHDL synthesis synthesis statistics synthesized design VHDL framework excluding tree Figure 7: MAC-unit steps using SRR. Figure 8: VHDL code generator in design flow. partial products, and area is the half adder equivalent of the needed area. Latency is the sum of the tree height and the CPA, the latter being the number of common bits in both resulting partial products to be added up by the final CPA to gain a binary number. The multiplication and multiply-accumulate examples show only a small part of the variety of possible tasks. The effects of SSR on multiplyaccumulate units are shown in detail in []. Whether to implement a multiply, multiply-accumulate, or digital filter operation affects the shape of the structure to be reduced. Deciding for unsigned, one s or two s complement, or sign-magnitude representation influences the shape, too. Using carry-save or non carrysave accumulation has an additional effect. The same holds for the bit width of all operands as well as the different strategies. To take reasonable design decisions, one would have to consider all different possible designs, describe them in VHDL, and synthesize them manually. This seems to be a quite time consuming task. The need of automation is obvious. 5 VHDL Code Generator For flexibility we implemented these algorithms in C++, creating a generator which produces VHDL source code. This source code describes a partial product tree using one of the discussed partial product strategies. Figure 8 shows the incorporation of the generator into the design flow. The design flow control defines a task and the needed strategy. It also provides the VHDL framework. This VHDL source code needed for synthesis defines all of the arithmetic circuit except the tree. The generator reduces the given bit structure, using the required strategy, creating VHDL source code of the tree as well as time and area statistics. Both VHDL source codes are synthesized using Synopsys Design Compiler. As a result we gain the Table 3: VHDL-generator statistics for multipliers. area latency area * latency Bits Dadda SRR Dadda SRR percentage 8 42.5 43 8 4 78.7 6 232.5 235 36 30 84.2 32 976.5 990 70 62 89.9 64 4000.5 409 36 26 93. synthesized design and post synthesis statistics about area, latency, and power consumption of the design. The advantages of this approach are two-fold: On one hand, we get statistics of the expected circuit complexity and performance before doing time consuming synthesis. On the other hand, we can easily perform rapid prototyping to compare the effect of different approaches and different tasks through synthesis of the VHDL source code. This enables us to compare different strategies and different bit widths in less time than having to design different arrays by hand. 6 Results For a multiplier implementation, a shorter final CPA can be used by applying the SRR strategy. This final CPA has to add up less bits, therefore being faster and smaller. Thus one could assume the SRR strategy to require less area and latency than the Dadda strategy. The generator statistics for a multiplier as shown in Table 3, assuming a ripple-carry adder (RCA) for low power results, seem to proof this assumption. The latency advantage of the SRR based multiplier could be traded into an area advantage by synthesizing both designs with equal latency constraints. The resulting smaller SRR based multiplier yields significant less power consumption than the Dadda based multiplier. This assumption neglects the different latencies used to calculate each bit of the resulting partial products, and may thus produce misleading conclusions, see for instance [6, 3, 8]. It has been shown in [8], that the Wallace strategy, although leading to a ISBN: 978--6804-056-5 232

Table 4: SM-multipliers synthesis results using SRR. bits area latency area * latency power 8 98.88 05.4 04.23 0.35 6 99.84 02.52 02.35 0.3 32 99.65 04.23 03.87 00.7 64 00.0 02.89 02.90 00. smaller final CPA, is worse than the Dadda strategy in terms of area and latency, when using full adders with input-dependent latencies. Similar arguments might hold for the SRR strategy. As an example, we generated VHDL source code for signed magnitude multipliers (SM-multipliers) and synthesized it with Synopsys Design Compiler and UMC 80 nm CMOS library. Synthesis results (in percent) for the SRR strategy are shown in Table 4, normalized to the results for the Dadda strategy. To create a low power design, we chose a RCA as the final CPA and minimized the area of the whole design throughout synthesis. The results are as predicted in [8]. The SRR strategy, although using a smaller CPA, performs worse than the Dadda strategy when designing a signed magnitude multiplier. Taking a look into a multiply-accumulate unit, we cannot rely on the same prediction. We now have the choice to either sum up the two resulting partial products and latch a smaller binary number, or latch both resulting partial products before adding them up. By moving the final adder out of the critical path, as in the second approach, we gain a significantly lower cycle time. Since the purpose of a multiply-accumulate unit is to add a sequence of products, this approach seems to be more efficient. Internally we latch both resulting partial products back to the partial product tree. Therefore we have to expand the partial product array by two additional lines. By latching both partial products, all latched bits will gain the same arrival time, depending on the worst bit latency. The Dadda strategy loses its latency advantage, now having the penalty of a larger and slower final CPA. Furthermore, the fewer result bits introduced by the SRR strategy require fewer latches than with the Dadda strategy. The generator statistics for two s complement multiply-accumulate units (TC-MAC-units) are shown in Table 5, assuming a RCA for low power performance. Again, the latency advantage of the SRR based design could be traded into an area advantage by synthesizing both designs with equal latency constraints. The resulting smaller SRR based design would yield significant less power consumption than the Dadda based design. Table 5: TC-MAC-units statistics using SRR. one multiply cycle one total operation bits area latency A L area latency A L 8 94.2 00.0 94.2 93.5 70.6 66.0 6 98.0 00.0 98.0 97.5 8.8 79.8 32 99.5 00.0 99.5 99.2 87.7 87.0 64 99.9 00.0 99.9 99.8 92.2 92.0 Table 6: TCA-MAC-units synthesis using SRR. one MAC cycle final MAC cycle + recoding bits A L power A L power 8 99.0 99.9 73.6 95.7 6 98.6 00.0 83.4 98.7 32 99.7 00. 89.2 99.7 64 99.9 00. 93.0 00.0 Synthesis results (in percent) for two s complement multiply-accumulate units using the SRR strategy are shown in Table 6, normalized to the results for the Dadda strategy. To create a low power design, we chose a RCA as the final CPA and minimized the area of the whole design throughout synthesis. The computation in Equation 2 corresponds to n multiplyaccumulate (MAC) cycles yielding a result in carrysave representation, and one final MAC cycle that is followed by a recoding phase adding up the two remaining partial products. Table 6 shows the results for (left) one of the first n cycles, and (right) the final MAC cycle, including the recoding. Comparing these two different designs, we gain an advantage by applying the SRR strategy. Therefore, when designing a multiply-accumulate unit with a redundant accumulate part, the SRR strategy is more efficient than the Dadda strategy, contrary to designing a multiplier. 7 Conclusion and Future Work The generator design tool allows us to freely define tasks of summing up partial products, gaining expected area and latency statistics before synthesis, as well as VHDL source code. Many different design choices when dealing with multipliers, multiplyaccumulate units, digital filters, and other complex computer arithmetic circuits can rapidly be prototyped using the generator. We have shown that there is no overall optimal strategy. The usability of the strategy depends on the required arithmetic operation. The need to compare the different strategies for every new arithmetic operation is obvious. Applying this design flow method on other complex computer arithmetic structures is our ISBN: 978--6804-056-5 233

primary goal. Especially looking into digital filters and deciding on an optimal strategy will be easier using the generator tool. Including other carry-save elements, as (4,2)-counters and (5,2)-counters, as well as other redundant representations like signed binary [] would widen the comparable space of possible designs. Incorporating the option of pipelining into the tree by inserting latches can be another future improvement of the generator tool to compare a wider variety of designs, as done in [5]. References: [] Avizienis, A.: Signed-digit number representations for fast parallel arithmetic. In: IRE Transactions on Electronic Computers, vol. 0, pp. 389 400. (96) [2] Baugh, C.R., Wooley, B.A.: A two s complement parallel array multiplier algorithm. In: IEEE Transactions on Computers, vol. 22, pp. 045 047. (973) [3] Bellanger, M. : Digital processing of signals. Theory and Practice. 3rd edition. Wiley, 2000. [4] Chen, J., Xu, R., Fu, Y.: Architecture Design of a High-Performance 32-Bit Fixed-Point DSP. In: ACSAC 2004. LNCS, vol. 389, pp. 5-25. Springer, Heidelberg (2004) [5] Dadda, L.: Some schemes for parallel multipliers. In: Alta Frequenza, vol. 34, pp. 349-356.(965) [6] Flynn, M.J., Oberman, S.F.: Advanced Computer Arithmetic Design. Wiley, New York, 200. [7] Huang, M., Gaj, K., Kwon, S., El-Ghazawi, T.: An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm. In: PKC 2008, LNCS, vol. 4939, pp. 24-228. International Association for Cryptologic Research (2008) [8] Johannson, K., Gustafsson, O., Wanhammar, L.: Power Estimation for Ripple-Carry Adders with Correlated Input Data. In: PATMOS 2004. LNCS, vol. 3254, pp. 662 674. Springer, Heidelberg (2005) [9] Kornerup, P.: Reviewing 4-to-2 Adders for Multi- Operand Addition. In: Journal of VLSI Signal Processing, vol. 40, pp. 43 52. Kluwer Academic Publishers(2005) [0] Kwon, O., Nowka, K., Swartzlander, E.E. Jr.: A 6-Bit by 6-Bit MAC Design Using Fast 5:3 Compressor Cells. In: Journal of VLSI Signal Processing, vol. 3, pp. 77 89. Kluwer Academic Publishers (2002) [] Neuhäuser, D., Zehendner, E.: On Carry- Save Strategies for Multiply-Accumulate Arithmetic. European Conference of Computer Science (ECCS ), Puerto de la Cruz, Spain (20) [2] Noll, T.G.: Carry-Save Architectures for High- Speed Digital Signal Processing. In: Journal of VLSI Signal Processing, vol. 3, pp. 2 40. Kluwer Academic Publishers, Boston (99) [3] Oklobdzija, V.G., Villeger, D., and Liu, S.S.: A method for speed optimized partial product and generation of fast parallel multipliers using an algorithmic aproach. In: IEEE Transactions on Computers, vol. 45, no. 3, pp. 294 306. (996) [4] Parhami, B.: Computer arithmetic: algorithms and hardware designs. Oxford University Press, New York, Oxford, 2000. [5] Schuster, Ch., Nagel, J.L., Piguet, Ch., Farine, P.A.: Leakage Reduction at the Architectural Level and Its Application to 6 Bit Multiplier Architectures. In: E. Macii et al. (Eds.): PATMOS 2004, LNCS vol. 3254, pp. 69-78. Springer, Heidelberg (2004) [6] Swartzlander, E.E. Jr.: Merged Arithmetic. In: IEEE Transactions on Computers, vol. 29, no. 0, pp. 946 950. (980) [7] Swartzlander, E.E. Jr.: The Negative Two s Complement Number System. In: Journal of VLSI Signal Processing, vol. 49, pp. 77-83. Springer Science + Business Media, LLC (2007) [8] Townsend, W.J., Swartzlander, E.E. Jr., Abraham, J.A.: A comparison of Dadda and Wallace multiplier delays. In: Advanced signal processing algorithms, architectures, and implementations. Conference No 3, vol. 5205, pp. 552-560. San Diego CA, USA (2003) [9] Wallace, C.S.: A suggestion for a fast multiplier. In: IEEE Transactions on Computers, vol. 3, pp. 4 7. (964) ISBN: 978--6804-056-5 234