Sloppy Addition and Multiplication IMM-Technical Report-2011-14 Alberto Nannarelli Dept. Informatic and Mathematical Modelling Technical Univerity of Denmark Kongen Lyngby, Denmark Email: an@imm.dtu.dk Abtract Sometime reducing the preciion of a numerical proceor, by introducing error, can lead to ignificant performance (delay, area and power diipation) improvement without compromiing the overall quality of the proceing. In thi work, we how how to perform the two baic operation, addition and multiplication, in an imprecie manner by implifying the hardware implementation. With the propoed loppy operation, we obtain a reduction in delay, area and power diipation, and the error introduced i till acceptable for application uch a image proceing. 1 Introduction In common language the adjective arithmetical uually indicate omething very precie or error-free. However, alo arithmetic operation have to be put in the context. There are everal field of application of computer arithmetic that can tolerate ome impreciion. For example, in audio and image proceing or in wirele communication, it might be deirable to get better performance (fater, maller, le power-hungry ytem) at expene of ome quality degradation. Recently, a few paper have addreed thi iue of deigninmprecie hardware to ave power [1, 2, 3, 4]. In thi work, we introduce a ytematic way of havinmprecie arithmetic operation for the two mot common operation: addition and multiplication. We liked the term loppy introduced in [5], and we will ue thi term in the paper to refer to imprecie arithmetic operation.
2 Sloppy Addition The idea i very imple. Do we need to propagate the carry for the whole word? Auming that we are operating on poitive integer, and defining poition k a the bit of weight 2 k in a n-bit word, we can ignore the carry up to poition k when implementing the addition. The bit-level algorithm to implement thi loppy adder i the following: c=0 // carry if (i < k) then _i = a_i XOR b_i; ele _i = a_i XOR b_i XOR c; c = (a_i AND b_i) OR (a_i AND c) OR (b_i AND c); endif For example, addition 103 + 70 (n = 8, k = 4): loppy exact A : 0110 0111 + 0110 0111 + B : 0100 0110 + 0100 0110 + c : 100- ---- = 0100 110- = -------------- -------------- S : 1010 0001 1010 1101 That i, the loppy adder compute 161 (exact value i 173) introducing an error ǫ = 12. By looking at the bit of weight < 2 k, we notice that the XOR of two one produce a zero um bit (1 1 = 0). Becaue the carry i not computed (or propagated), in poition k an error 2 k+1 i generated. The error can be halved to 2 k by computing the OR of the two bit in place of the XOR. For the example above we have: loppy (OR-ing) A : 0110 0111 + B : 0100 0110 + c : 100- ---- = -------------- S : 1010 0111 and the error i reduced from ǫ = 12 to ǫ = 6 (halved). By imulating all poible combination of the operand for the 8-bit addition (k = 4), we found that by obtaining the um by OR-ing the k leat-ignificant bit the average error i ǫ mean = 3.75, while by XOR-ing, it i ǫ mean = 7.5. We how in Figure 1 the comparion of the hardware implementation of the loppy adder ued in the above example (n = 8, k = 4) and an error-free 8-bit carry-propagate adder (CPA). The data on delay, area and power are reported in Table 1. In a rough evaluation, we conidered lowering the upply voltage V DD in the loppy adder to match the delay of the error-free adder (1.0 n). In our library, when V DD i lowered from 1.0 V 2
carry network carry network Figure 1: Implementation of 8-bit error-free (top) and loppy k = 4 (bottom) adder. to 0.7 V the delay double. Becaue the power diipation i P 1.0V = V 2 DD f N ai C i 20 = (1.0) 2 K we aume that the witching activity doe not change when caling V DD. Therefore, K = 20 i contant: P 0.7V = (0.7) 2 20 10 µw That i, with the loppy adder the power i reduced to 1/4 at ame adder peed. 2.1 Example: loppy adder in image filtering We ue the loppy adder defined above (k = 4) to proce two graycale (each pixel i an unigned 8-bit integer) image for the following bidimenional filter: 1. an averaging (low-pa) filter; 2. a harpening filter; 3
CPA 8-bit loppy ratio max. delay [p] 999 495 2.00 Area [µm 2 ] 191 112 1.70 Power [µw] 42 20 2.10 Table 1: Synthei data of adder in Figure 1. moothing harpening edge det. ǫ max ǫ ǫ max ǫ ǫ max ǫ uma 26 7.2 60 18.9 64 9.0 hue 28 7.8 59 17.5 68 9.2 Table 2: Error analyi of proceed image. 3. an edge-detection unit. The viual reult are hown in Figure 2. The maximum error (abolute value) ǫ max and the average error ǫ are reported in Table 2 for the different type of filtering. The reult how that the degradation i independent of the image (uma i a portrait, while hue ha greater detail). Depending on the filter mak, we can change the deign of the loppy adder to obtain larger aving. For example, for edge-detection, a loppy adder with k = 6 ha an average error ǫ = 28. 3 Sloppy Multiplication Parallel multiplication p = x y can be divided into three tep: 1. generation of Partial Product (PP); 2. carry-free reduction from n PP to 2 operand; 3. carry-propagate two operand addition. We ue a loppy approach for tep 1 only, a tep 2 i quite delay-efficient (no carry propagation) and tep 3 heen addreed in the previou ection. We conider radix-4 multiplication a for n n bit operand n PP are generated and the unit 2 i maller. In radix-4 multiplication, the radix-4 digit of the multiplier y are recoded into igned-digit repreentation to avoid multiple of 3 and carry propagation a explained in [6]. The reulting architecture (for one digit) recoder plu PP generation (rec+ppgen) i ketched in Figure 3 (top). Similarly to what wa done for the addition, we have a loppy rec+ppgen for the leat-ignificant digit of y. The recodin performed a hown in Table 3. The reultinmplementation i greatly implified a hown in Figure 3 (bottom). Clearly, a competitor of the loppy multiplier i the truncated multiplier. To compare performance and error introduced, we implemented a 8 8-bit multiplier (two complement) in the following cheme: 4
1. Smoothing filter (uma) original error-free loppy-adder error map (hue) original error-free loppy-adder error map 2. Sharpening filter (uma) original error-free loppy-adder error map (hue) original error-free loppy-adder error map 3. Edge-detection (uma) original error-free loppy-adder error map (hue) original error-free loppy-adder error map 5 Figure 2: Viual reult of loppy addition in filtering.
PP k y 2k+1 y 2k td. loppy ǫ k 0 0 0 0 0 0 1 x 4 k 2x 4 k x 4 k 1 0 2x 4 k 2x 4 k 0 1 1 3x 4 k 2x 4 k x 4 k Table 3: Sloppy radix-4 recoding. unit delay power area error [p] [µw] [µm 2 ] ǫ ǫ max r2-mult 900 70 2612 0 0 r4-mult 850 84 1842 0 0 r2-trunc 870 32 1426 256 897 r4-trunc 820 26 847 304 640 loppy 490 21 1195 145 657 Table 4: Summary of reult for 8 8-bit multiplier. 1. r2-mult a radix-2 tandard multiplier; 2. r4-mult a radix-4 tandard multiplier (with PP generation a in Figure 3-top); 3. r2-trunc a r2-mult with k t truncated bit; 4. r4-trunc a r4-mult with k t truncated bit; 5. loppy a radix-4 multiplier with PP generation a in Figure 3-bottom for k digit. We etimated a comparable error for k = 2 loppy digit and k t = 8 truncated bit. The reult of the imulation on all 2 16 combination are reported in Table 4. The data do not include the contribution of the final carry-propagate adder. 4 Putting Everything Together Now we combine the loppy multiplier and adder in a multiply-add (and accumulate) unit (Figure 4) which can be ued for the trivial implementation of the Invere Dicrete Coine Tranform (IDCT), which i part of the JPEG decompreion algorithm. We implemented the unit of Figure 4 with regular (R) and loppy (S) operation a hown in Table 5. The multiplier i 12 12 bit, the adder i 24 bit. By C imulation, we found a loppine limit of k m = 3 digit (6 bit) for the multiplier and k a = 8 bit for the adder. The reult in Table 5 are obtained by implementation in a 90 nm tandard cell library (clock rate i 100 MHz). The error are computed with repect to a floating-point oftware implementation. The viual reult are hown in Figure 5. The reult how that the larger reduction in power i obtained when the loppy multiplier i ued. The contribution of the loppy adder i little with repect to the power, but it i 6
y y y 2k+1 2k 2k 1 x x x n 1 1 0 one two recoding PP generation neg PP k n PP k n 1 PP k 1 PP k 0 y y y 2k+1 2k 2k 1 x x x n 1 n 2 0 two loppy recoding PP generation logic 0 PP k PP k PP k PP k n n 1 1 0 Figure 3: Implementation of error-free (top) and loppy (bottom) rec+ppgen. ignificant in delay reduction 1 (about 40% fater) and the lack can be ued for low power deign. The degradation due to the loppy adder, in addition to that of the loppy multiplier, i marginal. 5 Concluion and Future Work We have preented imple way of performing addition and multiplication in an imprecie manner with the aim to get better performance (delay, area and power) at expene of an increaed error which can be tolerated in ome application. Thi i preliminary work, jut the idea, which i going to be further developed. Reference [1] K. He, A. Gertlauer, and M. Orhanky, Controlled Timing-Error Acceptance for Low Energy IDCT Deign, Proc. of 2011 Deign, Automation and Tet in Europe Conference 1 The ynthei wa done with the minimum area contraint. Therefore, the adder i yntheized a a carryripple adder. 7
Unit delay area uma hue power MULT ADD [p] [µm 2 ] P ave [µw] ǫ ǫ max P ave [µw] ǫ ǫ max ratio R R 3500 5580 128 3.7 9 185 3.8 10 1.00 S R 3400 5090 107 5.0 34 155 6.0 39 0.84 R S 3090 5440 125 3.8 18 181 5.0 21 0.98 S S 2930 4950 106 5.0 35 153 6.6 36 0.83 Table 5: Summary of reult for IDCT implementation. X Y MULT CSA 3:2 ADD regiter S Figure 4: Scheme of multiply-accumulate ued for IDCT. (DATE), Mar. 2011. [2] A. Lingamneni, J.-L. N. C. Enz, K. Palem, and C. Piguet, Energy Parimoniou Circuit Deign through Probabilitic Pruning, Proc. of 2011 Deign, Automation and Tet in Europe Conference (DATE), Mar. 2011. [3] P. Kraue and I. Polian, Adaptive Voltage Over-Scaling for Reilient Application, Proc. of 2011 Deign, Automation and Tet in Europe Conference (DATE), Mar. 2011. [4] D. Mohapatra, V. Chippa, A. Raghunathan, and K. Roy, Deign of Voltage-Scalable Meta Function for Approximate Computing, Proc. of 2011 Deign, Automation and Tet in Europe Conference (DATE), Mar. 2011. [5] L. Hardety. The urpriing uefulne of loppy arithmetic. MIT New Office. [Online]. Available: http://web.mit.edu/newoffice/2010/fuzzy-logic-0103.html [6] M. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann Publiher, 2004. 8
uma hue original loppy decompreed Figure 5: Original picture (top) and after decoding by loppy (S-S) IDCT (bottom). 9