Computer Arithmetic () Arithmetic Units How do we carry out,,, in FPGA? How do we perform sin, cos, e, etc? ELEC816/ELEC61 Spring 1 Hayden Kwok-Hay So H. So, Sp1 Lecture 7 - ELEC816/61 Addition Two ve integers can be added similar to the way decimal numbers are added in long addition 3 1 19 4 1 1 1 1 1 1 1 1 1 1 1 The same addition can be implemented in hardware (ASIC), and FPGA. 1 1 1 Ripple Carry Adder Mimic the working of a long addition Each bit of addition handled by one Full-Adder Full Adder Add two 1-bit numbers AND a carry in i.e. Add THREE 1-bit numbers Produce 1 sum bit and 1 carry bit H. So, Sp1 Lecture 7 - ELEC816/61 3 H. So, Sp1 Lecture 7 - ELEC816/61 4 Half Adder Add two 1-bit numbers Produce 1 sum bit and 1 carry bit A A B C S S B 1 1 1 1 1 1 1 Full Adder A full adder handles a carry input as well as the two input data bits All together there are 3 inputs, and outputs S = A B C in = AB C in (A B) A B C in S 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 H. So, Sp1 Lecture 7 - ELEC816/61 5 H. So, Sp1 Lecture 7 - ELEC816/61 6
Ripple Carry Adder (1) A ripple-carry adder is formed by chaining series of full adders (s) 1 for each input bit Carry-out from a bit i is connected as the carry-input for bit (i 1) Ripple Carry Adder () Delay through a ripple-carry adder is proportional to the width of data input O(n) delay, where n is the width of the input A 3 B 3 A B A 1 B 1 A B A 3 B 3 A B A 1 B 1 A B 4 3 1 4 S 3 3 S S 1 1 S S 3 S S 1 S H. So, Sp1 Lecture 7 - ELEC816/61 7 H. So, Sp1 Lecture 7 - ELEC816/61 8 Carry Look Ahead Adder In a ripple carry adder, each bit must wait for the result of carry from previous bit before its calculation may start A carry look ahead (CLA) adder looks ahead in the input to figure out the carry Define two functions: Generate G i A i B i Propagate P i A i B i If G i = 1, then c i1 = 1 If P i = 1, then c i1 = c i Bit i propagate the carry from bit (i-1) to bit (i1) CLA adder Both generate and propagate can be calculated in constant time They depend only on the input bits Using the definition of P and G, carry bits can be calculated in constant time as well: c i1 P i c i P i (G i 1 c i 1 ) P i G i 1 P i (G i P i c i ) P i G i 1 P i G i P i P i G i 3 P i P c H. So, Sp1 Lecture 7 - ELEC816/61 9 H. So, Sp1 Lecture 7 - ELEC816/61 1 CLA Adder CLA Adder c i1 P i G i 1 P i G i P i P i G i 3 P i P c A 3 B 3 A B A 1 B 1 A B Looking at how a carry is calculated, we can interpret it as: Carry bit i1 is set if (1) a carry is generated at bit i OR () if a carry is generated in any of the previous position AND can be propagated all the way to position i. How long does it take to calculate carry? 3 C C 3 C C 1 4 S 3 P 3 G 3 S P G S 1 P 1 G 1 S P G 3 1 1 3 1 1 3 1 1 1 1 1 Constant delay! Caveat? Carry Lookahead Logic C H. So, Sp1 Lecture 7 - ELEC816/61 11 H. So, Sp1 Lecture 7 - ELEC816/61 1
Adder on FPGAs Implement Ripple-carry/CLA using logic fabric directly LUT, FF, etc Built-in adder Other adder architecture FPGA specific one? Bit-serial? Fast Adder on FPGA How do we build fast adder using this? LUT FF H. So, Sp1 Lecture 7 - ELEC816/61 13 H. So, Sp1 Lecture 7 - ELEC816/61 14 Fast Adder on FPGA S = A B C in = AB C in (A B) Fast Carry Logic H. So, Sp1 Lecture 7 - ELEC816/61 15 H. So, Sp1 Lecture 6 - ELEC816/61 16 Adder performance on FPGA Which of the following is fastest on FPGA? 16-bit ripple-carry adder implemented using LUT 16-bit carry-lookahead adder implemented using LUT 16-bit adder using fast carry logic 3-bit ripple-carry adder implemented using LUT 3-bit carry-lookahead adder implemented using LUT 3-bit adder using fast carry logic Subtractor Subtracting two numbers in s complement is relatively easy To calculate A - B: 1. find B from B Negate all bits in B Add 1. Add A and B Can reuse adder developed earlier H. So, Sp1 Lecture 7 - ELEC816/61 17 H. So, Sp1 Lecture 7 - ELEC816/61 18
Subtractor Multiplication B 3 B B 1 B 1 1 1 1 1 1 1 A 3 A A 1 A Subtract 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 S 3 S S 1 S H. So, Sp1 Lecture 7 - ELEC816/61 19 H. So, Sp1 Lecture 7 - ELEC816/61 Multiplication Multiplication is a form of repeated addition Multiplying two n-bit numbers can be achieved by adding n partial results Produce a result of n bits Multiplier - Iterative Start from basic definition of multiplication, do shift and conditional add Requires n cycles A B >> 1 >> CLK S H. So, Sp1 Lecture 7 - ELEC816/61 1 H. So, Sp1 Lecture 7 - ELEC816/61 Multiplier - Parallel Use n adders to perform all partial sum addition in parallel Requires 1 cycle but long cycle Simple Parallel Multiplier Critical path scales with n H. So, Sp1 Lecture 7 - ELEC816/61 3 H. So, Sp1 Lecture 7 - ELEC816/61 4
Multiplier - Carry Save Adders Carry save adder tree Critical path scales with n Fast adder at the end Fast Multiplier on FPGA Reuse carry logic for adders for partial result calculation Source: xapp15 H. So, Sp1 Lecture 7 - ELEC816/61 5 H. So, Sp1 Lecture 7 - ELEC816/61 6 Dedicated DSP Block in V6 Constant Multiplication If one of the input to a multiplier is constant, circuit can be simplified IF one of the input is a power of, then multiplication becomes shift A * n is equivalent to A << n What if the constant is not power of? Number decomposition H. So, Sp1 Lecture 6 - ELEC816/61 7 H. So, Sp1 Lecture 7 - ELEC816/61 8 Constant Multiplier Decomposition When multiplying a constant in fixed point, recall that the value represented by the bit string is: n k b n 1 i k b i Therefore, ALL representable fixed point numbers can be represented as a sum of power of Can decompose the constant multiplier into multiple shifts n 1 i= H. So, Sp1 Lecture 7 - ELEC816/61 9 A Decomposition B = ka A k B n-1 n- 1 << n-1 << n- B A B Compared to standard multiplier, all terms are eliminated Can we do better? H. So, Sp1 Lecture 7 - ELEC816/61 3 << 1 <<
Canonic Signed Digit Signed digit (SD) representation: Similar to binary representation except the set {-1,, 1} is used for the digits Representation is not unique E.g. In 4-bit SD number rep: 3 = 11 = 11 = 111 = 111 = 1111 Canonic representation has minimum number of nonzero digits Not unique A Canonic Signed Digit Use CSD to minimize number of nonzero E.g. 15 = 111111 = 111 6 5 B A 7 1 - B H. So, Sp1 Lecture 7 - ELEC816/61 31 H. So, Sp1 Lecture 7 - ELEC816/61 3 Division Division is substantially more complicated than multiplication main methods: Bit-by-bit calculation Calculate each bit similar to manual division Mathematical approximation Start with an approximation and iteratively refine the solution until desired precision is reached Use as few as possible! H. So, Sp1 Lecture 7 - ELEC816/61 33 Signal Flow Graph Manipulations H. So, Sp1 Lecture 7 - ELEC816/61 34 FIR as an Example Signal Flow Graph Simplify the block diagram with more efficient notation: h h 1 h k k h h 1 h = Delay for 1 sample (clock cycle) = FF FIR filter H. So, Sp1 Lecture 7 - ELEC816/61 35 H. So, Sp1 Lecture 7 - ELEC816/61 36
Dataflow system Remember: In most digital signal processing system with a continuous stream of data input, the overall latency usually doesn t matter. Therefore, it is ok to put extra delay at I/O without changing the function of the design z -5 h h 1 h z - Nodal Delay Transfer (a) (b) (c) (d) k k 1 z - k k 1 k k k k 1 k k 1 z 1 But why? H. So, Sp1 Lecture 7 - ELEC816/61 37 (e) z 1 H. So, Sp1 Lecture 7 - ELEC816/61 38 Nodal Delay Transfer Remember, z 1 is non-causal Not implementable on hardware Must eliminate any z 1 in the final graph before going to hardware implementation Pushing delay within the graph Inserting delay at I/O Reorganizing the graph Cutset Separate the SFG into two disjoint graphs Example: h h h 1 h h h 1 H. So, Sp1 Lecture 7 - ELEC816/61 39 H. So, Sp1 Lecture 7 - ELEC816/61 4 Cutset Retiming Generalization of the nodal delay transfer primitives Delay can be added to all incoming edges to a cutset if advances are added to all outgoing edges, and vice-versa Cutset Retiming h z 1 h h 1 z 1 h h 1 h H. So, Sp1 Lecture 7 - ELEC816/61 41 H. So, Sp1 Lecture 7 - ELEC816/61 4
Use of retiming Reduce critical path Pipelining Decrease number of registers Reduce Power Reduce clock rate In Summary Review basic computer arithmetic Add/sub easiest to implement Highly optimized in FPGAs Multiplier more complex VLSI has many optimized multipliers FPGAs design may use the fast carry logic Dedicated multiplier / DSP blocks Divisor very complex Use IP cores Signal flow graph and retiming helps to lay out signal processing systems H. So, Sp1 Lecture 7 - ELEC816/61 43 H. So, Sp1 Lecture 7 - ELEC816/61 44