An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor

Size: px

Start display at page:

Download "An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor"

Sophia Lloyd
5 years ago
Views:

1 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor HADI PARANDEH-AFSHAR, PHILIP BRISK, and PAOLO IENNE Ecole Polytechnique Federale de Lausanne (EPFL) To improve FPGA performance for arithmetic circuits that are dominated by multi-input addition operations, an FPGA logic block is proposed that can be configured as a 6:2 or 7:2 compressor. Compressors have been used successfully in the past to realize parallel multipliers in VLSI technology; however, the peculiar structure of FPGA logic blocks, coupled with the high cost of the routing network relative to ASIC technology, renders compressors ineffective when mapped onto the general logic of an FPGA. On the other hand, current FPGA logic cells have already been enhanced with carry chains to improve arithmetic functionality, for example, to realize fast ternary carry-propagate addition. The contribution of this article is a new FPGA logic cell that is specialized to help realize efficient compressor trees on FPGAs. The new FPGA logic cell has two variants that can respectively be configured as a 6:2 or a 7:2 compressor using additional carry chains that, coupled with lookup tables, provide the necessary functionality. Experiments show that the use of these modified logic cells significantly reduces the delay of compressor trees synthesized on FPGAs compared to state-of-the-art synthesis techniques, with a moderate increase in area and power consumption. Categories and Subject Descriptors: B.7.1 [Integrated Circuits]: Types and Design Styles Gate Arrays; G.1.0 [Numerical Analysis]: General Computer Arithmetic General Terms: Algorithms, Performance Additional Key Words and Phrases: FPGA, carry chain, compressor tree, 6:2 compressor, 7:2 compressor ACM Reference Format: Parandeh-Afshar, H., Brisk, P., and Ienne, P An FPGA logic cell and carry chain configurable as a 6:2 or 7:2 compressor. ACM Trans. Reconfig. Techn. Syst. 2, 3, Article 19 (September 2009), 42 pages. DOI = / P. Brisk is currently affiliated with the Department of Computer Science and Engineering in the Bourns College of Engineering at the University of California, Riverside. Author s address: H. Parandeh-Afshar, hadi.parandehafshar@epfl.ch; P. Brisk, Philip.brisk@gmail.com; P. Ienne, paolo.ienne@epfl.ch. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY USA, fax +1 (212) , or permissions@acm.org. c 2009 ACM /2009/09-ART19 $10.00 DOI: /

2 19: 2 H. Parandeh-Afshar et al. 1. INTRODUCTION Due to their inherent reconfigurability, FPGAs are one feasible hardware platform for low-volume markets, where vendors cannot justify the design, testing, and verification costs of an ASIC. Although an FPGA implementation of a circuit will outperform traditional software, a noticeable performance gap between FPGAs and ASICs remains [Kuon and Rose 2007]. One important area that is ripe for improvement is arithmetic dominated circuits; in particular, due to the peculiar logic cell structure and carry chains in modern FPGAs, addition and multiplication-dominated circuits cannot take advantage of the carry-save representation. One of the fundamental results in computer arithmetic is that addition scales well when the number of inputs increases beyond 2; this was first observed by Wallace [1964] in the context of parallel multiplier design. The key is not to use trees of traditional carry-propagate adders, that is, circuits that produce the sum of two (signed) binary integers; instead, the integers are aggregated together using a circuit called a compressor tree. Numerous methods for compressor tree generation have been published since their introduction in the early 1960s [Wallace 1964; Dadda 1965; Swartzlander 1973; Stenzel et al. 1977; Weinberger 1981; Santoro and Horowitz 1988; Song and De Micheli 1991; Fadavi-Arkedani 1993; Oklobdzija and Villeger 1995; Stelling and Oklobdzija 1996; Stelling et al. 1998; Kwon et al. 2002; Um and Kim 2002; Mora Mora et al. 2006; Verma and Ienne 2007a], mostly in the context of parallel multiplication; more generally, these circuits can also sum k > 2 integers. The architecture of modern FPGAs is generally not well suited to compressor trees. The logic clusters of the Altera Stratix II-IV and Xilinx Virtex-5 FPGAs can be configured to implement ternary (3-input) addition using fast carry chains [Cherepacha and Lewis 1996; Hauck et al. 2000; Frederick and Somani 2006]. The primary advantage of the carry chains is that the carry bits are propagated directly from one cell to its adjacent neighbor, thereby avoiding the overhead of the routing network. This design point favors the use of ternary adder trees rather than compressor trees. Parandeh-Afshar et al. [2008b, 2008c] showed that compressor trees can be synthesized on FPGAs using a circuit called a Generalized Parallel Counter (GPC) [Stenzel et al. 1977]. This GPC Mapping approach yields compressor trees whose delay is significantly lower than ternary adder trees, despite the latter s use of the carry chains; however, there is some noticeable increase in the number of logic cells required. This article, an extension of prior work by Parandeh-Afshar et al. [2008a], introduces and evaluates a new logic cell, based on the Altera Adaptive Logic Module (ALM), that has an additional carry chain, which allows it to be configured as a 6:2 or 7:2 compressor; this compressor belongs to a well-known class of circuits that have been used for successful synthesis of ASIC multipliers in the past [Weinberger 1981; Song and De Micheli 1991; Oklobdzija and Villeger 1995]. By combining the strengths of the GPC mapping with the use of 6:2 or 7:2 compressors, when possible, faster compressor trees can be realized on the FPGA. Additionally, we compare the power consumption of compressor trees

3 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 3 Fig. 1. (a) A ripple-carry adder; (b) a carry-save adder. mapped onto the proposed logic cells with compressor trees synthesized using ternary adder trees and GPC mapping. The article is organized as follows. Section 2 begins by introducing a collection of arithmetic primitives (counters, compressors, compressor trees) that are required to understand the remaining sections of the paper. Section 3 summarizes related work in the field of FPGA architecture and mapping, focusing specifically on features designed for enhanced arithmetic performance. Section 4 presents the new logic cell, and Section 5 describes the approach that we used to map circuits onto FPGAs containing the new cell. Our experimental platform, methodology, and results are presented in Sections 6 8. Section 9 concludes the article. 2. ARITHMETIC AND FPGA PRIMITIVES 2.1 Full and Half Adders At the bit level, a half-adder (HA) is a 2-input, 2-output circuit that computes the sum of two bits and outputs the result as an unsigned binary integer. A full-adder (FA) computes a similar sum for 3 input bits. The lower-order output bit is called a sum, and the higher-order output bit is called a carry. Inthecase of an FA, one of the inputs is called a carry-in bit and the high-order output is called a carry-out. Many arithmetic circuits, including adders and multipliers are comprised primarily of HAs and FAs. 2.2 Ripple-Carry and Carry-Save Adders A Carry Propagate Adder (CPA) is a circuit that adds two binary integers; if the integers are signed, two s complement form is assumed. Numerous architectures for carry-propagate adders have been proposed in the past. In modern CMOS technologies, significant differences in critical path delay among the different adder architectures generally do not manifest themselves for small bitwidths, that is, 8-bits or less. The most straightforward CPA architecture is the Ripple-Carry Adder (RCA), which generally has the smallest area but highest delay compared to the alternatives. Figure 1(a) shows a 4-bit RCA constructed from FA cells; the carry-in of the least significant FA is 0, so an HA can be used instead of an FA. As shown in Figure 1(a), an RCA is a 1-dimensional array of FAs, where the carry-out of each FA is connected directly to the carry-in of the next; thus, the worst-case critical path delay is through all of the FAs in the design. If an RCA adds two k-bit numbers, the complexity of the critical path delay is O(k). Many faster, but larger, alternative adders have been designed, most with a critical path delay of O(log k).

4 19: 4 H. Parandeh-Afshar et al. Fig. 2. Two implementations of a 4-bit ternary adder using (a) an adder tree, i.e., two RCAs; and (b) a compressor tree, i.e., a CSA followed by an RCA. The compressor tree implementation eliminates the delay of one half adder (HA) from the critical path. A Carry-Save Adder (CSA), shown in Figure 1(b), breaks the carry chain; in fact, it is a 1-dimensional array of disconnected FAs. CSAs are generally used in conjunction with CPAs in order to perform efficient n-input addition for n > Adder and Compressor Trees Suppose that we want to compute the sum of n > 2 binary integers. One approach is to use an Adder Tree, that is, a tree of CPAs; the alternative is to build a tree of carry-save adders instead, only using a CPA at the end. Figure 2 shows an example where three four-bit binary integers are added. In Figure 2(a), two RCAs are used; in Figure 2(b), a CSA is followed by an RCA. Let d FA and d HA are the respective delays of full and half adders. The critical path delay of the circuit in Figure 2(a) is 4d FA +2d HA, while the critical path delay of the circuit in Figure 2(b) is 3d FA +2d HA, an overall savings of d FA compared to Figure 2(a). This savings occurs because the use of the CSA instead of the RCA permits the elimination of one bit from the RCA in Figure 2(b). The idea of using carry-save addition for fast accumulation dates back to the work of Wallace [1964] and Dadda [1965] who designed fast parallel multipliers; however, the fundamental ideas generalize quite elegantly to multiinput addition as well. Formally, let A 1, A 2,..., A n be a set of binary integers to sum. A Compressor Tree is a circuit that produces two values, sum (S) and carry (C), such that: n S+ C = A i. (1) A CPA then performs the final addition, S+ C. i=1

5 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 5 Fig. 3. Illustration of the critical path delay through a compressor tree of a multiplier, including that of the final CPA. The critical path typically includes the j most significant bits of the final CPA; the portion of the final CPA that computes the m j least significant bits can be optimized for area rather than for speed, as long as it does not become critical. Wallace and Dadda trees are two specific compressor tree architectures; many others have also been proposed [Swartzlander 1973; Stenzel et al. 1977; Weinberger 1981; Santoro and Horowitz 1988; Song and De Micheli 1991; Fadavi-Arkedani 1993; Stelling and Oklobdzija 1996; Stelling et al. 1998; Kwon et al. 2002; Um and Kim 2002; Mora Mora et al. 2006; Verma and Ienne 2007a]. The superiority of compressor trees over adder trees is one of the most fundamental results of digital arithmetic. Intuitively, it may seem that this is because an adder tree pays the penalty of a carry chain at each level; this is, however, a fallacy, as illustrated by Figure 2 in the preceding discussion. In actuality, the benefit of compressor trees arises from their ability to reduce the bitwidth of the final CPA in the case of multiinput addition. Parallel multipliers in ASIC technology, however, are more complicated. In multiinput addition, the number of bits to sum at each position is the same. This is not true in the case of parallel multiplication: after a partial product generation or Booth encoding stage, the number of bits to sum tends to be greater among the bit positions in the middle. As illustrated conceptually by Figure 3, the lower-order bits of the final CPA are generally not on the critical path, as the bits that arrive at these positions go through fewer layers of logic within the compressor tree. In other words, the arrival time of the bits at the final CPA is nonuniform, unlike the case of multiinput addition. Based on this observation, Oklobdzija and Villeger [1995] argued that the final CPA of a multiplier should be implemented as a hybrid adder, which uses a small and slow CPA, such as an RCA, for the low-order bits, and a faster adder, such as a carry-select adder for the higher-order bits. Carry-select adders are particularly useful when the arrival time of bits is nonuniform. Carry-select adders can start to add the bits as soon as they arrive. RCAs, in contrast, cannot, as the output bit at position i depends on the carry-out bit computed at position i 1. That being said, carry-select adders can be constructed from smaller-bitwidth RCAs as building blocks. The work summarized in this section targets ASIC design methodologies; FPGAs, in contrast, possess fast carry chains, whose usage often dictates the types of adders that perform well on specific device families.

6 19: 6 H. Parandeh-Afshar et al. 2.4 Parallel Counters An m:n parallel counter (or single-column counter) is a circuit that takes m input bits, counts the number of input bits that are set to one, and outputs the value as an n-bit binary unsigned integer. The output range is [0, m], so the number of output bits is: n = log 2 (m +1). (2) In the context of compressor trees, HAs and FAs are 2:2 and 3:2 counters respectively. Verma and Ienne [2007a], for example, described an integer linear programming formulation for compressor tree design that uses a library of m:n counters, for 2 m 8. Let B = b k 1 b k 2...b 0 be a k-bit unsigned binary integer, where b k 1 is the most significant bit, and b 0 is the least significant bit. Each bit b r contributes atotalvalueofb r 2 r to the total value of B, i.e., b r contributes 2 r if it is set, and 0 otherwise. In this context, r is called the rank of b r. When an m:n counter is used to synthesize a compressor tree, all of its inputs have the same rank. A Generalized Parallel Counter (GPC) is an extension of an m:n counter that can sum bits of multiple ranks [Stenzel et al. 1977]. For example, a (2, 3; 3) GPC can sum up to 2 bits of rank 1 and 3 bits of rank 0; the maximum output value is = 7, so3 output bits are required. The general form of a GPC is (k t 1, k t 2,..., k 0 ; s), where k r is the maximum numberofbitsofrankr that can be summed, and s is the number of output bits. Similar to an m:n counter, a GPC must satisfy the following property: ( ) t 1 s = log 2 1+ k r 2 r. (3) In fact, a sufficiently large m:n counter can implement a GPC (although many other implementations also exist). Each GPC input bit of rank r is connected to 2 r inputs of the m:n counter; any unused input bits of the m:n counter are then driven to 0. GPCs map efficiently onto FPGAs [Parandeh-Afshar et al. 2008b, 2008c]. Specifically, if the FPGA has k-input LUTs, then k-input GPCs can be mapped onto the LUTs (one LUT is used per GPC output bit) using one logic level. 2.5 Compressors Compressors (not to be confused with compressor trees) are arithmetic components, similar in principle to parallel counters, but with two distinct differences: (1) they have explicit carry-in and carry-out bits; and (2) there may be some redundancy among the ranks of the sum and carry-output bits. The 4:2 compressor (also called a 4:2 CSA), illustrated in Figure 4, was introduced by Weinberger [1981]; at first sight, this name may appear to be somewhat of a misnomer: although it has 4 input bits and produces 2 sum output bits (out 0 and out 1 ), it also has a carry-in (c in ) and a carry-out (c out ) bit (thus, the total number of input/output bits are 5 and 3); however, it is not the same circuit as a 5:3 compressor. All input bits, including c in,haverank0; thetwo output bits have ranks 0 and 1 respectively, while c out has rank 1 as well. Thus, r=0

7 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 7 Fig. 4. (a) 4:2 compressor I/O diagram; (b) 4:2 compressor architecture; (c) 4-ary adder built from an array of 4:2 compressors followed by an RCA; (d) illustration of the interconnect between consecutive 4:2 compressors: although the array has the appearance of an RCA in Figure 4(c), the carry chain only goes through two compressors. the output of the 4:2 compressor is a redundant number; for example, out 1 = 0 and c out = 1 is equivalent to out 1 = 1 and c out = 0 in all cases. When k4:2compressors are connected in a carry chain, a total of 4k input bits are compressed down to 2k output bits plus one additional carry-out bit; the carry-in bit of the first compressor is set to 0. The primary difference between compressors and counters are the presence of carry bits in the former; it is also important to recognize that a compressor tree can be constructed from compressors, counters, or both. Figure 4(a) shows the inputs and outputs of the 4:2 compressor labeled with their ranks; Figure 4(b) shows one 4:2 compressor architecture, which is constructed using two 3:2 counters. Figure 4(c) shows a 4-bit 4-input adder, consisting of four 4:2 compressors in a 1-dimensional array followed by a four-bit RCA. At first glance, the array of 4:2 compressors appears to have the same structure as an RCA, as the c out bit of each 4:2 compressor is connected to the c in bit of the subsequent one; however, this is not actually the case, as shown in

8 19: 8 H. Parandeh-Afshar et al. Fig. 5. (a)/(b) 6:2/7:2 compressor I/O diagram; (c)/(d) 6:2/7:2 compressor architecture; (e) illustration of the interconnection pattern between consecutive 6:2 compressors (it is the same for 7:2 compressors). Figure 4(d); the fact that there is no direct path from a carry-in to a carry-out prevents the formation of a ripple-carry structure. The new FPGA logic cell described in this paper has two variants that can respectively be configured as a 6:2 or a 7:2 compressor, which generalize the 4:2 compressor cell whose use is shown in Figure 4. Figure 5(a) and (b) show the basic I/O structure of the 6:2 and 7:2 compressors. Figure 5(c) and (d) show the circuit-level architecture; the only difference is that a 2:2 counter in the 6:2 compressor is upgraded to an 3:2 counter in the 7:2 compressor, and the 7 th input is connected to one of the inputs the aforementioned 3:2 counter.

9 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 9 Fig. 6. (a) Covering a set of columns with 6:3 counters yields 3 bits per column in the output; (b) using 6:2 compressors reduces the number of bits per column to 2. Contiguous columns covered with 6:3 counters can be converted to 6:2 compressors. Figure 5(e) shows the interconnect structure. Consider the i th compressor in sequence. The rank1 carry output bit (c out,0 ) connects to carry-input c in,0 of the (i + 1) st compressor; also, the rank 2 carry output bit (c out,1 ) connects to carry-input c in,1 of the (i + 2) nd compressor. From Figure 5(c) and (d), we can see that there is no direct path from either of the carry-in bits of the 6:2 or 7:2 compressor to one of the carry-out bits; similar in principle to Figure 4(d), this prevents the formation of a ripple-carry chain between compressors. 2.6 Compression Ratio Let I and O be the number of inputs and outputs produced by a counter, GPC, or compressor; for compressors, I and O do not include the carry-in and carryout bits. The compression ratio (CR) is defined as CR = I/O. For example, a 6-input, 3-output GPC has CR=6/3=2, while a 6:2 compressor has CR = 6/2 = 3. TheCR tendstobehigherforcompressors than counters. Figure 6(a) shows compression using 6:3 counters; which produce three output bits per column, while 6:2 compressors, shown in Figure 6(b), produce two output bits per column; the other output bits are propagated down the carry chain. 3. RELATED WORK 3.1 Compressor Tree Synthesis in ASIC Technology Compressor trees for partial product accumulation were introduced by Wallace [1964] and Dadda [1965], who built them from CSAs; HAs were used at points where only 2 bits in the same column need to be compressed. Fadavi-Ardekani [1993] recognized that the bits produced by a compressor tree may arrive at different times at the final adder, and designed a specific adder for this purpose; however, this work assumed that all partial product bits arrive to the compressor tree at the same time. Stelling et al. [1996, 1998] relaxed this

10 19: 10 H. Parandeh-Afshar et al. assumption, and developed appropriate techniques to build the compressor tree and designed the final adder appropriately. Due to the importance of wire delays in deep submicron technology, Um and Kim [2002] proposed a two-phase layout-aware compressor tree synthesis technique that strives for a much more regular interconnect topology than the compressor trees produced by the 3-greedy algorithm of Stelling et al. [1998]. Verma and Ienne [2007a] developed an integer linear program (ILP) that could optimally synthesize compressor trees from a library of m:n counters. To bound the runtime of the synthesis procedure, they limited m to the range [2, 8]. Previously,m:n counters, like compressor trees, were built from CSAs, or libraries of smaller m:n counters. Through efficient logic synthesis techniques for arithmetic circuits [Verma and Ienne 2007b], they found that better m:n counters could be constructed from basic gates, rather than smaller counters. The availability of a library of highly optimized counters was important to the success of their ILP formulation; another contributing factor was that the ILP could optimize for the delay profile of any final adder. GPCs have also been used in the past to build efficient compressor trees for parallel multipliers [Stenzel et al. 1977]. Mora Mora et al. [2006] described a multiplier generation approach for ASICs that implemented GPCs using ROMs, with the restriction that all input columns to the GPC have the same number of bits. The 4:2 compressor [Weinberger 1981], was subsequently used by Santoro and Horowitz [1988] in a parallel multiplier. Over the years, various researchers have proposed the use of larger compressors and counters as well, including Kwon et al. [2002] (5:2, 5:3) and Song and De Micheli [1991] (9:2, 27:5). A column is a set of bits having the same rank, r, at some level in a compressor tree; all of the inputs to a FA or an HA in a compressor tree belong to the same column. The FA or HA produces two output bits, one of rank r, oneof rank r + 1. The delay through the FA or HA to the rank r output is called the vertical propagation delay, as the delay is confined to one column; the delay of the rank r + 1 output is called the horizontal propagation delay, as it passes from one column to the next. The use of compressors in favor of counters shifts some of the vertical propagation delay into horizontal propagation delay. Thus, the critical path through a compressor tree travels in both the horizontal and vertical direction before arriving at the final CPA. The compressor cells can be designed in order to minimize the difference between horizontal and vertical delays. Interestingly enough, a CPA actually has a higher compression ratio than an m:n counter, a GPC, or a compressor. To take advantage of this fact, Oklobdzija and Villeger [1995] advocate the inclusion of CPAs within compressor trees: the vertical propagation delay will dominate; however, at places where the horizontal propagation delay is noncritical, the use of internal CPAs within the compressor tree maximizes the compression ratio. This technique has some notable ramifications for FPGAs: due to the presence of carry chains within logic clusters (see Sections 3.3 and 3.4), horizontal propagation is naturally faster than vertical propagation, which must use the FPGA routing

11 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 11 network. This differentiates compressor tree synthesis on FPGAs from the same problem in VLSI. The challenge is that we cannot take advantage of the fast horizontal propagation on the critical paths of the compressor tree without resorting to CPAs. To address this concern, we design and evaluate new logic blocks that can be configured as 6:2 and 7:2 compressors. These logic blocks have a higher compression ratio than m:n counters and GPCs, and employ new carry chains that can exploit fast horizontal propagation. 3.2 FPGA Architecture This section describes a number of proposals to improve the arithmetic and logical capabilities of FPGA logic cells. The most enduring idea has been the integration of carry chains into FPGA logic cells along with LUTs. Carry chains include fast connections between adjacent logic cells that are used for carry propagation; this permits the elimination of most of the routing delays that would otherwise be present. The Altera Stratix II-IV Adaptive Logic Module (ALM) employs a carry chain based on ripple-carry adders (RCAs). The new logic cells proposed in this work features a new type of carry chain intended to allow a logic cell, such as the ALM, to be configured as a 6:2 or 7:2 compressor; the ALM will be described in greater detail in Section 3.3. The carry chains used in the configurable logic blocks (CLBs) of the Xilinx Virtex-4/5 include programmable multiplexors and xor gates to send propagate and generate signals to adjacent CLBs to enable parallel-prefix style addition [Parhami 1999]. Hauck et al. [2000] proposed more complicated carry chains that can implement Brent-Kung, carry-select, andcarry-lookahead addition. Different logical constructs were needed for different cells in the chain, making them nonuniform. This creates integration challenges because it is difficult to lay out a regular fabric consisting of irregular cells. This would require a large manual effort to design each individual cell at the transistor level, and would complicate the layout process for the entire chip. Frederick and Somani [2006] proposed a uniform logic block with carry chains that could efficiently implement a carry-skip adder; a similar bidirectional carry-skip chain was earlier proposed by Cherepacha and Lewis [1996, Figure 6]. Kaviani et al. [1998] and Leijten-Nowak and Van Meerbergen [2003] developed ALU-like blocks that support arithmetic functions such as addition, subtraction and (partial) multiplication. Distributed Arithmetic (DA) [Mirzaei et al. 2006] is a paradigm for implementing effective hardware for DSP systems that uses LUTs instead of multipliers. Grover et al. [2002] developed a special DA-oriented LUT structure (DALUT) specifically for multiply-accumulate (MAC) operations. In addition to two 4-input LUTs, their DALUT cell included arrays of xor gates, bit-level adders and shift accumulators, shift registers, and a CPA to add partial summations and carries. Brisk et al. [2007] reported that DSP/MAC blocks are not good candidates for implementing multioperand addition. The logic cell described here is intended to address this shortcoming.

12 19: 12 H. Parandeh-Afshar et al. Most FPGAs are hybrid-reconfigurable, as they embed ASIC components such as multipliers, more complex DSP blocks, and standard I/O interfaces into a reconfigurable fabric Zuchowski et al. [2002]. Kastner et al. [2002] developed techniques for a compiler to examine a set of applications to identify good candidates for these embedded cores. Their analysis, however, was limited to 2-operation combinations of addition and multiplication, and they did not use compressor trees for multioperand addition. A K-input macro gate [Cong and Huang 2005] is similar to a LUT, but it cannot implement all 2 K logic functions, and therefore has reduced delay and area. Hu et al. [2007] suggested that FPGA cells could benefit from the inclusion of both LUTs and macro gates. Similar to Kastner et al., they developed an automated method to profile a set of applications to find good macro-gate candidates. They did not, however, consider arithmetic-dominated functions or fast carry chains between macro gates. The Field Programmable Counter Array (FPCA) [Brisk et al. 2007; Cevrero et al. 2008] is a programmable IP used to accelerate multi-input addition in FPGAs. The FPCA is similar to an FPGA, but replaces LUTs with m:n counters instead. In a hybrid FPGA/FPCA, a compressor tree is mapped onto the FPCA, while all other operations are mapped onto the FPGA. As suggested by Kuon and Rose [2007], the cost of routing data to and from the FPCA may limit its performance benefit. The new FPGA cell proposed here is much less ambitious, and exploits carry chains rather than logical structures for effective local routing; furthermore, the I/O interface to the logic cell does not change. 3.3 The Altera Stratix II-IV Adaptive Logic Module (ALM) This new logic cell proposed in this article is a modified version of the Adaptive Logic Module (ALM) employed the Altera Stratix II-IV series of FPGAs. Each ALM contains an Adaptive LUT (ALUT). An ALUT is comprised of two sixinput LUTs (6-LUTs) with four shared inputs and shared configuration bits; in other words, they must implement the same logic function. Additionally, the ALM contains a carry chain that performs efficient ripple carry addition, and bypassable flip-flops that facilitate either combinational or sequential circuits. The two 6-LUTs are also fracturable, meaning that each can be decomposed into two or more smaller LUTs. The ALM also includes a 7 th input bit, but can only implement a selected set of 7-input functions. The ALM has four operating modes, two of which use the carry chains. In Arithmetic Mode, each 6-LUT is decomposed into two independent 4-LUTs, which perform a small amount of pre-adder logic, followed by the carry chains. Arithmetic mode implements effective adders, (sequential) counters, accumulators, parity functions, and comparators. In Shared Arithmetic Mode, the ALM is configured as a 2-bit ternary adder. The fracturable LUTs are configured as a carry-save adder (CSA), that is, a 3:2 compressor, and the carry chain functions as the final adder. Shared arithmetic mode was designed to efficiently implement soft multipliers (as opposed to using DSP blocks) and correlators.

13 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 13 Fig. 7. Mode. The Altera Stratix II/III Adaptive Logic Module (ALM) shown in Shared Arithmetic Figure 7 illustrates the ALM configured in shared arithmetic mode. It is important to note that the 6-LUTs in the ALM are decomposed into smaller LUTs of 3- and4-inputs; only the smaller LUTs are shown in Figure 7. The modification to the ALM proposed in this article is similar to shared arithmetic mode, but implements a 6:2 or 7:2 compressor. Similar to shared arithmetic mode, the fracturable LUTs are configured as a CSA; but the interconnection of FAs in the carry chain differs from the ripple-carry chain. We chose to provide a second carry chain in addition to the ripple-carry chain; in principle, both carry chains could be merged, but this would introduce multiplexers into the ripple-carry chain. We opted for the second carry chain in order to achieve better performance. 3.4 Synthesizing Compressor Trees on FPGAs The compressor tree synthesis techniques summarized in Section 3.1 are intended for ASIC design flows. Due to the specific logic and routing architectures of modern high performance FPGAs, these techniques are not likely to yield favorable results if used in a synthesis flow targeting an FPGA. Since the primary role of carry chains has been to facilitate efficient carry-propagate addition, conventional wisdom held that adder trees would yield better results than compressor trees synthesized on an FPGA. This is not necessarily true. Poldre and Tammemae [1999] synthesized 4:2 compressors onto the four input LUTs of the Xilinx Virtex FPGAs, exploiting the carry chains to propagate the carry-in/carry-out bits. Parandeh-Afshar et al. [2008b, 2008c] developed a general compressor tree synthesis method that mapped GPCs with 6 inputs and 3 or 4 outputs onto FPGA logic cells built from 6-LUTs. Limiting the number of GPC inputs to 6 ensuresthatatmostonelayerofluts is required to implement each GPC. On an Altera Stratix II, the delay of a compressor tree built from GPCs was 27% faster than that that of an

14 19: 14 H. Parandeh-Afshar et al. Fig. 8. (a) (0, 6; 3) and (b) (2, 3; 3) GPCs mapped onto two ALMs using shared arithmetic mode. adder tree. The GPC mapping, however, increased the ALM count by 47%, on average. In principle, a 6-input, k-output GPC could be synthesized on k6-luts, where each 6-LUT computes a single output bit. As the two 6-LUTs in a Stratix II-IV ALM must implement the same function, this would require k ALMs, where only one of the two 6-LUTs available in each ALM is used. Parandeh- Afshar et al. [2009], however, proposed a more efficient mapping that uses LUTs in conjunction with carry chains, reducing the number of ALMs required to k/2. In many cases, it is possible to map these components onto ALMs using either arithmetic or shared arithmetic mode. Figures 8 and 9, for example, shows three 6-input, 3-output GPCs mapped onto two ALMs using shared arithmetic mode. In fact, these are the only three

15 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 15 Fig. 9. (a) a (1, 5; 3) GPC implemented using full and half adders, and (b) mapped onto two ALMs using shared arithmetic mode. The internal signals S0, C0, S1, C1, and C2 in (a) are computed by LUTs in (b). Signals C2 and D are never 1 at the same time, so the carry output of the adder that produces output bit z2 is always 0. 6-input, 3-output GPCs that will be used by the GPC mapping heuristic, which is described in Section 5.3. Specifically, these are the only 6-input, 3-output covering GPCs; the definition of a covering GPC will be formalized in Section 5.1. The GPC mapping heuristic only employs covering GPCs; all other GPCs are either redundant or unreasonable, for reasons that will be discussed in Section NEW FPGA LOGIC CELL AND CARRY CHAIN Figure 10(a) shows our proposed new FPGA logic cell, which is presented as an extension of the ALM used in Altera s Stratix II-IV line of high-performance FPGAs. The components required for shared arithmetic mode are also shown in this figure. The left-hand side of Figure 10(a) shows four 3-LUTs, which are part of Altera s fracturable 6-LUT architecture. The carry chain on the right-hand-side is the traditional carry chain that is used to implement ternary

16 19: 16 H. Parandeh-Afshar et al. Fig. 10. (a) Enhanced version of the Shared Arithmetic Mode of the Altera ALM; a new carry chain, shown in gray, allows the ALM to be configured as a 6:2 or 7:2 compressor. Two additional multiplexers are required to select between the two sum outputs of the 6:2 compressor and ternary adder (already present in the ALM); (b) pattern of carry-propagation for the 6:2 and 7:2 compressor. addition, using the four 3-LUTs configured as a carry-save adder. The novel features of the new logic cell are the carry chain in the center (gray background), which can implement a 6:2 or 7:2 compressor, and the two multiplexers shown in gray on the right-hand side of Figure 10(a), which selects between the outputs of the two carry chains. Similar to ternary addition, the new carry chain requires the four 3-LUTs to be configured as a carry-save adder. To implement a 7:2 compressor, three FAs (and a 7 th LUT input) are required; to implement a 6:2 compressor, one of the FAs (outlined with a dashed line) becomes a half (two-input) adder, and the 7 th input bit is not used.

17 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 17 Three carry-in/carry-out bits are also required; they are labeled X, Y, and Z in Figure 10(a). The carry-out labeled X /Y/Z connects to the corresponding carry-in labeled X /Y/Z of the next compressor in the chain. A detailed picture of the carry chains across several logic cells is shown in Figure 10(b). In principle, the FAs used in the two carry chains could be shared; this design choice was illustrated by Parandeh-Afshar et al. [2008a, Figure 5(a)]; although doing this could slightly reduce area, it requires that multiplexers be inserted into the carry chain, significantly increasing the critical path delay; as our goal is to increase performance, this design point is nonideal, especially since the area of the multiplexers offsets the area savings from sharing FAs. There are two primary advantages of providing an FPGA logic cell that can be configured as a compressor compared to synthesizing GPCs on LUTs. The first advantage, which was illustrated in Figure 6, is that a k:2 compressor will have a higher compression ratio than a k input GPC. In some but certainly not all cases, this can reduce the number of levels of logic in the compressor tree. The second advantage involves area utilization. Each ALM contains two 6-LUTs with dependent inputs. A GPC with six inputs and three outputs, including a 6:3 counter, requires two ALMs, while only one of our proposed logic cells, which is marginally larger than an ALM, is required to realize a 6:2 compressor. Reducing the number of logic cells, moreover, may allow for a tighter placement of logic cells on the device, which, in turn, reduces wirelength and routing delay; our experiments confirm this hypothesis. Using similar reasoning, the use of 7:2 rather than 6:2 compressors further increases the compression ratio, and may also reduce the number of logic cells required since each cell can consume an additional bit. Consider the i th compressor in the chain. Carry-in bits c in,0 and c in,1 are driven by the the rank 1 carry-out of the (i 1) st compressor and the rank 2 carry-out of the (i 2) nd compressor, respectively; likewise, the rank 1 and carry-out of the i th compressor drives carry-in, c in,0,ofthe(i + 1) st compressor, and the rank 2 carry-out drives carry-in, c in,1,ofthe(i + 2) nd. When an ALM is configured as a two-bit ternary adder in shared arithmetic mode, six input bits are used, so no modifications to the I/O interface arerequiredtoimplementa6:2 compressor. The 7:2 compressor, in contrast, requires an extra input bit. This is not a problem, as the ALM contains eight architecturally visible inputs; either of the two remaining inputs can be used as the seventh input when the ALM is configured as a 7:2 compressor. 5. COMPRESSOR TREE SYNTHESIS ON THE NEW LOGIC CELL This section describes a mapping heuristic that can synthesize compressor trees targeting the logic cell shown in Figure 10(a). This heuristic is an extension of an earlier one proposed by Parandeh-Afshar et al. [2008b], which targeted the Altera Stratix II FPGA. Compressor trees synthesized using an ASIC design flow produce two outputs that are summed using a CPA. Since ternary CPAs are available in Stratix II for the same delay and area as binary CPAs, the heuristic outputs compressor trees that produce three outputs instead of two. The remainder of the

18 19: 18 H. Parandeh-Afshar et al. compressor tree is synthesized using GPCs. The number of outputs per GPC was limited to four, ensuring that each GPC can be implemented using at most four 6-LUTs (or fewer, if shared arithmetic mode can be exploited). This section extends the mapping heuristic to include the possibility of configuring the logic cells as 6:2 or 7:2 compressors as well. 5.1 GPC Classification By convention, we require that a GPC must have at least 2-input bits. For example, (0,1;1)and (1,0;2)are not GPCs. Some GPCs are considered unreasonable by the heuristic because they can always be replaced with another more sensible choice. GPCs, such as (3, 1; 3), have one rank 0 input bit, which is always passed directly to the least significant output bit, that is, the value of the input bit determines whether the output is odd/even; such a GPC is considered to be unreasonable. Another class of unreasonable GPCs are those for which the number of input bits is less than or equal to the number of output bits, for example, (2, 1; 3); thesegpcsare unreasonable because they do not perform any compression. A third class of unreasonable GPCs are those that have no rank 0 input bits, for example, (2, 0; 3). Inthiscase,therank 1 input bits could be converted to rank 0 input bits of a smaller counter that produces fewer output bits, for example, (0,2;2). A primitive GPC is one that satisfies input/output constraints of M and N and is reasonable. In theory, the number of primitive GPCs is exponential in M and N; limiting M and N to small constant values ensures tractability. With N output bits, the sum, where input bits are weighted by rank, of the input bits cannot exceed 2 N 1; this ensures that the number of primitive GPCs is finite. A covering GPC is a primitive GPC whose functionality cannot be implemented by another primitive GPC. For example, a (2, 3; 3) GPC can implement a (1, 3; 3) GPC by setting one rank 1 input bit to zero. For example, there are just three covering GPCs having six inputs and three outputs: (0, 6; 3), (1,5;3),and(2, 3; 3) (see Figures 8 and 9). All other GPCs satisfying these I/O constraints are either unreasonable, for example, (3, 1; 3), or can be covered by one of the three covering GPCs already listed. 5.2 GPC Library Construction The mapping heuristic uses a library of GPCs having at most M inputs and N outputs. This library is computed once for each target FPGA and stored in a text file. The library is read from the text file each time a set of compressor trees are synthesized. First, the primitive GPCs are enumerated and added to the library. Second, the set of covering GPCs are identified and marked as such. Third, the primitive GPCs are sorted in nondecreasing order of compression ratio. Each set of primitive GPCs having the same compression ratio is sorted in nondecreasing order of the number of inputs. The total ordering of primitive

19 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 19 GPCs favors a high compression ratio as the first criterion and the number of bits consumed as a second. Parandeh-Afshar et al. [2008b] used M = 6 and N = 4 to target the Altera Stratix II FPGA. Limiting the number of inputs to M = 6 ensures that only one layer of ALMs is required to implement the counter, regardless of whether the GPC is synthesized on LUTs or uses shared arithmetic mode, that is, Figures 8 and 9. Limiting the number of outputs to N = 4 ensures that at most four ALMs are required for each GPC, under the worst case assumption that each output bit is computed using a 6-LUT; fewer ALMs are required when shared arithmetic mode can be used [Parandeh-Afshar et al. 2009]. The mapping heuristic, described in the following section, converts chains of consecutive (0, 6; 3) GPCs (6:3 counters) into 6:2 compressors, whenever possible. Unfortunately, this approach cannot be used for 7:2 compressors, as M = 6 prevents 7-input GPCs from inclusion in the library. To support 7:2 compressors, a (0, 7; 3) GPC is added to the library, but no other 7-input GPCs are included. Chains of (0, 7; 3) GPCs are converted to 7:2 compressors; when a 7-input GPC is not contained in a chain, it is converted to GPCs with at most 6 inputs, as described in the following section. 5.3 Mapping Heuristic The input to the mapping heuristic is: (1) an ordered array of integers, k i, where the i th integer is the number of bits of rank i to sum, e.g., k 0 bits of rank 0, k 1 bits of rank 1, etc.; (2) a library of GPCs, as described in the preceding section; and (3) a flag called mode which takes one of three values, ALM, 6:2, or 7:2. Ifmode = ALM, then we are targeting an FPGA containing traditional ALMs that cannot be configured as 6:2 or 7:2 compressors; if mode = 6:2 or 7:2, then we are targeting an FPGA whose logic cells can be configured as a 6:2 or 7:2 compressor, for example, Figure 10(a). The mapping heuristic generates one level of the compressor tree at a time. A subset of the input bits is covered by GPCs and possibly 6:2 or 7:2 compressors. The output bits produced by each GPC are propagated to the next level of the compressor tree, along with the bits from the current level that are not covered. Since the rank of each GPC output bit is known, a new set of columns (array of integers) is generated for each level of the tree. Pseudocode for the mapping heuristic is shown in Figure 11. A new level in the tree is generated until there are at most three rows of bits remaining, that is, each column of the next level has at most three input bits. A ternary CPA completes the tree. The remainder of this section focuses on the process of producing one level of the compressor tree, that is, how to cover a set of columns with GPCs. The following process is applied until no remaining (primitive) GPCs can cover any bits in the current level of the tree. The column having the most noncovered input bits in the current level is always selected; ties are broken arbitrarily. Selecting the column with the largest number of bits tends to favor the use of GPCs with higher compression ratios and a large number of input bits. To find the best GPC for the selected

20 19: 20 H. Parandeh-Afshar et al. Fig. 11. Pseudocode for GPC mapping heuristic [Parandeh-Afshar et al. 2008a] with extensions to exploit 6:2 and 7:2 compressors, where appropriate. column, the set of primitive GPCs is searched. The first GPC that fits the base columns and its following or previous columns is selected. If mode = 7:2 and the column contains at least seven input bits, then a (0, 7; 3) GPC is always used, and a (0, 6; 3) GPCisalwaysusedifthecolumn contains six input bits. If mode = 6:2 or ALM and the column contains at least six input bits, then a (0,6;3)GPC is always used. Otherwise, the selected column contains fewer bits than the maximum input bandwidth of the largest GPC in the library; in this case, GPCs that cover bits from columns that are immediately adjacent to the selected column can be used as well. A forward search looks for a GPC under the assumption that the bits in the selected column will have rank 0. If the selected column is c, then the forward search will attempt to include bits from columns c + 1, c + 2,..., etc. A backward search assumes that the bits in the selected column will be of the

21 An FPGA Logic Cell and Carry Chain Configurable as a 6:2 or 7:2 Compressor 19: 21 Fig. 12. Illustration of the forward (a) and backward (b) search using GPC mapping. highest rank in the GPC that covers them. If the selected column is c, then the backward search will attempt to include bits from columns c 1, c 2,..., etc. In both searches, the first GPC that fits the selected column and the additional columns are selected. Among these two GPCs, the one with the highest priority, according to the sorted order, is selected. The forward and backward searches are particularly useful when the distribution of column heights is asymmetric. This occurs quite frequently for constant multipliers, including FIR filters. Figure 12 illustrates the forward and backward search. In Figure 12(a), a forward search finds a (1, 4; 3) GPC while the backward search in Figure 12(b) finds a (4, 3; 4) GPC. Since the compression ratios are 5/3 = 1.67 and 7/4 = 1.75, respectively, the GPC found by the backward search is selected. After selecting a column and a GPC, the bits that have been selected are removed from the current set of columns. The output bits produced by the GPC are added to the set of columns for the next level in the tree. This process repeats a column and GPC are selected until either all bits at the current level have been covered, or no primitive GPC in the list can cover more than a single bit. Once the current level is completely covered, the heuristic attempts toreplacesomegpcswith6:2 or 7:2 compressors. If mode = 6:2, each contiguous sequence of (0,6;3)GPCs is replaced with a contiguous sequence of 6:2 compressors, similar in principle to Figure 6. Note that this transformation reduces the number of bits in the following level; aggregated over several levels, the use of compressors rather than counters can reduce the total number of logic levels in the compressor tree. If mode = 7:2, then each contiguous sequence of (0, 7; 3) GPCs is replaced with a sequence of 7:2 compressors, just similar to what was done for 6:2 compressors. Each remaining (0,7;3)GPC is replaced by a (0, 6; 3) GPC and one unmapped bit that is propagated to the next level of the tree. The reason for doing this is that (0, 7; 3) GPCs do not map efficiently onto ALMs, so we replace them with a more favorable component. Next, the current level of the compressor tree is mapped onto logic cells. GPCs are mapped onto ALMs, while 6:2 and 7:2 compressors require the logic cell to be configured to use the carry chain shown in Figure 10(a). Additionally, the outputs of the GPCs and compressors from the preceding level of the compressor tree are connected to the inputs of the GPCs and compressors in the current level. The last step is to generate the columns for the next level of the compressor tree.

A Novel FPGA Logic Block for Improved Arithmetic Performance

A Novel FPGA Logic Block for Improved Arithmetic Performance Hadi Parandeh-Afshar Philip Brisk Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) School of Computer and Communication Sciences