VLSI DESIGN OF DIGIT-SERIAL FPGA ARCHITECTURE

Size: px

Start display at page:

Download "VLSI DESIGN OF DIGIT-SERIAL FPGA ARCHITECTURE"

Rachel Tyler
5 years ago
Views:

1 Journal of Circuits, Systems, and Computers Vol. 3, No. (24) 7 52 c World Scientific Publishing Company VLSI ESIGN OF IGIT-SERIAL FPGA ARCHITECTURE HANHO LEE School of Information and Communication Engineering, Inha University, Incheon 42-75, Korea GERAL E. SOBELMAN epartment of Electrical and Computer Engineering, University of Minnesota, 2 Union Street SE, Minneapolis, MN 55455, USA Received May 2 Revised 28 October 22 This paper presents a novel application-specific field-programmable gate array (FPGA) architecture that satisfies efficient implementation of digit-serial SP architectures on a digit wide basis. igit-serial SP designs have been an effective implementation method for FPGAs. To efficiently realize a digit-serial SP design on FPGAs, one must create an FPGA architecture optimized for those types of systems. We examine the various circuits used in digit-serial SP designs to extract their key features that should be reflected in the new FPGA architecture. We explain the design methodology, layout and implementation of the new digit-serial FPGA architecture. igit-serial SP designs using the digit-serial FPGA (S-FPGA) are compared to those implemented on Xilinx FPGAs. We have estimated that the S-FPGA are about times more efficient in area and faster than the equivalent digit-serial SP architectures implemented using Xilinx FPGAs. Keywords: VLSI; FPGA; SP; digit-serial; architecture.. Introduction Field-Programmable Gate Arrays (FPGAs) are of interest for use in digital signal processing (SP) systems due to their ability to implement custom hardware solutions while still maintaining flexibility through device reprogramming. FPGAs provide a configurable structure through an array of adjustable logic modules interconnected by programmable routing resources and surrounded by programmable input/output (I/O) blocks. The main constraints on FPGA architectures are limited Corresponding author. 7

2 8 H. Lee & G. E. Sobelman routing resources, limited I/O resources and large routing delays. Under these circumstances, a digit-serial approach has been shown to be an effective implementation style for FPGAs. 5 In practical SP applications, it may be desirable to combine the area-efficiency of a bit-serial architecture with the time-efficiency of a corresponding bit-parallel architecture into a single area/time efficient digit-serial architecture. 6 9 The implementation methods of digit-serial architectures have been proposed in Refs Moreover, the digit-level pipelined and bit-level pipelined digit-serial multipliers that can be used to further increase the performance of digit-serial architectures have been proposed in Refs. 7. It was demonstrated that the area time efficiency and performance of the digit-serial architectures are considerably above bit-serial and bit-parallel architectures for FPGA in Refs. 3. Several digit-serial arithmetic circuits and SP architectures using FPGAs were presented in Refs. 5. This paper shows that by focusing on a specific class of digit-serial SP applications, we may increase the area and speed efficiency of FPGAs significantly. However, the general-purpose FPGAs, such as those described in Refs. 4 and 5, do not offer area-efficient realization for a certain class of digit-serial SP applications, and were better suited for state machine and wide range of logic functions. To efficiently realize the digit-serial SP designs on FP- GAs, one must create an FPGA architecture targeted to digit-serial SP architectures that can compensate the weakness of the general-purpose FPGAs and accelerate the performance substantially. 6 We examine the various circuits used in digit-serial SP designs to extract their key features that should be reflected in the new FPGA architecture. Key to the suitability of the FPGA for these applications is the fact that each of its basic blocks is capable of processing a digit-size of up to 4-bits. The targeted digit-serial SP systems may contain several digit-level and bit-level pipelined digit-serial datapaths of various digit sizes. They may also have irregular control logic in some portions. Thus, our digit-serial FPGA (S-FPGA) architecture must contain some bit-level programmability, yet take advantage of the high degree of regularity that exists in digit-serial datapaths. The S-FPGA architecture makes possible a more efficient realization of those digit-serial architectures, and static RAM programming technology is used to provide the in-circuit reprogrammability. This paper is organized as follows. Section 2 describes the overview of digit-serial approach. S-FPGA architecture and major components are presented in Sec. 3. In Sec. 4, we discuss the various circuit design issues that must be considered. Routing architecture design issues will be addressed in Sec. 5. A layout style, performance and fabrication result for S-FPGA architecture is presented in Sec. 6. Example digit-serial SP implementations using the proposed S-FPGA are described in Sec. 7. Section 8 presents a performance comparison between the S-FPGA and Xilinx FPGA. Our conclusions are summarized in Sec. 9.

3 VLSI esign of igit-serial FPGA Architecture 9 2. igit-serial Approach Previous architectures have primarily focused on two approaches: bit-serial and bit-parallel implementations. Bit-serial designs process one input bit of a word (or sample) at a time. The advantages of these systems include fewer interconnections, fewer pin-outs, less internal hardware, faster clock speed, and less power consumption. Their main disadvantage is that they are slow because for a word-length of W bits, bit-serial architectures will require W clock cycles to compute one word or sample. Therefore they are primarily suited for low to medium speed applications. Bit-parallel systems process all input bits of a word in one clock cycle and is the most common implementation style. Their main advantage is that they can compute one word in one clock cycle and therefore can provide high-performance and are ideal for high-speed applications. Their disadvantages include larger chip area, interconnection, pin-out, and they consume more power. To avoid the disadvantages of the bit-serial and bit-parallel computation, the concept of digit-serial implementations has been proposed in recent years. 6 3,7 igit-serial approach offers a flexible trade-off between bit-serial and bit-parallel approaches, and between data throughput and the size of arithmetic operators. A system based on these approach can combine the advantages of the high throughput of parallel computation and the small operator size of serial computation. Bit-serial systems, which process one bit of the input sample in one clock cycle, have very localized routing and area-efficient in FPGAs. 8,9 On the other hand bit-parallel systems, which process one whole word of the input sample in one clock-cycle, requires many modules, so the routing resources often are insufficient and result in large routing delay. However, in applications which require moderate sample rates both these systems may be ineffective, that is, the bit-serial systems may be too slow and bit-parallel systems may be faster than necessary and occupy considerable amount of area. To this end, digit-serial systems are best suited for implementation of digital signal processing systems which require moderate sampling rates. In a digit-serial arithmetic implementation, the W -bits of a data word are processed in units of the digit-size N-bits in W/N clock cycles, and are processed serially one digit at a time with the least significant digit first. This leads to arithmetic operators that have smaller area than equivalent bit-parallel arithmetic designs and have a larger throughput than equivalent bit-serial arithmetic designs. Architectures based on the digit-serial approach may offer the best overall trade-off between speed, efficient area utilization, throughput, I/O pin limitations and power consumption. By considering a range of values for the digit-size, one can search the design space to find the optimum implementation for a given application. The implementation methods of digit-serial architectures have been proposed. 6 3,7 The first approach is to start with a bit-parallel structure and then use folding to obtain the digit-serial architecture. 6 8 The second approach is to start with a bit-serial architecture and then use unfolding to obtain the digit-serial

4 2 H. Lee & G. E. Sobelman architecture. 7 The major drawback of the architectures based on these approaches is that they cannot be pipelined at the bit-level, which has severely limited their throughput. This could be a major obstacle for high-speed applications. The main reason why these structure cannot be pipelined is due to the existence of carry feedback loops, which are impossible to pipeline. Recently, the digit-serial architectures that can be pipelined at the bit-level have been reported. 2 The use of carry feed-forward has solved a major bottleneck of the carry feedback loops of conventional digit-serial designs. The possibility of high degree of pipelining offered A B A B rst S S S2 S3 A B A B rst C_LS if if if2 if3 (a) (b) A S B A B S S2 S3 rst A B A B C_LS (b) rst if if if2 if3 C_LS (b) if A B if A B if2 if3 Add A B A B Add rst C_LS (c) (c) rst Out Out Out2 Out3 C_LS (c) A B Out A B Out Out2 Out3 C_MS rst A B A B Latch C_MS (d) C_LS rst Out Out Out2 Out3 Latch Sign (d) C_LS (d) Out Out Out2 Out3 Sign Fig.. (a) igit-serial adder, (b) digit-serial subtractor, (c) digit-serial adder/subtractor, and (d) digit-serial comparator with N = 4 bits.

5 VLSI esign of igit-serial FPGA Architecture 2 A B HA HA S A B FA HA S FA HA S2 FA S3 Pipelining (a) Cin= A B FA HA if A B FA HA if FA HA if2 FA if3 Pipelining (b) Fig. 2. (a) Bit-level pipelined digit-serial adder and (b) bit-level pipelined digit-serial subtractor with N = 4 bits. by the structure in Refs. 2 increases the throughput rate of the digit-serial architectures. A basic element in a digit-serial SP implementation is the digit-serial adder shown in Fig. (a). A digit-serial adder with a digit-size (N) of 4 bits is a circuit that adds four pairs of bits along with a previous carry bit and produces a sum digit and a new carry bit. The two operands, A and B, are fed one digit at a time into the digit-serial adder. The addition is done N-bits at a time, with the carry rippling from one full adder to the next. The carry-out from the digit-serial adder is fed back into the first full adder during the next clock cycle, when the next pair of inputs have arrived. Several examples of digit-serial arithmetic circuits are shown in Figs. 3.

6 SMM2 X X" Parallel inputs X X" Y(3) Y(4) Y() Y() Y(2) Y(5) Y(6) Y(7) 22 H. Lee & G. E. Sobelman Y PI PI PO PO igit-serial input X X X Y X" X X" SMM2 Pi PO Pi PO X Y X" X X" SMM2 Pi PO Pi PO X Y X" X X" SMM2 Pi PO Pi PO X Y X" X X" SMM2 Pi PO Pi PO X Y X" X X" SMM2 Pi PO Pi PO X Y X" X X" SMM2 Pi PO Pi PO X Y X" X X" SMM2 Out Pi PO Pi Out PO igi-serial output (a) Fig. 3. (a) Unsigned digit-level pipelined N = 2 digit-serial multiplier and (b) unsigned digit-level pipelined N = 4 digit-serial multiplier.

7 VLSI esign of igit-serial FPGA Architecture 23 X2 Pi Pi PO PO SMM4 Y X X X X2 X3 X X X2 X3 PI3 PI2 PI PI Y PO3 PO PO PO2 SMM4 X X X X2 X3 X X PO X3 SMM4 Y P2 P3 PO2 PO3 X X Parallel inputs X2 igit-serial input X2 X3 X X X2 X3 Pi Pi P2 P3 PO PO PO2 PO3 SMM4 Y X3 X X X2 X3 X X X2 X3 Pi Pi P2 P3 PO PO PO2 PO3 SMM4 Y P3 X X X2 X3 X X X2 X3 Pi Pi P2 P3 PO PO PO2 PO3 SMM4 Y X3 X X X2 X3 X X X2 X3 Pi Pi P2 P3 PO PO PO2 PO3 Y(3) Y(5) igi-serial output Out3 Out2 Out Out X3 X X X2 X3 Pi Pi P2 Y(6) PO PO PO2 PO3 SMM4 Y X X X2 X X X2 X3 Pi Pi P2 P3 PO PO2 PO3 SMM4 Y Y() Y(2) Y(4) Y(7) X2 X X Y() (b) Fig. 3 (Continued).

8 24 H. Lee & G. E. Sobelman A digit-level pipelined digit-serial multiplier shown in Fig. 3 can be implemented by unfolding the structure of a bit-serial multiplier. 7 One input of this multiplier is parallel while the other is digit-serial with the least significant digit presented first. The output is also digit-serial with the least significant digit first. Chang has presented the bit-level pipelined digit-serial multiplier design that can be pipelined at the bit-level, which results in higher processing speeds. The bitlevel pipelined digit-serial multiplier contains digit-cells, digit-serial 3:2 compressor and digit-serial adder. Each digit-cell consists of a partial product generator module and a carry-save adder (CSA) module. A simple digit-serial 3:2 compressor adder is used to combine the carry-save adder array outputs down to two digits. A digitserial adder is then used to add these two digits to generate the final digit-serial outputs. 3. igit-serial FPGA Architecture A igit-serial FPGA (S-FPGA) architecture is composed of a digit-serial logic block (LB), programmable interconnect architecture and I/O block as shown in Fig. 4. Overall structure of LB and routing architecture are explained in this section. 3.. igit-serial logic block The LB is a logic-based core cell which is simple and straightforward. Figure 5(a) shows a simplified schematic of the LB, consisting of four main parts; digit-serial S S S S LB LB LB IO Block S S S S LB LB LB S S S S LB LB LB S S S S Fig. 4. S-FPGA architecture.

9 VLSI esign of igit-serial FPGA Architecture 25 LB Input igit-serial Logic Array Logic Module Logic Module Logic Module Fast Carry Logic LB Cin Cin Cin2 Cin3 Carry -type Select Logic Register Array LB Output Logic Module (a) Multiplexer controlled by configuration program : SRAM Configuration Bits Sbit Sbit SbitM3(:) E Ci CTL A B Ci A B Ci2 Ci3 E 2 3 Ri(3:) CK SR CE C C C LM LM LM2 LM3 P G P G P2 G2 P3 G3 P(3:) G(3:) Fast Carry Logic C C C2 C3 2 3 SIR SRi SbitM3() SRA SRA SCO3 SO SO SO2 SO3 RO3 CO CO CO2 CO3 (b) Fig. 5. (a) igit-serial logic block (LB) diagram and (b) detail of LB of S-FPGA.

10 26 H. Lee & G. E. Sobelman Table. The configuration settings for the LB. Bit SIR SRi SRA SRA SCO3 SbitM3(:) Meaning High if (3:) is the direct input to -FFs High if Ri (3:) is the direct input to -FFs High if -FFs with outputs SO (3:) are used High if -FFs with outputs CO (2:) are used High if -FF with output CO3 is used Low Low if N = 2 digit-serial circuits are mapped onto LB High-Low if N = 3 digit-serial circuits are mapped onto LB logic array, fast-carry logic, carry-type select logic, and register array. The detailed LB structure is shown in Fig. 5(b). The LB is formed in a digit-serial structure and operates on N = 4 bit operands. This leads to direct implementation of basic mathematical functions, such as digit-serial adders, subtractors and multipliers, in LB. The LBs can be connected together through the interconnection resources to implement digit-serial adders, subtractors, and multipliers of any digit-size. The LB can also realize several bit-wise logical operations, including AN, OR, XOR, etc. It has 26 inputs and 9 outputs plus clock (CK), set/reset (SR) and clock enable (CE) inputs. Programming configuration bits in Table share across identically programmed datapath slices in a LB. This programming bit sharing reduces the total number of SRAM cells resulting in higher density and faster reconfiguration. The low number of SRAM bits simplified the programming of the LB, which is also desirable for reconfigurable computing. The LB will support efficient implementation of N = 2, 3 and 4 digit-serial SP applications as well as random control circuits. The LB architecture uniquely combines both fine and coarse logic granularity for optimum logic utilization and high performance. High logic utilization is provided by the fine-grained logic modules that can implement random logic functions without wasting device resources. The coarse-grained structure of the four fully interconnected logic modules provides fast operation and efficient routing with minimal signal skew for digit-serial SP architectures igit-serial logic array The digit-serial logic array is composed of four small logic modules (LMs), which are the smallest unit of logic in the structure of LB and a logic-based core cell. The logic-based core cell method has a less number of transistors than look-up table (LUT) method. Figure 6 depicts the structure of LM, which has five data inputs, four data outputs and a configuration bit. Propagate (P ) and generate (G) outputs are used as inputs of fast-carry logic in the LB. A large number of logic functions can be implemented by using an appropriate subset of the inputs and tying the remaining inputs of a LM high or low as shown in Table 2. Each logic module can implement arithmetic functions such as a full-adder, subtractor and

11 VLSI esign of igit-serial FPGA Architecture 27 Cin B SO CO E A Sbit P G Fig. 6. The structure of logic module (LM). Table 2. Logic functions which can be implemented by a logic module. Function Function Function Input pattern Configuration name at SO at CO (A B C E) bit (Sbit) Full-Adder X Y Z X Y + Z (X+Y) X Y Z Y Subtractor X Y Z X Y + Z (X+Y ) Y X Z X Half-Adder X Y X Y X Y Multiplier cell (X Y) P Z (X Y) P+Z ((X Y) P) X P Z P Y INV X X X AN2 X Y X Y X X Y NAN2 (X Y) Y X Y X OR2 X + Y (X + Y) X Y NOR2 (X + Y) Y X XOR2 X Y X Y X Y XNOR2 (X Y) X Y Y X MUX 2: X S + Y S X S + X Y S S X S Y MUXB 2: X S + Y S Y S Y S X AN XOR (X Y) Z X Y Z X Z Y some combinational functions. Table 3 shows the programming table of configurable SRAM cells for mapping the arithmetic functions. A single logic array can implement either N = 2, 3 and 4 digit-serial adders/subtractors, unsigned digit-serial multiplier modules with partial product or two s complement digit-serial multiplier modules. Based on our observation, these operations are widely used in digit-serial SP applications. The LM can also realize several bit-wise logical operations, including AN, OR, XOR, etc Fast-carry and carry-type select logic The LB provides fast-carry logic that bypasses the ripple-carry interconnect structure for N = 4 digit-serial circuits with a carry look-ahead method. The use of LMs to make the propagate (P ) and generate (G) signals made the fast-carry logic in LB very simple. The fast-carry logic greatly increases the efficiency and performance of digit-serial adders, subtractors and multiplier building blocks with N = 4.

12 28 H. Lee & G. E. Sobelman Table 3. Programming of configurable SRAM cells (S=digit-serial). Arithmetic Configuration bits function Sbit(:) SbitM3(:) SIR SRI SRA SRA SCO3 INV AN2 NAN2 OR2 NOR2 XOR2 XNOR2 MUX 2: MUXB 2: AN XOR Full-Adder flip-flop UU N = 4 S adder N = 4 S subt. N = 4 S mult. cell As shown in Fig. 5(b), there is three 3: multiplexers employing a combination of multiplexing to select one of three possible carry options via two configurable bits after LMs. This structure comprises of carry-type select logic that is used to select either the ripple-carry chains, the fast-carry logic or the carry-save array. The ripple-carry chains sequentially connect all the LMs in a LB, supporting N = 2 digit-serial circuits. The carry from a lower-order bit moves forward to the higher-order bit via the carry chain Register array Our S-FPGA is a register rich architecture which makes possible a high degree of pipelining, leading to increased performance. The register array contains a multiplexer to select the output, an edge-triggered flip-flop and output drivers. In total, there are nine flip-flops in each LB that can be combined to form a 8-bit register. The eight multiplexers in front of flip-flops will select either carry-save operation or direct inputs via configurable bits. The flip-flops share a common clock (CK), clock enable (CE), and set/reset (SR). Alternatively, the inputs (Ri(3:)) can be used as a direct inputs to the registers that are frequently used to implement shift-registers. The final outputs SO(3:) and CO(3:) can be either the direct outputs from the multiplexers or the outputs from flip-flops Routing architecture The routing framework contains predefined segmented wires in the vertical and horizontal directions as shown in Fig. 7. S-FPGA has two groups of routing resources. One is internal routing resources and the other is the external

13 Quad-length line (8) Singlelength ouble-length line (8) line (4) Internal Input Routing Internal Output Routing A B Ci E A B Ci E Ci2 2 E2 SO SO SO2 SO3 LB CO CO Ci3 3 E3 CO2 CO3 Ri Ri Ri2 Ri3 CK SR RO3 CE Connection Block Long-length line (6) M M Switch Block M M M M M M Buffered Switch Fig. 7. Connection Block S-FPGA routing architecture. Switch Block M Single-length line (4) ouble-length line (8) Quad-length line (8) VLSI esign of igit-serial FPGA Architecture 29

14 3 H. Lee & G. E. Sobelman routing resources. Internal routing is the routing between logic block pins to provide rich, direct routing resources needed in digit-serial circuits. These lines provide very fast signal transmission with short delay. The connections between adjacent logic blocks which frequently occurs in digit-serial circuits is implemented via direct line without consuming any slow switch block. Feedback interconnect is the feedback paths from the LB s outputs (CO(3:)) to its input (Ci) without consuming any external routing resources. Signal routed to one of the LB pins is buffered at the input or output. The buffers at the LB pins effectively isolates the capacitive load of the drain capacitance of pass-transistors from the routing segments. The routing delay through the routing segments is greatly reduced. External routing employs connection blocks and switch blocks to permit the interconnection of individual LBs. Switch blocks are connected to singlelength, double-length, quard-length and long-length line segments on four directions. Switch blocks provide connectivity with the routing segments using 2 programmable switches. Connection blocks provide connectivity between LBs and routing segments using programmable switches. Pass-transistor switches add series resistance to S-FPGA routing paths, resulting in long delays for long paths. Longer segmentation of interconnect lines has been used to address this issue. At very large device sizes, lines are heavily-loaded, and the wire resistance slows down signals on the line. Signal propagation delay depends on the number of switches that the signal passes through. 4. Circuit-Level esign Issues for LB In this section we discuss the circuit-level design of the LB. Throughout the discussion, various trade-offs among supply voltage, logic style and performance are evaluated. The most important parameter controlling power consumption is the supply voltage, due to the squared term in the power consumption equation. 2 Thus, supply voltage reduction is the most effective way to reduce the power consumption. However, this method presents a tough challenge in the design of FP- GAs since most of these structure make extensive use of pass-transistor logic. The problem of supply voltage reduction is further exacerbated in process that do not have low-threshold devices. In these processes, lowering the supply voltage below 2.5 V results in a dramatic loss of performance and even causes some circuits to malfunction. Therefore, such a supply voltage reduction requires new design methods for low-voltage and low-power integrated circuits. Circuit topologies that help reducing the supply voltage are discussed. 4.. Impact of logic style The logic style used in logic gates basically influences the speed, size, power dissipation, and the wiring complexity of a circuit. The circuit delay is determined by the number of transistors in series, transistor sizes (i.e., channel widths), and intraand inter-cell wiring capacitances. Circuit size depends on the number of transistors and their sizes and on the wiring complexity. Power dissipation is determined

15 VLSI esign of igit-serial FPGA Architecture 3 by the switching activity and the node capacitances. All these characteristics may vary considerably from one logic style to another and thus make the proper choice of logic style crucial for circuit performance. Various investigation of logic styles with respect to low-power dissipation have recently been carried out and reported in the literature. 2,2 In these publications, CPL and related pass-transistor logic styles are propagated as low-power logic styles, because CPL gates count fewer transistors, have smaller transistors and smaller capacitances, and are faster than gates in complementary CMOS. However, these circuits have a limited drive capability at a low supply voltage. Although the poor signal level can still drive other circuits correctly at a high supply voltage, it cannot guarantee proper operation at a low supply voltage. Therefore, the problems of threshold voltage loss must be alleviated and the full voltage swing is needed to get a correct signal level at a low supply voltage Logic module circuit XOR and MUX constitute the critical part of logic module (LM) in LB. However 7-transistor XOR circuit 22 has been used to implement the LM circuit. The performance comparisons using several XOR circuits are presented in Ref. 23. The investigation results presented show that 7-transistor XOR performs much better than CPL and complementary static CMOS XOR. ouble-pass transistor logic (PL) MUX is used to improve circuit performance at reduced supply voltages. Because of the presence of both NMOS and PMOS devices, the output of PL MUX circuit has a full voltage swing and there is no static short circuit current problem. The investigation results presented in Ref. 2 show that for all simple and complex logic gates such as two-input NAN (NAN2), two-input NOR (NOR2) and three-input and-or-invert (AOI), complementary static CMOS outperforms CPL and other pass-transistor logic styles with respect to circuit delay, power dissipation, power-delay product, and layout size. CMOS also shows the highest robustness and smallest sensitivity to transistor and voltage scaling. This makes complementary CMOS the logic style of choice for low-power, low-voltage implementation of LM circuit. However, other logic style, such as pass-transistor XOR and MUX, is still be viable candidates for low-power high-speed implementation of LM circuit. A transistor-level schematic diagram of the proposed LM is depicted in Fig. 8. Addition of the complementary transistor allows the circuit to operate at V dd = 3.3 V without the loss in thresholds that plagued the NMOS only pass-transistor design. The LM circuit has 25% less delay and 37% less power consumption than LM circuit using static CMOS circuits. The proposed LM has good signal output levels and low power consumption at a low supply voltage (V dd = 2 V). Table 4 shows the circuit delays of LM using HSPICE simulation Fast-carry logic circuit We have designed fast-carry logic circuit using compact static CMOS carry lookahead (CLA) circuits. This fast-carry look-ahead circuit is based on transistor

16 32 H. Lee & G. E. Sobelman Cin SO B F A CO E P G Fig. 8. Transistor-level schematic of logic module. Table 4. Path delays of LM circuit. Path delays Voltage A P A G A SO A Carry 2. V.7 ns.53 ns.58 ns 2.39 ns 3.3 V.54 ns.7 ns.67 ns.8 ns sharing in multiple output static CMOS complex gates to reduce the transistor count and improve the operation speed of the whole circuit. 24 We define C i as the carry of the ith stage, and A i and B i are the ith bits of the input data; then C i+ is expressed as Expanding this yields C i = G i + P i C i, G i = A i B i, P i = A i B i. C i = G i + P i G i + P i P i G i P i P i P C in.

17 VLSI esign of igit-serial FPGA Architecture 33 Practically, the number of look-ahead stages is limited to four and the term C 3 is expressed as C 3 = G 3 + P 3 G 2 + P 2 [G + P (G + P C in )]. Figure 9(a) shows the carry chain of the 4-bit CLA block which yields C, C and C 2 as well as C 3. By inserting a PMOS transistor with P i gate input into the pull-up part, we can isolate the pull-up part of carry C i from the pull-up part of all carry C j (j > i) according to the logic redundancy method earlier. In the pull-down part, however, P i and G i cannot both be high because P i is the XOR and G i is the AN function of input operand A i and B i for i =, 2 and 3; therefore, there is no other discharging path to ground except the original pull-down part of carry C i, which makes the addition of redundant transistors unnecessary. The fast-carry logic circuit is composed of chains of MOSFETs serially connected between a power supply rail and the output of the subcircuit. These serially connected MOSFETs are a major source of delay and power dissipation, therefore, optimal sizing of these transistors is important in reducing the delay and power dissipation of these circuit structures. Channel width tapering method 25 has been used to reduce the delay and power dissipation of serially connected MOSFET chains in fast-carry logic circuit Glitch-free TSPC flip-flop flip-flop (-FF) is used heavily throughout the S-FPGA in order to make a possible high degree of pipelining, leading to increased performance. The -FFs share a common clock (CK), clock enable (CE), and set/reset (SR) and they can be set/reset locally or globally. Enhancing -FFs speed can lead to a higher clock rate. Power dissipated in the clock distribution network has usually been a substantial part of the total system power consumption. Therefore, it is important to minimize the number of global clocks, as well as the gate capacitance associated with the clock nets. To accomplish this, the true single-phase clocking (TSPC) methodology has been proposed, with the basic register shown in Ref. 26. The TSPC scheme has the inherent advantage of clock skew problems being restricted to the proper distribution of only one clock phase. It is shown in Ref. 27 that, although the proportion of power consumption due to glitching varies significantly with the particular circuits (from 9% to 38%), the hazard/glitch power consumption cannot be neglected in static CMOS circuits. Therefore, the glitch-free TSPC -FF presented in Ref. 28 is used to implement the -FF with clock, set/reset and clock enable in LB as shown in Fig. 9(b). To ensure that the output does not discharge when y is high at evaluation, an NMOS transistor MN4, controlled by y b, is inserted into the output stage of the conventional TSPC -FF. Transistor size optimization also improves circuit speed by a factor of.5.8.

18 34 H. Lee & G. E. Sobelman Cin 2. P 2. P 2.4 P2 2.4 P3 2.7 C 2./.9 G P P 2.7/.9 P C 2./.9 G 2.4 P2 2.4 P2 P3 P2 P3 C2 2./.9 G2 P G3 P P3 C3 2./.9 C3 P2.9 P2 P.9 P P.2 Cin.2.2 G.9 G.9 G2.9 G3 (a) CLK SET CE 2.7/.6.9/.6 2.7/.6.9/.6 EN_CLK.9/.6 RST V V V MP 2.4/.6 MPS 2.7/.6 y MN.9/.6 V.9/.6 y_b.9/.6 MPS2 2.4/.6 2.7/.6 y2 MN2.9/.6 MNS.9/.6.9/.6 MP2 2.7/.6 MNS2.9/.6 MN4.9/.6 MN3.9/.6 QB 3./.6.9/.6 Q (b) Fig. 9. (a) Fast-carry logic circuit and (b) glitch-free TSPC Flip-flop with clock, set/reset and clock enable.

19 VLSI esign of igit-serial FPGA Architecture Routing Architecture esign Issues A key aspect in the design of an FPGA is its routing architecture, which comprises the resources that are used to interconnect the device s logic blocks. A large number of different routing architecture issues was investigated in Refs. 29 and 3. Several architectural and circuit design choices impact the amount of capacitance that will be charged and discharged within an FPGA design. The electrical design of FPGA interconnect circuit was investigated in Ref. 3. The number of metal layers available in the selected process technology influence the final area and power of an FPGA array. In addition, the sizing and number of switches within the interconnect fabric can also seriously affect power. Lastly, capacitance is affected by the interconnect of the basic logic cell to the surrounding interconnect and to the neighboring cells. Given that interconnect wiring is a crucial resource in an FPGA, the reduction of the intrinsic wiring capacitance by using upper layers of metal can help lower net capacitances. However, the necessity of reaching the active layers to insert switches leads to the addition of several contacts and vias to lower levels. Another important design consideration that has an impact on the resulting FPGA capacitance is the size and number of switches present on each interconnect segment. The interconnect network consists of programmable switches that are organized in connection blocks and switch blocks. The performance of FPGAs is mainly limited by the delay through the interconnects programmable switches. This delay increases quadratically with the number of series switches and linearly with the number of switches loading each node and is especially a problem when the programmable switches are implemented using MOS transistors since these have an appreciable resistance and capacitance. The FPGA has higher signal delay because (a) the channel resistance of pass-transistors connecting segments of wire, (b) parasitic capacitance of the off transistor, and (c) branches of extra wire that are not on the source to sink path. An approximate discrete analysis of a line modeled as RC cascadated sections (of equal R and C) yields t n = RCn(n + ) 2. () As it can be seen from the above equation, the total delay depends quadratically on the number of sections and linearly on the resistance of each interconnection and the capacitance of each section. The accumulation of quadratic delay can be limited by inserting repeaters that consist of pairs of unidirectional tristate buffers. A tradeoff for the switch size must be reached in order to obtain result. Once given the projected delay and the prospective load to be driven, a tradeoff can be found for the buffer strength and the CMOS switch in terms of required speed and area wasted. This tradeoff tries to minimize not only the total delay from the beginning of the interconnection to the end of the multi-switch line but also the local delays from the beginning of the line to each intermediate point after

20 36 H. Lee & G. E. Sobelman an interconnection element. As a consequence, the average delay is reduced and a significant area reduction is achieved. This section discusses the tradeoff of the NMOS pass-transistor switch size used for programmable interconnections as a function of the buffer strength and the desired delay for a given load. The repeater interval and buffer optimum size in the interconnection network are determined for the fast speed. 5.. NMOS pass-transistor switch sizing SRAM-based FPGAs normally use NMOS pass-transistors to implement routing switches and this kind of switch has significant series resistance and parasitic capacitance. The sizing and implementation of switches throughout the FPGA interconnect is a factor with a large impact on an FPGA s resulting power consumption and speed. When considering increases in switch size, there is a slow decrease in delay and a nearly linear increase in energy. In order to account for wiring capacitance component, a worst case estimate of wiring parasitics was made and added to each node. A series of simulations were performed to more accurately assess the trade-offs in switch size and implementation using interconnection network shown in Fig.. Figure shows the measured delay and energy-delay product from the input port to the node point after 2 switching sections. Figure (a) shows that increasing the NMOS pass-transistor switch width from.2 to 3.3 µm significantly reduces delay, but there are diminishing returns beyond that point. The reason that the delay flattens out as transistor size increases is that, although resistance drops as the transistor size is increased, parasitic capacitance increases. Assuming a wiring capacitance of 6 ff, a switch size of approximately 3.3 µm is optimal for passtransistor interconnect with a high fanout Repeater interval and buffer optimum size It is well known that an NMOS pass-transistor can transmit the signal completely, but it has poor performance on transmitting the signal. In the latter case, one will incur voltage drop V tn, where V tn is the threshold voltage of NMOS. The NMOS pass-transistor is effective at pulling low so the inverters PMOS transistor is fully turned on giving a solid low-to-high transition. The maximal voltage that can be passed through the NMOS transistor sits at V dd V tn. Since the poor V dd V tn voltage cannot fully turn on NMOS transistor in inverter, the falltime is longer than risetime. To achieve the same risetime and falltime at buffer output, we need to increase the NMOS transistor at first stage of buffer. Figure 2(a) shows the tristate buffer optimized with the same risetime and falltime. Figure shows an interconnection network with tristate buffer repeaters. The repeater interval k is defined as the number of switches between two nodes that

21 Buffered Switch Buffered Switch Output Buffer C block S block S block M S block S block M S block S block C block Input Buffer 6F 6F 6F 6F 6F 6F 6F 6F 6F k Fig.. Wp/Wn M Output Buffer 2.7u/.9u 4.5u/.5u k Input Buffer M.9u/2.u 2.u/.9u Interconnect network using pairs of unidirectional tristate buffers. VLSI esign of igit-serial FPGA Architecture 37

22 38 H. Lee & G. E. Sobelman 9 elay vs. Switch size elay (ns) Switch Size (Width in ums), Length=.6um (a) 2.5 Energy elay Product vs. Switch Size 2 Energy elay Product Switch Size (Width in ums), Length=.6um (b) Fig.. (a) elay versus switch-size and (b) energy delay product versus switch-size through 2 switches (repeater interval = 6).

23 VLSI esign of igit-serial FPGA Architecture 39 2./.6 2./.6 2./.6 IN.9/.6 2./.6 OUT.9/.6.9/.6.9/.6 2.4/.6 EN EN_b.9/.6 (a) 33 elay through 36 loaded stages versus repeater interval 32 3 elay through 36 loaded stages (ns) Repeater Interval (k) (b) Fig. 2. (a) Tristate buffer for repeater and (b) delay through 36 loaded switch sections versus repeater interval.

24 4 H. Lee & G. E. Sobelman elay 9 8 delay (ns) Switch Width (um) Buffer Width (um) Fig. 3. elay as the function of the switch width and buffer width (length =.6 µm). contain repeaters. Figure 2(b) compares delay for simulations of the propagation delay in 36 switch sections with variable repeater interval k. The curve shows the average propagation delay for a chain that uses tristate buffer repeater, which was optimized for minimum delay, with equal rising and falling delay. Increasing the repeater interval from 2 to 6 significantly reduces delay, but there are diminishing returns beyond that point. Above repeater interval, the signal is not transmitted from the input to the node point after 36 switch sections. Therefore, we can get a fast speed in interconnect network with repeater interval 6. Figure 3 shows the results of an experiment to measure the effect that varying the channel width of switch and buffer has on the speed-performance of interconnection network. The result shows that increasing the switch width significantly reduces delay by switch width 3.3 µm, but there is no large delay reduction as increasing the buffer width. Figure 4 shows the delay models of S-FPGA routing structure using the optimum-size pass-transistor switch and buffers. We have tried to identify the tradeoff of the NMOS pass-transistor switch size used for programmable interconnections as a function of the buffer strength and the desired delay for a given load. The switch size of approximately 3.3 µm is optimal for pass-transistor interconnect with a high fanout. The repeater interval in the interconnection network was determined for the fast speed. A tradeoff can be reached to minimize the area penalty and the average delay without practically increasing the overall delay.

25 elay model of S FPGA routing structure LB Output Buffer C block S block S block S block S block C block Input Buffer LB 2.76ns LB Buffered Switch Output Buffer C block S block S block S block 2.57 ns S block Buffered Switch S block Buffered S block S block S block S block Switch S block S block S block S block 2.38 ns Fig. 4. elay model of S-FPGA routing structure. Buffered Switch VLSI esign of igit-serial FPGA Architecture 4

26 April 3, 24 8:8 WSPC/23-JCSC H. Lee & G. E. Sobelman 6. Physical esign and Fabrication 6.. Area and speed Switch Block Input Internal Routing Vertical Routing Track LB Output Internal Routing river C Block river The proposed S-FPGA cell has been implemented using a full-custom VLSI design in order to extract physical characteristics. The custom layout of the S-FPGA cell was done in a.5 µm Hewlett-Packard (HP) CMOS process with three metal layers. Figure 5(a) shows the floorplan of S-FPGA tile and its major building blocks. C Block (a) (b) Fig. 5. (a) Floorplan of S-FPGA tile and (b) layout of S-FPGA prototype chip.

27 VLSI esign of igit-serial FPGA Architecture 43 Table 5. Circuit delays of S-FPGA cell measured from HSPICE simulation. Path elay (ns) Combinational logic LM inputs LB outputs (SO(3 : )) (without -FF) 2.9 LM inputs LB outputs (SO(3 : )) (with -FF) 3.86 LM inputs LM output (cm).85 LM inputs LM output (P ).54 LM inputs 2 th LM outputs (cm) (ripple-carry mode) 3.53 LB fast-carry logic P () and Ci Fast-carry logic output (C).42 P () and Ci Fast-carry logic output (C2).44 P () and Ci Fast-carry logic output (C3).47 Sequential delays Fast-carry logic output -FF output LB output 2.57 Routing track delays elay through two connection block and four switch block including buffers 2.76 (spanning 4 LBs) elay through four switch block including buffered switch (spanning 4 LBs) 2.38 Path delays N = 4 path delay (A CO3) 4.33 N = 2 critical path delay (using ripple-carry logic) (A SO) 6. N = 3 critical path delay (using fast-carry logic) (A SO2) 6.6 N = 4 critical path delay (using fast-carry logic) (A SO3) 6. The major components include the LB, connection block and switch block. The area of LB core is 23 µm 2 µm and the area of each tile (which contains LB, connection block and switch block) is 6 µm 42 µm = 252 µm 2. LB core makes up 9% of the total area while the routing resources makes up the remaining 8%. In particular, the routing resources such as connection block and switch block take up significant area. We used HSPICE to measure the circuit and routing track delays for the proposed S-FPGA cell and to verify the functionality of our layout. Table 5 shows the circuit and routing track delays for the proposed S-FPGA cell at V dd = 3.3 V. The routing track delays have been estimated from HSPICE simulation. These results are used to determine the speed of digit-serial arithmetic circuits implemented using the proposed S-FPGA cells. The critical path delay between the input and output pins of a LB, including direct-connection, is 6. ns Fabrication results A prototype chip of S-FPGA has been fabricated using.5 µm HP CMOS process with three metal layers, and the total layout is shown in Fig. 5(b). It contains only four of the tiles, which were enough to build the digit-serial adder and digit-serial multiplier. This chip has 4 pads, including four power and ground, 28 signal pins into the S-FPGA core, and eight programming pins. From this chip, we were able

28 44 H. Lee & G. E. Sobelman to make delay measurements that included one part of the logic block. Based on these measurements, the pad to pad delay through the part of critical path of LB and two switch block was 9 ns, giving a delay of about 7.5 ns when the pad delay is eliminated. Consideration of the propagation delays for S-FPGA suggest that digit-level pipelined digit-serial multipliers with throughput as fast as 5 MHz may be achieved. 7. igit-serial atapath Circuit Implementations on S-FPGA In this section, we overview the methodology used for technology mapping of the digit-serial circuits into S-FPGA. S-FPGA provides substantial support for the implementing of digit-serial arithmetic building blocks for digital systems such as FIR filters, CT circuits and similar computational intensive structures. Figure 6 shows how the LB can be used in various ways. igit-serial arithmetic modules can be implemented using LB in S-FPGA. Each N = 4 digit-serial arithmetic module is typically implemented using to 2 LBs with logic depth of or 2 LBs which leads to high clock frequency operation. Each digit-serial arithmetic modules consist of a single cell as in adders and subtractors, or multiple cells proportional to the word length as in digit-serial multipliers and registers. A single LB can be used to implement an unsigned N = 4 digit-serial multiplier module or a two s complement N = 4 digit-serial multiplier module as shown in Figs. 6(a) and 6(b). The outputs can be registered for pipelining; otherwise the -FFs are available for independent usage, bypassing the logic modules. X(3:) and Y bit(:) are 4-bit and 2-bit data for digit-serial multiplier modules. P I(3:) and P O(3:) are the input and output partial products. As shown in Figs. 6(a) and 6(b), the four AN gates and a 4-bit adder that are required for an N = 4 digit-serial multiplier module are contained in one LB. We can use the fast-carry logic for N = 4 digit-serial circuit to increase speed. The digit-serial multiplier modules can be easily stacked together to form deeper and/or wider multipliers. For example, to implement an N = 8 digit-serial multiplier module, the carry output (CO3) of one LB will be connected to the carry input (Ci) of the next LB, thus requiring two LBs. A single LB can be used to implement two N = 2 digit-serial multiplier modules using ripple-carry chain of LB as shown in Fig. 6(c), and four independent full-adders/subtractors as in a row of a carry-save array as shown in Fig. 6(d). Each logic module in a LB can also be configured to implement random logic without wasting device resources. We can configure each logic module to implement the following random logic gates: AN2, NAN2, OR2, NOR2, XOR2, XNOR2, MUX2, MUX, etc, as shown in Table 2. If LB is used to implement the bit-serial circuits, one LM can be used to implement the bit-serial circuit and the remaining LMs can be used for random logic gates. We have designed some digit-serial multipliers using LBs in S-FPGA as shown in Fig. 7. The digit-serial multipliers have been designed so that routing

29 VLSI esign of igit-serial FPGA Architecture 45 Ybit(:) Ybit(:) X PI X PI X2 PI2 PO PO PO2 X PI X PI X2 PI2 PO PO PO2 X3 PI3 PO3 X3 PI3 PO3 Sign it(:) (a) (b) PO PO PO2 PO3 X PO PI X PO PI Ybit PO2 X PI PO3 X PI Ybit X PI X PI Ybit X PI X PI Ybit PO PO PO PO A B Ci A B Ci Ci2 A B Ci A B Ci Ci2 Ci3 Ci3 Ybit(:) S C S C S2 C2 S3 C3 S C S C S2 C2 S3 C3 (b) (c) (c) (c) (d) (d) (d) Fig. 6. (a) Unsigned digit-level pipelined N = 4, (b) two s complement N = 4, (c) unsigned N = 2 and (d) bit-level pipelined N = 4 digit-serial multiplier module implementations onto LB.

30 X3 Y X A B Cin E A B Cin SO SO Y A B Cin E A B Cin SO SO Y2 A B Cin E A B Cin SO SO A B Cin E A B Cin SO SO A B Cin E A B Cin SO SO A B Cin E A B Cin SO SO A B Cin E A B Cin SO SO A B Cin E A B Cin SO SO Pout Pout A B Cin E A B Cin SO SO 46 H. Lee & G. E. Sobelman X Y X2 X X X2 X3 Cin2 2 E Cin3 3 Ri Ri Ri2 Ri3 LB SO2 SO3 CO Y Cin2 2 E Cin3 3 LB SO2 SO3 CO Y2 Cin2 2 E Cin3 3 LB SO2 SO3 CO CO CO CO CO CO CO CO2 Ri CO2 Ri CO2 Ri CO2 Ri CO2 Ri CO2 CO3 Ri Ri Ri Ri Ri Ri2 CO3 Ri2 CO3 Ri2 CO3 Ri2 CO3 Ri2 CO3 RO3 Ri3 RO3 Ri3 RO3 Ri3 RO3 Ri3 RO3 Ri3 RO3 CK SR CE CK SR CE CK SR CE CK SR CE CK SR CE CK SR CE Cin2 2 E Cin3 3 LB SO2 SO3 CO Cin2 2 E Cin3 3 LB SO2 SO3 CO Cin2 2 E Cin3 3 LB SO2 SO3 CO Cin2 2 E Cin3 3 Ri Ri Ri2 Ri3 LB SO2 SO3 CO CO CO2 CO3 RO3 CK SR CE Cin2 2 E Cin3 3 Ri Ri Ri2 Ri3 LB SO2 SO3 CO CO CO2 CO3 RO3 CK SR CE Pout2 Pout3 Cin2 2 E Cin3 3 Ri Ri Ri2 Ri3 LB SO2 SO3 CO CO CO2 CO3 RO3 CK SR CE CK SR CE (a) Fig. 7. (a) Implementation of unsigned digit-level pipelined N = digit-serial multiplier and (b) digit-cell for unsigned bit-level pipelined N = 4 digit-serial multipliers using LB.

31 CK SR CE CK SR CE CK SR CE CK SR CE CK SR CE CK SR CE CK SR CE CK SR CE CK S b b b3 b2 a2 b2 b b b3 a CK SR CE Ri3 Ri2 Ri Ri RO3 CO3 CO2 3 Ci3 E 2 Ci2 LB Ci B A E Ci B A CO CO SO3 SO2 SO SO CK SR CE Ri3 Ri2 Ri Ri RO3 CO3 3 Ci3 E 2 Ci2 CO2 CO LB CO SO3 Ci B A SO2 SO E Ci B A SO 4 4 b3 a b2 b a b CK SR CE Ri3 Ri2 Ri Ri RO3 CO3 3 Ci3 E 2 Ci2 CO2 CO LB CO SO3 Ci B A SO2 SO E Ci B A SO b a3 b2 b3 a3 b Sum_in Sum_in Sum_in CK SR CE CK SR CE CK SR CE Ri3 Ri2 Ri Ri RO3 RO3 RO3 CO3 Ri3 Ri2 Ri Ri CO3 Ri3 Ri2 Ri Ri CO3 3 Ci3 E 2 Ci2 CO2 CO2 CO2 CO CO CO LB CO SO3 3 Ci3 E 2 Ci2 SO2 SO LB CO SO3 3 Ci3 E 2 Ci2 SO2 SO LB CO (b) SO3 Fig. 7 (Continued). Ci B A Ci B A Ci B A SO2 SO E Ci B A SO E Ci B A SO E Ci B A SO Sum_out Sum_out Sum_out2 VLSI esign of igit-serial FPGA Architecture 47

32 48 H. Lee & G. E. Sobelman is kept regular and well-organized. An unsigned digit-level pipelined N = digit-serial multiplier implementation using LB is shown in Fig. 7(a). Each block is replaced by the digit-serial multiplier module shown in Fig. 6(a). In order to increase the throughput of the digit-serial multiplier, the architecture is pipelined at the digit-level. In the example shown, the pipelining limits the propagation to a 4- bit adder in the N = 4 digit-serial multiplier. The partial products presented to the shifting accumulator are generated by the logical AN of the input serial bit with each bit of the parallel input. However, the critical path delay of unsigned digitlevel pipelined N = 4 digit-serial multiplier is (AN + 4 Full adder + -FF) delay. Reduction in the critical path delay below this value is not possible because of the presence of feedback loops. It is found that the critical path of this N = 4 digit-serial multiplier using S-FPGA is (T LM + T F ast + T F F + T r ) delay. In these equations, T LM represents the propagation delay associated with the LM within the LBs, and T F ast and T F F represent, respectively, the propagation delay associated with fast-carry logic and flip-flop within the LBs. T r is a delay incurred in the routing between each LB. Since the proposed S-FPGA can implement significant digitserial arithmetic functions within a single LB, routing is only required to support the implementation of wide operand structures. Finally, the critical path for the overall digit-serial multiplier has been determined in terms of the worst critical path associated with the constituent multiplier module. Therefore, the maximum possible sampling frequency associated with the N = 4 digit-serial multiplier can be obtained as 4 f = W (T LM + T F ast + T F F + T r ). The unsigned bit-level pipelined digit-serial multiplier contains digit-cells, digitserial 3:2 compressor adder and digit-serial adder. A digit-cell can be configured using six LBs as shown in Fig. 7(b) and a simple digit-serial 3:2 compressor adder can be first used to reduce digit-cell output digits to two digits. A digit-serial adder is then used to add these two digits to generate the final digit-serial outputs. However, if it is mapped on S-FPGA, the reduction in the critical path below N = 4 digit-serial adder is not possible due to the presence of feedback loop in the final digit-serial adder. The resulting critical path of bit-level pipelined digit-serial multipliers would be (2T LM + T F F + T r ) delay. 8. Results To evaluate the advantages of S-FPGA, we need to compare the area and speed efficiency of the S-FPGA architecture with general purpose FPGAs. To determine the number of FPGA logic blocks needed to implement a circuit, we have mapped several digit-serial SP architectures onto S-FPGA using the direct handmapping which ensures the most efficient logic usage, and then estimated silicon area in each case. The area cost of FPGAs is estimated by the number of logic blocks required to implement a digit-serial SP architectures.

UNIT-II LOW POWER VLSI DESIGN APPROACHES

UNIT-II LOW POWER VLSI DESIGN APPROACHES Low power Design through Voltage Scaling: The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage.