Optimization and Modeling of FPGA Circuitry in Advanced Process Technology. Charles Chiasson

Optimization and Modeling of FPGA Circuitry in Advanced Process Technology by Charles Chiasson A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto Copyright 2013 by Charles Chiasson

Abstract Optimization and Modeling of FPGA Circuitry in Advanced Process Technology Charles Chiasson Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2013 We develop a new fully-automated transistor sizing tool for FPGAs that features area, delay and wire load modeling enhancements over prior work to improve its accuracy in advanced process nodes. We then use this tool to investigate a number of FPGA circuit design related questions in a 22nm process. We find that building FPGAs out of transmission gates instead of the currently dominant pass-transistors, whose performance and reliability are degrading with technology scaling, yields FPGAs that are 15% larger but are 10-25% faster depending on the allowable level of gate boosting. We also show that transmission gate FPGAs with a separate power supply for their gate terminal enable a low-voltage FPGA with 50% less power and good delay. Finally, we show that, at a possible cost in routability, restricting the portion of a routing channel that can be accessed by a logic block input can improve delay by 17%. ii

Acknowledgements First, I would like to express my sincerest gratitude to my supervisor Vaughn Betz for his guidance and motivation, for his technical help and for the tidbits of wisdom that he shared with me, knowingly or unknowingly, over the past two years. I learned so much in so little time and cannot imagine having had a better mentor. I also extend thanks to the other graduate students in Vaughn Betz s research group for all their help and support. Also, thanks for the lunch outings, the coffee breaks and the squash matches, among other things, that provided those much needed distractions. I would like to thank the Natural Sciences and Engineering Research Council of Canada, Altera Corporation and the University of Toronto for their financial support. Thanks also go to CMC Microsystems for providing the CAD tools used throughout this research. I would also like to thank David Lewis from Altera Corporation for the insightful discussions. Finally, thanks must undoubtedly go to my parents for nurturing my inherent desire to know why, for making me love the smell of new books, and simply, for being the best parents a kid could ask for. All of my accomplishments are certainly due to them. iii

Contents 1 Introduction 1 1.1 Motivation............................................ 1 1.2 Thesis Organization....................................... 3 2 Background 4 2.1 FPGA Architecture....................................... 4 2.1.1 Logic Block Architecture................................ 5 2.1.2 Routing Architecture................................... 6 2.1.3 Commercial BLE Architectures............................. 7 2.2 FPGA Architecture Assessment Methodology......................... 9 2.3 FPGA Circuit Design...................................... 10 2.3.1 SRAM cells........................................ 10 2.3.2 Routing Multiplexers................................... 10 2.3.3 Lookup Tables...................................... 12 2.3.4 Flip-Flops......................................... 12 2.4 Modeling of FPGA Circuitry.................................. 12 2.4.1 Area Modeling...................................... 13 2.4.2 Delay Modeling...................................... 14 2.5 Automated Transistor Sizing.................................. 14 3 COFFE: Automated Optimization of FPGA Circuitry 17 3.1 Introduction to COFFE..................................... 18 3.2 Architecture............................................ 19 3.3 Circuit Topologies........................................ 21 3.4 Area Modeling.......................................... 22 3.5 Delay Modeling.......................................... 25 3.5.1 Non-Linearity of Transistor Resistance and Capacitance............... 26 3.5.2 Topology Dependence of Transistor Resistance.................... 26 3.6 Wire Load Modeling....................................... 28 3.7 Transistor Sizing Algorithm................................... 28 3.7.1 Divide-and-Conquer................................... 29 3.7.2 Pre-Determined P/N Ratios............................... 29 3.7.3 Detailed Algorithm.................................... 29 3.8 Impact of Improved Wire Load Modeling........................... 32 iv

3.8.1 Base Architecture.................................... 32 3.8.2 Target Process Technology............................... 33 3.8.3 Results.......................................... 33 3.9 Integration of COFFE with VPR................................ 34 4 Efficient FPGA Circuitry 35 4.1 F c out for Single-Driver Routing and Multiple BLE Outputs................. 35 4.2 Transmission Gate FPGAs................................... 36 4.2.1 Pass-Transistor Scaling Challenges........................... 36 4.2.2 Replacing Pass-Transistors with Transmission Gates................. 37 4.2.3 Gate-Boosting Strategy................................. 38 4.2.4 Methodology....................................... 41 4.2.5 Results.......................................... 42 4.2.6 Area and Delay Breakdown............................... 44 4.3 Separating V DD and V G for Low-Power FPGAs........................ 45 4.4 Track-Access Locality...................................... 47 5 Conclusions and Future Work 49 5.1 Summary............................................. 49 5.2 Future Work........................................... 50 A N-well Sharing Sample Layout 51 B FPGA Circuitry Schematics 53 C Detailed Transistor Sizing Results 56 D Area and Delay Breakdown 61 Bibliography 62 v

List of Tables 3.1 COFFE s expected input architecture parameters....................... 20 3.2 Resistance of a 4 minimum-width NMOS transistor for different circuit topologies (Figure 3.9) and switching-thresholds................................. 27 3.3 Rise-fall re-balancing and the effect of M on COFFE s transistor sizing solutions (example). 31 3.4 Base architecture parameters................................... 33 3.5 Subcircuit count per tile for base architecture.......................... 33 3.6 Metal layer data used by COFFE for all circuit design investigations (ITRS [19])...... 33 3.7 Impact of wire loading....................................... 34 4.1 Effect of F c out on channel width and switch block multiplexers................ 36 4.2 Area and delay for different F c out values............................ 36 4.3 Pass-transistor and transmission gate FPGA tile area for different levels of gate boosting. 42 4.4 Switch block multiplexer transistor sizes for PT and TG implementations for different levels of gate boosting (see Figure 4.2 for transistor labels). Note that with the exception of P/N ratios, COFFE uses integer granularity......................... 42 4.5 Pass-transistor and transmission gate FPGA critical path delay for different levels of gate boosting (VTR benchmarks)................................... 43 4.6 Pass-transistor and transmission gate FPGA area-delay product for different levels of gate boosting (VTR benchmarks)................................... 43 4.7 Pass-transistor and transmission gate FPGA relative dynamic power for different levels of gate boosting (VTR benchmarks)............................... 43 4.8 Effect of cluster output track-access locality on area and delay. Input track-access span is set to 0.5............................................. 48 4.9 Effect of cluster input track-access locality on area and delay. Output track-access span is set to 0.25............................................ 48 C.1 Lookup table transistor sizes................................... 56 C.2 Switch block multiplexer transistor sizes............................. 57 C.3 Connection block multiplexer transistor sizes.......................... 57 C.4 Local routing multiplexer transistor sizes. Note: We don t give a size for buf2 of the local routing multiplexer as it is replaced by the LUT input driver of Figure B.2......... 57 C.5 BLE output to local interconnect................................. 57 C.6 BLE output to general routing.................................. 58 C.7 Flip-flop and register selection multiplexer transistor sizes................... 58 vi

C.8 LUT input driver A........................................ 59 C.9 LUT input driver B........................................ 59 C.10 LUT input driver C with register feedback multiplexer (Figure B.3)............. 59 C.11 LUT input driver D........................................ 60 C.12 LUT input driver E........................................ 60 C.13 LUT input driver F........................................ 60 D.1 Tile area breakdown........................................ 61 D.2 Critical path delay breakdown.................................. 61 vii

List of Figures 1.1 Architecture exploration with manual (a) and automated (b) transistor-level design.... 2 2.1 Tile-based FPGA......................................... 5 2.2 Basic logic element (BLE).................................... 5 2.3 Logic cluster architecture..................................... 6 2.4 Routing segment lengths..................................... 7 2.5 Multi-driver and single-driver routing architectures....................... 8 2.6 FPGA architecture assessment methodology with VPR.................... 9 2.7 Six transistor SRAM cell..................................... 10 2.8 Different 8:1 pass-transistor multiplexer topologies....................... 11 2.9 Multiplexer followed by two-stage buffer with PMOS level-restorer.............. 11 2.10 Fully encoded MUX tree 3-LUT................................. 12 2.11 Minimum-width transistor area model.............................. 13 2.12 Increasing drive strength with diffusion widening (b) or parallel diffusion regions (c). Note: Although not shown in the figure for simplicity, parallel diffusions must be connected together............................................... 14 3.1 FPGA design flow......................................... 18 3.2 COFFE s supported tile architecture............................... 19 3.3 COFFE s routing multiplexer circuit topologies......................... 21 3.4 Fully encoded MUX tree 6-LUT with internal re-buffering (partial view).......... 22 3.5 Static transmission gate-based master-slave register...................... 22 3.6 Transistor area prediction accuracy of original (Eq. 2.2) and improved (Eq. 3.1) area models against TSMC 65nm layouts............................... 23 3.7 Combining diffusion widening and parallel diffusion regions yields denser layouts (c).... 24 3.8 A switch-level model........................................ 25 3.9 Circuits used to measure transistor resistance.......................... 26 3.10 Inverter NMOS and PMOS resistivity vs. transistor width.................. 27 3.11 COFFE s transistor sizing algorithm............................... 30 4.1 V DD and V T h scaling trends [12]................................. 37 4.2 Generic two-level routing multiplexer with two-stage buffer implemented with pass-transistors (a) and transmission gates (b).................................. 39 4.3 Effect of different gate boosting strategies on transmission gate switch block multiplexer delay (V DD = 0.8V )........................................ 40 viii

4.4 CAD flow for each FPGA..................................... 41 4.5 Tile area and critical path delay breakdown........................... 44 4.6 Critical path delay for pass-transistor (PT) and transmission gate (TG) FPGAs for different V DD and V G......................................... 45 4.7 Dynamic power for pass-transistor (PT) and transmission gate (TG) FPGAs for different V DD and V G............................................ 46 4.8 Power-delay product for pass-transisor (PT) and transmission gate (TG) FPGAs for different V DD and V G......................................... 46 4.9 Cluster output wire load for different locality.......................... 47 4.10 Cluster input wire load for different locality........................... 47 A.1 A single-level 4:1 pass-transistor multiplexer with two-stage buffer and level restorer.... 51 A.2 Sample multiplexer layout with N-well sharing......................... 52 B.1 6-LUT............................................... 53 B.2 LUT input driver......................................... 53 B.3 LUT input driver with register feedback multiplexer...................... 54 B.4 Two-level multiplexer used for switch block, connection block and local routing multiplexers. 54 B.5 2:1 multiplexer used for BLE outputs.............................. 54 B.6 Flip-flop with register input selection multiplexer........................ 55 ix

Chapter 1 Introduction 1.1 Motivation The design and fabrication of modern digital integrated circuits costs tens to hundreds of millions of dollars, requires large teams of engineers and years of effort. Indeed, the cost of developping a new 20nm chip has been estimated to be as high as $160 million USD [1]. While this may be acceptable for high-volume applications, it can be a significant burden for lower-volume designs, often preventing them from being fabricated in the latest process technologies. Instead of being fabricated as a custom chip such as a standard cell-based application-specific integrated circuit (ASIC) or a full custom design, a digital design can be implemented in a field-programmable gate array (FPGA). FPGAs are pre-fabricated, programmable devices into which one can implement any arbitrary digital design in a matter of seconds. Therefore, FPGAs are an attractive alternative to ASICs or full custom designs because they allow the high non-recurring engineering costs and lengthy design times associated with semiconductor manufacturing to be completely avoided. However, digital designs that require high-density, high-performance or low-power might not find FPGAs as attractive. It has been shown that FPGAs require 35 more silicon area, are 4 slower and consume 14 more dynamic power than ASICs [25]. Accordingly, minimizing the FPGA-to-ASIC gap, that is, making FPGAs as efficient as possible such that they become a competitive implementation medium for all types of applications, is one of the primary drivers of FPGA research for both academic researchers and commercial FPGA manufacturers. The area, performance and power characteristics of an FPGA can be optimized at two main levels: architecture and transistor-level design. The architecture of an FPGA is defined by a number of parameters that describe the style and flexibility of its soft-logic blocks, dedicated hard-blocks and interconnect. Finding an architecture that meets specific design goals and constraints involves setting these architectural parameters to specific values. However, these parameters interact in complex ways to produce area, delay and power trade-offs that are very difficult to quantify through analytical methods. For that reason, finding the right architectural parameter values is usually accomplished experimentally with automated architecture explorations tools such as VPR [7]. For any architecture, there are a number of different transistor-level implementations. Transistorlevel design consists of choosing circuit topologies for an architecture as well as sizing the transistors of those circuits. Both circuit topology selection and transistor sizing provide opportunities to optimize the area, delay and power of the architecture. In prior FPGA research work, transistor-level design was often 1

Chapter 1. Introduction 2 Initial architecture parameters Manual transistorlevel design Initial architecture parameters Automated transistorlevel design Evaluate architecture Evaluate architecture Change architecture parameters Change architecture parameters (a) Manual transistor-level design. (b) Automated transistor-level design. Figure 1.1: Architecture exploration with manual (a) and automated (b) transistor-level design. performed manually making it a task that required a significant amount of time and effort. This often had a negative impact on the architecture exploration flow, which would proceed as follows. Manual transistor-level design would be performed on some initial architecture. Then, this architecture would be assessed with an architecture exploration tool such as VPR. Based on the results of the assessment, the architecture parameters would be adjusted and the evaluation process would be repeated. Ideally, one would then re-optimize the transistor-level design to match the new architecture parameters. However, since manual transistor-level design was such a time and effort intensive task, this step would often be skipped. It was assumed that transistor sizes obtained with a previous architecture still applied to the new architecture and this new architecture was then evaluated without re-optimizing its transistor-level design. This architecture exploration flow is illustrated in Figure 1.1a. The new architecture could likely be made more efficient if it s transistor sizes were re-optimized. As well, the detailed impact of new wire loads as the architecture and its area changed have often not been rigorously modeled, possibly leading to inaccurate architecture conclusions. In an environment where FPGAs need to be as efficient as possible to compete with ASICs, new architectures should be evaluated in their most efficient state. It follows that re-optimizing the transistor sizes as the FPGA architecture is changed provides a more thorough design space exploration and should yield more efficient FPGAs. Automating the transistor-level design of FPGAs enables such frequent re-optimization (Figure 1.1b). In addition, an automated transistor-level design tool facilitates investigations relating to efficient FPGA circuitry. For example, an automated transistor-level design tool could be used to explore the impact of different circuit topologies or the impact of different layout choices on the area, delay and power of an FPGA. This thesis consists of two parts. In the first, we develop COFFE (Circuit Optimization For FPGA Exploration), a new fully-automated transistor sizing tool for FPGAs. Although an FPGA-specific transistor sizing tool has been developed in prior work [24], we have made significant improvements that are necessary in advanced process nodes. In the second part of this thesis, we use COFFE to investigate a number of circuit design related questions in advanced process technology.

Chapter 1. Introduction 3 1.2 Thesis Organization This thesis is organized as follows. Chapter 2 provides background information on FPGA architecture, circuit design, modeling and optimization. Chapter 3 describes COFFE, a fully-automated transistor sizing tool for FPGAs developed as part of this thesis, as well as our area and delay modeling enhancements. A number of FPGA circuit design investigations are performed with COFFE in Chapter 4. Finally, Chapter 5 concludes this thesis and suggests future work.

Chapter 2 Background This thesis is focused on the transistor-level design of SRAM-based FPGAs and related computer-aided design (CAD) tools. We develop a fully-automated transistor sizing tool for FPGAs in Chapter 3 and use it to investigate a number of FPGA circuit design related questions in Chapter 4. This chapter provides relevant background material. First, we review FPGA architecture and the standard FPGA architecture assessment methodology. Then, we describe common practices in FPGA circuit design as well as commonly used area and delay modeling techniques for these circuits. Finally, we review prior work on automated transistor sizing. 2.1 FPGA Architecture An FPGA consists of an array of tiles that can each implement a small amount of logic and routing. Horizontal and vertical routing channels run on top of the tiles and allow them to be stitched together to perform larger functions. Figure 2.1 illustrates FPGA tile architecture at a high-level. A logic block (LB) supplies the tile s logic functionality. Connection blocks (CBs) provide connectivity between logic block inputs and routing channels. A switch block (SB) connects logic block outputs to routing channels and provides connectivity between wires within the routing channels. One replicates this basic tile to obtain a complete FPGA. Although Figure 2.1 shows logic and switching functions as distinct sub-blocks, an interleaved layout is more realistic and is what we assume throughout this work. The FPGA architecture described in the previous paragraph represents a generic soft-logic-based FPGA. Modern FPGAs are more heterogeneous. That is, in addition to general purpose soft-logic blocks, they also contain dedicated hard-blocks such as multipliers, block memories or even embedded processors [36, 51, 4, 38]. In this work, we focus on the architecture and circuit design of the soft-logic portion as it still forms the backbone of an FPGA and typically accounts for a large fraction of it s area 1 and critical path delay as shown in Section 4.2.6. However, since hard-blocks are an important part of modern FPGA architectures, all our VPR [7] experiments are performed with architecture files that contain multipliers and block memories along with our soft-logic blocks. We use the same multiplier and block memory designs across all our VPR experiments, and hence they are constant and do not affect the conclusions of our soft-logic investigations. 1 In [50], it was reported that the core area of the largest Stratix III FPGA consists of 72% soft-logic and associated programmable routing; the other 28% being block memory and multipliers. 4

Chapter 2. Background 5 FPGA Tile LB CB CB SB Routing Channel Figure 2.1: Tile-based FPGA. K-LUT FF Figure 2.2: Basic logic element (BLE). 2.1.1 Logic Block Architecture Most FPGAs are built around the idea of using lookup tables (LUTs) to implement logic functions. A K-input LUT can implement any combinational logic function of K inputs. Since digital designs are rarely purely combinational, the basic logic element (BLE) of an FPGA consists of a K-LUT and a flip-flop (FF) that both feed a 2:1 multiplexer which allows the output of the BLE to be driven by either the LUT output or the FF output as illustrated in Figure 2.2 [7]. Although an FPGA logic block could consist of a single BLE, it is much more common to group several BLEs together in the same logic block to form a locally interconnected logic cluster as this fast local interconnect can improve performance and save general routing area [7, 2]. The number of inputs to a LUT (K) and the number of BLEs in a logic cluster (N) are two important architectural parameters affecting the area and performance of an FPGA. Ahmed and Rose showed in [2] that K = 4 to 6 and N = 4 to 10 are good choices in terms of area-delay product. Modern commercial architectures use comparable values for N and K (Virtex 7: K=6, N=8 [51] and Stratix V: K=6, N=10 [35]). As illustrated in Figure 2.3, a logic cluster s local interconnect consists of two types of wires: local feedback wires and cluster input wires. There are typically N local feedback wires in a cluster; one for each BLE. Often, many BLEs in a cluster will share common inputs. Accordingly, the number of inputs to a cluster (I) is less than the number of distinct BLE inputs in a cluster (i.e. N K). It was shown in [2] that (2.1) is a good estimate of the number of inputs required to achieve 98% LUT utilization. I = K (N + 1) (2.1) 2

Chapter 2. Background 6 Local feedback wires K-LUT FF K local routing MUXes per BLE Total of N BLEs BLE BLE with internal details shown N BLE outputs BLE I cluster inputs Figure 2.3: Logic cluster architecture. Local routing multiplexers connect multiple local interconnect wires to each BLE input. These multiplexers are generally sparsely populated [29]. That is, BLE inputs can be connected to only a fraction of the wires in the local interconnect; we refer to this fraction as F c local. Sparsely populating the local routing multiplexers reduces their size and thus saves area. In [29], it was shown that reducing F c local from 1.0 to 0.5 reduces area by 10% with no degradation in critical path delay. However, as recommended by [29], between 2 to 5 spare cluster inputs should be added to (2.1) when sparsely populating the local routing multiplexers to maintain routability. 2.1.2 Routing Architecture Logic blocks are interconnected by programmable routing channels that run horizontally and vertically on top of a tile (Figure 2.1). The number of tracks in a routing channel is refered to as its width (W). In this work, we assume that the width of horizontal and vertical routing channels are equivalent, but it is possible for them to be different. For example, the horizontal routing channels on Stratix FPGAs are wider than the vertical channels due to the rectangular layout of their logic blocks [34]. A routing track is composed of wire segments that span one or more tiles. The length (L) of a routing segment specifies the number of tiles that it spans. For example, Figure 2.4 shows a routing channel that consists of four tracks of L = 2 wire segments and four tracks of L = 4 wire segments. Note that staggering the start point of wire segments as in Figure 2.4 is necessary for a tile-based layout as it ensures that all tiles remain identical [8]. A horizontal and a vertical routing channel intersect at each tile. The set of programmable switches that allow connections to be made between routing tracks at this intersection is called a switch block (SB in Figure 2.1). Switch block flexibility (F S ) specifies the number tracks to which any track can connect in a switch block. An F S of 3, where each horizontal track connects to another horizontal track and two vertical tracks (and vice-versa), is common [49]. The specific tracks to which each track connects is determined by the switch block pattern [7, 37] as well as the routing driver architecture. In

Chapter 2. Background 7 FPGA tiles Length 2 wire segments Length 4 wire segments Figure 2.4: Routing segment lengths. a multi-driver routing architecture (Figure 2.5a), a wire can be driven by multiple tri-state drivers at multiple points along its length. In contrast, in a single-driver routing architecture (Figure 2.5b), a wire can only be driven by a single multiplexer-based driver usually placed at one end of the wire. Figures 2.5a and 2.5b also show that logic block outputs connect to the routing tracks differently based on the routing driver architecture. That is, multi-driver architectures connect logic block outputs directly to the routing wires while single-driver architectures connect logic block outputs to the routing wires through switch-block multiplexers. Although multi-driver routing architectures have been widely used in the past [7, 2], single-driver routing has become the dominant routing architecture style in both academic research [28, 27, 24] and commercial FPGAs [34, 33]. In this work, we focus on single-driver routing architectures. In [28], Lemieux et al. found that FPGAs with single-driver routing had 9% lower delay and were 25% smaller than FPGAs with multi-driver routing. Connection block multiplexers connect multiple routing tracks to each logic block input (see Figure 2.5). The number of tracks that can connect to each logic block input is called the connection block input flexibility (F c in ). Similarly, the number routing wires that each logic block output can connect to is given by the connection block output flexibility (F c out ). Reducing F c in from W to 0.2W as the logic cluster size increases from N = 1 to 20 and using an F c out of W/N were found to be good choices in [7]. These interconnect flexibility values have generally been used as rules of thumb in subsequent FPGA research. 2.1.3 Commercial BLE Architectures The BLEs of modern commercial FPGAs are much more complex than the commonly used academic BLE described in Section 2.1.1 (Figure 2.2). Instead of a single K-LUT, some modern FPGA architectures [33, 35, 51] use fracturable LUTs, which are LUTs that can be configured as one large LUT or multiple smaller LUTs. For example, the Stratix V fracturable 6-LUT can be split into two 5-LUTs or four 4-LUTs provided that the functions being mapped to these LUTs meet certain constraints [35]. Modern BLEs also commonly support configuring LUTs as memories (LUTRAM) or shift registers and usually contain hard arithmetic carry logic [35, 52]. However, to keep the scope of this work tractable, we only consider regular K-LUTs, which are still relevant, and we do not consider carry logic as current academic CAD tools do not fully support this functionality. The commonly used academic BLE shown in Figure 2.2 has a very limited ability to use both the lookup table and flip-flop together. Modern commercial BLEs include additional 2:1 multiplexers to allow the lookup table and flip-flop to be used in concert in many more ways [3, 52]. These extra multiplexers are included in our designs and will be described in more detail in Section 3.2.

Chapter 2. Background 8 LB LB CB CB CB LB CB SB Connection block MUX Tri-state drivers LB output connects to routing wire via tri-state driver (a) Multi-driver architecture. Drivers at mid-points LB LB CB CB CB LB CB SB Connection block MUX LB output connects to routing wire via SB mux (b) Single-driver architecture. Switch block MUX No drivers at mid-points Figure 2.5: Multi-driver and single-driver routing architectures.

Chapter 2. Background 9 Benchmark circuits Architecture description Synthesize and map circuits to FPGA LUTs, FF, etc. Synthesized benchmark circuits Pack into logic clusters VPR architecture description file Place clusters into FPGA Route connections between clusters Analyze timing and area VPR Figure 2.6: FPGA architecture assessment methodology with VPR. 2.2 FPGA Architecture Assessment Methodology The quality of an FPGA in terms of area, performance and power consumption is a function of the architectural parameters described in Section 2.1. These architecture parameters interact in complex ways; hence determining the best choice for each parameter is a challenging task. Although there has been some work towards developing analytical models to evaluate FPGA architectures [46, 26, 16], the standard architecture assessment procedure used by both commercial FPGA manufacturers and academic researchers is an experimental one that consists of implementing benchmark circuits on a candidate architecture in order to evaluate its area, delay and power. Figure 2.6 shows the standard academic CAD flow used to evaluate FPGA architectures [7]. The CAD flow proceeds as follows. Benchmark circuits are first synthesized and mapped into lookup tables (LUTs), flip-flops (FF) and hard-blocks (multipliers and block memories) based on a description of the architecture. LUTs and FFs are then packed into clusters in a manner that attempts to keep related LUTs and FFs in the same cluster such that connections between them can be routed through the logic cluster s fast local interconnect. Next, each cluster is placed into a specific logic block on the FPGA that minimizes both the delay and the wire length of connections between logic clusters as much as possible. Once all logic clusters have been placed, connections between logic blocks are routed through the FPGA s general purpose interconnect. The routing algorithm tries to minimize the benchmark circuit s critical path delay, while using the least amount of routing resources possible. Finally, timing analysis is performed to determine the implemented benchmark circuit s critical path delay and area is calculated based on tile area and the number of logic blocks required by the placement. The packing, placement and routing phases of the flow of Figure 2.6 are performed by VPR [7]. Since many of the algorithms used by VPR are timing-based, the VPR architecture file must describe

Chapter 2. Background 10 WL V SRAM+ BL BL V SRAM- Figure 2.7: Six transistor SRAM cell. the delays through the lookup tables, routing multiplexers and any other circuitry that makes up the FPGA. The delay of these circuits depend on the circuit topologies used, as well as the transistor sizing of the FPGA circuitry. Consequently, evaluting an FPGA architecture requires first completing its transistor-level design. 2.3 FPGA Circuit Design As mentioned in Section 2.1, we only consider soft-logic-based FPGAs with single-driver routing architectures in this thesis. Soft-logic FPGA architectures consists entirely of SRAM cells, routing multiplexers, lookup tables and flip-flops. This section describes commonly used circuit topologies and circuit design practices for these structures. 2.3.1 SRAM cells An FPGA typically contains millions of memory bits used to configure routing multiplexers and store lookup table logic functions. Because there are so many of them, a key design goal for these memory bits is small area. Stability is also important, as state flipping would cause problems such as incorrectly configured routing multiplexers. A six transistor SRAM cell (Figure 2.7) has been the standard implementation in FPGA research [7] as it achieves both design goals reasonably well. 2.3.2 Routing Multiplexers Routing multiplexers account for a large fraction of the area and delay of an FPGA. Consequently, it is crucial to choose a circuit implementation that is as efficient as possible. There are a number of approaches that can be taken to build a multiplexer but most commercial FPGAs and almost all academic FPGA studies use an NMOS pass-transistor-based approach because each switch requires only one transistor, minimizing area. Figure 2.8 shows three of the most commonly used pass-transistor multiplexer topologies. Each multiplexer style possesses a different area-delay tradeoff that is a function of the number of multiplexer inputs [27, 9]. For example, since it has just one pass-transistor on the signal path, a 1-level multiplexer is generally faster than a 2-level multiplexer. But, for a large number of inputs, a 1-level multiplexer requires more SRAM cells than a 2-level multiplexer and can thus have larger area. Furthermore, if the the number of inputs is large enough, a 1-level multiplexer could even become slower than a 2-level multiplexer due to a greater number of transistors loading the output node.

Chapter 2. Background 11 SRAM cell SRAM cell SRAM cell out out out (a) Tree MUX. (b) 1-level MUX. (c) 2-level MUX. Figure 2.8: Different 8:1 pass-transistor multiplexer topologies. Level-restorer out MUX 2-stage buffer Figure 2.9: Multiplexer followed by two-stage buffer with PMOS level-restorer. It was shown in [9] that a 2-level multiplexer generally yields a lower area-delay product than a 1-level or tree multiplexer. Commercial FPGAs also commonly use 2-level multiplexers [33]. Although they are beneficial in terms of area, pass-transistors have an important disadvantage: they are incapable of passing a full logic-high voltage. That is, their output voltage saturates at approximately V G V T h where V G is the gate voltage and V T h is the threshold voltage of the transistor. In FPGA circuitry, the output of a pass-transistor-based routing multiplexer is typically driven by a multi-stage buffer [7, 30, 33]. Static power dissipation in these buffers caused by the reduced voltage swing of passtransistors has long been a cause for concern [7]. To mitigate this problem, gate boosting [7] (applying a voltage larger than the supply voltage (V DD ) on the pass-transistor gate) and PMOS level-restorers [30, 33] have been used to help pull pass-transistor output voltages up to V DD. Figure 2.9 shows a routing multiplexer followed by a two-stage buffer equiped with a PMOS-level restorer.

Chapter 2. Background 12 SRAM cells A B C LUT inputs Level-restorer out 3-LUT 2-stage buffer Figure 2.10: Fully encoded MUX tree 3-LUT. 2.3.3 Lookup Tables Like routing multiplexers, lookup tables also use pass-transistor-based multiplexer circuitry but, the multiplexer input and control connectivity is reversed. In a lookup table, SRAM cells connect to the inputs of the multiplexer and hold the logic functions truth table, while the gates of the multiplexer are controlled by the lookup table inputs. Consequently, lookup tables are generally implemented as fully-encoded multiplexer trees, such that each level of the tree can be connected to a LUT input [7]. Figure 2.10 shows a 3-input fully encoded multiplexer tree lookup table followed by a two-stage buffer. 2.3.4 Flip-Flops Flip-flops are generally implemented as standard master-slave registers [7]. However, some commercial FPGAs use flip flops that are more advanved. For example, Altera s Stratix V FPGAs use flip-flops based on pulse latches and configurable pulse width generators to improve performance [35]. 2.4 Modeling of FPGA Circuitry Evaluating an FPGA architecture with the assessment methodology described in Section 2.2 requires that we develop models that allow us to estimate the area and delay of FPGA circuitry because fabricating an integrated circuit for each architecture to measure area and delay is obviously not practical. In this section, we describe commonly used area and delay modeling approaches for FPGAs. These models are also useful for transistor sizing, which we will discuss in Section 2.5.

Chapter 2. Background 13 Minimum-width transistor Space to neighboring transistors Minimum-width transistor area Diffusion Metal contact Metal/polysilicon gate Figure 2.11: Minimum-width transistor area model. 2.4.1 Area Modeling Creating a complete layout is the best way to determine the exact area of an FPGA. However, this process is much too time consuming when multiple designs need to be explored. A variety of different approaches have been used to more quickly estimate area such as counting transistors or counting SRAM cells, but the most widely used in FPGA research is the minimum-width transistor area model introduced in [7]. In this model, layout area is expressed in units of minimum-width transistor areas. A minimumwidth transistor is defined as the smallest possible contactable transistor for a specific process technology and one minimum-width transistor area is the area of this transistor plus the spacing to neighboring transistors as shown in Figure 2.11. Unlike area models that simply count transistors or SRAM cells, the minimum-width transistor area model provides an actual estimate of layout area. This is an important distinction because as well as being more accurate, actual layout area estimates enable better estimates of wire loads since wire lengths are layout dependent. Transistors in FPGA circuitry often require more drive strength than that provided by a minimumwidth transistor. A transistor s drive strength can be increased by either widening its diffusion region (Figure 2.12b) or by adding parallel diffusion regions (Figure 2.12c). Consequently, increasing a transistor s drive strength increases it s area. The widely-used area model of [7] estimates the layout area of a transistor with drive strength x, in units of minimum-width transistor areas, with (2.2), which was obtained by averaging the layout areas that result from either widening the diffusion region or adding parallel diffusion regions to increase drive strength. Area(x) = 0.5 + 0.5x (2.2) Then, [7] calculates the area of an FPGA subcircuit by simply summing the areas of all the transistors in that subcircuit. Note from (2.2) that doubling a transistor s drive strength does not double it s area. This is due to the fact that increasing a transistor s drive strength only increases certain transistor dimensions while others remain constant. For example, the spacing to neighboring transistors remains the same regardless of a transistor s drive strength.

Chapter 2. Background 14 2 parallel diffusions 1x minimum contactable width 2x minimum contactable width 1x minimum contactable width (a) Minimum drive strength. (b) 2 minimum drive strength. (c) 2 minimum drive strength. Figure 2.12: Increasing drive strength with diffusion widening (b) or parallel diffusion regions (c). Note: Although not shown in the figure for simplicity, parallel diffusions must be connected together. 2.4.2 Delay Modeling Time-domain circuit simulators such as HSPICE are generally the most accurate way to estimate the delay of a circuit. However, time-domain simulation can be computationally intensive making it impractical when a large number of delay measurements need to be obtained. For example, the timing analysis phase of the architecture assessment flow described in Section 2.2 involves measuring delay for the thousands of nets in a benchmark circuit; performing time-domain simulation for each one would lead to prohibitively long runtimes. Instead, previous FPGA research work has typically modeled wires and transistors as linear resistances and capacitances, such that a transistor-based circuit can be modeled as an RC-tree network [22, 7, 24]. The delay of this network can then be estimated with the Elmore [15] or the Penfield-Rubinstein [20] delay models, which are much quicker than time-domain simulations. With the Elmore delay model, the delay T D of a path is given by: T D = R i C(subtree i ) (2.3) i path where R i is the equivalent resistance of element i along the path and C(subtree i ) is the total downstream capacitance rooted at element i. An enhanced version of the Elmore delay model was proposed in [39]. Since it is more difficult to model a buffer as a simple RC circuit due to the buffer s intrinsic delay, [39] combines the Elmore delay model with a common model of buffer delay where a buffer is modeled as a constant delay and a resistor. This approach maps well to FPGA circuitry, which consists mostly of pass-transistors and buffers, and was adopted as the delay model for VPR in [7]. With this model, the delay T D of a path is given by: T D = R i C(subtree i ) + T buf,i (2.4) i path where T buf,i is the buffer s intrinsic delay if element i is a buffer or 0 otherwise [7]. 2.5 Automated Transistor Sizing Transistor sizing is a well-studied problem that consists of improving a circuit s performance by increasing the sizes of its transistors and thus provides yet another level, in addition to architecture and circuit design, at which the area and delay characteristics of an FPGA can be adjusted. The transistor sizing optimization problem is usually formulated in one of three ways:

Chapter 2. Background 15 1. Minimize some function of area and delay. 2. Minimize area subject to a delay constraint. 3. Minimize delay subject to an area constraint. There has been much prior work on automated transistor sizing for custom circuitry. Fishburn and Dunlop showed in [17] that modeling transistors as linear resistances and capacitances and calculating the delay of the resulting RC circuits with the Elmore [15] or the Penfield-Rubinstein [20] delay model (i.e. (2.3)) allows the transistor sizing problem to be formulated as a convex optimization problem, which guarantees that any local minimum is the global minimum. With this useful property, [17] develops TILOS, a transistor sizing tool for custom circuits based on a heuristic method that iteratively identifies a circuit s critical path and increases transistor sizes on that path until all timing constraints are met. Despite the convexity of the problem, TILOS s heuristic is such that it can terminate with a suboptimal solution [45]. Algorithms guaranteeing the optimal solution through convex optimization [44] or mathematical relaxation techniques [10, 47] have subsequently been proposed but these algorithms, along with TILOS, all suffer from their reliance on linear device models and the Elmore delay, which have long been known to be inaccurate [40, 21]. To enhance accuracy, at the cost of increased computational complexity, some transistor sizing algorithms have used time-domain simulation to obtain delay estimates [14, 13]. The programmability of FPGAs adds unique features to the transistor sizing problem which makes FPGA-specific transistor sizing tools valuable. Kuon and Rose proposed such a tool in [24]. Their FPGA transistor sizing approach is different than transistor sizing algorithms for custom circuits because it deals with two features unique to FPGAs. The first of these unique features is repitition. As described in Section 2.1, an FPGA consists of an array of tiles. Since these tiles are all identical, transistor-level design only needs to be performed for one of them. This design can then be replicated to obtain a complete FPGA. Similar design space reductions can be found within a tile. For example, a switch block can include over 100 logically equivalent multiplexers whose transistor-level design should be kept identical. Consequently, only 80 unique transistors need to be sized when designing an FPGA s soft-logic despite there being billions of transistors on the chip, which is in contrast to transistor sizing for custom circuits where the whole chip must be considered. This reduced design space makes HSPICE-based optimization practical for FPGAs, but as we show in Section 3.7, we must still search this space intelligently to keep runtime reasonable. The second unique feature to FPGA transistor sizing is their undefined critical path. Because they are programmable, FPGAs have application-dependent critical paths which implies that at design time, there is no clear critical path to optimize for delay. To deal with this issue, [24] optimizes a representative path that contains one of each type of FPGA subcircuit (LUTs, MUXes, etc.). Delay is taken as a weighted sum of the delay of each subcircuit and the weighting scheme is chosen based on the frequency with which each subcircuit was found on the critical paths of placed and routed benchmark circuits. Optimizing a representative critical path still presents a huge design space which Kuon and Rose tackle with a two-phased algorithm that consists of an exploratory phase that utilizes linear device models and a TILOS-like transistor sizing heuristic to keep CPU times reasonable, followed by an HSPICE-based fine-tuning phase that adjusts the transistor sizes to account for the inaccuracies of linear models. In [46], Smith et al. present a method that enables the rapid and concurrent optimization of highlevel architecture parameters and transistor sizes for FPGAs through the use of analytic architecture

Chapter 2. Background 16 models, linear device models and a convex optimization-based transistor sizing algorithm. They show that this concurrent optimization can have a significant impact on architectural conclusions versus a separate optimization.

Chapter 3 COFFE: Automated Optimization of FPGA Circuitry When developing a new chip, FPGA architects are faced with two main tasks: choosing an architecture for their FPGA and performing the transistor-level design of that architecture. As described in Section 2.2, choosing an architecture is typically accomplished experimentally with architecture exploration tools such as VPR [7]. By implementing benchmark circuits on a proposed FPGA, these tools allow architects to evaluate the area, delay and power impact of various architectural choices. Then, based on their observations, architects can select an FPGA architecture that meets their design goals and constraints. Transistor-level design consists of selecting circuit topologies for the various subcircuits that implement the chosen architecture, as well as sizing the transistors of those subcircuits. Transistor-level design is an essential precursor to the evaluation of an architecture because it provides accurate area, delay and power estimates of the underlying FPGA circuitry; these estimates are required inputs to the architecture exploration tools. Transistor sizing also provides an additional opportunity to tune the area, delay and power of an FPGA. Therefore, developing a new FPGA is an iterative process that involves performing the transistor-level design of various architectures before evaluating them through synthesis, placement and routing experiments. This interdependence between architecture exploration and transistor-level design necessitates automated design tools if high-quality results are to be obtained in reasonable amounts of time. In this chapter, we describe COFFE (Circuit Optimization For FPGA Exploration), a fully-automated transistor sizing tool for FPGAs that we developed as part of this thesis. COFFE enables the design flow detailed above by providing area, delay and power estimates of properly sized FPGA circuitry. COFFE also enables design exploration of FPGA circuitry and we will use COFFE in such a capacity in Chapter 4. Although COFFE solves the same problem as Kuon and Rose s FPGA transistor sizing tool [24] (see Section 2.5), we have made significant improvements which are necessary for FPGAs in advanced process nodes; these improvements will be described in the following sections. 3.1 Introduction to COFFE Figure 3.1 shows the FPGA design flow we wish to enable with COFFE. COFFE is used to perform transistor-level optimization for some architecture of interest, thus producing accurate area and delay 17

Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 18 Process models Optimization objective Benchmark circuits HSPICE Area model Wire load model Circuit Optimizer Subcircuit areas and delays (VPR arch. file) Typical critical path (delay weights) Pack Place Route Generate subcircuit SPICE netlists Subcircuit SPICE netlists COFFE Architecture parameters Analyze timing and area VPR Figure 3.1: FPGA design flow. estimates for the subcircuits of this architecture (LUTs, routing multiplexers, etc.). These estimates are used by VPR to evaluate the architecture through place and route experiments. Based on the results of the assessment, the architecture parameters are adjusted and sent back to COFFE to begin a new iteration of optimization and evaluation. COFFE s circuit optimizer makes area and performance trade-offs through transistor sizing. Like [24], COFFE s optimization objective is of the form Area b Delay c thus allowing for different area and performance tradeoffs by varying b and c. Creating a complete layout is the most accurate way to obtain the area and delay measurements needed during transistor sizing. However, for the iterative design flow of Figure 3.1, this approach is impractical as layout is a very time consuming task. Instead, COFFE estimates area with an improved version of the minimum-width transistor area model (see Section 3.4) and measures delay with HSPICE simulations. Although previous FPGA transistor sizing tools have used linearized models of transistors to measure delay during certain phases of the optimization, we show in Section 3.5 that such models are highly inaccurate for the fine-grained transistor-level design we wish to undertake in advanced process nodes such as the 22nm process we use in this work. COFFE automatically generates the SPICE netlists required for delay measurement based on the input architecture parameters and the circuit topologies described in Sections 3.2 and 3.3 respectively. These netlists are parametrized such that COFFE s circuit optimizer can change the transistor sizes by simply changing a transistor size parameter list. To obtain meaningful delays, COFFE is careful to ensure that these netlists include realistic transistor and wire loading. Transistor loads are relatively easy to determine based on architectural parameters and circuit topologies. Wire loads, on the other hand, are layout dependent making them more difficult to determine since the exact layout is not known. COFFE estimates wire loads with the model described in Section 3.6. 3.2 Architecture Figure 3.2 shows the tile architecture that COFFE supports in its designs and Table 3.1 lists the architecture parameters that COFFE expects as inputs. Parameters listed in the top portion of Table 3.1