Optimization and Modeling of FPGA Circuitry in Advanced Process Technology. Charles Chiasson
|
|
- Benjamin McLaughlin
- 5 years ago
- Views:
Transcription
1 Optimization and Modeling of FPGA Circuitry in Advanced Process Technology by Charles Chiasson A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto Copyright 2013 by Charles Chiasson
2 Abstract Optimization and Modeling of FPGA Circuitry in Advanced Process Technology Charles Chiasson Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2013 We develop a new fully-automated transistor sizing tool for FPGAs that features area, delay and wire load modeling enhancements over prior work to improve its accuracy in advanced process nodes. We then use this tool to investigate a number of FPGA circuit design related questions in a 22nm process. We find that building FPGAs out of transmission gates instead of the currently dominant pass-transistors, whose performance and reliability are degrading with technology scaling, yields FPGAs that are 15% larger but are 10-25% faster depending on the allowable level of gate boosting. We also show that transmission gate FPGAs with a separate power supply for their gate terminal enable a low-voltage FPGA with 50% less power and good delay. Finally, we show that, at a possible cost in routability, restricting the portion of a routing channel that can be accessed by a logic block input can improve delay by 17%. ii
3 Acknowledgements First, I would like to express my sincerest gratitude to my supervisor Vaughn Betz for his guidance and motivation, for his technical help and for the tidbits of wisdom that he shared with me, knowingly or unknowingly, over the past two years. I learned so much in so little time and cannot imagine having had a better mentor. I also extend thanks to the other graduate students in Vaughn Betz s research group for all their help and support. Also, thanks for the lunch outings, the coffee breaks and the squash matches, among other things, that provided those much needed distractions. I would like to thank the Natural Sciences and Engineering Research Council of Canada, Altera Corporation and the University of Toronto for their financial support. Thanks also go to CMC Microsystems for providing the CAD tools used throughout this research. I would also like to thank David Lewis from Altera Corporation for the insightful discussions. Finally, thanks must undoubtedly go to my parents for nurturing my inherent desire to know why, for making me love the smell of new books, and simply, for being the best parents a kid could ask for. All of my accomplishments are certainly due to them. iii
4 Contents 1 Introduction Motivation Thesis Organization Background FPGA Architecture Logic Block Architecture Routing Architecture Commercial BLE Architectures FPGA Architecture Assessment Methodology FPGA Circuit Design SRAM cells Routing Multiplexers Lookup Tables Flip-Flops Modeling of FPGA Circuitry Area Modeling Delay Modeling Automated Transistor Sizing COFFE: Automated Optimization of FPGA Circuitry Introduction to COFFE Architecture Circuit Topologies Area Modeling Delay Modeling Non-Linearity of Transistor Resistance and Capacitance Topology Dependence of Transistor Resistance Wire Load Modeling Transistor Sizing Algorithm Divide-and-Conquer Pre-Determined P/N Ratios Detailed Algorithm Impact of Improved Wire Load Modeling iv
5 3.8.1 Base Architecture Target Process Technology Results Integration of COFFE with VPR Efficient FPGA Circuitry F c out for Single-Driver Routing and Multiple BLE Outputs Transmission Gate FPGAs Pass-Transistor Scaling Challenges Replacing Pass-Transistors with Transmission Gates Gate-Boosting Strategy Methodology Results Area and Delay Breakdown Separating V DD and V G for Low-Power FPGAs Track-Access Locality Conclusions and Future Work Summary Future Work A N-well Sharing Sample Layout 51 B FPGA Circuitry Schematics 53 C Detailed Transistor Sizing Results 56 D Area and Delay Breakdown 61 Bibliography 62 v
6 List of Tables 3.1 COFFE s expected input architecture parameters Resistance of a 4 minimum-width NMOS transistor for different circuit topologies (Figure 3.9) and switching-thresholds Rise-fall re-balancing and the effect of M on COFFE s transistor sizing solutions (example) Base architecture parameters Subcircuit count per tile for base architecture Metal layer data used by COFFE for all circuit design investigations (ITRS [19]) Impact of wire loading Effect of F c out on channel width and switch block multiplexers Area and delay for different F c out values Pass-transistor and transmission gate FPGA tile area for different levels of gate boosting Switch block multiplexer transistor sizes for PT and TG implementations for different levels of gate boosting (see Figure 4.2 for transistor labels). Note that with the exception of P/N ratios, COFFE uses integer granularity Pass-transistor and transmission gate FPGA critical path delay for different levels of gate boosting (VTR benchmarks) Pass-transistor and transmission gate FPGA area-delay product for different levels of gate boosting (VTR benchmarks) Pass-transistor and transmission gate FPGA relative dynamic power for different levels of gate boosting (VTR benchmarks) Effect of cluster output track-access locality on area and delay. Input track-access span is set to Effect of cluster input track-access locality on area and delay. Output track-access span is set to C.1 Lookup table transistor sizes C.2 Switch block multiplexer transistor sizes C.3 Connection block multiplexer transistor sizes C.4 Local routing multiplexer transistor sizes. Note: We don t give a size for buf2 of the local routing multiplexer as it is replaced by the LUT input driver of Figure B C.5 BLE output to local interconnect C.6 BLE output to general routing C.7 Flip-flop and register selection multiplexer transistor sizes vi
7 C.8 LUT input driver A C.9 LUT input driver B C.10 LUT input driver C with register feedback multiplexer (Figure B.3) C.11 LUT input driver D C.12 LUT input driver E C.13 LUT input driver F D.1 Tile area breakdown D.2 Critical path delay breakdown vii
8 List of Figures 1.1 Architecture exploration with manual (a) and automated (b) transistor-level design Tile-based FPGA Basic logic element (BLE) Logic cluster architecture Routing segment lengths Multi-driver and single-driver routing architectures FPGA architecture assessment methodology with VPR Six transistor SRAM cell Different 8:1 pass-transistor multiplexer topologies Multiplexer followed by two-stage buffer with PMOS level-restorer Fully encoded MUX tree 3-LUT Minimum-width transistor area model Increasing drive strength with diffusion widening (b) or parallel diffusion regions (c). Note: Although not shown in the figure for simplicity, parallel diffusions must be connected together FPGA design flow COFFE s supported tile architecture COFFE s routing multiplexer circuit topologies Fully encoded MUX tree 6-LUT with internal re-buffering (partial view) Static transmission gate-based master-slave register Transistor area prediction accuracy of original (Eq. 2.2) and improved (Eq. 3.1) area models against TSMC 65nm layouts Combining diffusion widening and parallel diffusion regions yields denser layouts (c) A switch-level model Circuits used to measure transistor resistance Inverter NMOS and PMOS resistivity vs. transistor width COFFE s transistor sizing algorithm V DD and V T h scaling trends [12] Generic two-level routing multiplexer with two-stage buffer implemented with pass-transistors (a) and transmission gates (b) Effect of different gate boosting strategies on transmission gate switch block multiplexer delay (V DD = 0.8V ) viii
9 4.4 CAD flow for each FPGA Tile area and critical path delay breakdown Critical path delay for pass-transistor (PT) and transmission gate (TG) FPGAs for different V DD and V G Dynamic power for pass-transistor (PT) and transmission gate (TG) FPGAs for different V DD and V G Power-delay product for pass-transisor (PT) and transmission gate (TG) FPGAs for different V DD and V G Cluster output wire load for different locality Cluster input wire load for different locality A.1 A single-level 4:1 pass-transistor multiplexer with two-stage buffer and level restorer A.2 Sample multiplexer layout with N-well sharing B.1 6-LUT B.2 LUT input driver B.3 LUT input driver with register feedback multiplexer B.4 Two-level multiplexer used for switch block, connection block and local routing multiplexers. 54 B.5 2:1 multiplexer used for BLE outputs B.6 Flip-flop with register input selection multiplexer ix
10 Chapter 1 Introduction 1.1 Motivation The design and fabrication of modern digital integrated circuits costs tens to hundreds of millions of dollars, requires large teams of engineers and years of effort. Indeed, the cost of developping a new 20nm chip has been estimated to be as high as $160 million USD [1]. While this may be acceptable for high-volume applications, it can be a significant burden for lower-volume designs, often preventing them from being fabricated in the latest process technologies. Instead of being fabricated as a custom chip such as a standard cell-based application-specific integrated circuit (ASIC) or a full custom design, a digital design can be implemented in a field-programmable gate array (FPGA). FPGAs are pre-fabricated, programmable devices into which one can implement any arbitrary digital design in a matter of seconds. Therefore, FPGAs are an attractive alternative to ASICs or full custom designs because they allow the high non-recurring engineering costs and lengthy design times associated with semiconductor manufacturing to be completely avoided. However, digital designs that require high-density, high-performance or low-power might not find FPGAs as attractive. It has been shown that FPGAs require 35 more silicon area, are 4 slower and consume 14 more dynamic power than ASICs [25]. Accordingly, minimizing the FPGA-to-ASIC gap, that is, making FPGAs as efficient as possible such that they become a competitive implementation medium for all types of applications, is one of the primary drivers of FPGA research for both academic researchers and commercial FPGA manufacturers. The area, performance and power characteristics of an FPGA can be optimized at two main levels: architecture and transistor-level design. The architecture of an FPGA is defined by a number of parameters that describe the style and flexibility of its soft-logic blocks, dedicated hard-blocks and interconnect. Finding an architecture that meets specific design goals and constraints involves setting these architectural parameters to specific values. However, these parameters interact in complex ways to produce area, delay and power trade-offs that are very difficult to quantify through analytical methods. For that reason, finding the right architectural parameter values is usually accomplished experimentally with automated architecture explorations tools such as VPR [7]. For any architecture, there are a number of different transistor-level implementations. Transistorlevel design consists of choosing circuit topologies for an architecture as well as sizing the transistors of those circuits. Both circuit topology selection and transistor sizing provide opportunities to optimize the area, delay and power of the architecture. In prior FPGA research work, transistor-level design was often 1
11 Chapter 1. Introduction 2 Initial architecture parameters Manual transistorlevel design Initial architecture parameters Automated transistorlevel design Evaluate architecture Evaluate architecture Change architecture parameters Change architecture parameters (a) Manual transistor-level design. (b) Automated transistor-level design. Figure 1.1: Architecture exploration with manual (a) and automated (b) transistor-level design. performed manually making it a task that required a significant amount of time and effort. This often had a negative impact on the architecture exploration flow, which would proceed as follows. Manual transistor-level design would be performed on some initial architecture. Then, this architecture would be assessed with an architecture exploration tool such as VPR. Based on the results of the assessment, the architecture parameters would be adjusted and the evaluation process would be repeated. Ideally, one would then re-optimize the transistor-level design to match the new architecture parameters. However, since manual transistor-level design was such a time and effort intensive task, this step would often be skipped. It was assumed that transistor sizes obtained with a previous architecture still applied to the new architecture and this new architecture was then evaluated without re-optimizing its transistor-level design. This architecture exploration flow is illustrated in Figure 1.1a. The new architecture could likely be made more efficient if it s transistor sizes were re-optimized. As well, the detailed impact of new wire loads as the architecture and its area changed have often not been rigorously modeled, possibly leading to inaccurate architecture conclusions. In an environment where FPGAs need to be as efficient as possible to compete with ASICs, new architectures should be evaluated in their most efficient state. It follows that re-optimizing the transistor sizes as the FPGA architecture is changed provides a more thorough design space exploration and should yield more efficient FPGAs. Automating the transistor-level design of FPGAs enables such frequent re-optimization (Figure 1.1b). In addition, an automated transistor-level design tool facilitates investigations relating to efficient FPGA circuitry. For example, an automated transistor-level design tool could be used to explore the impact of different circuit topologies or the impact of different layout choices on the area, delay and power of an FPGA. This thesis consists of two parts. In the first, we develop COFFE (Circuit Optimization For FPGA Exploration), a new fully-automated transistor sizing tool for FPGAs. Although an FPGA-specific transistor sizing tool has been developed in prior work [24], we have made significant improvements that are necessary in advanced process nodes. In the second part of this thesis, we use COFFE to investigate a number of circuit design related questions in advanced process technology.
12 Chapter 1. Introduction Thesis Organization This thesis is organized as follows. Chapter 2 provides background information on FPGA architecture, circuit design, modeling and optimization. Chapter 3 describes COFFE, a fully-automated transistor sizing tool for FPGAs developed as part of this thesis, as well as our area and delay modeling enhancements. A number of FPGA circuit design investigations are performed with COFFE in Chapter 4. Finally, Chapter 5 concludes this thesis and suggests future work.
13 Chapter 2 Background This thesis is focused on the transistor-level design of SRAM-based FPGAs and related computer-aided design (CAD) tools. We develop a fully-automated transistor sizing tool for FPGAs in Chapter 3 and use it to investigate a number of FPGA circuit design related questions in Chapter 4. This chapter provides relevant background material. First, we review FPGA architecture and the standard FPGA architecture assessment methodology. Then, we describe common practices in FPGA circuit design as well as commonly used area and delay modeling techniques for these circuits. Finally, we review prior work on automated transistor sizing. 2.1 FPGA Architecture An FPGA consists of an array of tiles that can each implement a small amount of logic and routing. Horizontal and vertical routing channels run on top of the tiles and allow them to be stitched together to perform larger functions. Figure 2.1 illustrates FPGA tile architecture at a high-level. A logic block (LB) supplies the tile s logic functionality. Connection blocks (CBs) provide connectivity between logic block inputs and routing channels. A switch block (SB) connects logic block outputs to routing channels and provides connectivity between wires within the routing channels. One replicates this basic tile to obtain a complete FPGA. Although Figure 2.1 shows logic and switching functions as distinct sub-blocks, an interleaved layout is more realistic and is what we assume throughout this work. The FPGA architecture described in the previous paragraph represents a generic soft-logic-based FPGA. Modern FPGAs are more heterogeneous. That is, in addition to general purpose soft-logic blocks, they also contain dedicated hard-blocks such as multipliers, block memories or even embedded processors [36, 51, 4, 38]. In this work, we focus on the architecture and circuit design of the soft-logic portion as it still forms the backbone of an FPGA and typically accounts for a large fraction of it s area 1 and critical path delay as shown in Section However, since hard-blocks are an important part of modern FPGA architectures, all our VPR [7] experiments are performed with architecture files that contain multipliers and block memories along with our soft-logic blocks. We use the same multiplier and block memory designs across all our VPR experiments, and hence they are constant and do not affect the conclusions of our soft-logic investigations. 1 In [50], it was reported that the core area of the largest Stratix III FPGA consists of 72% soft-logic and associated programmable routing; the other 28% being block memory and multipliers. 4
14 Chapter 2. Background 5 FPGA Tile LB CB CB SB Routing Channel Figure 2.1: Tile-based FPGA. K-LUT FF Figure 2.2: Basic logic element (BLE) Logic Block Architecture Most FPGAs are built around the idea of using lookup tables (LUTs) to implement logic functions. A K-input LUT can implement any combinational logic function of K inputs. Since digital designs are rarely purely combinational, the basic logic element (BLE) of an FPGA consists of a K-LUT and a flip-flop (FF) that both feed a 2:1 multiplexer which allows the output of the BLE to be driven by either the LUT output or the FF output as illustrated in Figure 2.2 [7]. Although an FPGA logic block could consist of a single BLE, it is much more common to group several BLEs together in the same logic block to form a locally interconnected logic cluster as this fast local interconnect can improve performance and save general routing area [7, 2]. The number of inputs to a LUT (K) and the number of BLEs in a logic cluster (N) are two important architectural parameters affecting the area and performance of an FPGA. Ahmed and Rose showed in [2] that K = 4 to 6 and N = 4 to 10 are good choices in terms of area-delay product. Modern commercial architectures use comparable values for N and K (Virtex 7: K=6, N=8 [51] and Stratix V: K=6, N=10 [35]). As illustrated in Figure 2.3, a logic cluster s local interconnect consists of two types of wires: local feedback wires and cluster input wires. There are typically N local feedback wires in a cluster; one for each BLE. Often, many BLEs in a cluster will share common inputs. Accordingly, the number of inputs to a cluster (I) is less than the number of distinct BLE inputs in a cluster (i.e. N K). It was shown in [2] that (2.1) is a good estimate of the number of inputs required to achieve 98% LUT utilization. I = K (N + 1) (2.1) 2
15 Chapter 2. Background 6 Local feedback wires K-LUT FF K local routing MUXes per BLE Total of N BLEs BLE BLE with internal details shown N BLE outputs BLE I cluster inputs Figure 2.3: Logic cluster architecture. Local routing multiplexers connect multiple local interconnect wires to each BLE input. These multiplexers are generally sparsely populated [29]. That is, BLE inputs can be connected to only a fraction of the wires in the local interconnect; we refer to this fraction as F c local. Sparsely populating the local routing multiplexers reduces their size and thus saves area. In [29], it was shown that reducing F c local from 1.0 to 0.5 reduces area by 10% with no degradation in critical path delay. However, as recommended by [29], between 2 to 5 spare cluster inputs should be added to (2.1) when sparsely populating the local routing multiplexers to maintain routability Routing Architecture Logic blocks are interconnected by programmable routing channels that run horizontally and vertically on top of a tile (Figure 2.1). The number of tracks in a routing channel is refered to as its width (W). In this work, we assume that the width of horizontal and vertical routing channels are equivalent, but it is possible for them to be different. For example, the horizontal routing channels on Stratix FPGAs are wider than the vertical channels due to the rectangular layout of their logic blocks [34]. A routing track is composed of wire segments that span one or more tiles. The length (L) of a routing segment specifies the number of tiles that it spans. For example, Figure 2.4 shows a routing channel that consists of four tracks of L = 2 wire segments and four tracks of L = 4 wire segments. Note that staggering the start point of wire segments as in Figure 2.4 is necessary for a tile-based layout as it ensures that all tiles remain identical [8]. A horizontal and a vertical routing channel intersect at each tile. The set of programmable switches that allow connections to be made between routing tracks at this intersection is called a switch block (SB in Figure 2.1). Switch block flexibility (F S ) specifies the number tracks to which any track can connect in a switch block. An F S of 3, where each horizontal track connects to another horizontal track and two vertical tracks (and vice-versa), is common [49]. The specific tracks to which each track connects is determined by the switch block pattern [7, 37] as well as the routing driver architecture. In
16 Chapter 2. Background 7 FPGA tiles Length 2 wire segments Length 4 wire segments Figure 2.4: Routing segment lengths. a multi-driver routing architecture (Figure 2.5a), a wire can be driven by multiple tri-state drivers at multiple points along its length. In contrast, in a single-driver routing architecture (Figure 2.5b), a wire can only be driven by a single multiplexer-based driver usually placed at one end of the wire. Figures 2.5a and 2.5b also show that logic block outputs connect to the routing tracks differently based on the routing driver architecture. That is, multi-driver architectures connect logic block outputs directly to the routing wires while single-driver architectures connect logic block outputs to the routing wires through switch-block multiplexers. Although multi-driver routing architectures have been widely used in the past [7, 2], single-driver routing has become the dominant routing architecture style in both academic research [28, 27, 24] and commercial FPGAs [34, 33]. In this work, we focus on single-driver routing architectures. In [28], Lemieux et al. found that FPGAs with single-driver routing had 9% lower delay and were 25% smaller than FPGAs with multi-driver routing. Connection block multiplexers connect multiple routing tracks to each logic block input (see Figure 2.5). The number of tracks that can connect to each logic block input is called the connection block input flexibility (F c in ). Similarly, the number routing wires that each logic block output can connect to is given by the connection block output flexibility (F c out ). Reducing F c in from W to 0.2W as the logic cluster size increases from N = 1 to 20 and using an F c out of W/N were found to be good choices in [7]. These interconnect flexibility values have generally been used as rules of thumb in subsequent FPGA research Commercial BLE Architectures The BLEs of modern commercial FPGAs are much more complex than the commonly used academic BLE described in Section (Figure 2.2). Instead of a single K-LUT, some modern FPGA architectures [33, 35, 51] use fracturable LUTs, which are LUTs that can be configured as one large LUT or multiple smaller LUTs. For example, the Stratix V fracturable 6-LUT can be split into two 5-LUTs or four 4-LUTs provided that the functions being mapped to these LUTs meet certain constraints [35]. Modern BLEs also commonly support configuring LUTs as memories (LUTRAM) or shift registers and usually contain hard arithmetic carry logic [35, 52]. However, to keep the scope of this work tractable, we only consider regular K-LUTs, which are still relevant, and we do not consider carry logic as current academic CAD tools do not fully support this functionality. The commonly used academic BLE shown in Figure 2.2 has a very limited ability to use both the lookup table and flip-flop together. Modern commercial BLEs include additional 2:1 multiplexers to allow the lookup table and flip-flop to be used in concert in many more ways [3, 52]. These extra multiplexers are included in our designs and will be described in more detail in Section 3.2.
17 Chapter 2. Background 8 LB LB CB CB CB LB CB SB Connection block MUX Tri-state drivers LB output connects to routing wire via tri-state driver (a) Multi-driver architecture. Drivers at mid-points LB LB CB CB CB LB CB SB Connection block MUX LB output connects to routing wire via SB mux (b) Single-driver architecture. Switch block MUX No drivers at mid-points Figure 2.5: Multi-driver and single-driver routing architectures.
18 Chapter 2. Background 9 Benchmark circuits Architecture description Synthesize and map circuits to FPGA LUTs, FF, etc. Synthesized benchmark circuits Pack into logic clusters VPR architecture description file Place clusters into FPGA Route connections between clusters Analyze timing and area VPR Figure 2.6: FPGA architecture assessment methodology with VPR. 2.2 FPGA Architecture Assessment Methodology The quality of an FPGA in terms of area, performance and power consumption is a function of the architectural parameters described in Section 2.1. These architecture parameters interact in complex ways; hence determining the best choice for each parameter is a challenging task. Although there has been some work towards developing analytical models to evaluate FPGA architectures [46, 26, 16], the standard architecture assessment procedure used by both commercial FPGA manufacturers and academic researchers is an experimental one that consists of implementing benchmark circuits on a candidate architecture in order to evaluate its area, delay and power. Figure 2.6 shows the standard academic CAD flow used to evaluate FPGA architectures [7]. The CAD flow proceeds as follows. Benchmark circuits are first synthesized and mapped into lookup tables (LUTs), flip-flops (FF) and hard-blocks (multipliers and block memories) based on a description of the architecture. LUTs and FFs are then packed into clusters in a manner that attempts to keep related LUTs and FFs in the same cluster such that connections between them can be routed through the logic cluster s fast local interconnect. Next, each cluster is placed into a specific logic block on the FPGA that minimizes both the delay and the wire length of connections between logic clusters as much as possible. Once all logic clusters have been placed, connections between logic blocks are routed through the FPGA s general purpose interconnect. The routing algorithm tries to minimize the benchmark circuit s critical path delay, while using the least amount of routing resources possible. Finally, timing analysis is performed to determine the implemented benchmark circuit s critical path delay and area is calculated based on tile area and the number of logic blocks required by the placement. The packing, placement and routing phases of the flow of Figure 2.6 are performed by VPR [7]. Since many of the algorithms used by VPR are timing-based, the VPR architecture file must describe
19 Chapter 2. Background 10 WL V SRAM+ BL BL V SRAM- Figure 2.7: Six transistor SRAM cell. the delays through the lookup tables, routing multiplexers and any other circuitry that makes up the FPGA. The delay of these circuits depend on the circuit topologies used, as well as the transistor sizing of the FPGA circuitry. Consequently, evaluting an FPGA architecture requires first completing its transistor-level design. 2.3 FPGA Circuit Design As mentioned in Section 2.1, we only consider soft-logic-based FPGAs with single-driver routing architectures in this thesis. Soft-logic FPGA architectures consists entirely of SRAM cells, routing multiplexers, lookup tables and flip-flops. This section describes commonly used circuit topologies and circuit design practices for these structures SRAM cells An FPGA typically contains millions of memory bits used to configure routing multiplexers and store lookup table logic functions. Because there are so many of them, a key design goal for these memory bits is small area. Stability is also important, as state flipping would cause problems such as incorrectly configured routing multiplexers. A six transistor SRAM cell (Figure 2.7) has been the standard implementation in FPGA research [7] as it achieves both design goals reasonably well Routing Multiplexers Routing multiplexers account for a large fraction of the area and delay of an FPGA. Consequently, it is crucial to choose a circuit implementation that is as efficient as possible. There are a number of approaches that can be taken to build a multiplexer but most commercial FPGAs and almost all academic FPGA studies use an NMOS pass-transistor-based approach because each switch requires only one transistor, minimizing area. Figure 2.8 shows three of the most commonly used pass-transistor multiplexer topologies. Each multiplexer style possesses a different area-delay tradeoff that is a function of the number of multiplexer inputs [27, 9]. For example, since it has just one pass-transistor on the signal path, a 1-level multiplexer is generally faster than a 2-level multiplexer. But, for a large number of inputs, a 1-level multiplexer requires more SRAM cells than a 2-level multiplexer and can thus have larger area. Furthermore, if the the number of inputs is large enough, a 1-level multiplexer could even become slower than a 2-level multiplexer due to a greater number of transistors loading the output node.
20 Chapter 2. Background 11 SRAM cell SRAM cell SRAM cell out out out (a) Tree MUX. (b) 1-level MUX. (c) 2-level MUX. Figure 2.8: Different 8:1 pass-transistor multiplexer topologies. Level-restorer out MUX 2-stage buffer Figure 2.9: Multiplexer followed by two-stage buffer with PMOS level-restorer. It was shown in [9] that a 2-level multiplexer generally yields a lower area-delay product than a 1-level or tree multiplexer. Commercial FPGAs also commonly use 2-level multiplexers [33]. Although they are beneficial in terms of area, pass-transistors have an important disadvantage: they are incapable of passing a full logic-high voltage. That is, their output voltage saturates at approximately V G V T h where V G is the gate voltage and V T h is the threshold voltage of the transistor. In FPGA circuitry, the output of a pass-transistor-based routing multiplexer is typically driven by a multi-stage buffer [7, 30, 33]. Static power dissipation in these buffers caused by the reduced voltage swing of passtransistors has long been a cause for concern [7]. To mitigate this problem, gate boosting [7] (applying a voltage larger than the supply voltage (V DD ) on the pass-transistor gate) and PMOS level-restorers [30, 33] have been used to help pull pass-transistor output voltages up to V DD. Figure 2.9 shows a routing multiplexer followed by a two-stage buffer equiped with a PMOS-level restorer.
21 Chapter 2. Background 12 SRAM cells A B C LUT inputs Level-restorer out 3-LUT 2-stage buffer Figure 2.10: Fully encoded MUX tree 3-LUT Lookup Tables Like routing multiplexers, lookup tables also use pass-transistor-based multiplexer circuitry but, the multiplexer input and control connectivity is reversed. In a lookup table, SRAM cells connect to the inputs of the multiplexer and hold the logic functions truth table, while the gates of the multiplexer are controlled by the lookup table inputs. Consequently, lookup tables are generally implemented as fully-encoded multiplexer trees, such that each level of the tree can be connected to a LUT input [7]. Figure 2.10 shows a 3-input fully encoded multiplexer tree lookup table followed by a two-stage buffer Flip-Flops Flip-flops are generally implemented as standard master-slave registers [7]. However, some commercial FPGAs use flip flops that are more advanved. For example, Altera s Stratix V FPGAs use flip-flops based on pulse latches and configurable pulse width generators to improve performance [35]. 2.4 Modeling of FPGA Circuitry Evaluating an FPGA architecture with the assessment methodology described in Section 2.2 requires that we develop models that allow us to estimate the area and delay of FPGA circuitry because fabricating an integrated circuit for each architecture to measure area and delay is obviously not practical. In this section, we describe commonly used area and delay modeling approaches for FPGAs. These models are also useful for transistor sizing, which we will discuss in Section 2.5.
22 Chapter 2. Background 13 Minimum-width transistor Space to neighboring transistors Minimum-width transistor area Diffusion Metal contact Metal/polysilicon gate Figure 2.11: Minimum-width transistor area model Area Modeling Creating a complete layout is the best way to determine the exact area of an FPGA. However, this process is much too time consuming when multiple designs need to be explored. A variety of different approaches have been used to more quickly estimate area such as counting transistors or counting SRAM cells, but the most widely used in FPGA research is the minimum-width transistor area model introduced in [7]. In this model, layout area is expressed in units of minimum-width transistor areas. A minimumwidth transistor is defined as the smallest possible contactable transistor for a specific process technology and one minimum-width transistor area is the area of this transistor plus the spacing to neighboring transistors as shown in Figure Unlike area models that simply count transistors or SRAM cells, the minimum-width transistor area model provides an actual estimate of layout area. This is an important distinction because as well as being more accurate, actual layout area estimates enable better estimates of wire loads since wire lengths are layout dependent. Transistors in FPGA circuitry often require more drive strength than that provided by a minimumwidth transistor. A transistor s drive strength can be increased by either widening its diffusion region (Figure 2.12b) or by adding parallel diffusion regions (Figure 2.12c). Consequently, increasing a transistor s drive strength increases it s area. The widely-used area model of [7] estimates the layout area of a transistor with drive strength x, in units of minimum-width transistor areas, with (2.2), which was obtained by averaging the layout areas that result from either widening the diffusion region or adding parallel diffusion regions to increase drive strength. Area(x) = x (2.2) Then, [7] calculates the area of an FPGA subcircuit by simply summing the areas of all the transistors in that subcircuit. Note from (2.2) that doubling a transistor s drive strength does not double it s area. This is due to the fact that increasing a transistor s drive strength only increases certain transistor dimensions while others remain constant. For example, the spacing to neighboring transistors remains the same regardless of a transistor s drive strength.
23 Chapter 2. Background 14 2 parallel diffusions 1x minimum contactable width 2x minimum contactable width 1x minimum contactable width (a) Minimum drive strength. (b) 2 minimum drive strength. (c) 2 minimum drive strength. Figure 2.12: Increasing drive strength with diffusion widening (b) or parallel diffusion regions (c). Note: Although not shown in the figure for simplicity, parallel diffusions must be connected together Delay Modeling Time-domain circuit simulators such as HSPICE are generally the most accurate way to estimate the delay of a circuit. However, time-domain simulation can be computationally intensive making it impractical when a large number of delay measurements need to be obtained. For example, the timing analysis phase of the architecture assessment flow described in Section 2.2 involves measuring delay for the thousands of nets in a benchmark circuit; performing time-domain simulation for each one would lead to prohibitively long runtimes. Instead, previous FPGA research work has typically modeled wires and transistors as linear resistances and capacitances, such that a transistor-based circuit can be modeled as an RC-tree network [22, 7, 24]. The delay of this network can then be estimated with the Elmore [15] or the Penfield-Rubinstein [20] delay models, which are much quicker than time-domain simulations. With the Elmore delay model, the delay T D of a path is given by: T D = R i C(subtree i ) (2.3) i path where R i is the equivalent resistance of element i along the path and C(subtree i ) is the total downstream capacitance rooted at element i. An enhanced version of the Elmore delay model was proposed in [39]. Since it is more difficult to model a buffer as a simple RC circuit due to the buffer s intrinsic delay, [39] combines the Elmore delay model with a common model of buffer delay where a buffer is modeled as a constant delay and a resistor. This approach maps well to FPGA circuitry, which consists mostly of pass-transistors and buffers, and was adopted as the delay model for VPR in [7]. With this model, the delay T D of a path is given by: T D = R i C(subtree i ) + T buf,i (2.4) i path where T buf,i is the buffer s intrinsic delay if element i is a buffer or 0 otherwise [7]. 2.5 Automated Transistor Sizing Transistor sizing is a well-studied problem that consists of improving a circuit s performance by increasing the sizes of its transistors and thus provides yet another level, in addition to architecture and circuit design, at which the area and delay characteristics of an FPGA can be adjusted. The transistor sizing optimization problem is usually formulated in one of three ways:
24 Chapter 2. Background Minimize some function of area and delay. 2. Minimize area subject to a delay constraint. 3. Minimize delay subject to an area constraint. There has been much prior work on automated transistor sizing for custom circuitry. Fishburn and Dunlop showed in [17] that modeling transistors as linear resistances and capacitances and calculating the delay of the resulting RC circuits with the Elmore [15] or the Penfield-Rubinstein [20] delay model (i.e. (2.3)) allows the transistor sizing problem to be formulated as a convex optimization problem, which guarantees that any local minimum is the global minimum. With this useful property, [17] develops TILOS, a transistor sizing tool for custom circuits based on a heuristic method that iteratively identifies a circuit s critical path and increases transistor sizes on that path until all timing constraints are met. Despite the convexity of the problem, TILOS s heuristic is such that it can terminate with a suboptimal solution [45]. Algorithms guaranteeing the optimal solution through convex optimization [44] or mathematical relaxation techniques [10, 47] have subsequently been proposed but these algorithms, along with TILOS, all suffer from their reliance on linear device models and the Elmore delay, which have long been known to be inaccurate [40, 21]. To enhance accuracy, at the cost of increased computational complexity, some transistor sizing algorithms have used time-domain simulation to obtain delay estimates [14, 13]. The programmability of FPGAs adds unique features to the transistor sizing problem which makes FPGA-specific transistor sizing tools valuable. Kuon and Rose proposed such a tool in [24]. Their FPGA transistor sizing approach is different than transistor sizing algorithms for custom circuits because it deals with two features unique to FPGAs. The first of these unique features is repitition. As described in Section 2.1, an FPGA consists of an array of tiles. Since these tiles are all identical, transistor-level design only needs to be performed for one of them. This design can then be replicated to obtain a complete FPGA. Similar design space reductions can be found within a tile. For example, a switch block can include over 100 logically equivalent multiplexers whose transistor-level design should be kept identical. Consequently, only 80 unique transistors need to be sized when designing an FPGA s soft-logic despite there being billions of transistors on the chip, which is in contrast to transistor sizing for custom circuits where the whole chip must be considered. This reduced design space makes HSPICE-based optimization practical for FPGAs, but as we show in Section 3.7, we must still search this space intelligently to keep runtime reasonable. The second unique feature to FPGA transistor sizing is their undefined critical path. Because they are programmable, FPGAs have application-dependent critical paths which implies that at design time, there is no clear critical path to optimize for delay. To deal with this issue, [24] optimizes a representative path that contains one of each type of FPGA subcircuit (LUTs, MUXes, etc.). Delay is taken as a weighted sum of the delay of each subcircuit and the weighting scheme is chosen based on the frequency with which each subcircuit was found on the critical paths of placed and routed benchmark circuits. Optimizing a representative critical path still presents a huge design space which Kuon and Rose tackle with a two-phased algorithm that consists of an exploratory phase that utilizes linear device models and a TILOS-like transistor sizing heuristic to keep CPU times reasonable, followed by an HSPICE-based fine-tuning phase that adjusts the transistor sizes to account for the inaccuracies of linear models. In [46], Smith et al. present a method that enables the rapid and concurrent optimization of highlevel architecture parameters and transistor sizes for FPGAs through the use of analytic architecture
25 Chapter 2. Background 16 models, linear device models and a convex optimization-based transistor sizing algorithm. They show that this concurrent optimization can have a significant impact on architectural conclusions versus a separate optimization.
26 Chapter 3 COFFE: Automated Optimization of FPGA Circuitry When developing a new chip, FPGA architects are faced with two main tasks: choosing an architecture for their FPGA and performing the transistor-level design of that architecture. As described in Section 2.2, choosing an architecture is typically accomplished experimentally with architecture exploration tools such as VPR [7]. By implementing benchmark circuits on a proposed FPGA, these tools allow architects to evaluate the area, delay and power impact of various architectural choices. Then, based on their observations, architects can select an FPGA architecture that meets their design goals and constraints. Transistor-level design consists of selecting circuit topologies for the various subcircuits that implement the chosen architecture, as well as sizing the transistors of those subcircuits. Transistor-level design is an essential precursor to the evaluation of an architecture because it provides accurate area, delay and power estimates of the underlying FPGA circuitry; these estimates are required inputs to the architecture exploration tools. Transistor sizing also provides an additional opportunity to tune the area, delay and power of an FPGA. Therefore, developing a new FPGA is an iterative process that involves performing the transistor-level design of various architectures before evaluating them through synthesis, placement and routing experiments. This interdependence between architecture exploration and transistor-level design necessitates automated design tools if high-quality results are to be obtained in reasonable amounts of time. In this chapter, we describe COFFE (Circuit Optimization For FPGA Exploration), a fully-automated transistor sizing tool for FPGAs that we developed as part of this thesis. COFFE enables the design flow detailed above by providing area, delay and power estimates of properly sized FPGA circuitry. COFFE also enables design exploration of FPGA circuitry and we will use COFFE in such a capacity in Chapter 4. Although COFFE solves the same problem as Kuon and Rose s FPGA transistor sizing tool [24] (see Section 2.5), we have made significant improvements which are necessary for FPGAs in advanced process nodes; these improvements will be described in the following sections. 3.1 Introduction to COFFE Figure 3.1 shows the FPGA design flow we wish to enable with COFFE. COFFE is used to perform transistor-level optimization for some architecture of interest, thus producing accurate area and delay 17
27 Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 18 Process models Optimization objective Benchmark circuits HSPICE Area model Wire load model Circuit Optimizer Subcircuit areas and delays (VPR arch. file) Typical critical path (delay weights) Pack Place Route Generate subcircuit SPICE netlists Subcircuit SPICE netlists COFFE Architecture parameters Analyze timing and area VPR Figure 3.1: FPGA design flow. estimates for the subcircuits of this architecture (LUTs, routing multiplexers, etc.). These estimates are used by VPR to evaluate the architecture through place and route experiments. Based on the results of the assessment, the architecture parameters are adjusted and sent back to COFFE to begin a new iteration of optimization and evaluation. COFFE s circuit optimizer makes area and performance trade-offs through transistor sizing. Like [24], COFFE s optimization objective is of the form Area b Delay c thus allowing for different area and performance tradeoffs by varying b and c. Creating a complete layout is the most accurate way to obtain the area and delay measurements needed during transistor sizing. However, for the iterative design flow of Figure 3.1, this approach is impractical as layout is a very time consuming task. Instead, COFFE estimates area with an improved version of the minimum-width transistor area model (see Section 3.4) and measures delay with HSPICE simulations. Although previous FPGA transistor sizing tools have used linearized models of transistors to measure delay during certain phases of the optimization, we show in Section 3.5 that such models are highly inaccurate for the fine-grained transistor-level design we wish to undertake in advanced process nodes such as the 22nm process we use in this work. COFFE automatically generates the SPICE netlists required for delay measurement based on the input architecture parameters and the circuit topologies described in Sections 3.2 and 3.3 respectively. These netlists are parametrized such that COFFE s circuit optimizer can change the transistor sizes by simply changing a transistor size parameter list. To obtain meaningful delays, COFFE is careful to ensure that these netlists include realistic transistor and wire loading. Transistor loads are relatively easy to determine based on architectural parameters and circuit topologies. Wire loads, on the other hand, are layout dependent making them more difficult to determine since the exact layout is not known. COFFE estimates wire loads with the model described in Section Architecture Figure 3.2 shows the tile architecture that COFFE supports in its designs and Table 3.1 lists the architecture parameters that COFFE expects as inputs. Parameters listed in the top portion of Table 3.1
COFFE: Fully-Automated Transistor Sizing for FPGAs
COFFE: Fully-Automated Transistor Sizing for FPGAs Charles Chiasson and Vaughn Betz Department of Electrical and Computer Engineering University of Toronto, Toronto, ON, Canada {charlesc,vaughn}@eecg.utoronto.ca
More informationSHOULD FPGAS ABANDON THE PASS-GATE? Charles Chiasson and Vaughn Betz
SHOULD FPGAS ABANDON THE PASS-GATE? Charles Chiasson and Vaughn Betz Department of Electrical and Computer Engineering University of Toronto, Toronto, ON, Canada {charlesc,vaughn}@eecg.utoronto.ca ABSTRACT
More informationUNIT-II LOW POWER VLSI DESIGN APPROACHES
UNIT-II LOW POWER VLSI DESIGN APPROACHES Low power Design through Voltage Scaling: The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage.
More informationAUTOMATING TRANSISTOR RESIZING DESIGN OF FIELD-PROGRAMMABLE GATE ARRAYS IN THE. By Anthony Bing-Yan Chan. Supervisor: Jonathan Rose
AUTOMATING TRANSISTOR RESIZING IN THE DESIGN OF FIELD-PROGRAMMABLE GATE ARRAYS By Anthony Bing-Yan Chan Supervisor: Jonathan Rose April 2003 AUTOMATING TRANSISTOR RESIZING IN THE DESIGN OF FIELD-PROGRAMMABLE
More informationLecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.
Introduction EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Techniques Cristian Grecu grecuc@ece.ubc.ca Course web site: http://courses.ece.ubc.ca/353/ What have you learned so far?
More informationLow Power, Area Efficient FinFET Circuit Design
Low Power, Area Efficient FinFET Circuit Design Michael C. Wang, Princeton University Abstract FinFET, which is a double-gate field effect transistor (DGFET), is more versatile than traditional single-gate
More informationA Case Study of Nanoscale FPGA Programmable Switches with Low Power
A Case Study of Nanoscale FPGA Programmable Switches with Low Power V.Elamaran 1, Har Narayan Upadhyay 2 1 Assistant Professor, Department of ECE, School of EEE SASTRA University, Tamilnadu - 613401, India
More informationUNIT-III POWER ESTIMATION AND ANALYSIS
UNIT-III POWER ESTIMATION AND ANALYSIS In VLSI design implementation simulation software operating at various levels of design abstraction. In general simulation at a lower-level design abstraction offers
More informationPower Optimization of FPGA Interconnect Via Circuit and CAD Techniques
Power Optimization of FPGA Interconnect Via Circuit and CAD Techniques Safeen Huda and Jason Anderson International Symposium on Physical Design Santa Rosa, CA, April 6, 2016 1 Motivation FPGA power increasingly
More informationTowards PVT-Tolerant Glitch-Free Operation in FPGAs
Towards PVT-Tolerant Glitch-Free Operation in FPGAs Safeen Huda and Jason H. Anderson ECE Department, University of Toronto, Canada 24 th ACM/SIGDA International Symposium on FPGAs February 22, 2016 Motivation
More informationWhite Paper Stratix III Programmable Power
Introduction White Paper Stratix III Programmable Power Traditionally, digital logic has not consumed significant static power, but this has changed with very small process nodes. Leakage current in digital
More informationA Survey of the Low Power Design Techniques at the Circuit Level
A Survey of the Low Power Design Techniques at the Circuit Level Hari Krishna B Assistant Professor, Department of Electronics and Communication Engineering, Vagdevi Engineering College, Warangal, India
More informationA Dual-V DD Low Power FPGA Architecture
A Dual-V DD Low Power FPGA Architecture A. Gayasen 1, K. Lee 1, N. Vijaykrishnan 1, M. Kandemir 1, M.J. Irwin 1, and T. Tuan 2 1 Dept. of Computer Science and Engineering Pennsylvania State University
More informationStatic Power and the Importance of Realistic Junction Temperature Analysis
White Paper: Virtex-4 Family R WP221 (v1.0) March 23, 2005 Static Power and the Importance of Realistic Junction Temperature Analysis By: Matt Klein Total power consumption of a board or system is important;
More informationII. Previous Work. III. New 8T Adder Design
ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: High Performance Circuit Level Design For Multiplier Arun Kumar
More informationPE713 FPGA Based System Design
PE713 FPGA Based System Design Why VLSI? Dept. of EEE, Amrita School of Engineering Why ICs? Dept. of EEE, Amrita School of Engineering IC Classification ANALOG (OR LINEAR) ICs produce, amplify, or respond
More informationModeling the Effect of Wire Resistance in Deep Submicron Coupled Interconnects for Accurate Crosstalk Based Net Sorting
Modeling the Effect of Wire Resistance in Deep Submicron Coupled Interconnects for Accurate Crosstalk Based Net Sorting C. Guardiani, C. Forzan, B. Franzini, D. Pandini Adanced Research, Central R&D, DAIS,
More informationLecture 9: Cell Design Issues
Lecture 9: Cell Design Issues MAH, AEN EE271 Lecture 9 1 Overview Reading W&E 6.3 to 6.3.6 - FPGA, Gate Array, and Std Cell design W&E 5.3 - Cell design Introduction This lecture will look at some of the
More information2009 Spring CS211 Digital Systems & Lab 1 CHAPTER 3: TECHNOLOGY (PART 2)
1 CHAPTER 3: IMPLEMENTATION TECHNOLOGY (PART 2) Whatwillwelearninthischapter? we learn in this 2 How transistors operate and form simple switches CMOS logic gates IC technology FPGAs and other PLDs Basic
More informationReference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering
FPGA Fabrics Reference Wayne Wolf, FPGA-Based System Design Pearson Education, 2004 CPLD / FPGA CPLD Interconnection of several PLD blocks with Programmable interconnect on a single chip Logic blocks executes
More informationCHAPTER 3 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED ADDER TOPOLOGIES
44 CHAPTER 3 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED ADDER TOPOLOGIES 3.1 INTRODUCTION The design of high-speed and low-power VLSI architectures needs efficient arithmetic processing units,
More informationTRENDS in technology scaling make leakage power an
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 3, MARCH 2006 423 Active Leakage Power Optimization for FPGAs Jason H. Anderson, Student Member, IEEE, and Farid
More informationLecture #2 Solving the Interconnect Problems in VLSI
Lecture #2 Solving the Interconnect Problems in VLSI C.P. Ravikumar IIT Madras - C.P. Ravikumar 1 Interconnect Problems Interconnect delay has become more important than gate delays after 130nm technology
More informationYet, many signal processing systems require both digital and analog circuits. To enable
Introduction Field-Programmable Gate Arrays (FPGAs) have been a superb solution for rapid and reliable prototyping of digital logic systems at low cost for more than twenty years. Yet, many signal processing
More informationLow Power Design of Schmitt Trigger Based SRAM Cell Using NBTI Technique
Low Power Design of Schmitt Trigger Based SRAM Cell Using NBTI Technique M.Padmaja 1, N.V.Maheswara Rao 2 Post Graduate Scholar, Gayatri Vidya Parishad College of Engineering for Women, Affiliated to JNTU,
More informationIntroduction to CMOS VLSI Design (E158) Lecture 9: Cell Design
Harris Introduction to CMOS VLSI Design (E158) Lecture 9: Cell Design David Harris Harvey Mudd College David_Harris@hmc.edu Based on EE271 developed by Mark Horowitz, Stanford University MAH E158 Lecture
More informationPower-Area trade-off for Different CMOS Design Technologies
Power-Area trade-off for Different CMOS Design Technologies Priyadarshini.V Department of ECE Sri Vishnu Engineering College for Women, Bhimavaram dpriya69@gmail.com Prof.G.R.L.V.N.Srinivasa Raju Head
More informationDepartment of Electrical and Computer Systems Engineering
Department of Electrical and Computer Systems Engineering Technical Report MECSE-31-2005 Asynchronous Self Timed Processing: Improving Performance and Design Practicality D. Browne and L. Kleeman Asynchronous
More informationDESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM
DESIGN & IMPLEMENTATION OF SELF TIME DUMMY REPLICA TECHNIQUE IN 128X128 LOW VOLTAGE SRAM 1 Mitali Agarwal, 2 Taru Tevatia 1 Research Scholar, 2 Associate Professor 1 Department of Electronics & Communication
More informationDesign and Implementation of Complex Multiplier Using Compressors
Design and Implementation of Complex Multiplier Using Compressors Abstract: In this paper, a low-power high speed Complex Multiplier using compressor circuit is proposed for fast digital arithmetic integrated
More informationCHAPTER 3 NEW SLEEPY- PASS GATE
56 CHAPTER 3 NEW SLEEPY- PASS GATE 3.1 INTRODUCTION A circuit level design technique is presented in this chapter to reduce the overall leakage power in conventional CMOS cells. The new leakage po leepy-
More informationPOWER GATING. Power-gating parameters
POWER GATING Power Gating is effective for reducing leakage power [3]. Power gating is the technique wherein circuit blocks that are not in use are temporarily turned off to reduce the overall leakage
More informationEECS 427 Lecture 22: Low and Multiple-Vdd Design
EECS 427 Lecture 22: Low and Multiple-Vdd Design Reading: 11.7.1 EECS 427 W07 Lecture 22 1 Last Time Low power ALUs Glitch power Clock gating Bus recoding The low power design space Dynamic vs static EECS
More informationLOW-POWER SOFTWARE-DEFINED RADIO DESIGN USING FPGAS
LOW-POWER SOFTWARE-DEFINED RADIO DESIGN USING FPGAS Charlie Jenkins, (Altera Corporation San Jose, California, USA; chjenkin@altera.com) Paul Ekas, (Altera Corporation San Jose, California, USA; pekas@altera.com)
More informationEngr354: Digital Logic Circuits
Engr354: Digital Logic Circuits Chapter 3: Implementation Technology Curtis Nelson Chapter 3 Overview In this chapter you will learn about: How transistors are used as switches; Integrated circuit technology;
More informationNovel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis
Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis N. Banerjee, A. Raychowdhury, S. Bhunia, H. Mahmoodi, and K. Roy School of Electrical and Computer Engineering, Purdue University,
More informationMemory Basics. historically defined as memory array with individual bit access refers to memory with both Read and Write capabilities
Memory Basics RAM: Random Access Memory historically defined as memory array with individual bit access refers to memory with both Read and Write capabilities ROM: Read Only Memory no capabilities for
More informationFPGA Based System Design
FPGA Based System Design Reference Wayne Wolf, FPGA-Based System Design Pearson Education, 2004 Why VLSI? Integration improves the design: higher speed; lower power; physically smaller. Integration reduces
More informationLeakage Power Modeling and Reduction Techniques for Field Programmable Gate Arrays
Leakage Power Modeling and Reduction Techniques for Field Programmable Gate Arrays by Akhilesh Kumar A thesis presented to the University of Waterloo in fulfilment of the thesis requirement for the degree
More informationAN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER
AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER K. RAMAMOORTHY 1 T. CHELLADURAI 2 V. MANIKANDAN 3 1 Department of Electronics and Communication
More informationALPS: An Automatic Layouter for Pass-Transistor Cell Synthesis
ALPS: An Automatic Layouter for Pass-Transistor Cell Synthesis Yasuhiko Sasaki Central Research Laboratory Hitachi, Ltd. Kokubunji, Tokyo, 185, Japan Kunihito Rikino Hitachi Device Engineering Kokubunji,
More informationHigh Performance Low-Power Signed Multiplier
High Performance Low-Power Signed Multiplier Amir R. Attarha Mehrdad Nourani VLSI Circuits & Systems Laboratory Department of Electrical and Computer Engineering University of Tehran, IRAN Email: attarha@khorshid.ece.ut.ac.ir
More informationPolicy-Based RTL Design
Policy-Based RTL Design Bhanu Kapoor and Bernard Murphy bkapoor@atrenta.com Atrenta, Inc., 2001 Gateway Pl. 440W San Jose, CA 95110 Abstract achieving the desired goals. We present a new methodology to
More informationInternational Journal of Advance Engineering and Research Development
Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 05, May -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 COMPARATIVE
More informationSophisticated design of low power high speed full adder by using SR-CPL and Transmission Gate logic
Scientific Journal of Impact Factor(SJIF): 3.134 International Journal of Advance Engineering and Research Development Volume 2,Issue 3, March -2015 e-issn(o): 2348-4470 p-issn(p): 2348-6406 Sophisticated
More informationBASIC PHYSICAL DESIGN AN OVERVIEW The VLSI design flow for any IC design is as follows
Unit 3 BASIC PHYSICAL DESIGN AN OVERVIEW The VLSI design flow for any IC design is as follows 1.Specification (problem definition) 2.Schematic(gate level design) (equivalence check) 3.Layout (equivalence
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 24: Peripheral Memory Circuits [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11
More informationDIGITAL INTEGRATED CIRCUITS A DESIGN PERSPECTIVE 2 N D E D I T I O N
DIGITAL INTEGRATED CIRCUITS A DESIGN PERSPECTIVE 2 N D E D I T I O N Jan M. Rabaey, Anantha Chandrakasan, and Borivoje Nikolic CONTENTS PART I: THE FABRICS Chapter 1: Introduction (32 pages) 1.1 A Historical
More informationCHAPTER 6 GDI BASED LOW POWER FULL ADDER CELL FOR DSP DATA PATH BLOCKS
87 CHAPTER 6 GDI BASED LOW POWER FULL ADDER CELL FOR DSP DATA PATH BLOCKS 6.1 INTRODUCTION In this approach, the four types of full adders conventional, 16T, 14T and 10T have been analyzed in terms of
More informationFIELD-PROGRAMMABLE gate array (FPGA) chips
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 54, NO. 11, NOVEMBER 2007 2489 3-D nfpga: A Reconfigurable Architecture for 3-D CMOS/Nanomaterial Hybrid Digital Circuits Chen Dong, Deming
More informationTECHNOLOGY scaling, aided by innovative circuit techniques,
122 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 2, FEBRUARY 2006 Energy Optimization of Pipelined Digital Systems Using Circuit Sizing and Supply Scaling Hoang Q. Dao,
More informationKeywords: VLSI; CMOS; Pass Transistor Logic (PTL); Gate Diffusion Input (GDI); Parellel In Parellel Out (PIPO); RAM. I.
Comparison and analysis of sequential circuits using different logic styles Shofia Ram 1, Rooha Razmid Ahamed 2 1 M. Tech. Student, Dept of ECE, Rajagiri School of Engg and Technology, Cochin, Kerala 2
More informationDigital Systems Design
Digital Systems Design Digital Systems Design and Test Dr. D. J. Jackson Lecture 1-1 Introduction Traditional digital design Manual process of designing and capturing circuits Schematic entry System-level
More informationA 0.9 V Low-power 16-bit DSP Based on a Top-down Design Methodology
UDC 621.3.049.771.14:621.396.949 A 0.9 V Low-power 16-bit DSP Based on a Top-down Design Methodology VAtsushi Tsuchiya VTetsuyoshi Shiota VShoichiro Kawashima (Manuscript received December 8, 1999) A 0.9
More informationLow Power Design of Successive Approximation Registers
Low Power Design of Successive Approximation Registers Rabeeh Majidi ECE Department, Worcester Polytechnic Institute, Worcester MA USA rabeehm@ece.wpi.edu Abstract: This paper presents low power design
More informationA Novel Radiation Tolerant SRAM Design Based on Synergetic Functional Component Separation for Nanoscale CMOS.
A Novel Radiation Tolerant SRAM Design Based on Synergetic Functional Component Separation for Nanoscale CMOS. Abstract This paper presents a novel SRAM design for nanoscale CMOS. The new design addresses
More informationOn Chip Active Decoupling Capacitors for Supply Noise Reduction for Power Gating and Dynamic Dual Vdd Circuits in Digital VLSI
ELEN 689 606 Techniques for Layout Synthesis and Simulation in EDA Project Report On Chip Active Decoupling Capacitors for Supply Noise Reduction for Power Gating and Dynamic Dual Vdd Circuits in Digital
More informationA new 6-T multiplexer based full-adder for low power and leakage current optimization
A new 6-T multiplexer based full-adder for low power and leakage current optimization G. Ramana Murthy a), C. Senthilpari, P. Velrajkumar, and T. S. Lim Faculty of Engineering and Technology, Multimedia
More informationPreface to Third Edition Deep Submicron Digital IC Design p. 1 Introduction p. 1 Brief History of IC Industry p. 3 Review of Digital Logic Gate
Preface to Third Edition p. xiii Deep Submicron Digital IC Design p. 1 Introduction p. 1 Brief History of IC Industry p. 3 Review of Digital Logic Gate Design p. 6 Basic Logic Functions p. 6 Implementation
More informationA Novel Low-Power Scan Design Technique Using Supply Gating
A Novel Low-Power Scan Design Technique Using Supply Gating S. Bhunia, H. Mahmoodi, S. Mukhopadhyay, D. Ghosh, and K. Roy School of Electrical and Computer Engineering, Purdue University, West Lafayette,
More informationDomino Static Gates Final Design Report
Domino Static Gates Final Design Report Krishna Santhanam bstract Static circuit gates are the standard circuit devices used to build the major parts of digital circuits. Dynamic gates, such as domino
More informationDesign of Adders with Less number of Transistor
Design of Adders with Less number of Transistor Mohammed Azeem Gafoor 1 and Dr. A R Abdul Rajak 2 1 Master of Engineering(Microelectronics), Birla Institute of Technology and Science Pilani, Dubai Campus,
More informationMapping Multiplexers onto Hard Multipliers in FPGAs
Mapping Multiplexers onto Hard Multipliers in FPGAs Peter Jamieson and Jonathan Rose The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Modern FPGAs Consist
More informationLecture 11: Clocking
High Speed CMOS VLSI Design Lecture 11: Clocking (c) 1997 David Harris 1.0 Introduction We have seen that generating and distributing clocks with little skew is essential to high speed circuit design.
More informationEnergy Efficiency of Power-Gating in Low-Power Clocked Storage Elements
Energy Efficiency of Power-Gating in Low-Power Clocked Storage Elements Christophe Giacomotto 1, Mandeep Singh 1, Milena Vratonjic 1, Vojin G. Oklobdzija 1 1 Advanced Computer systems Engineering Laboratory,
More informationDFT for Testing High-Performance Pipelined Circuits with Slow-Speed Testers
DFT for Testing High-Performance Pipelined Circuits with Slow-Speed Testers Muhammad Nummer and Manoj Sachdev University of Waterloo, Ontario, Canada mnummer@vlsi.uwaterloo.ca, msachdev@ece.uwaterloo.ca
More informationAn Interconnect-Centric Approach to Cyclic Shifter Design
An Interconnect-Centric Approach to Cyclic Shifter Design Haikun Zhu, Yi Zhu C.-K. Cheng Harvey Mudd College. David M. Harris Harvey Mudd College. 1 Outline Motivation Previous Work Approaches Fanout-Splitting
More informationTHERE is a growing need for high-performance and. Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment
1014 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 7, JULY 2005 Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment Dongwoo Lee, Student
More informationPROGRAMMABLE ASIC INTERCONNECT
PROGRAMMABLE ASIC INTERCONNECT The structure and complexity of the interconnect is largely determined by the programming technology and the architecture of the basic logic cell The first programmable ASICs
More informationFPGA-SPICE: A Simulation-based Power Estimation Framework for FPGAs
FPGA-SPICE: A Simulation-based Power Estimation Framework for FPGAs ifan Tang, Pierre-Emmanuel Gaillardon and Giovanni De icheli Integrated Systems aboratory (SI), École Polytechnique Fédérale de ausanne
More informationVery Large Scale Integration (VLSI)
Very Large Scale Integration (VLSI) Lecture 6 Dr. Ahmed H. Madian Ah_madian@hotmail.com Dr. Ahmed H. Madian-VLSI 1 Contents Array subsystems Gate arrays technology Sea-of-gates Standard cell Macrocell
More informationLeakage Power Minimization in Deep-Submicron CMOS circuits
Outline Leakage Power Minimization in Deep-Submicron circuits Politecnico di Torino Dip. di Automatica e Informatica 1019 Torino, Italy enrico.macii@polito.it Introduction. Design for low leakage: Basics.
More informationDigital Microelectronic Circuits ( ) Pass Transistor Logic. Lecture 9: Presented by: Adam Teman
Digital Microelectronic Circuits (361-1-3021 ) Presented by: Adam Teman Lecture 9: Pass Transistor Logic 1 Motivation In the previous lectures, we learned about Standard CMOS Digital Logic design. CMOS
More informationDESIGNING powerful and versatile computing systems is
560 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 5, MAY 2007 Variation-Aware Adaptive Voltage Scaling System Mohamed Elgebaly, Member, IEEE, and Manoj Sachdev, Senior
More informationA Low-Power High-speed Pipelined Accumulator Design Using CMOS Logic for DSP Applications
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume. 1, Issue 5, September 2014, PP 30-42 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org
More informationFast Placement Optimization of Power Supply Pads
Fast Placement Optimization of Power Supply Pads Yu Zhong Martin D. F. Wong Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering Univ. of Illinois at Urbana-Champaign
More informationCPE/EE 427, CPE 527 VLSI Design I: Homeworks 3 & 4
CPE/EE 427, CPE 527 VLSI Design I: Homeworks 3 & 4 1 2 3 4 5 6 7 8 9 10 Sum 30 10 25 10 30 40 10 15 15 15 200 1. (30 points) Misc, Short questions (a) (2 points) Postponing the introduction of signals
More informationDesigning Information Devices and Systems II Fall 2017 Note 1
EECS 16B Designing Information Devices and Systems II Fall 2017 Note 1 1 Digital Information Processing Electrical circuits manipulate voltages (V ) and currents (I) in order to: 1. Process information
More informationElectronic Circuits EE359A
Electronic Circuits EE359A Bruce McNair B206 bmcnair@stevens.edu 201-216-5549 1 Memory and Advanced Digital Circuits - 2 Chapter 11 2 Figure 11.1 (a) Basic latch. (b) The latch with the feedback loop opened.
More informationRamon Canal NCD Master MIRI. NCD Master MIRI 1
Wattch, Hotspot, Hotleakage, McPAT http://www.eecs.harvard.edu/~dbrooks/wattch-form.html http://lava.cs.virginia.edu/hotspot http://lava.cs.virginia.edu/hotleakage http://www.hpl.hp.com/research/mcpat/
More informationNanoFabrics: : Spatial Computing Using Molecular Electronics
NanoFabrics: : Spatial Computing Using Molecular Electronics Seth Copen Goldstein and Mihai Budiu Computer Architecture, 2001. Proceedings. 28th Annual International Symposium on 30 June-4 4 July 2001
More informationImplementation of Efficient 5:3 & 7:3 Compressors for High Speed and Low-Power Operations
Volume-7, Issue-3, May-June 2017 International Journal of Engineering and Management Research Page Number: 42-47 Implementation of Efficient 5:3 & 7:3 Compressors for High Speed and Low-Power Operations
More informationRuixing Yang
Design of the Power Switching Network Ruixing Yang 15.01.2009 Outline Power Gating implementation styles Sleep transistor power network synthesis Wakeup in-rush current control Wakeup and sleep latency
More informationAndrew Clinton, Matt Liberty, Ian Kuon
Andrew Clinton, Matt Liberty, Ian Kuon FPGA Routing (Interconnect) FPGA routing consists of a network of wires and programmable switches Wire is modeled with a reduced RC network Drivers are modeled as
More informationLow Power 32-bit Improved Carry Select Adder based on MTCMOS Technique
Low Power 32-bit Improved Carry Select Adder based on MTCMOS Technique Ch. Mohammad Arif 1, J. Syamuel John 2 M. Tech student, Department of Electronics Engineering, VR Siddhartha Engineering College,
More informationLow-Power Digital CMOS Design: A Survey
Low-Power Digital CMOS Design: A Survey Krister Landernäs June 4, 2005 Department of Computer Science and Electronics, Mälardalen University Abstract The aim of this document is to provide the reader with
More informationEC 1354-Principles of VLSI Design
EC 1354-Principles of VLSI Design UNIT I MOS TRANSISTOR THEORY AND PROCESS TECHNOLOGY PART-A 1. What are the four generations of integrated circuits? 2. Give the advantages of IC. 3. Give the variety of
More informationA Novel Flipflop Topology for High Speed and Area Efficient Logic Structure Design
IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735. Volume 6, Issue 2 (May. - Jun. 2013), PP 72-80 A Novel Flipflop Topology for High Speed and Area
More informationEE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling
EE241 - Spring 2004 Advanced Digital Integrated Circuits Borivoje Nikolic Lecture 15 Low-Power Design: Supply Voltage Scaling Announcements Homework #2 due today Midterm project reports due next Thursday
More informationThe challenges of low power design Karen Yorav
The challenges of low power design Karen Yorav The challenges of low power design What this tutorial is NOT about: Electrical engineering CMOS technology but also not Hand waving nonsense about trends
More informationGeared Oscillator Project Final Design Review. Nick Edwards Richard Wright
Geared Oscillator Project Final Design Review Nick Edwards Richard Wright This paper outlines the implementation and results of a variable-rate oscillating clock supply. The circuit is designed using a
More informationIMPLEMANTATION OF D FLIP FLOP BASED ON DIFFERENT XOR /XNOR GATE DESIGNS
IMPLEMANTATION OF D FLIP FLOP BASED ON DIFFERENT XOR /XNOR GATE DESIGNS 1 MADHUR KULSHRESTHA, 2 VIPIN KUMAR GUPTA 1 M. Tech. Scholar, Department of Electronics & Communication Engineering, Suresh Gyan
More informationPOWER ESTIMATION FOR FIELD PROGRAMMABLE GATE ARRAYS. Kara Ka Wing Poon B.A.Sc, University of British Columbia, 1999
POWER ESTIMATION FOR FIELD PROGRAMMABLE GATE ARRAYS by Kara Ka Wing Poon B.A.Sc, University of British Columbia, 999 A thesis submitted in partial fulfillment of the requirements for the degree of Master
More informationLearning Outcomes. Spiral 2 8. Digital Design Overview LAYOUT
2-8.1 2-8.2 Spiral 2 8 Cell Mark Redekopp earning Outcomes I understand how a digital circuit is composed of layers of materials forming transistors and wires I understand how each layer is expressed as
More informationA Low Power Array Multiplier Design using Modified Gate Diffusion Input (GDI)
A Low Power Array Multiplier Design using Modified Gate Diffusion Input (GDI) Mahendra Kumar Lariya 1, D. K. Mishra 2 1 M.Tech, Electronics and instrumentation Engineering, Shri G. S. Institute of Technology
More informationUMAINE ECE Morse Code ROM and Transmitter at ISM Band Frequency
UMAINE ECE Morse Code ROM and Transmitter at ISM Band Frequency Jamie E. Reinhold December 15, 2011 Abstract The design, simulation and layout of a UMAINE ECE Morse code Read Only Memory and transmitter
More informationLecture 4&5 CMOS Circuits
Lecture 4&5 CMOS Circuits Xuan Silvia Zhang Washington University in St. Louis http://classes.engineering.wustl.edu/ese566/ Worst-Case V OL 2 3 Outline Combinational Logic (Delay Analysis) Sequential Circuits
More informationPROCESS-VOLTAGE-TEMPERATURE (PVT) VARIATIONS AND STATIC TIMING ANALYSIS
PROCESS-VOLTAGE-TEMPERATURE (PVT) VARIATIONS AND STATIC TIMING ANALYSIS The major design challenges of ASIC design consist of microscopic issues and macroscopic issues [1]. The microscopic issues are ultra-high
More informationMS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.
MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng., UCLA - http://nanocad.ee.ucla.edu/ 1 Outline Introduction
More informationReduction of Peak Input Currents during Charge Pump Boosting in Monolithically Integrated High-Voltage Generators
Reduction of Peak Input Currents during Charge Pump Boosting in Monolithically Integrated High-Voltage Generators Jan Doutreloigne Abstract This paper describes two methods for the reduction of the peak
More information