IMPLICATIONS OF FUTURE TECHNOLOGIES. ON THE DESIGN OF FPGAs

Size: px

Start display at page:

Download "IMPLICATIONS OF FUTURE TECHNOLOGIES. ON THE DESIGN OF FPGAs"

Noel Moody
5 years ago
Views:

1 The Pennsylvania State University The Graduate School Department of Computer Science and Engineering IMPLICATIONS OF FUTURE TECHNOLOGIES ON THE DESIGN OF FPGAs A Thesis in Computer Science and Engineering by Aman Gayasen c 2006 Aman Gayasen Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2006

2 The thesis of Aman Gayasen was reviewed and approved by the following: Mahmut Kandemir Associate Professor of Computer Science and Engineering Thesis Co-Adviser Co-Chair of Committee Vijaykrishnan Narayanan Associate Professor of Computer Science and Engineering Thesis Co-Adviser Co-Chair of Committee Mary Jane Irwin Professor of Computer Science and Engineering Vittal Prabhu Associate Professor of Industrial and Manufacturing Engineering Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering Signatures are on file in the Graduate School.

3 iii Abstract The Field Programmable Gate Array (FPGA) industry is going through an exciting phase. The market leaders, Xilinx and Altera, announce new products almost every year. Their CAD tools also keep adding new features. The growing popularity of FPGAs demands that we sustain the growth of FPGAs. This thesis explores new technologies for continuing the improvement of FPGAs in future. In this thesis, we study the effect of three main future technologies. First, we evaluate FPGA designs for scaled CMOS technologies 65nm and below. The main problems here are leakage, temperature, and process variation. Second, we look at 3-D stacking of multiple dies within a package. Since this technology is still being perfected, we have several parameters to play with. For example, the properties of the vias that provide communication among the different layers (inter-layer vias) are very different from the other wires, especially pitch and length. This brings about an asymmetry in the FPGA fabric. How this influences the FPGA architecture is a question we try to answer. Furthermore, stacking multiple layers increases the power density, which increases the junction temperature. This thesis studies the impact of stacking on temperature, and proposes thermal-aware organization of FPGAs. Finally, we look at some technologies that are still in their infancy, such as molecular switches, carbon nanotubes, and silicon nanowires. Specifically, we explore the use of such technologies to implement the interconnect fabric in an FPGA.

4 iv Table of Contents List of Tables vii List of Figures viii Acknowledgments xi Chapter 1. Introduction FPGA Architectures Chapter 2. Related Work Chapter 3. Reducing Leakage Energy in FPGAs Using Sleep Transistors RCP: Region-Constrained Placement Combining RCP and Time-Based Control Experimentation Time-based leakage control Results and Analysis Time-based Leakage Control Summary Dual-Vdd FPGA Architecture Fully Programmable (FP)

5 v Logic Programmable (LP) Level Conversion Methodology Vdd Assignment Power Estimation Results and Analysis FP Architecture LP Architecture Summary Chapter 4. Three-Dimensional FPGAs Background D Switch Boxes D Technology Overview D Detailed Routing Architecture Switch Box Topology Experimentation Architecture and Technology Parameters Experimentation Flow Area Model Results and Analysis Thermal Issues in 3-D FPGAs Thermal-Characterization of FPGAs: 2-D to 3-D

6 4.3.2 Thermal-Aware 3-D FPGA Organization Summary vi Chapter 5. Technology Alternatives for Nanoscale FPGA Interconnects Nanotechnology Primitives Related Work Nanoscale FPGA Architectures Arch1: Using non-lithographic nano-wires and molecular switches Arch2: FPGA using lithographic wires and molecular switches Comparative Evaluation Results Summary Chapter 6. Summary and Future Directions Future Directions References

7 vii List of Tables 3.1 Characteristics of benchmark designs Comparison of High-to-Low and Low-to-High algorithms (LC at CLB inputs, Vddh = 1.1V, Vddl = 0.8V Via properties Power densities in 4VFX100 (Freq : 500MHz) Effect of stacking on temperature Parameters for temperature estimation in HS3d Thermal-aware 3-D FPGA design

8 viii List of Figures 1.1 Traditional FPGA architecture Virtex-2 FPGA architecture FPGA containing sleep transistors Leakage energy breakdown a) Horizontal and b) Vertical styles of RCP on an XC2V40 FPGA for a region size of 2 4 slices. Required number of regions is 100 (13 regions) Different placements for an example design. In part (c), each module is bounded by a polygon Experimental Flow Average leakage power savings for RCP and normal placement Leakage power savings for RCP for 4 16 region for all designs Average clock frequency for RCP Average leakage energy savings for RCP and normal placement Leakage power savings for time-based leakage control Leakage energy savings for time-based leakage control Supply transistors used for programmable Vdd Fully programmable dual-vdd architecture (FP) Logic programmable dual-vdd architecture (LP) Level converter circuit Experimental Flow

9 ix 3.17 Distribution of path delays Power consumption for different Vddl s. Vddh=1.1V Power consumption for different architectures and algorithms. Vddh=1.1V, Vddl=0.9V Average power breakdown between logic and routing resources. Vddh=1.1V, Vddl=0.9V Average power consumption for different critical path delay tolerances. Vddh=1.1V, Vddl=0.9V Critical path delay for LP FPGA with different extents of Vddl resources. Vddh=1.1V, Vddl=0.9V Energy consumption in LP FPGAs. Vddh=1.1V, Vddl=0.9V D switch boxes. X 0, Y 0, X 1, Y 1 mark their sides Two kinds of stacking D FPGA D switch boxes for H=4, V= Experimentation flow Comparing 2-D and 3-D FPGAs Comparing the switch boxes for 5-layer FPGA Comparing the switch boxes for different via technologies for 5-layer FPGA Comparing the switch boxes for different process nodes for 5-layer FPGA Virtex-4 FX100 device (not to scale) Thermal profile of 4VFX

10 4.12 Effect of stacking on peak temperature D FPGA organizations x 5.1 FPGA using nano-wires and molecular switches D organization of nano-wires Critical path delays in the 3 architectures Dependence of performance on molecular switch s ON resistance Resistance and Capacitance values of single-length NiSi nano-wires Performance of a design (misex3) using metal nano-wires

11 xi Acknowledgments I am grateful to my advisers, Dr. Vijay and Dr. Kandemir, for their support throughout my Ph.D. Without their guidance, both in professional and personal matters, I would never have completed this thesis. I am also thankful to members of MDL for creating a friendly work environment. Some of the work in this thesis was done with the help of other MDL students. While it is impossible to thank everyone who might have influenced my research indirectly, I am attempting to thank those who worked closely with me on several projects. Yuh-Fang and Ki-Yong helped me with the FPGA power work. Priya helped with the thermal work. Besides them, Suresh worked with me on several projects. I also enjoyed some enlightening conversations with Vijay Sai and Greg Link. During the last semester at Penn State, I also worked with Soumya, Prasanth, and Sungmin. Besides them, my neighbor in the lab, Jooheung, was a constant source of inspiration. I also wish to thank Ing-Chao for the ping-pong games; they helped me focus when I was under stress. Outside Penn State, I frequently collaborated with Tim Tuan and Arif Rahman of Xilinx Research Labs. I am grateful for their help. They, and Satyaki Das, were excellent mentors during my internships at Xilinx.

12 1 Chapter 1 Introduction Field Programmable Gate Arrays (FPGAs) are Integrated Circuits (ICs) containing programmable logic and interconnect elements. The Field in FPGA denotes their ability to be programmed by the end-user. The Gate Array signifies their similarity to conventional mask-programmed gate arrays. FPGAs belong to a broader category of field-programmable devices, called Programmable Logic Devices (PLDs), which include PLA (Programmable Logic Arrays), PAL (Programmable Array Logic), and CPLD (Complex PLD). While PLAs and PALs can implement only two-level logic, both FPGAs and CPLDs can implement multi-level logic. CPLDs consist of multiple PAL elements interconnected through a programmable switch matrix. In contrast, FPGAs contain several small programmable logic elements connected using a mixture of short and long wires and programmable switches. While CPLDs offer a more predictable timing, they lag FPGAs in logic capacity. Because of their large capacities, and superior device utilization, FPGAs are among the most popular programmable devices. FPGAs present significant advantages over microprocessors as well as ASICs. Compared to microprocessors, they offer a higher performance for parallel applications. Compared to ASICs, they offer a simpler design flow and lower Non-Recurring Engineering (NRE) costs. Therefore, they are suited for small-to-medium volumes of production and for products where time-to-market is critical. Furthermore, the regular structure of

13 2 FPGAs makes them highly amenable to shrinking geometries, and therefore, they usually are at the forefront of new technologies. Consequently, by using FPGAs, designers can get the advantages of advanced top-of-the-line process technologies without worrying about the complexities that accompany the technology scaling. Due to all the above reasons, FPGAs are poised to be among the most popular devices of the future. At the time of their introduction in the mid-eighties, FPGAs were primarily used for prototyping and to implement glue logic. However, over a period of twenty years, especially since the late nineties, their market has diversified significantly. The inclusion of embedded processors, memory, and DSP blocks provides the complete platform to create embedded systems [1, 2]. Their inherent parallelism, coupled with an increase in size and decrease in logic delays, allows them to be used as hardware accelerators for high performance applications (e.g., [3, 4, 5]). People are also working to create scalable supercomputers using an array of off-the-shelf FPGAs [6]. Furthermore, the introduction of low-cost FPGAs by both Xilinx [1] and Altera [2] has enabled the use of FPGAs in consumer markets. Technology research group Gartner Dataquest forecasts that the market for programmable logic devices, which incorporates reconfigurable computing with FPGAs, will double in a period of five years, from $3.1 billion in 2005 to $6.2 billion in 2010 [7]. In order to maintain this growth in the FPGA market, FPGAs must consistently improve in performance, size, and features. This thesis explores future technologies that will be crucial in sustaining this improvement. Future technologies can be divided into the following three categories.

14 3 The first category consists of scaled CMOS technologies, as predicted by ITRS [8]. It predicts that the industry will move to 22nm technology in These technologies will face severe power and reliability problems. Leakage power, which until 130nm was only a minor component of the total power, has already become a severe problem in sub-100nm technologies. Furthermore, because of increased power densities on the die, the die temperature is also increasing. Higher temperature in-turn causes a plethora of problems, including an increase in leakage power and reduction in silicon lifetime. In severe cases, heat could also melt the package and cause total disruption. Beside power and temperature, variability, both manufacturing and long-term, is a serious problem. Defect rates are also expected to increase for smaller technologies. In this study, we focus on reducing leakage power and temperature in an FPGA. The second category comprises evolutionary technologies, such as, stacking multiple device layers to create a three-dimensional (3-D) IC. Three-D stacking is helpful in reducing wire lengths, which translates into reduction in the FPGA area and power consumption. A timing-driven placement and routing tool can also use 3-D to reduce the critical path delay. The vertical connections in a 3-D technology are much larger than the metal wires in a 2-D chip. Therefore, we would normally try to reduce their number. A key challenge for 3-D FPGAs is designing the routing architecture such that we use the vertical connections judiciously. Further, temperature is a major concern here, because stacking multiple layers increases the

15 effective power density. Our results show that going from a single layer to a 4-layer FPGA could increase the peak temperature by a factor of 2.4 (see Chapter 4). 4 The final category contains non-lithographic technologies, such as, carbon nanotubes, silicon nanowires, and molecular switches. We broadly call them, nanotechnologies. Although several scientists are working on them, these technologies are still in their initial stages of development. The key question here is, What are the desired properties in such technologies for them to be better than scaled CMOS? With this information, we give valuable feedback to the chemists and material scientists who are developing these technologies, and also set reasonable expectations from nanotechnologies. The remainder of the thesis is organized as follows. Chapter 2 reviews the existing literature related to this study. Chapter 3 discusses the challenges in reducing leakage energy in future CMOS technology nodes, and presents two techniques to reduce leakage. Chapter 4 develops a detailed routing architecture for 3-D FPGAs, and also studies thermal issues in them. Chapter 5 explores nanotechnology alternatives to implement the interconnect fabric in FPGA. Finally, Chapter 6 summarizes the contributions of this thesis, and presents possible directions for future research in this field. 1.1 FPGA Architectures Figure 1.1 shows the traditional island-style FPGA architecture. It consists of a 2-D array of configurable logic blocks (CLBs) in a sea of routing wires. The CLBs typically contain multiple Look-Up Tables (LUTs) and Flip-Flops (FFs). The routing

5 (a) (b) Fig. 1.1. Traditional FPGA architecture wires connect among themselves through programmable switches, forming a switch block.

1. An example of a modern FPGA is the Virtex-2 FPGA, shown in Figure 1.2. It stores the configuration information in SRAM cells, each of which typically consists of 6 transistors.

16 5 (a) (b) Fig Traditional FPGA architecture wires connect among themselves through programmable switches, forming a switch block. Similarly, these wires also connect to the CLBs, forming connection blocks. The modern FPGA has a more complex architecture than the one shown in Figure 1.1. An example of a modern FPGA is the Virtex-2 FPGA, shown in Figure 1.2. It stores the configuration information in SRAM cells, each of which typically consists of 6 transistors. The basic logic element in Virtex-2 is called a slice. A slice consists of 2 LUTs, 2 FFs, fast carry logic, and some wide MUXes [1]. A CLB in turn consists of 4 slices and an interconnect switch matrix. The interconnect switch matrix consists of large multiplexers (as large as 32-to-1) controlled by configuration SRAM cells. Note that Figure 1.2 is not drawn to scale, and in reality the interconnect switches account for nearly 70% of the CLB area. The FPGA contains an array of such CLBs along with block RAMs (BRAMs), multipliers and IO blocks as depicted in Figure 1.2. Altera s FPGAs are also similar in technology to Virtex-2.

17 6 Fig Virtex-2 FPGA architecture A different kind of FPGAs are the antifuse-based FPGAs offered by Actel that are one-time programmable. Actel and Lattice also offer flash-based FPGAs. In this study, we focus on only SRAM-based FPGAs.

18 7 Chapter 2 Related Work Ever since the first FPGAs were introduced by Xilinx in the mid 80 s, they have been a popular topic for research. Their programmability offers interesting avenues for creativity. Researchers at University of Toronto performed early research on FPGA architecture [9]. They used the area, delay, and area-delay product as metrics to evaluate FPGA architectures. They also developed tools to allow FPGA architecture exploration [10]. Meanwhile, in the late 90 s, researchers at Berkeley started looking at the energy consumption in FPGAs. Energy consumption was becoming crucial because of the growing demand for the use of FPGAs in embedded devices. They proposed lowswing interconnect circuits and an interconnect architecture optimized to reduce the energy-delay product [11]. Some studies also analyzed the dynamic power consumption of commercial FPGAs first of a Xilinx XC4003A FPGA [12], and, more recently, of Virtex-2 [13]. Both studies observed that the interconnect fabric consumes the majority of the dynamic power. Similar to the early study at Toronto (which focused on area and delay), some researchers studied the influence of architecture parameters, such as LUT size, cluster size, and segment length, on power consumption [14, 15]. Studies have also tried to reduce the dynamic power through modifications in the CAD tools, ranging from

19 8 clustering [16], place and route [17], to bitstream manipulation [18]. The bitstream manipulation technique modified the LUT configuration bits to reduce dynamic power [18]. Recently, Lamoureux and Wilton [19] proposed a complete power-driven CAD flow, and studied the interaction among the different CAD stages. All the above studies focused on dynamic power consumption. With shrinking transistor sizes, leakage power is also becoming important. FPGA researchers recognized this, and therefore, the past two years have seen several studies on FPGA leakage power (see [20] for a survey). Since FPGAs use several transistors to provide programmability, their leakage power consumption is significantly higher than a custom circuit implementing the same functionality. Tuan and Lai [21] performed a detailed analysis of leakage power in Xilinx CLBs. They concluded that leakage must be significantly reduced to enable the use of FPGAs in mobile applications. Several techniques to reduce leakage in FPGAs have also been proposed. Two of them proposed the use of sleep transistors [22, 23]. While researchers at MIT [22] proposed a fine-grained leakage control scheme, embedding sleep transistors within the CLB circuit; Gayasen et al. [23] advocated a coarser region-based leakage control, and proposed constraining the design to a minimum number of regions to reduce leakage. At the circuit-level, Azizi and Najm [24] presented low-leakage circuits for LUTs. Since the leakage of routing muxes depends strongly on their input values, Srinivasan et al. [25] presented circuits to reduce leakage in the routing fabric by setting desired values at the inputs of the unused routing muxes. Lodi et al. [26] developed low leakage circuits for the FPGA routing switch. Rahman and Polavarapuv [27] evaluated several low-leakage design techniques for FPGAs. One of them was the use of a heterogenous routing fabric,

20 9 consisting of a mixture of high and low threshold (Vt) transistors. Since high Vt reduces the leakage current at the expense of an increase in delay, the router needs to pick the correct resources based on the slack available. This idea was proposed for a more commercial architecture later [28], where detailed experiments helped them decide which resources to slow down without affecting performance. At the CAD level, Hassan et al. [29] proposed a low-leakage packing algorithm that packed the LUTs exhibiting similar idleness together so that they could be shut down together. Anderson et al. [30] presented a no-cost technique to reduce leakage by selecting the polarities of logic signals appropriately. A similar technique was proposed in [31], but with Asymmetric SRAM cells [32]. Chapter 3 presents our region-based leakage control technique. Researchers have previously proposed dual-vdd techniques for ASICs [33, 34]. The dual-vdd ASIC uses high-vdd (Vddh) only to supply the timing-critical blocks, and saves power on the non-critical ones by supplying them a lower Vdd (Vddl). Li et al. [35] first applied the idea to FPGAs. They fixed the voltages of logic blocks and attempted to place the design such that timing-critical blocks used high Vdd. This approach did not provide enough power savings unless some performance degradation was allowed. Therefore, a programmable Vdd FPGA was next proposed [36, 37], where the circuit blocks could be programmed to run on Vddh or Vddl. In [36], the Vdd of only logic blocks could be programmed. All routing resources remained at Vddh, and the emphasis was on reducing dynamic power while keeping the leakage constant. Gayasen et al. applied the programmable Vdd idea to routing resources as well as logic, and reduced both dynamic

21 10 and leakage powers [37, 38]. Later, Lin et al. [39] evaluated several variants of the dual- Vdd architecture, and also improved the voltage assignment algorithm by formulating it as a linear programming problem [40]. All these approaches required two power supplies and two power grids. To eliminate these overheads, Anderson and Najm [41] proposed a circuit that utilized the threshold drop across an NMOS transistor to locally generate an alternate power supply for every routing switch. Chen et al. [42] also presented a cut enumeration algorithm targeting low power technology mapping for FPGA architectures with dual supply voltages. In Chapter 3, we present our dual-vdd architectures. Several studies have recognized the overheads of the programmable interconnect fabric in an FPGA. The interconnect resources take almost 70% of the die area and consume the major part of FPGA power [21, 13]. Furthermore, for most designs, they also constitute more than 50% of the critical path delay. Therefore, FPGA interconnect merits special attention. In order to reduce the interconnect area, researchers have proposed 3-D FPGAs, consisting of multiple stacked 2-D FPGAs. More than a decade ago, Alexander et al. [43] presented a 3-D FPGA that used package-level integration to stack multiple 2-D FPGAs interconnected using solder bumps. The minimum pitch of these vertical interconnects was 100µm. Campenhout et al. [44] proposed opto-electronic FPGAs, in which the inter-chip communication used optical links. The optical links provide a large vertical channel density. The Rothko 3-D FPGA [45] was a 3-D extension of the Triptych sea-of-gates architecture [46], consisting of routing and logic blocks. The 3-D integration was done at the wafer-level and inter-layer communication used metal vias. A dynamically reconfigurable 3-D FPGA was presented in [47], which consisted of three physical layers: routing and logic block layer, routing layer, and memory layer.

22 11 Recently, Lin et al. [48] analyzed the performance benefits of a monolithically stacked 3- D FPGA. Their 3-D integration technology provided very fine vias, which allowed them to stack the configuration memory on top of the rest of the FPGA (logic blocks and interconnects). Researchers have also looked at theoretical models for 3-D FPGAs. Rahman et al. [49] presented an analytical model for predicting interconnect requirements in 3-D FPGAs, and estimated over 50% reduction in channel width, interconnect delay, and power dissipation, when compared to 2-D FPGAs. Kwon et al. [50] recently extended this model to incorporate clustered logic blocks (similar to Virtex-2 [1]). On the CAD front, Ababei et al. [51, 52] recently presented a partitioning-based placement algorithm for 3-D FPGAs, which primarily focused on reducing the inter-layer vias. However, their router was not timing-driven. Although several researchers have proposed 3-D FPGAs, the detailed routing architecture of a 3-D FPGA remains unexplored. Ababei et al. [51] assumed a subset switch block. Although Wu et al. [53] designed universal 3-D switch blocks, they used track count as the sole metric of quality. Furthermore, they assumed that the number of inter-layer vias is the same as the horizontal channel width. In today s technology, especially if we stack more than two layers, the vias are much thicker than the horizontal wires (1µm vs. 0.1µm), which makes this assumption impractical. In Chapter 4, we propose 3-D switch block designs considering the special via properties [54]. Three-D technology is known to suffer from thermal issues stacking multiple layers increases the effective power density in the package. Package designers have been considering thermal issues in 2-D ICs for a long time. Instead of considering variations in

23 12 the temperatures on the die, they designed the package to support the worst case specifications of the design. As designing the package for the worst case junction temperature started becoming too expensive, researchers started looking at design level solutions to reduce the temperature. Dynamic thermal management (DTM) techniques use thermal sensors to monitor the junction temperature and control the power consumption of the design on the basis of the temperature [55]. Common techniques include clock gating, and voltage and frequency scaling when the temperature increases beyond a threshold. Thermal-aware floorplanning is another design-level solution. Here, the floorplan tries to reduce the hotspots on the die by distributing the temperature uniformly [56, 57]. Researchers have mostly focused at microprocessors in these works. Thermal placement is a similar technique applied at the placement stage. Chen and Sapatnekar [58] proposed a partition-driven algorithm for standard cell thermal placement. Thermal floorplanning and placement are particularly attractive because they impact the performance less than DTM. On the modeling front, several researchers have developed tools for estimating the die temperature. Among them, HotSpot [59] is an architecture level thermal simulator, which can perform transient as well as steady-state temperature estimation. HS3d [60] is another architecture level tool that performs only steady state temperature estimation, but is orders of magnitude faster than HotSpot. Since in this work we look at only steady state temperatures, we use HS3d. Recently, some researchers have proposed solutions for thermal issues in 3-D ICs too. Cong et al. [61] suggested a thermal-driven floorplanning for 3-D. Goplen and Sapatnekar [62] also proposed a temperature-driven placement algorithm for 3-D standard

24 13 cell ASICs. Studies have also indicated that careful insertion of thermal vias can reduce the peak temperature [63, 64]. Thermal issues in FPGAs are relatively unexplored. Some researchers have proposed the use of distributed sensors for monitoring temperatures in FPGAs [65, 66]. They, however, considered only CLBs in the fabric, and consequently, observed very little temperature variations across the die. In Chapter 4 we characterize the thermal profile of a real platform FPGA [67], and then observe the effect of stacking on temperature. We also suggest alternate organizations to reduce the temperature. In the long term, even 3-D may not provide the desired performance. Therefore, we need to explore alternate technologies. Studies have looked at using some non-lithographic technologies to manufacture FPGAs. DeHon [68], Goldstein [69], and Tour [70] have previously proposed programmable architectures using some form of nanostructures that are made using self-assembly. Goldstein tried to make crossbar-based devices by aligning nano-wires in two planes at right angles to each other. The crosspoints contained molecules that provided programmable logic as well as interconnections. It suffered from problems of signal-degradation, as there was no way to restore the signal using only two terminal devices. DeHon overcame this problem by using SiNW based FETs to restore the signals, and proposed a PLA structure. However, the logic functionality in that architecture was limited to OR (and inversion). Tour, instead, proposed replacing the logic blocks by nanocells and connecting them using metal wires. This suffered with problems of training these nanocells, which were assumed to consist of a randomly connected mass of molecules. Furthermore, since the bottleneck in current FPGAs lies in the interconnect, Tour s architecture does not help solve this problem.

25 14 All the above architectures propose drastic changes in the existing CMOS technology as well as the design methodologies. In Chapter 5, we propose an architecture that blends with existing technology easily, and preserves all the design methodologies and flexibility in logic functionality [71].

26 15 Chapter 3 Reducing Leakage Energy in FPGAs With the development of FPGAs in new technologies - 90nm and below, optimizing leakage power 1 is becoming imperative. As the transistor feature sizes and threshold voltages reduce, and the number of transistors used in FPGAs increase, the overall leakage power is rapidly increasing. Consequently, the leakage problem is anticipated to be a major obstacle for FPGA applications in both high performance and low-power embedded designs. Due to this trend, we need to focus on leakage power optimizations going beyond prior power optimization techniques for FPGAs that focus primarily on reducing the dynamic energy [11, 12]. 3.1 Using Sleep Transistors The flexibility provided by the FPGA structures in placing different applications results in a large portion of the components being unutilized [21]. In fact, the typical logic utilization for the designs experimented is 62%. A similar trend holds for larger benchmark suites of greater than 100 designs in different target devices [21]. These unutilized resources in an FPGA serve as a good candidate for leakage optimizations. Reducing leakage power has already been the focus of optimization in various non-fpga architectures. These optimizations have ranged from circuit to software 1 Unlike dynamic power which is expended only when the hardware component in question exercised, leakage power is spent even if the component is idle.

27 16 approaches [72, 73, 74]. Among these techniques, a popular one to reduce both the subthreshold and gate leakage components is to switch off the power supply to the circuit by introducing a high-threshold voltage sleep transistor between the circuit and its supply rail. The sizing of this sleep transistor has an impact on both the performance and the area overheads imposed. Specifically, its sizing should be large for better performance. However, this increases both the area penalty and the ability to reduce leakage current (as wider transistors leak more). The optimal sizing of these sleep transistors has been the focus of prior efforts and the peak current required by the supply gated circuit serves as the reference for this sizing [75]. Since the peak current for different portions of a circuit do not normally occur simultaneously, prior work has used the approach of controlling a clustered group of circuits together with a single sleep transistor [75]. This optimization helps to reduce the area penalty as compared to using sleep transistors with each individual sub-circuit. It should be noted that sleep transistors can be used to control leakage in FPGAs as well. An obvious approach would be to place unused CLBs into low-power states using sleep transistors (see Figure 3.1). However, such a fine-grain (at individual CLB level) power management of the FPGA fabric can introduce a significant area penalty, which may not be tolerated in many designs. Instead, in this paper, we propose a strategy, whereby the FPGA fabric is divided into regions, each of which can be independently controlled through a sleep transistor. A region is a rectangular array of CLBs, and is the minimum unit of power management. This approach is similar to the clustering technique mentioned in the paragraph above. Our experimental results indicate that area of the CLB arrays including the sleep transistor area overhead can be reduced by

28 Fig FPGA containing sleep transistors 17

29 18 5% when moving from using regions with 4 logic slices to regions with 256 logic slices. By selecting a suitable region size, one can control the area overheads and at the same time achieve large leakage savings. Based on this region concept, we also propose a placement technique, referred to as Region-Constrained Placement (RCP), that tries to use a minimum number of regions for a given application, thereby increasing the number of unused regions that can be switched off. A key observation from our results is that the leakage power savings obtained using RCP on an FPGA with coarse-grain regions is larger than that obtained using normal placement employed on an FPGA with fine-grain regions. The maximum savings that can be obtained from the leakage management scheme discussed above is limited by the volume of the unused regions. Consequently, we also utilize a time-based control scheme that reduces leakage even in the utilized portions of the FPGA by switching off/on the power supply, exploiting the idleness in portions of the design. Specifically, the time-based scheme dynamically turns off power supply to all regions containing only idle modules. We investigate combinations of the time-based control scheme with two variants of RCP: (i) module-level RCP that places each module of the design that exhibits a distinct idleness profile using RCP individually and turns off power supply to all regions containing only idle modules, and (ii) design-level RCP that places the entire design using RCP and turns off power supply to all regions that contain only idle modules. Our experiments show that the time-based RCP scheme can provide additional energy savings as compared to statically switching off only unused portions.

19 The leakage distribution in a Xilinx FPGA in 90nm technology, with the exception of the BRAMs and multipliers was shown to be 38% in the configuration SRAMs, 34% in the interconnect matrix, 16% in

Since many of the techniques proposed for saving leakage energy in on-chip memory can be applied to BRAMs (and because they are not used by most of our designs), our leakage optimizations in this

30 19 The leakage distribution in a Xilinx FPGA in 90nm technology, with the exception of the BRAMs and multipliers was shown to be 38% in the configuration SRAMs, 34% in the interconnect matrix, 16% in LUTs and 12% in other logic [21]. Since many of the techniques proposed for saving leakage energy in on-chip memory can be applied to BRAMs (and because they are not used by most of our designs), our leakage optimizations in this paper do not target them. In order to reduce the leakage energy in the configuration SRAM, we increased the threshold voltage of the configuration SRAM to obtain a 98% reduction in leakage energy while increasing configuration time by 20%. Since configuration time is not critical in most of our target designs, this tradeoff for power savings is reasonable. The resulting leakage breakdown in our system is shown in Figure 3.2. The focus of this work is on reducing the leakage energy in the LUTs, arithmetic logic and flip flops that account for 45% of the total leakage energy. While our work focuses on the slices, the technique can be extended to switching off the routing resources as well. This is a part of our planned future work. Leakage energy break- Fig down Fig a) Horizontal and b) Vertical styles of RCP on an XC2V40 FPGA for a region size of 2 4 slices. Required number of regions is 100 (13 regions)

31 20 In order to provide leakage control, the FPGA is divided into regions. A region consists of one or more neighboring slices (potentially across different CLBs), and is the minimum power management unit (granularity). Sleep transistors are embedded into the FPGA fabric controlling the power supply to the individual regions. In this architecture, the control bit for the power switch (See Figure 3.1) of the region determines whether the region is supply-gated or not. The control bits of the different regions are set during the configuration of the FPGA. The area overhead associated with the control bits (and the associated wiring) is proportional to the number of regions, while their impact on leakage energy is relatively small due to the use of high threshold transistors for the configuration bits. Thus, the area overhead favors a smaller number of large regions. An important issue in the design of this architecture is the sizing of the power switches. The power switches should be large enough to support the peak current requirements of the logic slices that they control to have negligible impact on performance. Since the peak current for a larger region is less than the sum of the peak currents of smaller regions constituting the larger region, it is possible to have a smaller area overhead when moving to larger regions with similar performance. In order to show this impact, we experimented with two different region sizes of 256 slices and 4 slices using XPower [1]. A single region of 256 slices had a peak current that was 68% of the sum of peak currents of 64 regions each of 4 slices constituting the same area. Next, we performed SPICE simulations to estimate the sleep transistor size for various region sizes. Assuming a slice area of 5000 sq. micron (from custom layout), it was estimated that the area penalty for a region size of 4 slices was around 15%, while that for 256 was 10%. This motivates the need for using large region sizes.

32 21 The amount of leakage reduction due to the introduction of the power switch is also influenced by the sizing and threshold voltage of the sleep transistor and whether a PMOS or NMOS transistor is used to gate the V DD or ground power supply rail. The leakage reduction varies from 85-98% based on these factors, incurring performance degradation varying from 0-30% [76]. In our experiments, we use a PMOS gate switch providing 90% leakage reduction RCP: Region-Constrained Placement The placement of the design has a significant impact on the ability to supplygate the logic slices in our region-based architecture. Employing the PAR tool in the normal design flow due to lack of region concept tends to scatter the utilized slices across different regions (See Figure 3.4(a)). Since the regions with partially used slices cannot be supply-gated, the potential for leakage energy savings reduces. Thus, we propose a new region constrained placement strategy, RCP, that takes into account the region concept explicitly. The basic principle of RCP is to constrain the placement of the design to specific regions of the FPGA (See Figure 3.4(b)) and leave some regions of the FPGA completely unused, so that they can be supply-gated. This in turn helps to maximize the potential leakage savings. In our implementation of RCP, we place the design into contiguous regions to the extent possible and utilize two different styles: horizontal and vertical placements as shown in Figure 3.3. While the horizontal and vertical placements utilize the same number of logic slices, they do not provide similar performance results due to asymmetry in the target Virtex-II architecture. For example, there are fast carry

22 chains running vertically in the FPGA, but not horizontally.

While we confine the utilization of logic slices to specified regions, in order to circumvent issues with routing congestion, routing of IO signals and unroutability;

As part of our future work, we plan to investigate a supply-gating mechanism that also switches off interconnect muxes. 3.1.

33 22 chains running vertically in the FPGA, but not horizontally. Furthermore, there are more slices in a column than in a row in all Virtex-II parts (except XC2V40, which has 16 slices in both directions). While we confine the utilization of logic slices to specified regions, in order to circumvent issues with routing congestion, routing of IO signals and unroutability; the constraints on routing resources are kept as soft. This permits the use of routing resources outside the regions that have logic placed in them. As part of our future work, we plan to investigate a supply-gating mechanism that also switches off interconnect muxes Combining RCP and Time-Based Control (a) Traditional (b) RCP (c) Module-level RCP Fig Different placements for an example design. In part (c), each module is bounded by a polygon It should be observed that RCP is essentially a static technique where the unutilized FPGA space (regions) can be shut off at configuration time (before the execution).

34 23 While it is easy to implement, it may not be as effective in designs that occupy large portion of the FPGA space (which in turn limits the potential leakage savings). However, for the designs with modules that remain inactive over significant durations of time, we can employ a time-based control scheme that reduces leakage even in the utilized portions of the FPGA by switching off/on the power supply, exploiting the idleness in portions of the design. Specifically, the time-based scheme turns off power supply to all regions containing only idle modules. We investigate combinations of the time-based control scheme with two variants of RCP: (i) module-level RCP that places each module of the design that exhibits a distinct idleness profile using RCP individually, and turns off power supply to all regions containing only idle modules, and (ii) design-level RCP that places the entire design using RCP and turns off power supply to all regions that contain only idle modules. We can implement the idea of time-based control as follows. The gate voltage of a sleep transistor is still controlled by a configuration bit. However, instead of configuring this bit statically when the design is loaded on the FPGA, dynamic reconfiguration [77] of these control bits is used to switch a sleep transistor on or off. In order to limit the overhead of reconfiguring these control bits, the sleep transistor should not change state very frequently. Furthermore, support for just reconfiguring these control bits may be useful as opposed to the minimum reconfigurable block in current Virtex- II technology, which is a frame [77]. Reconfiguration time for one frame varies from 2µ seconds for smallest to 23µ seconds for the largest FPGA. However, support for reconfiguring only the sleep transistor configuration bits can reduce this time, but may increase area overheads due to the configuration circuit.

35 24 With increasing FPGA sizes, it is possible to envision an entire system on FPGA. In such designs, many parts of the design may remain inactive for long durations. Timebased control seems to be a very promising approach for such designs. Figure 3.4(c) shows an example design placement using module-level RCP for time-based leakage control. We see from this figure that modules of the design get placed on non-overlapping regions, thus maximizing the number of regions that can be dynamically switched off. Note that this slightly decreases the statically unused portion on the FPGA (because in order to ensure the inter-module region exclusivity needed for module-level RCP, some regions can only be partially filled). Still, our experiments show a significant increase in leakage savings due to module-level RCP Experimentation In order to investigate the energy savings due to the proposed approach, we selected a set of applications and used the Xilinx Virtex-II FPGA as our target hardware. The selected applications include 14 publicly available reference designs provided by Xilinx, 4 designs from ITC 99 benchmark suite, 3 academic designs and 14 commercial designs available internally at Xilinx. Table 3.1 provides the important characteristics of each application and lists the number of slices, IO blocks (IOBs), block RAMs (BRAMs) and multipliers (MULTs) used in the designs along with the target FPGA device used for the mapping. Note that on an average only 62% of the slices were used. Industry6 is an extreme case, where although only 4% of the slices are used; but due to the I/O requirements of the design, it cannot be mapped to a smaller FPGA.

36 25 These designs were then implemented using the experimental flow illustrated in Figure 3.5 to evaluate the energy savings possible due to the proposed optimizations. The specific steps in this design flow are elaborated below. All the designs were synthesized for area-optimization from their HDL representation using the Xilinx Synthesis Technology (XST). This synthesis step produced a gate-level netlist. Next, the designs were mapped on to the smallest possible Virtex-II FPGA device, setting the place and route effort level high (level 5). After the mapping and completion of place and route (PAR), an NCD file that contains the placed and routed design is generated. The map process also generates a MAP report which is used to implement RCP. The maximum clock frequency for the design was estimated by using the post-par static timing analysis tool, TRACE on the mapped design. The NCD file was translated to an ASCII file in XDL format using the xdl tool. This ASCII file was processed using a customized tool developed for this project to determine the unused regions of the FPGA given the region sizes. Using this information, the leakage savings possible in the standard placement process was obtained (assuming that the regions that are completely unused are switched off). In order to determine the leakage savings using RCP, the synthesized gate-level (NGC file) was re-used. The MAP report from the normal mapping was used to determine the number of logic slices used in the design. Based on the number of slices obtained and the size of the regions, a User Constraints File (UCF) was created to restrict the placement to a specific number of regions. Different UCFs were created for horizontal and vertical styles of RCP, and for different regions. The mapping and place and route obtained using the specified constraints produces an NCD file. Similar to the

37 Fig Experimental Flow 26

38 normal placement scheme, the maximum clock frequency for the design is estimated by using TRACE. Leakage energy savings is evaluated in this case by assuming that power supply to all unused regions is turned off. As explained earlier, an estimated 45% of the total leakage happens in the logic slices (Figure 3.2). Furthermore, as explained in the beginning of this chapter, leakage reduces to 10% of the original using supply-gating with PMOS transistors. Thus, if for some design, 25% of the slices can be switched off, then the leakage power is reduced by ( ) = %. Furthermore, suppose after RCP, the clock frequency degrades to 97% of original. Then, the new leakage energy (Power-Delay-Product, PDP) will be ( ) = 92.65%. Our experiments were performed for different region choices. Region-widths of 2, 4, 8, 16 slices, and heights of 2, 4, 8, 16 slices were considered. Thus, a total of 16 different region choices were explored. Furthermore, as explained earlier, two styles of RCP: horizontal and vertical were explored. Thus, a total of 32 different varieties of RCP were explored Time-based leakage control The experiments for time-based leakage control were performed using an academic design implementing an Adaptive Viterbi Algorithm (AVA) decoder [78]. The design consists of 3 AVA decoders of varying constraint lengths (4, 6, and 9). Different decoders are selected depending on the noise levels in the transmission channel. If the noise level is high, then the decoder with a larger constraint length is selected. In [78], the authors utilize reconfiguration to switch between decoders of different constraint lengths. We

39 28 modified the design by statically mapping 3 different sizes of decoders on the FPGA, and selecting the right decoder depending on noise in the channel. For this work, we assumed that an input coming into the FPGA decides which decoder to choose. The design was mapped onto an XC2V1500-bg575. The resource usage was 5469 slices (71%), 90 IOBs (22%), 0 BRAM and 0 multiplier. The three different decoders occupied 718, 1846, and 2854 slices respectively. Another module, which remained active all the time (branch metric generator) occupied 51 slices. The advantage of this design is that the decoding can be done much more rapidly if the channel is not noisy. The drawback is that at any given time, two decoders are sitting idle. This gives a scope for switching-off the unused decoders. We estimated and compared the leakage savings for this design for design-level RCP, module-level RCP and normal placement, assuming run-time leakage control. We also compared savings obtained from run-time leakage control with static control. In order to estimate the leakage savings from run-time control, we assumed that each of the 3 decoders is active for equal durations. Thus, any of the three decoders can be switched off for two-thirds of the total time Results and Analysis Figure 3.6 plots the average estimated leakage power savings by switching off the unused regions in FPGA. The savings are represented as percent of total leakage (that occurs without any switching off). A region represented as 2 4 means that the region is 2 slices wide and 4 slices high. Plots for RCP as well as without RCP have been shown. For both, RCP and normal placement, leakage savings decrease with increase in region size. However, the decrease for RCP is very small compared to normal placement. As is

29 Fig. 3.6. Average leakage power savings for RCP and normal placement. Fig. 3.7.

40 29 Fig Average leakage power savings for RCP and normal placement. Fig Leakage power savings for RCP for 4 16 region for all designs. Fig RCP. Average clock frequency for Fig Average leakage energy savings for RCP and normal placement.

41 30 evident from the plots, RCP clearly outperforms normal placement. Especially for large region sizes, RCP provides more than 6 times the savings of normal placement. This happens because, although the number of slices used is the same in both cases; in case of normal placement, they are scattered across regions. Larger regions can accentuate this problem. We observed that the leakage power savings are strongly dependent on the resource usage of a design. Figure 3.7 plots the variation, across all designs, of leakage power savings for a single region choice. It shows that the leakage power savings vary significantly depending on the design. For some designs, there is no leakage saving because those designs occupy all the regions of the FPGA. Leakage power is reduced by more than 20% for 40% of the designs. However, the constraint on the placement due to RCP can influence the timing of the signals. Figure 3.8 plots the average estimated clock frequencies achieved using RCP, expressed as a percentage of frequency estimated for normal placement. A region represented as 2 4 h refers to horizontal style of RCP with region of height 4 slices and width 2 slices. Similarly, a region represented as 2 4 v refers to vertical style with the same region size and shape. The plot shows that for all regions, the average clock frequency is within 8% of original clock frequency. The performance penalty can result in longer execution time and consequently increase the duration of leakage. To capture this impact, Figure 3.9 plots average estimated leakage energy savings for RCP as well as for normal placement. We note that except for very fine-grain regions, RCP always results in higher leakage energy savings. The difference between the two increases for large region sizes. Again note that small

42 31 region size incurs larger area overhead due to larger effective sleep transistor size, more routing and control signals, and more configuration bits (which increases configuration time too). Fig Leakage power savings for timebased leakage control. Fig Leakage energy savings for time-based leakage control Time-based Leakage Control Figure 3.10 plots leakage power savings for dynamic and static leakage controls for the AVA decoder design. The savings from dynamic control are shown for a module level RCP (modules get placed in non-overlapping regions), design level RCP, and for normal placement. The savings from static leakage control are shown for design-level RCP, and for normal placement.

43 32 It is observed that time-based leakage control results in very large savings compared to static control. Furthermore, among the different placement strategies for timebased control, module-level RCP outperforms the others. Design-level RCP performs better than normal placement in most cases, but in some cases normal placement results in larger savings. This happens because in case of normal placement, the 3 different modules are placed slightly separated (because the placer has a larger area available to place the modules). Therefore, only a few regions are common among the different modules. In case of design-level RCP, the placer has a smaller area in which to fit the entire design. This increases the overlap among the 3 modules, thus disabling the dynamic switching-off of those regions. Figure 3.11 plots leakage energy savings for time-based control (module-level RCP, design-level RCP, normal placement with no RCP) and static control (RCP, normal placement with no RCP) for the AVA decoder design. It is observed that time-based control results in very large savings compared to static control. Also, in all but two cases, module-level RCP results in the largest energy savings. It must be observed that the plots shown above do not account for additional overhead for dynamic reconfiguration of the control bits. However, even assuming that reconfiguration incurs a 10% increase in overall execution time and consequent leakage energy penalty, we find that module-level RCP with time-based leakage control provides 19% (is 27% without reconfiguration overhead) more leakage savings than a normal placement with static leakage control.

33 (a) (b) (c) Fig. 3.12. Supply transistors used for programmable Vdd 3.1.5 Summary Our work demonstrates that switching off parts of FPGA can result in significant leakage savings in most designs.

44 33 (a) (b) (c) Fig Supply transistors used for programmable Vdd Summary Our work demonstrates that switching off parts of FPGA can result in significant leakage savings in most designs. The savings can be further increased by using Region Constrained Placement (RCP). Furthermore, if RCP is used then the switch-off granularity need not be very fine, since leakage savings decrease very gradually with increasing region size. Thus, considering the area overhead of having very small regions, a large region size coupled with RCP looks to be a practical choice. Module-level RCP is a promising enhancement for designs in which some modules stay inactive for significant durations of time. 3.2 Dual-Vdd FPGA Reducing the supply voltage (Vdd) is an effective technique for reducing both dynamic and static power. Dynamic power varies quadratically with supply voltage, while both sub-threshold leakage (due to Drain Induced Barrier Lowering, DIBL) and gate leakage vary exponentially. However, reducing the supply voltage negatively affects

45 34 the circuit performance. Dual-Vdd is a popular technique to reap the benefits of voltage scaling without its performance penalty. The timing-critical blocks in the design operate on the normal Vdd (or Vddh), while non-critical blocks operate on a second supply rail with a lower voltage (or Vddl). While dual-vdd ICs have been successfully used in lowpower ASICs and custom ICs [34], no commercial FPGA today uses multiple Vdd s for power reduction. 2 The difficulty of designing a dual-vdd FPGA is that the optimal Vdd assignment changes from one design to another. Consequently, if logic blocks are statically determined to be operating at low or high Vdd, the placement and routing algorithms need to be modified accordingly (e.g., [35]). However, static assignment of Vdd to the blocks may prevent the ability to reduce power or to meet timing constraints for some designs. In contrast, the use of Vdd-programmability for each block helps to tune the number of high and low Vdd blocks as desired by the application. In this approach, the challenge is in determining the Vdd assignments to each block. The need for level converters wherever a low-vdd block drives a high-vdd block and the associated delay and energy overheads are important considerations when performing these Vdd assignments. Furthermore, positioning of the level converters influences the ability to assign lower Vdd s to the routing blocks. In our programmable dual-vdd architecture, the Vdd of a circuit block is selected between Vddh and Vddl by using two high-vt transistors (supply transistors) connecting the block to the supplies (see Figure 3.12). This circuit was previously used by [36]. The 2 Xilinx Virtex-II FPGAs use different supply voltages for I/O and the core. Pass transistors used for interconnects are also supplied higher gate voltages to eliminate the Vt drop. However, this is not targeted to reduce power.

46 35 state (ON/OFF) of each supply transistor is controlled by a configuration bit, which is set by the Vdd assignment algorithm. The configuration bits are set either to connect the block to one of the power supplies or completely disconnect the block from both the power supply lines when the block is unused or idle. We evaluate the effectiveness of different Vdd assignment algorithms and implementation choices for an island-style FPGA architecture designed in 65nm technology. Our results demonstrate that one of the Vdd assignment techniques provides an average power saving of 61% across different MCNC benchmarks Architecture We propose two types of dual-vdd architectures. The first, Fully Programmable (FP), architecture allows all logic blocks (CLBs) and routing resources to be independently programmed as Vddh or Vddl. The second, Logic Programmable (LP), gives that flexibility only for CLBs, and fixes the voltages of the routing resources. Both the architectures are built on cluster-based island-style FPGAs, with the configuration stored in SRAM cells. The basic logic element (BLE) consists of a 4-input LUT and a flip-flop. Multiple BLEs cluster together to form a CLB (see Figure 3.13). In both architectures, level conversion takes place only at CLB pins. For this purpose, CLB pins have level converters (LCs) attached to them. A multiplexer allows to by-pass the level converter if level conversion is not needed at that pin. Placing the level converter only at CLB pins reduces the complexity of the routing fabric, and also limits the area and leakage overhead of level converters.

47 36 (a) Dual-Vdd CLB (b) Dual-Vdd routing mux Fig Fully programmable dual-vdd architecture (FP) Fully Programmable (FP) The FP architecture facilitates configurable supply voltage for logic blocks and routing multiplexers. Figure 3.13(a) shows how the CLB is configured using high-vt supply transistors to operate at two different voltages. We experimented with two variants of FP, differing in the placement of the level converters. While the first version places LCs at the output pins of CLBs, the second places them at CLB input pins. Figure 3.13(a) shows the first case, where only the output pins of a CLB have LCs attached to them. In this case, a net with multiple fanouts operates at high Vdd if any one of the CLBs driven by this net is at high Vdd (since the signal s voltage level does not change in the routing fabric). This limits the number of routing muxes that can operate at low Vdd, and therefore is less effective in reducing routing power compared to the case when LCs are attached to CLB input pins. However, the drawback of keeping LCs at input pins of CLBs (apart from area penalty) is that a larger number of LCs are needed, which increases the leakage in logic blocks.

48 37 Our results support this reasoning, but show that overall leakage is lower for the second case. Figure 3.13(b) shows a routing multiplexer (mux) in the FP architecture. The multiplexer s output is connected to a level-restoring buffer to restore the Vt-drop through the NMOS-based multiplexer. Note that the same set of supply transistors controls the voltage of configuration SRAM cells and the level-restoring buffer. Since the configuration SRAM is not timing critical, the supply transistors need to be sized just enough to supply the maximum current needed by the level-restoring buffer. If a circuit block (CLB or routing mux) is completely unused, then in order to save leakage, it is desirable to completely switch off that block. This is achieved by keeping a separate configuration bit for every supply transistor 3. Although this incurs more area overhead, it results in significant leakage savings, since resource utilization in an FPGA is typically low [21, 23]. Due to the area overhead of level converters and supply transistors (and associated configuration SRAM cells), the dual-vdd FPGA takes approximately 50% more area than a single-vdd FPGA. The majority of leakage in an FPGA occurs in the configuration SRAM cells. [23] have previously shown that by increasing the threshold voltage of the configuration SRAM, its leakage can be reduced by 98%, while increasing configuration time by 20%. Since configuration time is not critical in most of our target designs, this tradeoff for power savings is reasonable. For applications where configuration time is crucial, we have 3 In case of a routing mux, we need to pull down the control signals when the mux is unused. The pull-down transistors can be sized very small.

49 38 proposed the use of Asymmetric SRAM cells [31]. In order to see the effect of dual-vdd on power consumption, we have neglected the configuration SRAM leakage both for the single supply design, and for the dual supply design (since the reduction of configuration SRAM leakage is achieved by increasing its threshold voltage, and is equally applicable to both single and dual supply designs) Logic Programmable (LP) The LP architecture facilitates configurable supply only to logic blocks (see Figure 3.14). The routing resources run at supplies fixed at the time of device fabrication. The routing switches contain sleep transistors to cut off their power supply when not used. The FP dual-vdd FPGA of the previous section results in a large area penalty of about 50%. A key observation is that most of the area is consumed by the routing resources. By fixing the supply voltages of routing resources, an LP FPGA eliminates the supply transistors and associated configuration SRAM cells in the routing fabric. Instead, we need only one sleep transistor per routing switch. This sleep transistor is controlled by the SRAM cell that controls the state of the routing switch. This more than halves the area cost of supply transistors in the routing fabric. Compared with a single Vdd FPGA, the area penalty for an LP FPGA is close to 20%. This circuit is similar to one of the circuits in [39], with the difference that in our case the supply voltage could be either Vddh or Vddl while they fixed the supply to Vddh for routing. Every logic block still has its own supply transistors, and can be independently programmed to function at Vddh or Vddl. In order to further reduce the area penalty

50 39 Fig Logic programmable dual-vdd architecture (LP) due to these supply transistors, we share the supply transistors among multiple logic blocks. Since all CLBs do not normally draw the maximum currents at the same time, the supply transistor can be sized smaller than the sum of independent supply transistors. Hence, the area overhead of supply transistors is reduced. Level conversion still occurs only at CLB pins. However, unlike FP, we do not have the flexibility to set the Vdd of nets to match that of logic blocks connected to them. Therefore, we need to allow for level conversion at both input and output pins of CLBs. The LP architectures are especially suited for low-cost applications with low power requirements Level Conversion Level converters have been studied widely ever since multi-vdd circuits were proposed [33, 79]. The area, delay and power overheads of level converters prohibit random

51 40 Fig Level converter circuit Vdd assignment to logic elements of a circuit. For the present work, we have used the level converter circuit shown in Figure 3.15, and a 65nm Berkeley Predictive SPICE model [80] to simulate it. For an FPGA architecture where level converters are placed at CLB input pins, four level converters are required per BLE. For a Vddh of 1.1V and Vddl of 0.9V, the LC delay is almost 17% of the delay of an LUT, and as much as 41% of the clock-to-q delay of the flip-flop. This significant delay in the LC prohibits the use of many LCs within a logical path of the circuit. In contrast to delay, power consumption in an LC was observed to be negligible (< 1%) compared to a BLE. This allows us to place LCs at all pins of a CLB and still save power Methodology We used VPR and its power model [10, 14] for this work. MCNC benchmarks were used to evaluate the dual-vdd architecture and Vdd assignment algorithms. The architecture of FP FPGA closely resembles a modern FPGA. The LUT size of 4, and cluster size of 8 LUTs are the same as a Xilinx Virtex-II device. The routing channel

52 41 Fig Experimental Flow consists of 200 tracks, with buffered segments of lengths 1, 2, 6 and long. The switch block used a Wilton topology [81]. For LP, however, we simplified the fabric to resemble the one used by [82]. The CLB consists of 4 BLEs. The routing fabric consists of only length-four segments, which has been shown to be the best for area and speed by [82]. We further changed the switch block topology to Subset. These simplifications made it easier to implement the LP architecture in VPR. A Subset switch block connects only segments of the same type. In an LP FPGA, we wanted no connections from a Vddl routing resource to a Vddh resource because the routing switches did not have any level converters. Using a Subset switch block made it easier to guarantee this (by creating a type for segments at a particular Vdd). This, however, also does not allow connections from Vddh to Vddl routing resources, and therefore, the power savings we report here for LP could be improved. For the purpose of comparison of FP with LP, this restriction is justified

53 42 because we do not allow such connections for FP either. Furthermore, we chose all segments to be of length 4 because we did not want nets to solely use longer or shorter wires. Because of the Subset topology, only wires of the same type would connect, and therefore, a length 6 wire will not connect to a length 2 wire (which does not resemble a modern commercial FPGA architecture, such as Virtex-II). Despite these simplifications, we believe our results to be indicative of other segmented routing architectures as well. Circuit simulations were performed in SPICE using 65nm BSIM4 device models. Delays of BLE and LC were obtained from these simulations. Power consumption, both static and dynamic, of the LC was also obtained through SPICE simulations. Figure 3.16 shows the experimental flow. The flow deviates from a normal VPR flow after the place and route stage. We first assign voltage to all CLBs using algorithms that are discussed below, and then estimate power of the design placed and routed on the target dual-vdd architecture. Assigning voltages after routing makes the timing analysis more accurate, since all the routing delays get incorporated in the timing graph Vdd Assignment In order to be effective, a dual-vdd scheme requires that paths in the circuit vary in their delays. If all paths are of same delay then all circuit elements will require high Vdd to maintain the performance of the design. Figure 3.17 shows the distribution of path delays averaged over MCNC benchmarks. We observe that path delays in a circuit vary considerably. Therefore, a dual-vdd scheme can be expected to reduce the power consumption significantly. Figure 3.17 also shows the path delays after using our dual-vdd assignment algorithms.

54 43 Algorithm 1 Algorithm for Vdd assignment: Low-to-High (assuming LCs at CLB input pins) Assign Vddl to all CLBs and routing muxes P list of all paths in the design T longest path delay when all blocks operate at Vddh T d xt, x 1 is a user-defined performance metric critical path {P i P delay(p i ) > T d } for all CLBs do criticality(clb) # paths passing through CLB end for while critical path not empty do P k path critical path with maximum delay N all blocks through which P k flows Sort N based on criticality (first entry has most paths) while delay(p k ) > T d do N i first(n) N N - N i Assign Vddh to N i and all routing muxes driven by N i update delay of all paths passing through N i end while critical path critical path - {P k } end while

55 44 Algorithm 2 Algorithm for Vdd assignment: High-to-Low (assuming LCs at CLB input pins) Assign Vddh to all CLBs and routing muxes P list of all paths in the design T longest path delay when all blocks operate at Vddh T d xt, x 1 is a user-defined performance metric vddl delay(p i ) delay(p i ) when all blocks in P i are at Vddl critical path {P i P vddl delay(p i ) > T d } for all CLBs do criticality(clb) # paths passing through CLB end for while critical path not empty do P k path critical path with maximum delay N all blocks through which P k flows Sort N based on criticality (last entry has most paths) while (delay(p k ) < T d ) & (N not empty) do N i first(n) N N - N i Assign Vddl to N i and all routing muxes driven by N i calculate delays of all paths flowing through N i if any of the delays > T d then reset N i and all routing muxes driven by N i to Vddh else update delays of all paths flowing through N i end if end while critical path critical path - {P k } end while

45 Fig. 3.17. Distribution of path delays We use the heuristic shown in algorithm 1 for Vdd assignment.

56 45 Fig Distribution of path delays We use the heuristic shown in algorithm 1 for Vdd assignment. Initially we assign low Vdd to all CLBs in the FPGA, and find those paths whose delays become greater than the desired clock time period. We call such paths critical. Those CLBs which do not belong to any of the critical paths can be kept at low voltage without affecting performance of the design. Some of the remaining CLBs and routing muxes need to operate at high-vdd so that the design s performance target is met. The order in which these CLBs are analyzed is crucial for the performance of the heuristic. We define criticality of a CLB as the number of critical paths that pass through this CLB. The CLBs within a path are analyzed in decreasing order of their criticalities. We started with CLBs on the most critical path, and proceeded to smaller paths in decreasing order of their delay. Algorithm 1 handles the case when LCs are at CLB inputs. In that case all routing muxes driven by a CLB have the same voltage as the CLB. For the other situation, when LCs are at CLB outputs, the voltage of routing muxes driving a CLB is the same as that of the CLB.

57 46 In order to enumerate all paths whose delays become larger than the required clock time period, we used the algorithm proposed by [83]. It maintains all paths in a heap data structure with their delays as the keys. Each path also maintains all the branch-points in the path in increasing order of their branch-slacks 4. We also experimented with a variant (High-to-Low) of the above algorithm, in which all the CLBs are initially kept at high Vdd and then some of them are changed to low Vdd (see algorithm 2). Before changing a CLB to low-vdd, we need to make sure that this will not increase the delay of some other path in the circuit above the desired clock period. The number of low Vdd blocks using both versions, for Vddh of 1.1V and Vddl of 0.8V (for 65nm technology) is shown in Table 3.2. For 10 out of 15 designs, the High-to-Low (h2l) version performs better than Low-to-High (l2h). This happens because in case of h2l, when the CLBs on a particular path are being analyzed whether they can be run on low-vdd, the algorithm continues to look at all the other CLBs on the path even after it failed to change the Vdd of some CLB. In contrast, in the l2h case, the algorithm keeps changing CLBs on a path to high Vdd (in decreasing order of criticality), till the delay of the path is less than the required clock period. This sometimes causes the path s delay to be reduced more than what was necessary. For the LP FPGA, the core of the Vdd assignment algorithm remains the same as that for FP. The main differences lie in the way the routing segments are handled. Since their Vdd s are fixed, the assignment algorithm does not assign voltages to them. 4 Branch slack is defined as the decrease in path delay if a particular branch point is used to generate a new path

58 Additionally, since this architecture allows level conversion at both inputs and outputs of the logic blocks, we modify the assignment algorithm accordingly Power Estimation After all logic blocks have been assigned appropriate supply voltages, we estimate the power consumption of the entire FPGA. We concentrate only on the power consumption in the core of the FPGA, and do not try to optimize or estimate IO power consumption. Furthermore, we did not estimate the power consumption in the global routing grid used for clock distribution. In order to estimate dynamic power, VPR s power model calculates transition densities at all internal nodes of the FPGA, assuming that all inputs to the FPGA have the same static probability (default: 0.5). Capacitances are estimated from the capacitance values of a MOSFET, and that of wires and switches, all of which need to be provided in the architecture file taken by VPR as an input. We used the Berkeley Predictive 65nm technology parameters for our experimentation. We modified VPR s dynamic power model to include dual supply voltages. The dynamic power of a circuit element reduces by ( V ddl V ddh )2 when its voltage is reduced from Vddh to Vddl. SPICE simulations of an LC provided its energy values for different pairs of Vddh and Vddl. We used these energy values and the transition density at the input of an LC to calculate the its dynamic power. VPR has got a basic leakage model, which calculates sub-threshold leakage due to weak inversion. However, in a 65nm technology, two more effects, namely, DIBL and gate leakage become significant, and need to be included in the leakage estimation. We

59 also modified the leakage model to take into account multiple supply voltages, and sleep modes. Specifically, the following modifications were made to VPR s leakage estimation Gate leakage and sub-threshold leakage due to DIBL were included in the leakage estimation. In order to estimate leakage of a single MOSFET, we used results from SPICE simulations. BSIM4 device models for 65nm were used. Simulations were performed for various supply voltages to get leakage numbers for different voltages. These numbers were incorporated into the power model of VPR to estimate gate leakage of the entire FPGA. 2. We estimated average leakage in a routing multiplexer by halving the worst case leakage, as discussed in [27]. To verify the results, we simulated multiplexers of various sizes and structures and found our leakage estimate to be very close to the SPICE results. 3. In the dual-vdd FPGA, unused logic blocks and routing muxes are kept in a sleep state by switching off both the supply transistors. Circuit simulations in SPICE showed that in sleep mode, leakage of a circuit block reduces to 10% of the original (high Vdd) leakage. 4. To estimate level converter leakage, we obtained the leakage number for one level converter from SPICE simulations, and multiplied this by the number of level converters in the FPGA.

60 49 Fig Power consumption for different Vddl s. Vddh=1.1V. Fig Power consumption for different architectures and algorithms. Vddh=1.1V, Vddl=0.9V

61 50 Fig Average power breakdown between logic and routing resources. Vddh=1.1V, Vddl=0.9V Fig Average power consumption for different critical path delay tolerances. Vddh=1.1V, Vddl=0.9V

62 Results and Analysis In this section, we first evaluate the FP architecture (Figures 3.18, 3.19, 3.20, 3.21) and then compare it with LP (Figures 3.22, 3.23) FP Architecture Power in the dual-vdd architecture strongly depends on the values of Vddh and Vddl. In order to understand this dependence, and to come up with a good voltage choice, we fixed the high-vdd at 1.1V and varied Vddl from 0.8V to 1.0V. Figure 3.18 shows the power consumption for different Vddl values (using High-to-Low Algorithm, LC at CLB s inputs). Note that for 11 (out of 15) designs, Vddl value of 0.9V results in maximum power savings. When Vddl is increased to 1.0V, although the number of CLBs on low Vdd increases, the total power consumption increases. This happens because the power consumption of the circuit elements at 1.0V is significantly higher than at 0.9V. Interestingly, when we reduce Vddl to 0.8V, power consumption again increases because the number of CLBs and routing muxes on low Vdd becomes too low. Therefore, for all other results in this section, we use a Vddl of 0.9V. For this case, the average power reduction is close to 61%. Figure 3.19 shows the power consumption of the designs for the two algorithms High-to-Low (h2l) and Low-to-High (l2h), and level converter placements at CLB outputs (LCo) or inputs (LCi). (h2llci denotes High-to-Low algorithm with LC at CLB Inputs.) Note that for most designs, the High-to-Low algorithm outperforms the Low-to-High algorithm. This is expected because, as shown above (see Table 3.2), the High-to-Low algorithm resulted in larger number of low-vdd CLBs. Furthermore, the

63 52 placement of LCs at CLB inputs saves more power (average: 61%) than their placement at outputs (average: 57%). This happens because LC leakage is not large enough to overshadow the gains we get in the routing power by placing LCs at CLB inputs. Figure 3.20 shows the static and dynamic power consumption in both logic and routing resources for the different algorithms and LC placements. An important observation is that not all components of power are reduced by the same factor. The reduction in dynamic power is much less than that in leakage. For example, using High-to-Low algorithm and placing LC at CLB inputs saves 24% dynamic power and 76% leakage power. This can be attributed to two factors. First, in an FPGA since there exist a large number of unused circuit elements, it is possible to reduce the leakage in them by switching them off. Second, leakage varies exponentially with supply voltage, but dynamic power varies only quadratically with supply voltage. Note that leakage in routing resources reduces to less than 17% of the original, because in most designs it is possible to put a large number of routing muxes in sleep state, as they are sparsely used. Another trend to note is that the logic portion of leakage is larger when LCs are placed at CLB inputs (LCi) than when they are placed at CLB outputs (LCo). This implies that the larger overall power saving for the LCi case comes entirely from the routing resources. Figure 3.21 shows what happens when we modify the Vdd assignment algorithm to allow some degradation in the performance of the design. In the figure, a delay value of 110% denotes 10% performance penalty. Note that these delay values may increase after circuit implementation due to the use of supply transistors, and due to a possible increase of wire lengths (since total CLB area and consequently inter-clb distances

64 increase). Using h2llci, a 10% decrease in performance increases the average power saving by around 4%, but beyond 20%, the power remains almost constant LP Architecture For LP architectures, since we hard-wire the supply voltages of routing fabrics, the critical path delay of the design may get affected. Therefore, we first look at the impact of LP on the delays of all designs. Figure 3.22 shows both the average and worstcase delays for the benchmark designs. Restricting the maximum increase in delay due to LP to 20%, we decide to keep 50% of the routing resources on low Vdd. Note that the average increase in delay for this architecture is only 3% of the FP architecture. The slightly irregular variation in delay happens due to the heuristic nature of the router. In this delay comparison, we do not include the increase in delay because of resistance of the supply transistors, delay through the mux at CLB pins that selects between Vddl or the level-converted signal, and because of an increase in the wire lengths as a consequence of an increase in the FPGA area. The increase, however, is minimal, and is highly dependent on the circuit implementation. For example, [84] demonstrate effective supply-gating of circuits with a performance penalty of less than 10%. [36] observed a penalty of 5% for dual-vdd circuits when they used regular-vt gate-boosted supply transistors We realized that if the FPGA has too many routing resources, it is possible that none of the low voltage resources get used, and the delay of the design remains the same as that for single Vdd FPGA (if the router is timing-driven). To avoid such a scenario, we first found the minimum channel width for every design using VPR, and

65 54 then used 130% of the minimum as the channel width. This is different from the above FP experiments. However, while comparing FP with LP, we used the LP channel width for both architectures. Also note that the CLB here consists of 4 BLEs instead of the 8 in the FP experiments. Figure 3.23 shows the total FPGA energy (power-delay product) obtained using this architecture for different spatial granularities. h2l-50-2x1 on the x-axis refers to the architecture where 50% of the routing resources are at Vddl, and the supply transistors are shared among CLBs in clusters of dimension 2 1. We compare energy instead of power because the critical path delays of designs mapped on LP FPGAs are different from those on FP FPGAs. The Vdd assignment algorithm remains h2l (High-to-Low) for all of them. Compared with FP, LP increases the energy by about 4.1%. The routing energy increases because we do not change their supply voltage. However, the energy used by logic blocks decreases by about 1.5%, because, due to the presence of LCs at both CLB inputs and outputs, we have more flexibility in assigning Vdd s to them. We further observe that the use of 4 4 clusters increases the total energy by about 12% (compared with FP) Summary We presented two types of dual-vdd FPGA. The fully programmable (FP) FPGA reduced the total energy by about 60% on an average at the expense of about 50% area penalty. The logic programmable (LP) FPGA reduced the total energy by 57.3% with about 20% area increase compared to single supply FPGA. LP, however, resulted in an average increase of 3% in the critical path delay over FP.

66 55 We also explored different Vdd assignment algorithms and level converter placements for FP architecture. Experiments demonstrated that high-to-low algorithm coupled with placement of level converters at the input pins of CLBs resulted in maximum power savings. The dynamic power was reduced by 24%, while the reduction in static power was close to 76%. In future, the implementation of the LP dual Vdd architecture can be modified to allow connections from Vddh to Vddl resources in the routing fabric. Further, the routing architecture can be improved to use different lengths of segments.

67 Table 3.1. Characteristics of benchmark designs Design #Slices #IOBs #BRAMs #MULTs FPGA device 1 xapp248 96(37%) 17(19%) 0 0 XC2V40-cs144 2 xapp270\des 4,723(92%) 189(58%) 0 0 XC2V1000-fg456 3 xapp270\triple-des 14,273(99%) 301(62%) 0 0 XC2V3000fg676 4 xapp288\ser decoder 50(19%) 20(22%) 0 0 XC2V40-cs144 5 xapp288\par decoder 107(41%) 28(31%) 0 0 XC2V40-cs144 6 xapp (39%) 166(83%) 0 0 XC2V250-fg456 7 xapp298 70(27%) 16(18%) 0 0 XC2V40-cs144 8 xapp (31%) 262(99%) 1(2%) 0 XC2V500-fg456 9 xapp610 1,369(89%) 20(21%) 0 8(33%) XC2V250-cs xapp611 1,534(99%) 20(21%) 0 16(66%) XC2V250-cs xapp615 1,155(75%) 45(48%) 0 2(8%) XC2V250-cs xapp621 1,305(84%) 29(31%) 0 0 XC2V250-cs xapp625\video 254(99%) 63(71%) 0 0 XC2V40-cs xapp (10%) 278(85%) 0 0 XC2V1000-fg itc99\b04 110(42%) 21(23%) 0 0 XC2V40-cs itc99\b05 259(50%) 39(42%) 0 0 XC2V80-cs itc99\b12 214(83%) 13(13%) 0 0 XC2V40-cs itc99\b14 2,432(79%) 88(51%) 0 2(6%) XC2V500-fg ava\k4 724(47%) 85(92%) 0 0 XC2V250-cs ava\k7 2,034(66%) 85(49%) 0 0 XC2V500-fg ava\k9 2,895(94%) 85(49%) 0 0 XC2V500-fg industry1 1,954(38%) 279(64%) 0 0 XC2V1000-ff industry2 2,488(80%) 185(70%) 0 0 XC2V500-fg industry3 2,513(81%) 132(50%) 0 0 XC2V500-fg industry4 5,777(75%) 182(34%) 0 0 XC2V1500-ff industry5 5,153(67%) 65(12%) 0 0 XC2V1500-ff industry6 206(4%) 287(66%) 0 0 XC2V1000-ff industry7 3,505(68%) 251(58%) 0 0 XC2V1000-ff industry8 1,602(52%) 60(22%) 0 0 XC2V500-fg industry9 2,280(44%) 293(67%) 0 0 XC2V1000-ff industry10 3,663(71%) 224(51%) 0 0 XC2V1000-ff industry11 4,364(85%) 172(40%) 0 0 XC2V1000-ff industry12 97(37%) 80(90%) 0 0 XC2V40-fg industry13 411(80%) 84(70%) 0 0 XC2V80-fg industry14 1,288(83%) 186(93%) 2(8%) 0 XC2V250-fg456 Average 61.89% 50.82% 0.29% 3.23% - 56

68 57 Table 3.2. Comparison of High-to-Low and Low-to-High algorithms (LC at CLB inputs, Vddh = 1.1V, Vddl = 0.8V Design # CLBs # Vddl CLBs Low-to-High High-to-Low alu apex apex bigkey des dsip elliptic ex ex5p misex pdc s seq spla tseng Fig Critical path delay for LP FPGA with different extents of Vddl resources. Vddh=1.1V, Vddl=0.9V.

69 Fig Energy consumption in LP FPGAs. Vddh=1.1V, Vddl=0.9V 58

70 59 Chapter 4 Three-Dimensional FPGAs As transistors become faster and designs get larger, the delay incurred in the interconnecting metal wires becomes significant. Consequently, reducing the wire-length is crucial for future technologies. Three-dimensional (3-D) integration is a promising technique for reducing wire lengths. By stacking multiple silicon wafers interconnected with fine vias, the average wire length in the designs gets significantly reduced, which improves their performance. Other gains, such as reduced design footprint and the ability to integrate different technologies, further favor 3-D ICs. Field Programmable Gate Arrays (FPGAs) are consistently improving in capacity and performance, and are now among the most popular devices in the market. With their regular structure, they also scale easily to future technologies. However, the large overheads of their programmable interconnect are severely limiting their growth. The programmable interconnect resources take almost 70% of the die area, and consume the major part of FPGA power. Furthermore, for most designs, they also constitute more than 50% of the critical path delay. Therefore, a reduction in the interconnect resources, by going to 3-D, will greatly benefit FPGAs. The advantages of 3-D FPGAs have evoked significant interest, and several studies have looked at them in the past. More than a decade ago, Alexander et al. [43] presented a 3-D FPGA that used package-level integration to stack multiple 2-D FPGAs

71 60 interconnected using solder bumps. The minimum pitch of these vertical interconnects was 100µm. Campenhout et al. [44] proposed opto-electronic FPGAs, in which the inter-chip communication used optical links. The optical links provide a large vertical channel density. The Rothko 3-D FPGA [45] was a 3-D extension of the Triptych seaof-gates architecture [46], consisting of routing and logic blocks. The 3-D integration was done at the wafer-level and inter-layer communication used metal vias. A dynamically reconfigurable 3-D FPGA was presented in [47], which consisted of three physical layers: routing and logic block layer, routing layer, and memory layer. Recently, Lin et al. [48] analyzed the performance benefits of a monolithically stacked 3-D FPGA. Their 3-D integration technology provided very fine vias, which allowed them to stack the configuration memory on top of the rest of the FPGA (logic blocks and interconnects). Researchers have also looked at theoretical models for 3-D FPGAs. Rahman et al. [49] presented an analytical model for predicting interconnect requirements in 3-D FPGAs, and estimated over 50% reduction in channel width, interconnect delay, and power dissipation, when compared to 2-D FPGAs. Kwon et al. [50] recently extended this model to incorporate clustered logic blocks (similar to Virtex-2 [1]). On the CAD front, Ababei at al. [51, 52] recently presented a partitioning-based placement algorithm for 3-D FPGAs, which primarily focused on reducing the inter-layer vias. However, their router was not timing-driven. Although several researchers have proposed 3-D FPGAs, the detailed routing architecture of a 3-D FPGA remains unexplored. Ababei et al. [51] assumed a subset switch block (see definition in Section 4.1.1). Although Wu et al. [53] designed universal 3-D switch blocks, they used track count as the sole metric of quality. Furthermore,

72 61 they assumed that the number of inter-layer vias is the same as the horizontal channel width. In today s technology, especially if we stack more than two layers, the vias are much thicker than the horizontal wires (1um vs. 0.1um), which makes this assumption impractical. This chapter consists of two main sections. In Section 4.2, we explore six 3-D switch box (SB) topologies for the case when the vias are fewer than the horizontal wires. These switch boxes range from a simple extension of the 2-D subset SB - used in prior studies [51] - to 3D universal SBs with additional flexibility for the inter-layer vias. Section 4.1 gives a brief overview of 2-D switch boxes and 3-D technology. The switch box topologies explored in this study are described in Section 4.2. Section explains the experimentation methodology, and Section analyzes the exploration results. Using detailed area and delay models, we estimate their impact on FPGA area, delay, and area-delay product. The results indicate that the area-delay product (ADP) depends heavily on the SB topology: our best SB reduces ADP by 10% compared to the subset SB. Section 4.3 analyzes (Section 4.3.1) and reduces (Section the thermal issues in 2-D and 3-D FPGAs. A thermal-aware 3-D FPGA design reduces the peak temperature by about 16 C. Finally, Section 4.4 summarizes the contributions of this chapter.

62 (a) Subset (b) Universal Fig. 4.1. 2-D switch boxes. X 0, Y 0, X 1, Y 1 mark their sides. 4.1 Background 4.1.1 2-D Switch Boxes Our study will focus on island-style SRAM-based FPGAs.

73 62 (a) Subset (b) Universal Fig D switch boxes. X 0, Y 0, X 1, Y 1 mark their sides. 4.1 Background D Switch Boxes Our study will focus on island-style SRAM-based FPGAs. FPGAs from Xilinx and Altera belong to this category. The logic block (CLB) consists of Look-Up-Tables (LUTs) and Flip-Flops (FFs). Routing wires (tracks) and programmable switches constitute the routing channel. Channel width refers to the number of tracks in a channel. The CLBs connect to the channel through connection boxes. The routing wires connect among themselves through switch boxes. Switch box topology refers to the connectivity provided by the switch box. Researchers have explored several topologies [85, 86, 81, 87, 88] (see Figure 4.1). The subset (also called disjoint) topology, used in Xilinx XC4000 FPGAs, connects tracks of the same number in all four directions. This divides the channel into disjoint sets of

74 63 tracks, and a net uses the same track number for its route. Universal topology provides more flexibility than disjoint. It facilitates connectivity for all possible global routes of two-terminal nets. Research has shown that the universal switch box results in fewer tracks in the channel [89]. Hyper-universal switch boxes provide even greater flexibility, and facilitate the connectivity for all possible global routes of k-terminal nets [90]. However, they use more switches than universal switch boxes. (a) Face-to-Face (b) Face-to-Back Fig Two kinds of stacking D Technology Overview 3-D chip design is a promising methodology to alleviate many interconnect problems. Current state of the art chips are two-dimensional, which means that they have

75 64 Table 4.1. Via properties Thickness Pitch Height Via 1 1um 3um 10um Via 2 2um 5um 20um Via 3 5um 10um 50um only one plane of active layer that contains all the devices. Note that although no transistor (device) is stacked on top of other transistor (device), the metal wires interconnecting these devices typically span multiple layers, with the higher layers occupied by global wires. 3-D ICs extend this concept to the devices by stacking multiple device layers in the vertical dimension. Several technologies, such as beam recrystallization, silicon epitaxial growth, processed wafer bonding, and solid phase crystallization, enable the vertical integration of multiple device layers [91], Among these technologies, wafer bonding is particularly promising. It involves the bonding of two fully processed wafers (on which the devices and interconnects have already been fabricated). Since the individual wafers are fabricated separately, it is possible to integrate completely different technologies, and have a very large number of layers. The inter-layer vias in this technology can be as fine as 1µm 1µm at a 3µm pitch [92]. The wafers can be bonded in two ways: face-to-face or face-to-back. In the former, a wafer is inverted to bond with another wafer (see Figure 4.2 (a)). This reduces the area overhead of the inter-layer vias because they do not need to pass through the Silicon substrate. However, this limits the number of layers to only two. The second way, face-to-back, does not invert the wafer (see Figure 4.2 (b)). Consequently, it can integrate more than two layers of Silicon. However, since the

76 65 inter-layer vias now need to pass through the Si layer, they take up die space. In this study, we evaluate these two wafer-bonding techniques for 3-D FPGA integration. Since the wafer-bonding 3-D technology is still being perfected, several methods are being explored. These methods result in different via dimensions and wafer thicknesses. For this study, we explore three different methods, which result in the via dimensions shown in Table 4.1. Via 1 reflects the process from Tezzaron [92], which uses a wafer thickness of 10um. Because they are so thin, these wafers lack mechanical strength, and require the use of handle wafers during processing. At the other extreme is via 3 that uses 50um wafers, which reflects the process in [93]. A larger wafer thickness imparts mechanical strength to the wafers, and eliminates the need for handle wafers. Via 2 reflects an intermediate process that we use to illustrate the trends due to via dimensions. An integration technology from MIT uses SOI wafers to reduce the device layer thickness to less than a micron [94]. We do not model this technology in this work D Detailed Routing Architecture We extend the island-style architecture of 2-D FPGAs to 3-D (see Figure 4.3). The CLB consists of LUTs and FFs. The switch box is modified to connect the interlayer vias (ILVs) to the horizontal wires (CHANX and CHANY), and also with other ILVs. The ILVs form channels in the vertical direction (CHANZ). The architecture is symmetric in the X and Y directions, i.e., CHANX and CHANY contain the same number of tracks. CHANZ, however, differs from CHANX and CHANY in its width, which is influenced by the via density provided by the 3-D technology. We use V to

77 Fig D FPGA 66

78 67 (a) Subset (b) Subset-split (c) Subset-twist (d) Subset-more (e) Universal-twist (f) Universal-more Fig D switch boxes for H=4, V=2.

79 68 refer to the number of vias (i.e. vertical channel width) and H for the horizontal channel width. Figure 4.3 shows the case when H = V = 3. CHANZ differs from CHANX and CHANY in another respect too. The length of these vias depends on the wafer thickness, which is typically much smaller than the average 2-D wire length (e.g., wafer thickness = 10um for Tezzaron s process [92], length of a wire spanning 4 CLBs = 150um in a 65nm process). These differences between vertical and horizontal channels must be accounted for to design a good 3-D FPGA. Next, we describe the various 3-D architectures we explored. Where appropriate, we also discuss how technology parameters influence our design Switch Box Topology The flexibility, F s, of a switch box (SB) refers to the number of wires to which each incoming wire can connect. Previous studies have shown that for a 2-D FPGA, an F s of 3 provides good routability [85]. In such SBs, a track connects to one track on each of the other three sides of the SB. Subset and universal topologies are examples of such SBs (see Figure 4.1). These 2-D SBs are extended to 3-D by adding two more faces, which contain terminals for vertical wires one for going up, and another for going down. Since the vias will be fewer than the horizontal wires, the two vertical faces will contain fewer terminals than the other four. We use V to refer to the number of vias (i.e. vertical channel width) and H for the horizontal channel width. Figure 4.4 shows the SBs we created for this study for H=4 and V =2. Normally, the 3-D SB is visualized as a cube, where each face of the cube represents one of the

80 69 directions. However, for ease of illustration, we have flattened the SB and shown it as a hexagon, where each side represents a direction: North (Y 0 ), South (Y 1 ), East (X 1 ), West (X 0 ), Top (Z 1 ), or Bottom (Z 0 ). Furthermore, we show only the connections to the vertical faces (Z 0 and Z 1 ). For all SBs, the horizontal wires (CHANX and CHANY) use either the subset or universal connections among themselves. These connections were described in Section and illustrated in Figure 4.1. For clarity, we do not show the horizontal connections in Figure 4.4. The first four SBs use subset connections among the horizontal wires, and the last two use universal. Figure 4.4 also tabulates the connections from the vertical faces, where Xi,j refers to the jth terminal on the Xi face of the SB. The first SB (subset, see Figure 4.4 (a)) is an extension of the 2-D subset SB. This SB connects the same track number on all sides. Consequently, the entire routing fabric gets divided into disjoint subsets, and a net uses the same track number for its entire route. Note that only the first V of the H horizontal wires connect to the vias. While these wires have a flexibility of 5 (3 connections to the other horizontal directions, and 2 to the vertical ones), the other wires connect to only horizontal tracks (flexibility = 3). Apart from decreasing the routing flexibility, this results in a difference in the capacitive loads of the horizontal wires: large for the first V wires, and small for the rest. The second SB (subset-split, see Figure 4.4 (b)) modifies the subset SB by allowing the first V horizontal tracks to connect to the vias going above, and the last V to those going below. This implies that now there are twice as many horizontal wires that connect to the vertical wires. Therefore, if nets do not fanout at the SB, then this SB provides greater flexibility to the vertical directions. A limitation, however, is that the first V

81 70 can only go above, and the last V, only below. Consequently, if a net needs to fanout to both Top and Bottom, then it needs to use two horizontal tracks (compared to one for subset). This SB distributes the capacitive loads on the horizontal tracks more evenly than the Subset SB. The subset-split SB, although more flexible than subset, suffers from the disjoint property of subset SBs: the entire routing fabric is divided into disjoint subsets, and a net can use only one of those subsets. This disjoint subset consists of vertical track i, and horizontal tracks i and H i 1 (where i {0, 1,..., V 1}). In order to improve upon this, we modified the connections to the vertical faces as shown in Figure 4.4 (c). Now, terminal Z0,0 connects to track 1 on the side X0, but track 0 on side X1. This allows the net to switch tracks at the SBs. We call this SB subset-twist. The main objective of the subset-twist SB is to improve the flexibility in the vertical direction. Another way to achieve this is by adding more switches to the vertical faces the approach used by the next, subset-more SB (see Figure 4.4 (d)). Here, the vertical terminal i connects to both i and H i 1 terminals on the horizontal faces (where i {0, 1,..., V 1}). The extra switches have a two-fold effect. On the one hand, they improve the flexibility in the vertical direction, and on the other, they increase the area of the SB and the capacitive loads on the wires. The next two switch boxes use universal connections among the horizontal wires. The vertical connections in the universal-twist SB are identical to the subset-twist SB (see Figure 4.4 (e)). However, due to universal connections among the horizontal wires, it provides greater flexibility. The last SB, universal-more further increases the flexibility by adding more switches to the vertical faces. For example, in Figure 4.4 (f), track 0 on

82 71 side Z0 connects to both, tracks 1 and 3 on the X0 side. These extra switches improve the flexibility in the vertical direction, but also increase the area of the SB and the capacitive loads on the wires Experimentation We modified VPR [10], an FPGA place and route tool available from University of Toronto, to model our 3-D FPGA architectures. We refer to this tool as 3-D VPR. It uses simulated annealing to place the logic blocks and then routes the nets using a modified path-finder algorithm. Both placement and routing are timing-driven, i.e., they try to reduce the delays of critical paths. The 2-D placement algorithm of VPR optimized the following cost function. Cost 2D = α Cost timing + (1 α) Cost cong 2D Cost timing = Cost cong 2D = N nets i=1 N nets i=1 [ num sinks(i) j=1 [ bbx (i) q(i) C av,x (i) β + ] delay(i, j) bb y (i) C av,y (i) β ] where N nets is the number of nets in the design, num sinks(i) is the number of sink pins of net i, delay(i, j) is the estimated delay from the source of net i to sink number j. For each net i, bb x (i) and bb y (i) denote the x and y spans of its bounding box, respectively. The q(i) factor compensates for the fact that the bounding box wire length model underestimates the wiring necessary to connect nets with more than three terminals. Its

83 72 value depends on the number of terminals of net i. C av,x (i) and C av,y (i) are the average channel capacities in the x and y directions respectively, over the bounding box of net i. The value of β adjusts the weight given to congestion in the cost function. The larger the value of β, the more wiring in narrow channels is penalized relative to wiring in wider channels. A value of 1 has been previously found to work best, and is used in this work. To the 2-D cost function, we add a term, Cost spanz, to reduce the vertical span of the nets. This is similar to what was proposed in [51], except that, similar to the congestion cost terms for x and y directions, we incorporate congestion in Cost spanz. Cost 3D = α Cost timing + β Cost cong 2D + γ Cost spanz Cost spanz = N nets i=1 [ ] bbz (i) q(i) C av,z (i) β By varying the values for α, β, and γ for two of the benchmark designs, we found α = 0.5, β = 0.1, and γ = 0.4 to give the smallest critical path delay. Hence, we use these values in all our experimentation Architecture and Technology Parameters The logic blocks in our experiments consist of 4 4-input LUTs and 4 FFs, with 10 inputs and 4 outputs. All the inputs are equivalent, and so are the outputs, that is, every input pin can internally drive any LUT input. The pins are uniformly distributed around the sides of the CLB. Each output pin connects to 25% of the tracks in the adjacent channel, and every input pin connects to 60% of the adjacent tracks. All horizontal

84 73 segments (CHANX and CHANY) in the routing fabric span 1 CLB, and are driven by tri-state buffers. The vertical channel (CHANZ) has vias that transcend only single layer. When these vias are very short (10um), we use minimum size pass transistor switches to drive them. However, for the case when they are 50um high, we use a 5X tri-state buffer switch to drive them. In contrast, the buffers driving the CHANX and CHANY segments are always 5X the minimum, and consist of two stages. We calculated the resistance and capacitance values for the vias and horizontal wires by using the Predictive Technology Model (PTM) [95]. Timing parameters for switches were derived from Spice simulations using 65nm BPTM. We explored a spectrum of 3-D technologies: with the via properties shown in table 4.1, number of layers varying from 2 to 5, and either face-to-face (f2f) or faceto-back (f2b) bonding technology. The finest vias of 1um thickness are in line with Tezzaron s process [92], while the coarsest ones (of 5um thickness) are reflecting the process from [93] Experimentation Flow Figure 4.5 shows the experimentation flow. A design in blif format is packed into clusters (CLBs) of 4-LUTs using T-VPack. On the basis of the number of CLBs in the design, 3-D VPR creates the smallest FPGA fabric that would contain the design. It takes the number of layers as an input, and finds the minimum square size of one layer, assuming that all layers contain the same number of CLBs. The packed netlist is then placed and routed using 3-D VPR to find the minimum number of vias for a

85 74 Fig Experimentation flow large horizontal channel width (= 80 for 5 layers). The router performs a binary search over the number of vias to find the minimum value. Fixing the number of vias to 130% of the minimum value, we re-route the design to find the minimum possible channel width. Thus, this flow gives priority to reducing the number of vias instead of channel width, which makes sense because the vias take more area than the horizontal wires. However, most FPGAs provide more than the minimum number of channels to ensure good performance for the worst case too. On similar lines, we add 30% to the minimum via and channel-width numbers while evaluating the FPGA. Using these values (which may be different for every design), we re-route the design to obtain the critical path delay of the routed design. This flow is repeated for every switch-block type for all the 20 MCNC benchmark designs.

86 Area Model VPR estimates area by counting the number of transistors in the fabric. This works because the 2-D FPGA area is transistor-dominated. In case of 3-D, however, we must add the via areas to the transistor areas. The two types of 3-D integration technologies discussed in Section need different area models. In case of face-to-face (f2f) bonding, the inter-layer vias (ILVs) do not pass through the Silicon (see Figure 4.2). Consequently, they do not take any die area. In contrast, the face-to-back (f2b) bonding requires vias to pass through the Silicon (through-vias). In this case, every via consumes some Si area. We incorporate the area overhead of these through-vias in our area estimates. While comparing the area of two architectures, we estimate the total FPGA area and divide it by the number of CLBs in the fabric to estimate the area per CLB. Thus, the area numbers in the next section include the area for one logic block (CLB), and the routing resources (horizontal wires, switches, and vias) associated with it Results and Analysis Here, we show the results for two extremes of 3-D integration: first, a simple stack of two layers; and second, a more aggressive stack of 5 layers. Together they capture the trends seen by varying the number of layer in a 3-D FPGA. While the two-layer FPGA can be fabricated using f2f or f2b wafer bonding, the 5-layer FPGA must be fabricated using f2b. For all these technology points, we evaluate the effects of different via dimensions shown in table 4.1. The metric we primarily look at to evaluate an architecture is the area-delay product (ADP), because it is inversely proportional to the

87 76 Fig Comparing 2-D and 3-D FPGAs Fig Comparing the switch boxes for 5-layer FPGA

88 77 throughput of the device [96]. In all the figures in this section, we plot the geometric means over 20 MCNC benchmarks. The first step towards evaluating 3-D FPGAs is comparing them with 2-D FPGAs. Figure 4.6 shows the average area (per CLB), delay, and ADP for 1, 2, and 5 layers in 65nm technology. For both 2 and 5 layers, it shows the results for the three via technologies of table 4.1. The key 2-layers-f2f-3um in the figure refers to the use of 2 device layers, stacked using f2f bonding with vias at 3um pitch (via 1 in Table 4.1). Figure 4.6 uses the same switch box (universal-twist) for all cases. The area is estimated as explained in Section Note that area reduces as we increase the number of layers, or reduce the pitch of the vias. The smallest area is obtained when five layers are used with 3um-pitch vias, in which case, the CLB s area is only 84% of the single-layer case. Furthermore, we observe that the area of the 2-layer FPGA using f2f bonding remains constant with increasing via pitches. This happens because the vias in this case are accommodated within the transistors footprint, and the CLB area is determined by the transistors. The critical path delay also reduces with increasing number of layers (second set of bars in Figure 4.6). The 5-layer FPGA with 5um-pitch vias (best case) reduces the delay by 24.7% compared with the single layer case, and by 14% compared with the 2-layer case. This happens because interconnect lengths (and hence delays) reduce as we increase the number of layers. F2f and f2b technologies do not have any significant impact on the delay. The reduction of area and delay in 3-D combine to significantly reduce the areadelay product of the FPGA (third set of bars in Figure 4.6). The 5-layer FPGA reduces

89 78 the area-delay product by 36% (for 3um pitch vias), while a 2-layer FPGA does so by about 20%, when compared to a single-layer FPGA. These results justify the interest in 3-D FPGAs, and demonstrate that we can obtain significant improvements even by the relatively simple integration of two FPGA layers. Now, we explore the different switch boxes to find which one gives the best values for area, delay, and area-delay product. Figure 4.7 shows the results for 5 layers, using 65nm process and 3um-pitch vias (via 1 in Table 4.1). The results for 2 layers follow a similar trend. The first set of bars in Figure 4.7 compare the flexibility in the vertical direction of the various SBs by looking at the minimum number of vias they take for the designs to route. Observe that the universal-more type of SB provides the greatest flexibility (minimum number of vias). In fact, it uses only 49% of the vias needed by the subset SB. It also results in the minimum channel width among all the SBs. However, the total area is determined by both, the vias and the number of transistors in the fabric. Since universal-more uses extra switches to increase flexibility, we observe that the total area taken by the FPGA using universal-more SB is larger than that of the one with universal-twist SB. This indicates that the universal-twist SB provides greater flexibility per switch than the universal-more SB. While the area metric reduced to 88% by using universal-twist SB instead of the subset SB, the critical path delay does not show such a strong variation. This happens because the timing-driven router of 3-D VPR gives less weight to congestion for timingcritical nets, which implies that they almost always take the shortest possible route. The smallest delay is obtained for the subset-split case. Note that adding more switches to the SB increases the delay, which is explained by the larger parasitic capacitances

90 79 due to these switches. Because the variation in delay is not much, the trend for areadelay product is similar to that for area. The universal-twist offers the lowest area-delay product, 91% of that for the subset SB. Next, we explore how the via properties affect the choice of SB for the 5-layer FPGA. Figure 4.8 compares the area-delay product for different SBs for the three via technologies of Table 4.1. The x-axis is labeled as <via pitch>-<via height>. Intuitively, as the vias become larger, we will prefer the SB that provides the minimum number of vias. Figure 4.8 demonstrates this trend. As vias become larger, the difference between the area-delay products for universal-twist and universal-more (which produces the minimum number of vias) reduces. This happens because, as vias become larger, the area taken by the vias starts dominating the total area. However, even for 10um-pitch vias (the largest case), the universal-twist SB continues to provide the smallest area-delay product. We also look at the effect of technology scaling on the performance of our SBs in a 5-layer FPGA (see Figure 4.9). The vias are assumed to remain at 3um pitch while the CMOS technology scales from 65nm to 45nm and 32nm. Again, the universal-twist remains the best SB for all process nodes. Since the via dimensions remain constant among the different process nodes, the area penalty due to through-vias increases as transistor dimensions shrink. Consequently, the universal-more SB (which gives the minimum number of vias) improves as process scales. However, even for the 32nm node, the universal-twist SB remains the best from an area-delay product perspective.

91 80 Fig Comparing the switch boxes for different via technologies for 5-layer FPGA Fig Comparing the switch boxes for different process nodes for 5-layer FPGA

92 Thermal Issues in 3-D FPGAs Junction temperature is a growing concern in integrated circuits. Improvements in fabrication technology, circuit design, architecture, and tools, have all contributed towards an increase in logic density as well as clock frequency. Increased logic density and performance have in turn led to an increase in power densities, which manifests itself in the form of high temperatures. FPGAs are following a similar trend. Recent articles on thermal management from leading FPGA manufacturers ([97, 98]) clearly indicate the growing importance of thermal issues in FPGA designs. Since three-dimensional integration increases the effective power density, 3-D ICs suffer from even higher temperatures. Die temperature must be controlled because it impacts the timing, leakage power, package design, and lifetime of the device. Circuits run slower when they are hot, and their lifetime reduces exponentially with increasing temperature. Besides, plastic packages can only withstand relatively low temperatures. Furthermore, leakage power increases exponentially with temperature, which can cause a thermal runaway. All these factors have forced chip manufacturers to employ techniques to control the die temperature. These techniques can be divided into two categories, namely package level, and design level. Package designers have been considering thermal issues for a long time. Heat sinks, spreaders, and fans are the most common examples of package level techniques. Instead of considering variations in the temperatures on the die, they design the package to support the worst case specifications of the design. They typically provide the user with the thermal resistance (θ JA ) of the package, which is used to estimate the junction

93 82 temperature (T J ) using T J = T A + θ JA Power, (4.1) where T A is the ambient temperature, and Power refers to the total power consumed by the chip. As designing the package for the worst case junction temperature started becoming too expensive, researchers started looking at design level solutions to reduce the temperature. A common example is dynamic thermal management (DTM), where the design is run at a reduced power (and performance) if the chip temperature increases beyond a previously set threshold. Thermal sensors measure the temperature, and power is reduced by lowering the clock frequency or the supply voltage, and clock-gating [55]. Design level techniques can also aid in removing the heat generated by the design. For example, thermal-aware floorplanning tries to reduce the hotspots on the die by distributing the temperature uniformly [56, 57]. Researchers have mostly focused at microprocessors in these works. Thermal placement is a similar technique applied at the placement stage. Chen and Sapatnekar [58] proposed a partition-driven algorithm for standard cell thermal placement. Thermal floorplanning and placement are particularly attractive because they impact the performance less than DTM. On the modeling front, several researchers have developed tools for estimating the die temperature. Among them, HotSpot [59] is an architecture-level thermal simulator, which can perform transient as well as steady-state temperature estimation. HS3d [60] is another architecture-level tool that performs only steady state temperature estimation, but is orders of magnitude faster than HotSpot. Both HS3d and HotSpot provide the

94 83 flexibility to set several package and die parameters, such as the spreader thickness, package-to-air thermal resistance (r convec), and substrate thickness. Since in this work we look at only steady state temperatures, we use HS3d. Recently, some researchers have proposed solutions for thermal issues in 3-D ICs too. Cong et al. [61] suggested a thermal-driven floorplanning for 3-D. Goplen and Sapatnekar [62] also proposed a temperature-driven placement algorithm for 3-D standard cell ASICs. Studies have also indicated that careful insertion of thermal vias can reduce the peak temperature [63, 64]. Thermal issues in FPGAs are relatively unexplored. Some researchers have proposed the use of distributed sensors for monitoring temperatures in FPGAs [65, 66]. They, however, considered only CLBs in the fabric, and consequently, observed very little temperature variations across the die. In contrast, we focus on platform FPGAs, containing embedded circuit blocks including high-speed transceivers, multipliers, DLLs, and memories (see Figure 4.10) [1, 2]. Here, we first characterize the temperature distribution in a modern 2-D FPGA, and then observe how it changes when we stack multiple such layers. We further propose changes in the placement of hard blocks in a 3-D FPGA to reduce the die temperature Thermal-Characterization of FPGAs: 2-D to 3-D Most modern FPGAs incorporate hard blocks in the fabric (e.g., Virtex-4, see Figure 4.10). Table 4.2 shows the power densities for the blocks in a Virtex-4 FPGA. Observe that the power densities vary from 0.78 for the DSP blocks to for the DCMs. This vast range results in large temperature variations within the FPGA die

84 Fig. 4.10. Virtex-4 FX100 device (not to scale) Table 4.2. Power densities in 4VFX100 (Freq : 500MHz) Block type Power density (normalized to CLB) DSP 0.

95 84 Fig Virtex-4 FX100 device (not to scale) Table 4.2. Power densities in 4VFX100 (Freq : 500MHz) Block type Power density (normalized to CLB) DSP 0.78 CLB 1.00 PPC 1.32 IOB 2.33 BRAM Dual Port 3.85 Single Port 1.93 Transceiver 7.75 MGT Transmitter 4.22 Receiver 4.11 PMCD 11.4 DCM High Freq Low Freq 9.84

96 85 Temperature (C) X location(cm) Y location(cm) Fig Thermal profile of 4VFX100 Fig Effect of stacking on peak temperature

97 86 Table 4.3. Effect of stacking on temperature #Layers 3-D Tech Vias Temperature C Peak Average Min Ref [92] No via Ref [92] Via 1 (Table 4.1) Ref [92] Max vias Ref [93] No via Ref [93] Via Ref [93] Max vias Ref [92] No via Ref [92] Via Ref [92] Max vias Ref [93] No via Ref [93] Via Ref [93] Max vias Ref [92] No via Ref [92] Via Ref [92] Max vias Ref [93] No via Ref [93] Via Ref [93] Max vias Table 4.4. Parameters for temperature estimation in HS3d Parameter Value Ambient temperature 45 C r convec 0.5 C/W Substrate thickness 500 um Spreader thickness 1 mm Sink thickness 6.9 mm Glue thickness 2 um

98 87 (see Figure 4.11). The hotspots occur near the MGTs and DCMs, which are about 14 C above the coolest portions. Table 4.3 shows the temperatures for 3-D FPGAs consisting of identical FPGA layers of 4VFX100. The temperatures were estimated using HS3d [60] with the parameters listed in Table 4.4. The r convec value of 0.5 C reflects the thermal resistance of a high-end package with a moderate heat sink. We estimated temperatures for two extremes of 3-D technologies: one with very thin layers and fine vias (Tezzaron s process, Via 1 of Table 4.1), and another with 5um vias and 50um layers (Via 3 of Table 4.1). For both these technology nodes, we also varied the number of inter-layer thermal vias between the two extremes of no thermal vias to the maximum possible number of thermal vias. Table 4.3 shows the temperatures for these two corners along with a more realistic number based on the via pitches in Table 4.1. As expected, the peak temperature increases with increase in the number of layers from 89.4 C for a 2-D FPGA to C for a 4-layer FPGA using Tezzaron s process. The intra-package temperature variation also increases with increase in the number of layers, from 14.4 C for a 2-D FPGA to 55.0 C for a 4-layer FPGA. This large variation in temperature indicates that the peak temperature could be reduced by distributing the hot blocks more evenly across the fabric. Interestingly, 3-D technology parameters change the temperatures only minutely. For a 4-layer FPGA, layer thickness changes the peak temperature by up to 4.4 C, while thermal vias could decrease the peak temperature by up to 3.4 C. Figure 4.12 shows the effect of stacking on temperature, as well as the possible variations because of 3-D technology parameters.

9 78.0 73.8 2-layer stacked 128.48 111.11 102.78 2-layer thermal 112.92 111.19 110.

99 88 Table 4.5. Thermal-aware 3-D FPGA design FPGA Design Temperature C Peak Average Minimum 2-D layer stacked layer thermal layer thermal inverted Layer 1 Layer 2 a) 2-layer stacked Layer 1 Layer 2 b) 2-layer thermal Fig D FPGA organizations

100 Thermal-Aware 3-D FPGA Organization Recently, a study proposed alternate organizations for a 2-D FPGA to reduce the intra-die temperature variations [67]. Using a fully utilized Virtex-4 FX100 FPGA as an example, it demonstrated a reduction in peak die temperature of about 6 C. Since temperature variation is larger in a 3-D FPGA, we would expect thermal organization to have a greater impact. To demonstrate this, we design a thermal-aware 2-layer FPGA. For ease of experimentation, we consider only 4 types of blocks in the FPGA, namely, CLB, BRAM, DSP, and MGT. These blocks consume the majority of the area in 4VFX100. The peak temperature for a 2-D FPGA containing these blocks is 86.9 C. In the first case, we stack two identical such layers to form a 2-layer stacked FPGA (see Figure 4.13(a)). The peak temperature for this FPGA is C. Note that stacking the hot blocks significantly increases the power density, and therefore, the temperature. Hence, next, we keep all the MGTs, DSPs, and BRAMs on a single layer. The second layer now consists only of CLBs (see Figure 4.13(b)). This change in floorplan can be implemented easily with the column-based modular architecture of Virtex-4 (ASMBL) [1]. This reduces the peak temperature to C (2-layer thermal in Table 4.5). The temperature variation also drops from 25.7 C for the stacked design to only 2.6 C for the thermal-aware design. In the previous experiments, the heat sink is attached closest to the layer consuming the maximum power. Previous studies have suggested that this should be preferred. In fact, researchers have proposed thermal-aware 3-D floorplanning that tries to place the hot blocks closer to the sink [61]. In order to see the effect of sink placement, we

101 90 attached it to the layer containing only CLBs in the 2-layer thermal organization. Table 4.5 also shows the temperature for this case (2-layer thermal inverted). We observe that the temperature increases only very slightly because of this change. This happens because the vertical distances are small compared to the horizontal dimensions of the FPGA. 4.4 Summary This chapter demonstrated that 3-D FPGAs can provide significant advantages over 2-D by reducing the interconnect area and the total area-delay product. The 3-D FPGA with 5 layers and 3um-pitch vias reduces the area-delay product of a 2-D FPGA by 36%. We designed and evaluated several switch boxes for 3-D FPGAs, and showed that the area-delay product depends heavily on the switch box topology. In 65nm technology, the area-delay product for our universal-twist switch box is 15% lower than that of the subset switch box for 5um-pitch vias. We further showed that the universal switch boxes become even better with scaling process technology, as well as with larger vias. However, adding more switches to the universal SB does not provide any benefit. Three-D integration, however, increases the die temperature. Our experiments indicate that the peak temperature for a 4-layer FPGA is 2.4 times that of a single-layer FPGA. However, the large variation in temperature within the 3-D package allows us to re-organize the 3-D FPGA to reduce the peak temperature. For a 2-layer FPGA, the peak temperature reduced by 16 C when the design was altered to create a more uniform temperature profile.

102 91 Chapter 5 Technology Alternatives for Nanoscale FPGA Interconnects The previous decade has seen large-scale concerted efforts to develop nano-scale technologies that will help sustain the Moore s law. Innovations in lithographic CMOS technologies have indicated that it would be possible to scale CMOS till at least up to the second half of the next decade. However, conventional lithographic techniques suffer from increasing fabrication costs, which may ultimately limit their application. Recently, a (comparatively) low cost and reliable nano-imprint lithography technique has been proposed [99, 100] which raises the hopes of obtaining cost-effective nanoscale fabrication. However, at present, this imprint technique is limited to very regular structures, and is unlikely to produce the complex structures that current lithography can produce. While nano-imprint as well as conventional lithography are top-down techniques, there are several bottom-up assembly techniques [101] in which molecules assemble to form nano-structures. Although these techniques are expected to be very low cost, they suffer from yield issues and are limited to very simple geometries. Modern high-end FPGAs contain a variety of resources, and are not restricted to a simple array of logic blocks consisting of Look-Up Tables (LUTs) connected using programmable switch blocks. In current FPGAs, apart from the basic programmable blocks, there exist RAM modules, some hard-coded blocks (e.g. multipliers), and even some full processors (e.g. PowerPC processors). Apart from them, the basic programmable logic

103 92 block itself has been augmented to contain non-lut structures, like fast carry-chain circuits. There have been advances in the interconnect architecture too. Modern FP- GAs consist of segments of different lengths, each with different connectivity. However, it is widely accepted that the interconnect is the major bottleneck in FPGAs. The interconnect multiplexers in Xilinx s Virtex-2 FPGAs take around 70% of the CLB area. Furthermore, even after careful timing-driven packing and placement, interconnects are the dominant source of delay for most designs. In addition to this, the power consumption in a typical FPGA-mapped design is absolutely dominated (> 70%) by the interconnect resources [13]. In this chapter, we explore different solutions to the interconnect problem in the nano-scale regime. We explore nano-wires of different widths and materials as interconnect. We also explore replacing the pass-transistor switches in current FPGAs by molecular switches [101, 102] that provide reprogrammable connections between wires. This alleviates the need for SRAM cells to control the state of the switch, since these molecules store the state within themselves. This is similar to anti-fuse FPGAs, but, in contrast to anti-fuse technology, these molecules are reprogrammable. Furthermore, we expect the structure of the CLB to be more difficult to realize efficiently in a technology more amenable to regular structures. Therefore, the logic blocks in our architecture are fabricated using lithographic techniques. 5.1 Nanotechnology Primitives Several nano-structure fabrication techniques have been proposed over the past few years. Among them, Nano-imprint [99, 100] and Dip Pen Nano-lithography (DPN) [103]

104 93 are the most promising techniques. In case of nano-imprint technology [99, 100], e-beam lithography (or a similar technique) is used to create a mould, which is subsequently used to emboss the circuit on other chips for mass production. The mould can be made very fine, and the technique is expected to scale up to a few nano-meters of feature size. DPN [103], in contrast, uses an Atomic Force Microscope (AFM) to write the circuit on the die. Although inherently slower than nano-imprint, using multiple AFM tips improves the writing speed significantly. This has been demonstrated to produce very small features, and is expected to fabricate features smaller than 10nm. Directed selfassembly [101] is another approach towards making nano-structures. Although this may be the cheapest way to make circuits, it suffers from very high defect rates. Note that all these (nano-imprint, DPN and self-assembly) technologies are expected to be limited to very simple geometries. It has been shown that it is possible to get sets of parallel wires using any of the above techniques. Therefore, we propose to use them (preferably nano-imprint) to make only wires in the FPGA. These wires could be made using a single crystal of metal-silicide (e.g., NiSi nano-wires [104]) or made out of metal. Carbon nanotube wires could also be considered, although a recent work claimed that carbon nanotubes may not be better than metal wires with respect to reducing interconnect delays [105]. In addition to the wires, we also need some sort of programmable switches to provide programmable connection among the wires and between wires and logic pins. In the FPGAs of Xilinx and Altera, these are made using pass transistors and SRAM cells, while Accelerator FPGAs use one-time programmable anti-fuse material. At the nanoscale we can use single-molecule switches that exhibit reversible switching behavior [70].

105 94 These molecules self-assemble at the cross-points of nano-wires, and can be switched between ON and OFF states by the application of a voltage bias. It is desirable that these switches have very low ON resistance and a very large OFF resistance. ON resistances of hundreds of ohms and OFF-to-ON ratios of 1000 have been observed recently [102]. Note that very fast switching characteristics is not essential for FPGAs, because these switches will not be configured very frequently and the FPGA configuration time is normally not critical. Early work in molecular switching suffered from filament formation due to the small gap separating the nano-wires. Consequently, the switching behavior observed was due to the metallic filament instead of molecule. Chemists at several research institutions are targeting this problem. One such (as yet unpublished) work from our collaborating chemists can increase the vertical separation among wires to 30nm and uses nano-spheres to provide programmable connections. In line with this work, we experiment with a fixed vertical separation between nano-wires of 30nm Related Work DeHon [68], Goldstein [69], Tour [70] have previously proposed programmable architectures using some form of nano-structures that are made using self-assembly. Goldstein tried to make crossbar-based devices by aligning nano-wires in two planes at right angles to each other. The crosspoints contained molecules that provided programmable logic as well as interconnections. It suffered from problems of signal-degradation, as there was no way to restore the signal using only two terminal devices. DeHon overcame this problem by using SiNW based FETs to restore the signals, and proposed a PLA

106 95 structure. However, the logic functionality in that architecture was limited to OR (and inversion). Tour, instead, proposed replacing the logic blocks by nanocells and connecting them using metal wires. This suffered with problems of training these nanocells, which were assumed to consist of a randomly connected mass of molecules. Furthermore, since the bottleneck in current FPGAs lies in the interconnect, Tour s architecture does not help solve this problem. All the above architectures propose drastic changes in the existing CMOS technology as well as the design methodologies. We propose an architecture that blends with existing technology easily, and preserves all the design methodologies and flexibility in logic functionality. 5.2 Nanoscale FPGA Architectures We explored FPGA architectures with varying degrees of nanoscale integration in the interconnect fabric. The logic block in all architectures is assumed to be made using 22nm lithography (which [8] predicts to be available in 2016). In the first architecture, we consider the inter-clb wires to be made using some nano-fabrication technology and the interconnect switches to be made using self-assembled molecular switches. Both metal and metal-silicide nano-wires are explored. Note that this organization needs decoders to address the (nano) wires. In the second architecture, we assume inter-clb copper wires fabricated using advanced lithography but keep molecular switches to connect them. In order to make the exploration tractable, we limit the inter-clb metal wires to only two levels (M3 and M4). The main difference between arch1 and arch2 is the

107 96 attainable wire pitch (up to 10nm for arch1, 54nm for arch2). Finally, we compare these architectures with the current island-style FPGA architecture containing pass-transistor switches (arch2), scaled to the 22nm technology node Arch1: Using non-lithographic nano-wires and molecular switches Figure 5.1 shows the proposed architecture, and figure 5.2 shows how the different technologies are stacked together. The logic block remains in silicon, and uses M1 and M2 layers for local connections. The IO pins of the logic block are in M2 layer, and the nano-wires are on top of this. Molecular switches provide programmable connections between nano-wires and between nano-wires and logic blocks. Note that each layer in figure 5.2(a) is isolated from its adjacent layers by a dielectric. The salient features of this architecture are described below. Interconnect wires A good interconnect material must have a low resistivity, a large current-carrying capacity, and the ability to be made at small pitches. A low resistivity is needed to have small delay, which is determined by the RC product. While copper wires are expected to have a resistivity of 2.2µΩ-cm at the 22nm technology node [8], NiSi nanowires have been shown to have resistivities of around 10µΩ-cm [104]:. Even with poorer resistivities, NiSi nanowires may be preferred due to their ability to sustain a current density of up to hundred times that of copper (> A/cm 2 ). Some nano-fabrication technology may be needed to fabricate wires at pitches of less than 10nm 1. 1 The wire pitch at the 22nm node is predicted to be 54nm.

108 97 We experimented with different routing architectures, consisting of different segment lengths. It has been previously shown that a segmented routing architecture is better than non-segmented ones [81]. The logic block (8 LUT+FFs) in 22nm technology is expected to be around 12.5µm x 12.5µm. In addition to this, the decoders take some space. Therefore, a single-length wire in our architecture needs to run 25µm, a double length wire 38µm, triple-length wire 50µm. Assuming 50µm as the limit for the length of these wires, we investigate architectures having a maximum segment length of 3 logic blocks. Interface to CMOS The problem of interfacing such nano-structures with the structures made using traditional lithography was addressed in [68]. These nano-wires can be accessed with a decoder made using advanced nano-imprint technology. [106] also proposes a stochastic approach to addressing these wires, and claims that we can uniquely address these wires with high probability if the number of wires is large. [68] proposed the use of N control signals for a decoder that is used to address N wires. We use a similar technique, and therefore account for 15 decoder control signals for 200 wires in the FPGA channel. Note that these decoders are needed only to configure the switches, and are switched off at operation time. Programmable Switches

109 98 As described in section 5.2, arch1 uses molecular switches that can be made to assemble at the cross-points of the wires. After this, these switches can be configured to make the desired connections by applying the correct voltages at the wires (similar to anti-fuse FPGAs). Configuring the FPGA The logic functionality of this FPGA can be easily programmed using SRAM cells. Programming the routing is similar to anti-fuse FPGAs, except that we need decoders to address the nano-wires. The main concept is that the wires should be activated in some particular order to avoid affecting wrong switches. [107] presents a way to program the anti-fuses in an anti-fuse FPGA, which is directly applicable to our architecture too. Initially, all the molecular switches are off and all the wires are pre-charged to a voltage V p /2. This is required to ensure that the voltage difference of V p is applied only to the desired switch. Then the two wires that need to be connected through a switch are addressed using a decoder and pulled to V p and ground respectively, thus applying a voltage difference of V p to the molecular switch that needs to be turned on. Note that V p needs to be larger than the operating voltage. Experiments with molecular switches have shown a value of 1.75V [70], which is more than double that of the operating voltage at 22nm node. We also envision a possibility of using the CLB logic itself to program the molecular switches. In order to do that, the configuration will need to go through the following steps. First the global clock resources need to be configured. Next, the CLB (logic) is configured to drive appropriate control signals to the address decoder. Note that

110 99 since different CLBs cannot communicate at this stage, all control signals need to be synchronized with the global clock signal. Furthermore, since the configuration time is usually not critical, we can afford to minimize the configuration logic (that needs to fit within a single CLB). Next the routing (molecular) switches are programmed followed by configuration of the CLBs to implement the user design. Note that this configuration methodology will greatly simplify the programming circuitry when compared to anti-fuse FPGAs. Capacitance and Area Estimation Capacitance of a single-length wire 2, C 1 wire, in arch1 is estimated as follows. C 1 wire = 4 N channel C nano jn +(2 N clb pins + 2 N decoder ) C micro jn + 2 C couple where N channel is the number of wires in the FPGA channel (channel width), C nano jn is the junction capacitance between two nano-wires, N clb pins in the number of IO pins in the logic block, N decoder refers to the number of control signals in the decoders, C micro jn is the junction capacitance between a lithographic wire and a nano-wire, and C couple is the coupling capacitance with an adjacent wire. 2 wire that spans adjacent CLBs

111 The junction capacitance between any two wires, C junc is calculated using [68] C junc = 2πǫL ln( 2h r ), where ǫ is the permittivity of the dielectric separating the wires (we assumed SiO 2 ), r is the radius of the wires and h is the separation between the wires. For C nano jn, L = 2r and h was kept as 30nm and for calculating C micro jn, L was changed to the lithographic metal half pitch (54nm for 22nm node). C couple was estimated using the equation for two long parallel cylindrical conductors. 100 C couple = πǫl ln( D 2a + ( D 2a )2 1) where D is the spacing between the axes of the two cylinders, and L is the length of the cylinders (wires). We observed that the coupling capacitance calculated using the above equation was always larger than the capacitance calculated using Berkeley device group s interconnect model [80], and therefore used the above as a pessimistic value. The area of the arch1 FPGA is equal to area of logic blocks + area of decoders when the pitch of the nano-wires is within 25nm. For larger wire-pitches, area is determined by the wires and is quadratically proportional to the wire pitch. Note that when area of the device increases, the lengths of the wires also increase and consequently, wire capacitance and resistance per CLB length changes.

112 Arch2: FPGA using lithographic wires and molecular switches Arch1 described in the previous subsection needs decoders for addressing the nano-wires, which increases the complexity of the fabrication process. Therefore, we also explore an FPGA, which uses conventional lithographic metal wires as the interconnect, with molecular switches at their cross-points (as in the previous architecture). Note that assuming a channel width of 200 (same as arch3, and similar to commercial SRAM-based FPGAs), the area of the CLB will be determined by the wires instead of the logic. For 22nm technology, ITRS predicts a wire pitch of 54nm. For a channel width of 200, we will need 400 wires within the CLB pitch. This comes out to be = 21.6µm long. In addition to that, we will need space for the logic pins, which calculates to 40 x 54 = 2.16 µm. Therefore, the CLB dimensions in this case is projected to be 23.76µm x 23.76µm, which is only slightly smaller than the current Xilinx CLB scaled to 22nm technology (25µm x 25µm). 5.3 Comparative Evaluation performance. We used VPR [10] to model the various FPGA architectures and evaluate their Modeling Arch1 in VPR In order to model arch1 in VPR, we added a new type of switch box that allows a wire to connect only to the wires at right angles to it. This was done because in arch1, molecules assemble only at wire cross-over points and not between two wires running in the same direction. In order to account for the large defect rates expected at this scale,

113 102 we started with assuming that only half of the switches are operational, but due to the immensely large number of programmable switches in our architecture (even when only half of the switches are visible), VPR takes extremely long (> 2 days on a SunBlade- 2000, for a 191 CLB design) to finish the placement and routing of the designs. In order to facilitate experimenting with multiple designs, we limited the number of switches in VPR to only about 1% of the total physically present switches. Consequently, in VPR, the CLB outputs have switches to only half of the wires in the channel, and a wire can connect to only 4 other wires in the switch box, two in each of the perpendicular directions. The performance we obtained by limiting the number of switches was not very different from that obtained by keeping all the switches for the few designs we initially experimented with. Since the flexibility provided by our switch box is still greater than the switches built in VPR, we expect that our switch box is still not very limiting, and similar results will be obtained considering all switches too. Note that since we still counted the junction capacitances between all crossing wires, our results for the proposed architectures should be considered as the lower bound, and could be enhanced by improvements in the tools. We used MCNC benchmark circuits for all experimentation. These designs varied in size from 131 to 806 CLBs. In order to have reasonable performance, we kept the routing as segmented with 20% single-length, 30% double-length, and 50% triple-length wires Results Figure 5.3 shows the critical path delays of all the designs when mapped to the three architectures. The results for arch1 use a spacing s of 10nm between the nanowires

114 103 and a wire diameter of 15nm. The lithographic wire pitch was kept as 54nm, as predicted by [8] for the 22nm node. The resistance of the molecular switch was assumed to be 1kΩ, and the material for the nano-wire was assumed to be copper (resistivity=2.2µω-cm [8]). Note that the delay is maximum for arch3 (lithographic, SRAM-based), and the delays for arch1 and arch2 are comparable. However, the area of the arch1 FPGA is only about 30% of the arch2 FPGA. The average reduction in critical path delay was 30% for arch2 and 32% for arch1, when compared to arch3. The performance of the designs (mapped on arch1 and arch2 FPGAs) strongly depends on the molecular switch resistance. For our experimentation we assumed that the off resistance of the switch is sufficiently high to consider it as an open circuit. Results for varying molecular on resistance from 100 Ω to 100 KΩ (typical value is around 10kΩ today) are shown in figure 5.4. It is observed that the delay of the circuit increases very sharply beyond 10kΩ. In fact the delay becomes as large as 20X for arch1 when the molecular resistance is 100kΩ. The delay value for arch1 using NiSi nanowire remains larger than arch3 for all values of molecular resistances. This happens due to very large resistance of these wires. Note that these NiSi nano-wires can support large current densities, while the metal nano-wires may in reality be limited by electro-migration. Figure 5.5 shows the variation of resistance and capacitance of single-length NiSi nano-wires with wire dimensions. The notation R-25 means resistance for nano-wires with a pitch of 25nm. The plot shows results for wire pitches ranging from 25nm to 55nm. Note that as the wire pitch is increased, the area of the FPGA increases, thereby increasing the wire length. Therefore, we can see a slight increase in the wire resistance when the pitch is increased even when the width of the wire remains the same. The

115 104 capacitance value at 50nm width clearly reaches unacceptable limits (>20fF). At the other extreme, the resistance values are very large (>100 kω) when the width of the wire is reduced to 5nm. Note that looking at the RC product of the wire alone is not expected to give an indication of the performance of the FPGA, since every net will go through some molecular switches (with resistances) and into the input pins of logic blocks (with capacitances). Figure 5.6 shows the variation of performance of arch1 with varying wire dimensions for the design misex3; other designs showed a similar behavior. Note that performance of arch1 is inferior to arch3 when the molecular resistance is 100kΩ or 10kΩ. However, as the molecular resistance reduces, arch1 starts performing better than arch3. The figure is divided into vertical sections of separate wire pitches. For every wire pitch, we experimented with several wire dimensions. Note that for Rswitch=100kΩ, delay increases monotonically (except ) with width of the wire for a fixed pitch. This happens because the large switch resistance makes the net delay very sensitive to capacitance of the wire. With the delay of the design being dominated by the routing delay (and because the logic delay remains almost same for different wire dimensions), the delay of the design increases with capacitance. The other extreme occurs when Rswitch is 100 Ω, in which case the delay decreases with increase in width due to reduction in wire resistance. Rswitch values of 10 and 1 kω show intermediate behavior. 5.4 Summary In this paper we explored several nano-scale interconnect technologies for FPGAs. First, we replaced the FPGA interconnect fabric by nano-structures: lithographic wires

116 105 by nano-wires made using nano-imprint technology, and switches by molecular switches. Second, we used lithographic wires connected using programmable switches. The results for these two were compared with current FPGA architecture containing pass transistor switches, scaled to 22nm. We found that the first architecture provided the best performance with the least area. The area reduced to 30% of the scaled architecture, and the critical path delay reduced by 32% on an average. The second architecture improved the performance over the scaled FPGA, but area reduction was only 10%. Using NiSi nano-wires instead of metal nano-wires was not good for performance, but may be useful to counter electromigration. The resistance of the molecular switch was found to be a crucial factor in the performance of the design, and values lower than 10kΩ were observed to be critical for performance. This kind of exploratory research is highly interdisciplinary, and building successful nanoscale devices requires synergy between the architects and the chemists. One of the motivations of this work was to set the requirements from these nanoscale technologies to the chemists who are actually developing these. From the results we conclude that molecular switches with on-resistances of around 1kΩ are needed for good performance. Furthermore, materials with lower resistivities than NiSi nanowires must be explored for fabricating nano-wires. Architectural improvements, and throughput-oriented designs may utilize the area benefits of nanotechnologies to provide faster application run-times even with higher molecule and wire resistances.

117 106 Fig FPGA using nano-wires and molecular switches (a) (b) Fig D organization of nano-wires

A Dual-V DD Low Power FPGA Architecture

A Dual-V DD Low Power FPGA Architecture A. Gayasen 1, K. Lee 1, N. Vijaykrishnan 1, M. Kandemir 1, M.J. Irwin 1, and T. Tuan 2 1 Dept. of Computer Science and Engineering Pennsylvania State University