Methodologies for Tolerating Cell and Interconnect Faults in FPGAs

Size: px

Start display at page:

Download "Methodologies for Tolerating Cell and Interconnect Faults in FPGAs"

Mervyn Tate
5 years ago
Views:

1 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY Methodologies for Tolerating Cell and Interconnect Faults in FPGAs Fran Hanchek, Member, IEEE, and Shantanu Dutt, Member, IEEE Abstract The very high levels of integration and submicron device sizes used in current and emerging VLSI technologies for FPGAs lead to higher occurrences of defects and operational faults. Thus, there is a critical need for fault tolerance and reconfiguration techniques for FPGAs to increase chip yields (with factory reconfiguration) and/or system reliability (with field reconfiguration). We first propose techniques utilizing the principle of node-covering to tolerate logic or cell faults in SRAM-based FPGAs. A routing discipline is developed that allows each cell to cover to be able to replace its neighbor in a row. Techniques are also proposed for tolerating wiring faults by means of replacement with spare portions. The replaceable portions can be individual segments, or else sets of segments, called grids. Fault detection in the FPGAs is accomplished by separate testing, either at the factory or by the user. If reconfiguration around faulty cells and wiring is performed at the factory (with laser-burned fuses, for example), it is completely transparent to the user. In other words, user configuration data loaded into the SRAM remains the same, independent of whether the chip is defect-free or whether it has been reconfigured around defective cells or wiring a major advantage for hardware vendors who design and sell FPGA-based logic (e.g., glue logic in microcontrollers, video cards, DSP cards) in production-scale quantities. Compared to other techniques for fault tolerance in FPGAs, our methods are shown to provide significantly greater yield improvement, and a 35 percent non-ft chip yield for a FPGA is more than doubled. Index Terms Fault tolerance, Field Programmable Gate Array (FPGA), yield improvement, cell faults, wiring faults, reconfiguration. 1 INTRODUCTION O F. Hanchek is with Intel Corporation, 500 NE Elam Young Parkway, AL4-51, Hillsboro, OR fhanchek@scic.intel.com. S. Dutt is with the University of Illinois at Chicago, Department of Electrical Engineering and Computer Science, 110 Science and Engineering Offices, 851 S. Morgan St., Chicago, IL dutt@eecs.uic.edu. For information on obtaining reprints of this article, please send to: tc@computer.org, and reference IEEECS Log Number UR model characterizes a Field Programmable Gate Array (FPGA) as being composed of an array of programmable logic cells surrounded by channels of segmented programmable interconnection wiring. The channel wiring architecture is shown in Fig. 1, and a logic cell consists of programmable combinational functions with optional output registers. The group of segment-to-segment interconnections at each channel row and column intersection is called a switch box. This model is similar to the Xilinx FPGA architecture [4], [7]. We are interested in reprogrammable FPGAs, in which data in a configuration memory overlaid on the array defines the interconnection of the channel wiring and the functions performed by the logic cells. Signals from the memory control pass transistors to make or break programmable connections and define the functions realized by the array. Reprogrammability allows a test configuration to be programmed into an FPGA at the factory, or at power-up in a user s system, before reprogramming to the desired configuration. Current very high levels of integration and smaller device sizes can lead to higher occurrences of all types of faults permanent (fabrication or operational), transient, and intermittent. Thus, fault tolerance (FT) in FPGAs is desirable from two points of view. For the chip manufacturer, it can increase the yield of usable chips. If faults can be detected using techniques like [15], [16], [], [3] and reconfigured around at the factory in a manner such that they are transparent to the user, then these otherwise useless chips do not have to be discarded. For the user, fault tolerance can increase reliability, which means reduced downtime and lower maintenance costs. If a board containing a faulty FPGA must be removed from service to be tested, and reconfiguring the FPGA can be done as easily as replacing it, then fault tolerance is the more cost effective alternative. We propose here techniques whereby FPGAs can be reconfigured around faults in the logic cells and the wiring, thus leading to increased yield and reliability. A preliminary version of the technique addressing only faults in the logic cell portions of an array appears in [9]. Here, we expand the scope of this technique and show how it also enables faults to be tolerated in additional portions of the configuration memory. Additionally, after completing modifications to adapt the routing software to the needs of our technique, we now present empirical results on the amount of wiring track overhead that would be required by the cell FT technique. Two new techniques, whereby faults can also be tolerated in the wiring channels, have appeared in [10]. Here, we provide additional information on their implementation and show how they can be used by themselves as well as in combination with logic cell FT. These are believed to be the first user-transparent techniques developed to specifically address FPGA wiring channel faults, and are significant, considering that at least half of the array area is occupied by the wiring channels [3], [5]. In this paper, we will emphasize factory reconfiguration for yield improvement, but our techniques also apply to reconfiguration around faults by the user /98/$ IEEE

2 16 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 Fig. 1. FPGA generic channel wiring architecture with segment length spanning one cell. The set of dashed/dotted segments forms a grid. 1.1 Existing Fault Tolerance Methods Since FPGAs are composed of a large number of identical cells in a highly regular array, two methods of providing fault tolerance are readily apparent. One is simply to note the locations of faulty cells and reroute the user s circuit to avoid them using spares or other unused cells instead. This has been proposed for use in programmable gate arrays designed to be programmed at the factory, and, potentially, requires a different routing for every chip to which the user s circuit is mapped [18], [0]. It has also been proposed for (primarily) logic cell faults in FPGAs in [1], [19], and, specifically, for wiring faults in [1]. However, requiring the layout tools to perform a new routing of a circuit for each new faulty cell or wiring location encountered puts a heavy burden on the user, who must also keep track of all of the different routings for a given circuit design. A variation on this rerouting technique makes fault tolerance transparent to the user by using extra wiring and factoryconfigured switches on the tracks to physically warp the channel routing segments around a faulty cell while maintaining the same logical routing configuration. The extra switch delay overhead is too high for use in reconfiguring around individual FPGA cells, but can be employed more efficiently using blocks of cells [14]. However, the area overhead of spare blocks of cells and extra wiring is also very high. The other fault-tolerant technique, adding spare rows and/or columns of cells, is intended for reconfiguration at the factory, making the technique transparent to the user. To reconfigure around a faulty row, fuses are burned at the factory such that nonfaulty rows are remapped to include the spare row. For the faulty row to be transparent, it is necessary to maintain the original connectivity between the rows on either side of the faulty one. One method of doing this is to employ longer wiring segments in the vertical channels [13]. Each segment spanning m rows is lengthened to span m + 1 rows, and the extra portion of the segment added for fault tolerance cannot have cell connections made to it except after reconfiguration. This reduces connection flexibility for certain segment sizes. For m = 1 (a common useful segment length), this technique will not allow connections along a track (segmented channel wire) from one cell to another if they are an odd number of cells apart. For m =, an additional track must be added to each pair of tracks to ensure that any cell can connect to any other using that track group, and this restores the original connection flexibility of the non-ft array. A spare column of cells is also used by Altera [6], but the array architecture is different from the model considered here in that it does not use segmented wiring. 1. Proposed Fault-tolerant Method Our proposed methods of fault tolerance require neither the factory nor the user to generate new routing maps to reconfigure around faulty cells or wiring, as is required by [1], [18], [19], [0], [1], so no additional time is spent with routing tools each time a chip is to be programmed. Instead, the original configuration data can be reused. Our cell FT method involves a routing strategy that requires the use of additional wiring segments. On the other hand, no explicit additional tracks are needed in the channels in order to avoid the loss of connection flexibility seen in [13]. We also propose two FT wiring techniques. The first technique, based on replacement of individual wiring segments, introduces no extra wiring path delays due to wiring switches except where a faulty segment has been reconfigured around. In our second technique, which replaces groups of segments called grids, no extra switches are added in the channel wiring, so there are never any extra wiring path delays. These techniques contrast with [14], where extra switches embedded in the wiring channels cause extra wiring path delays whether faults are reconfigured around or not. Our cell FT technique has the low spare cell overhead of spare rows or columns, but promises better yield improvement because it is a finer-grained FT method in that it is able to tolerate many more fault patterns (one fault per row or column, if there is a spare row or column, respectively). When cell FT and wiring FT techniques are combined, their ability to tolerate faults in more areas of the array results in even greater yield improvement. The rest of this paper is organized as follows. Section discusses implementation of the node-covering based technique we have developed for cell FT. It also explains how to minimize or eliminate any additional signal path delay that might be introduced by the need, in the cell FT technique, to attach additional wiring segments to the regular nets. Section 3 presents data for a number of benchmark circuits on the additional segment and track overheads incurred by the cell FT technique. In Section 4, we introduce our techniques for tolerating channel wiring faults, and, in Sections 5 and 6, we analyze and compare the yield improvement generated by our cell FT and wiring FT techniques to the yield improvement afforded by spare column, and spare row and column, FT techniques [13]. Finally, Section 7 presents our conclusions.

3 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 17 functionality of the dependent cell. This is easy in an FPGA, since all cells are identical. Configuration data for the dependent cell itself is simply transposed to the cover cell. Second, the cover cell must be able to duplicate the connectivity of the dependent cell with respect to the rest of the array. Our method of ensuring connectivity is described next. Fig.. (a) A covering graph for a 1-FT FPGA with four primary cells and one spare cell. (b) Reconfiguration in the covering graph after cell becomes faulty. THE NODE-COVERING METHOD FOR CELL FAULT TOLERANCE Under the principle of node-covering [7], each primary node, or cell, u in the FPGA is assigned a cover cell which can be reconfigured to replace it in the event that cell u becomes faulty. Primary cells are assigned to cover other primary cells in a chain-like manner, with a spare cell covering the last primary cell in the chain, as seen in the covering graph of Fig. a. Node covering does not provide a FT solution in itself, but essentially engenders a disciplined mode of thinking about fault tolerance in any system or circuit. Innovative and potentially different solutions need to be developed for applying node covering efficiently to the system or circuit of interest. The node covering philosophy has resulted in very efficient FT and reconfiguration solutions for multiprocessor systems [7], [8], and arithmetic circuits [6]. In an FPGA, if a faulty cell is identified through testing, the FPGA can be reconfigured such that the faulty cell is replaced by its cover, which in turn is replaced by its own cover, and so on until a spare cell in the chain is reached. This can be seen in the reconfigured covering graph of Fig. b. We define these chains to be along either rows or columns of the array, with a spare cell at the end of each row or column. Our FT FPGA designs will be able to tolerate one faulty cell in each row (or column), and this will subsequently be called the node-covering cell FT method. In order for a cell to cover another cell (the dependent cell), two conditions must be met. First, the cover cell must be able to duplicate the.1 Cover Segments Consider a row of the FPGA to be a fault-tolerant group, with the rightmost cell being a spare. We will assume the generic channel wiring of Fig. 1, with segments of length 1, though the technique also applies to segments of longer lengths. As is the case with many FPGAs, the channel segment interconnections will be considered to be separate from the cell-to-channel-segment connections, which will be associated with the cell configuration. Therefore, when cell configuration data is transposed to a cover cell, all of the cell-to-channel connection data is transposed as well. Fig. 3 shows some representative nets routed to allow the replacement of each cell by its cover cell. These nets are shown as (a) initially configured, and (b) reconfigured around faulty cells. To meet the connectivity requirement, each net connected to a cell through a channel segment must also include the corresponding channel segment a cover segment bordering the cover cell. Cover segments are included in a net in one of two ways. First, segments in the net may already be in positions to act as covers. For example, in Fig. 3a, the channel segment connecting to Cell A is covered by the net channel segment connecting to Cell B. In case the above condition does not hold, additional segments, termed reserved segments, should be attached to the net to provide covers, and these constitute overhead, since they are only in use when the circuit is reconfigured around a faulty cell. For example, the channel segment connecting to Cell B must be covered by a reserved segment that can be connected to Cell C, which covers Cell B. In the reconfigured circuit shown in Fig. 3b, cell-to-channel connection data for the faulty cell and all cells to the right of the faulty cell in each row have been transposed one cell to the right. The cover segments present in the channels allow the logical connectivity to be maintained (i.e., Cell A still connects to Cell B). Although Fig. 3 shows both faulty cells in the leftmost column, note that one faulty cell can be tolerated in any position in each row. Fig. 3. (a) Fault-tolerant routing of nets using cover segments. (b) Reconfiguration around faulty cells.

4 18 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 With fault-tolerant groups defined along rows, covering a cell connection to a horizontal channel segment requires, at most, one reserved segment. Covering a cell connection to a vertical channel segment, however, can require up to two reserved segments if the cover is not already provided in the net. This is illustrated in Fig. 3a, where it can be seen that the vertical channel segment connecting to Cell C is already covered by channel segments of the same net connecting to Cell D. However, for the net connection to Cell D, two reserved segments are required for the connection to Cell D s cover, the spare cell at the end of that row. One is the vertical cover segment that can be connected to the spare cell if and when the spare replaces Cell D; the other is a horizontal segment to connect the vertical segment to the net. A point-to-point path is defined as a connection either between a pair of cell terminals or between an I/O pad and a cell terminal. A net consists of such intersecting point-topoint paths. When two paths of a net connect at some point, the unused track segments available might not allow them to share the same segment at the connection point, and thus be connected along a single track. In such a situation, the router may perform a track-hop, in which the two pointto-point paths are routed along different tracks (in the same channel), using the cell terminal to connect the tracks. This is illustrated in Fig. 3a with a track-hop at the connection to Cell E. In such a situation, there are two segments of the same net connected to Cell E, and both must be covered. The cover segment for the lower one is already part of the net, and a reserved segment over cover Cell F must be added to cover the other, as seen in Fig. 3a. A reconfiguration is shown in Fig. 3b, where the track-hop is now performed at the new Cell E (previously Cell F).. Routing Heuristics In order to make the most efficient use of wiring resources, routers consider such things as net size and the number of routes available for a point-to-point path when choosing the order in which nets and point-to-point paths are routed. Some of the heuristics we have used for determining which net to route next include choosing the largest remaining net and choosing the net containing the point-to-point path having the fewest remaining track choices. Similar heuristics are used in choosing which point-to-point path to route first, such as starting with the longest path and starting with the point-to-point path having the fewest track choices. The heuristics are applied in combinations in an effort to obtain the best possible track overhead for FT. While these heuristics provide reductions in track usage for non-ft, as well as FT, routing, we describe two others below that are specific to our FT routing techniques. Segment Reuse. The segment reuse heuristic takes advantage of the properties of the reserved segments and attempts to share segments between nets. It does this by locating certain net end segments that could also be reused as reserved segments by another net. Since the reserved segment is not connected to the other net until needed for reconfiguration around a fault, the first net is free to use it until then. At that time, it is disconnected from the first net, which would not need it in the reconfigured circuit. Thus, the router can overlap nets at the shared segment without actually shorting the nets together. Preferred Routing Direction. When the global router selects the channels to be used in routing from a logic cell pin to the next pin located to the upper or lower right, it must use both a horizontal and a vertical channel. Since our node-covering scheme has chosen to place cover segments to the right, routing first to the right (of the current pin connection) in the horizontal channel before bending up or down in the vertical channel allows at least one cover segment to be automatically included in the net. Routing vertically first would miss this opportunity. This heuristic thus reduces segment overhead by minimizing the need for adding reserved segments in such cases..3 Reconfiguration The reconfiguration procedure described here assumes that cover segments necessary for reconfiguration around a faulty cell are provided by the routing tools and included in the initial configuration data. Our model of the configuration memory assumes that its data is loaded serially, as is the case with both Altera and Xilinx architectures [1], [7]. Fig. 4a shows the configuration memory for a faulttolerant row of an FPGA, where the last cell in the row is a spare. Configuration data for a row of cells and their connections to the channel wiring is grouped contiguously by cells in a shift register. Configuration data for the channel segment interconnections is kept separate from the cell data for a row, since it does not get rerouted when there are faulty cells. It can be stored either in its own portion of the same shift register or else (more simply) in its own shift register. The last (spare) cell in the row is initially reset to a null configuration to disconnect it from the array. Configuration data produced by the user s placement and routing tools assumes that the primary cells are nonfaulty and maps them to the user s circuit accordingly. However, it also includes channel segment covers to allow for reconfiguration around one faulty cell per row. Both spares and faulty cells are transparent to the user. When configuration data is loaded in the absence of faults, it simply bypasses the spare cell enroute to the other rows, and a factoryenabled signal forces the configuration pass-transistors of the spare cell to the OFF state in order to disconnect that cell from the array, as shown in Fig. 4a. If a cell is faulty, the faulty cell s configuration pass-transistors are turned off instead. The user does not need to know the location of factory-detected faulty cells and does not need to modify the configuration data obtained from the router. Instead, laser-programmable links burned at the factory allow configuration data to bypass the detected faulty cells and travel to the cover cells when it is loaded into the FPGA. Configuration data originally intended for the faulty cell and the cells following it in that row are rerouted by a factoryprogrammed multiplexer (mux) in the cell so that it automatically flows instead to the cover cells, including the spare cell, as shown in Fig. 4b. Since the FT routing rules ensure that cover segments are already programmed into the channel segment configuration data, a correct reconfiguration is automatically achieved when the configuration data is loaded, and any factory-detected faults are transparent to the user adhering to these rules.

5 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 19 Fig. 4. (a) Row of factory reconfigurable FPGA cells. (b) Reconfigured around faulty cell with configuration data for the faulty cell reset to the null state. Flow path of configuration data is highlighted. If the laser-programmable links are replaced by switches controlled by nonvolatile microconfiguration memory bits (e.g., EEPROM), reconfiguration can be performed by both the factory and the user. As with laser-programmable links, reconfiguration performed at the factory is transparent to the user. However, a user detecting any additional fault now has the capability of reading from and writing to the microconfiguration memory to reconfigure around the new fault. It is not necessary that a faulty cell s configuration memory remains entirely fault-free. Since the shift register path bypasses the faulty cell, breaks in the faulty cell s shift register chain do not interfere with the ability of the remaining cells to be configured. When a faulty cell is reset to the null state to prevent it from interfering with the rest of the array, a signal sent to each of the configuration bits controlling the cell-to-channel interconnections forces each of these bits to the OFF state. Fig. 5 shows a configuration memory bit with a reset input. It can be seen that only a minimal portion of the memory bit circuitry is required to remain faultfree in order to ensure that the pass-transistor is OFF..4 Eliminating Reserved Segment Delay for Cell FT Due to the additional parasitic capacitance, the propagation delay of a net will increase if reserved segments are left attached to the nets before any reconfiguration. This effect can be minimized and even eliminated if reserved segments are not actually connected to the nets until they are needed. We present two different methods to facilitate this. The first requires additional memory external to the FPGA, while the second requires additional circuitry inside the FPGA. In the first method, the routing tools would produce two separate maps of track interconnections for each channel row. The reconfiguration map would have all of the reserved segments connected, and the normal map would connect only those segments necessary for fault-free operation. Both maps would be loaded into the FPGA upon initialization, one after the other, for each channel row. A faulty cell present in a row would cause the reconfiguration maps for the channels above and below that row to be selected for shifting into the channel configuration memory. Otherwise, the normal maps would be shifted in and the reconfiguration maps shunted aside. Only those nets affected by the reconfiguration would experience a propagation delay penalty, and it would be much less than if all of the reserved segments were connected. Either additional system memory (if the FPGA is loaded by the system when initialized) or a larger configuration PROM would be required. Only the track segment connection data needs to be duplicated, and is estimated to be about 30 percent of the total configuration data. The second method makes use of the fact that there is a simple relationship between the logic values of the configuration bits controlling the cell-to-channel connections and the logic values of the configuration bits controlling the segment-to-segment connection switches required to connect the reserved segments for a reconfigured cell. Circuitry associated with each cell would sense that the cell has been reconfigured as a cover and would force ON the appropriate pass transistors to connect cover segments as directed by the cell-to-channel connection data then in effect. This can be seen for a top cell-to-channel connection in Fig. 6a, where the output of the cell configuration memory bit controlling the cell-to-channel pass transistor also controls which channel routing configuration pass transistor to turn ON. Each cell-to-channel configuration memory bit is simply ANDed with the cell_is_reconfigured signal to set the appropriate channel routing configuration memory bit, whether it already happens to be set or not. Fig. 6b applies to either a left or right cell connection, where it must be assured that two pass transistors are ON. Here, similar AND functions set both of the channel routing configuration bits that connect the reserved segments. Fig. 5. Configuration memory cell showing reset input and the portion of the circuitry required to remain fault-free. 3 OVERHEADS OF CELL FAULT TOLERANCE Node-covering has been shown to be effective in providing FT with low area overhead and very good yield improvement in VLSI circuits. We have designed FT multipliers with area overheads of 40 percent that provide yield improvement greater than 50 percent [6]. While the potential for node-covering cell FT to improve yield in FPGAs is

6 0 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 Fig. 6. Automatic connection of reserved segments to a cover cell for reconfiguration. (a) Top cell-to-channel connection. (b) Right cell-to-channel connection. significant, the technique can have an effect on the routability of a circuit. Some nets may have a high reserved channel segment overhead, and this may reduce the circuit s routability. In order to quantify this effect, we analytically and empirically determine the number of segments and routing tracks required with and without our FT techniques. 3.1 Analysis The segment overhead required by node-covering cell FT depends primarily on the circuit parameters and on the way the circuit is mapped to the FPGA. In order to provide a first order estimate of this overhead, we present the following theorem, which gives best case, worst case, and average expected overheads. THEOREM 1. A circuit consisting of nonbranching nets that travel from left to right monotonically incurs the following segment overheads when the node-covering cell FT technique is applied. Best case overhead is e/(l(p - e)). Worst case overhead is 4/l. Average overhead is 1p - e61h + v6 + e p - e l + a where the average segment overhead at a horizontal connection segment is a4 9 8l + 4l + 4l + 1 h = 4l + 4l + 1 and the average segment overhead at a vertical connection segment is 8l a544l + 8l + 19 v =. 4l + 8l + 1 In the above expressions, e is the number of nets, p is the number of logic cell pins used, l is the average length in segments between adjacent logic cell pins in a net, and a is the probability of a track-hop occurring at a pin. A brief version of the proof of this theorem is presented in the Appendix, and complete details are available in [11]. In Section 3.3, we compare the results predicted by the theorem to the actual segment overheads we have obtained. 3. Experimental Procedure We have obtained FPGA routing software from the University of Toronto, and also several MCNC and other benchmark circuits whose netlists are in a format compatible with the routing tools. The software consists of a global router which assigns channels to the route of each point-to-point signal connection in a net, and a detailed router which chooses the actual track segments within the channels to be used by the nets. Thus, we are able to compare the number of segments and tracks required in a non-ft circuit to the number after FT techniques have been applied. We input the node-covering specifications to the global router. However, since it is not known at that time where the detailed router might perform a track-hop, it was necessary to modify the detailed router to add the reserved segments necessary to cover the instances of track-hopping. We have also modified the detailed router to implement several different routing heuristics, some of which were mentioned in Section., in order to minimize the track overhead required for FT. A few of the nets in these benchmark circuits connect 0, 30, or more logic cells. The simple segmented channel wiring model we employ does not consider that such nets would be routed more efficiently using nonsegmented global wiring. A few such resources are typically available for routing clock and reset signals in commercial FPGAs with otherwise segmented architectures [7]. Since such nets would be routed to all or most of the cells, there would probably not be a great need for reserved segments, as covers would already be present in the net. We have thus followed the lead of the University of Toronto in choosing to limit the fanout of a net to 10 logic cells when using the short wiring segments []. This gives an average reduction in the number of circuit nets of about 5 percent. The detailed router is able to compute signal propagation delay in a net, from the net s source to its furthest destination. It does this by modeling the net as an RC tree, including the ON and OFF resistances of the pass transistors, the capacitances of the transistor source and drain diffusions, and the capacitances of wiring segments [17]. Resistance of the wires is considered negligible in this model. The effect on propagation delay due to reconfiguring an FPGA around faulty logic cells was determined by modifying this portion of the detailed router to change the trees to reflect reconfigurations around various fault patterns. Single faulty cells at the beginning of a row were simulated, as well as the case of a faulty cell at the beginning of every row.

7 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 1 TABLE 1 PREDICTED AND ACTUAL SEGMENT OVERHEAD FOR ROUTING CIRCUITS WITHOUT HIGH FANOUT NETS Segment Overhead Without High Fanout Nets Number of Predicted Number of Actual Number of Number of Non-FT Segment FT Segment Circuit Nets Pins Segments Overhead % Segments Overhead % too_large , ,911 3 example , ,86 5 vda ,33 35,935 6 alu ,60 37 alu ,151 36, symml C , ,453 7 C , , C , ,900 7 apex ,10 9 k 388 1,37 4, ,05 1 term CbnrD ,384 4, ,78 6 DbnrD ,381 4, ,79 7 EbnrD4 36 1,054, ,19 33 bus_cntld , dmad , ,990 3 dram_fsmd ,933 45, Average TABLE TRACK OVERHEAD FOR ROUTING CIRCUITS WITHOUT HIGH FANOUT NETS WHEN USING BEST OVERALL HEURISTIC AND ALSO WHEN USING THE BEST HEURISTIC FOR EACH CIRCUIT Track Overhead Without High Fanout Nets Best Heuristic Best Combination Number of Non-FT % % Circuit Nets Tracks FT Tracks Overhead FT Tracks Overhead too_large example vda alu alu symml C C C apex k term CbnrD DbnrD EbnrD bus_cntld dmad dram_fsmd Average Empirical Results We have routed 18 benchmark circuits, ranging in size from 79 to 508 nets, and calculated segment and track overheads for node-covering cell FT. Table 1 compares segment overheads predicted by Theorem 1 to the actual overheads. Parameters for the predicted overheads were determined using empirical data from the circuits. The average length l between logic cell pins connected by a net is 3.8 segments, and track-hop probability a is Agreement between the predicted overhead of 39 percent and actual overhead of 31 percent is reasonably good, and the difference is likely due to the simplified assumption that the nets do not branch, while they actually do. While having multiple paths leaving a connecting point increases the number of terminal points that must be covered, it also reduces the number of nonterminal points and the number of reserved segments needed to cover them. Table shows the results of the best overall heuristic when circuits are routed without the high fanout (fanout > 10) nets. 1 The heuristic that chose the largest remaining net next, and started with the longest point-to-point path, proved to give the best results, an average track overhead of 43 percent. Since our goal is not actually to produce the most efficient router, but rather to determine the overhead of our FT technique, we also calculated average overheads based on using the most efficient heuristic for each circuit. These results are shown in Table under Best Combination, and indicate that actual track overhead for nodecovering FT is considerably lower, at least as low as 34 percent. 1. As explained in Section 3., high fanout nets were not routed for the data presented here. If all nets were routed, the averages in Table would be 48 percent for Best Heuristic and 44 percent for Best Combination.

8 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 TABLE 3 PROPAGATION DELAY INCREASE DUE TO LOGIC CELL FAULT TOLERANCE Signal Propagation Delay Increase Due to Fault Tolerance Average Single Fault One Fault/Row Tracks Non-FT Average FT % Average FT % Circuit Used Net Delay Net Delay Increase Net Delay Increase too_large example vda alu alu symml C C C apex k term CbnrD DbnrD EbnrD bus_cntld dmad dram_fsmd Average Net delay times are in nanoseconds. This is very close to the overhead of 30 percent that we first predicted in [9] before completing router modifications to cover instances of track-hopping and to provide more efficient routing heuristics. One manufacturing alternative, then, is that the FT FPGA can be fabricated with 34 percent more tracks, so that, after reconfiguration, the full routing capability of a non-ft FPGA (with the regular number of tracks) is maintained. Another alternative is that the number of tracks can remain the same, with the understanding that the routing capability would not be as great as for a non-ft chip with the same number of tracks. In other words, circuits mapped to FPGAs that use up to 75 percent ( % = 100%) of the available tracks could be mapped to FPGAs with one cell defect per row. Since a typical circuit does not require all of the available tracks, lower routing capability might be acceptable to many users, especially since the higher yield afforded by fault tolerance would translate into lower chip prices. The effect of reconfiguration on signal propagation delay was determined by simulating a single faulty cell and also by simulating the maximum reconfigurability of the nodecovering cell FT technique one faulty cell in every row. The number of tracks present in an FPGA affects the capacitance at each node in a net, and, thus, affects the delay. Therefore, in order to isolate only the effect on delay due to reconfiguration, propagation delays of the non-ft cases were computed using the same number of tracks as the FT cases. Table 3 shows the average signal propagation delays of a net in each circuit when routed in a non-ft FPGA and in a reconfigured FT FPGA. It is seen in the table that, for the FT FPGA, the average propagation delay of a net affected by reconfiguration around a single faulty cell increases only slightly, about 5 percent. With a faulty cell in every row, the average delay is only about 9 percent, which is still very small. 4 FAULT-TOLERANT WIRING Wiring channel area in an FPGA is significant, occupying 50 to 90 percent of the chip area [3], [5]. In an effort to reduce the chip area occupied by wire segments in the routing channels, wire width and separation have been reduced. However, these reductions increase the chances of faults, such as breaks or shorts, in the wiring segments, and special techniques have been developed to specifically test for wiring faults in FPGAs [15]. As an alternative to increasing the width and separation of the wires in order to reduce wire breakage and shorting and, thus, increasing chip yield, we propose fault-tolerant techniques for handling wiring faults. Our wiring FT techniques are compatible with the node-covering-based FT methods applied to cells, and, when combined with the cell FT methods, they further increase the yield of usable chips. It is assumed that interconnect wiring faults will be detected during testing, using a technique such as [15]. Types of wire segment faults that can be tolerated include breaks, shorts to power or ground, and shorts between two adjacent segments. Similar to replacing faulty logic cells, our wiring FT techniques work by replacing faulty wiring portions by adjacent ones, until eventually a spare wiring portion is reached. In a segmented channel wiring architecture, contiguous wiring segments in a horizontal or vertical channel constitute a track. A grid consists of a set of interconnectable tracks, one per channel, along all horizontal and vertical channels (see Fig. 1). It is possible in our techniques to consider the replaceable portions to be either individual segments or entire. We assume a switchbox can connect a segment to only a corresponding horizontal or vertical segment (as seen in Fig. 1). If a switchbox can connect a segment to segments of other tracks (e.g., from track i to track i + 1), the grid definition would change to a set of corresponding interconnectable tracks. For FT, such a switchbox would need to be slightly modified. If track i can connect to track i + 1, for example, it would also have to be able to connect to track i + in the event that track i + 1 is faulty. One way to do this would be to have a laser-programmable connection between i s switch connection to i + 1 and i + 1 s switch connection to i +.

9 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 3 Fig. 7. Fault-tolerant wiring segments (a) before reconfiguration and (b) after reconfiguration around faulty segment. (c) Repositioning circuitry for cell-to-channel segment configuration data bits for p/4 = 4. grids, leading to different designs that trade off between area overheads and chip yields to different degrees. 4.1 Fault-tolerant Segments Our technique for replacing individual faulty wiring segments provides a high degree of fault tolerance in the channel wires up to one faulty segment in the channel portion along each side of every cell can be tolerated. Fig. 7 shows the mechanism for replacing faulty segments, which includes the incorporation of a spare segment in each channel. Fig. 7a shows the situation before replacement has taken place. To reconfigure around a faulty segment, the switch boxes at each end of the segment must disconnect from it and connect instead to the segment below it, as shown in Fig. 7b, where the uppermost segment is faulty due to either a break or a short. The segment below is, in turn, replaced by the next segment, and so on, until the spare is used. In the case of a short between two adjacent segments, the uppermost of the two segments is considered to be the faulty one. It is disconnected from the switch boxes and simply remains attached, via the short circuit, to its replacement. Details in Figs. 7a and 7b show how fuses and pass transistors can be used to connect the wiring segments to the rest of the circuitry within the switch box. During reconfiguration, which needs to be done at the factory, fuses at the ends of the faulty segment and at the ends of each of the lower segments are burned, using a laser, and, then, fuses are burned to allow resistors to turn on the pass transistors connecting the replacement segments. Note that depletion mode nmos transistors can be used as pullup resistors. Configuration data for the segment-to-segment interconnections in the channel remains the same as without segment replacement. However, provision must be made for remapping the original logical cell-to-channel segment connection relationships onto the replacement segments. Consider an FPGA cell having p pins, p/4 pins per side, connecting to track segments in the surrounding channels. If there are t tracks in a channel, then each of the t segments on one side of a cell has programmable connections to p/4 cell pins. The configuration bits controlling the cell-tosegment connections on one side of a cell can then be arranged consecutively in the configuration memory shift register as t groups of p/4 bits. Such an arrangement is shown in Fig. 7c, where there are also p/4 bits for connections to the spare segment. Configuration bit positions in the shift register that correspond to connections to the faulty segment can be bypassed in the same manner that positions are bypassed when reconfiguring around a faulty cell. Thus, the bits are relocated to positions that correspond to the replacement segments. In Fig. 7c, burning the two fuses associated with the group of p/4 shift register bits that correspond to the faulty segment will cause that group to be bypassed. The spare group is then brought into use and the spare segment gets included in the set of usable segments. Since the spare group is normally bypassed, it is configured such that burning its fuses brings it into use (see Fig. 7c). Time Overhead. For fault-free wiring, there are no additional pass transistors in a signal path between cells, compared to a non-ft FPGA, so there is no increase in propagation delay. When a faulty wire segment is replaced, only two pass transistor delays are added to a signal path passing through the reconfigured portion of a channel. For long paths, this is not a significant percentage increase, and for

10 4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 Fig. 8. Reconfiguration around faulty tracks for grid replacement. short paths, the delay is likely to already be small enough, compared to delays in longer nets, that an increase may be tolerated. Hardware Overhead. Counting the additional elements (pass transistors, fuses, and resistors) needed in both the segment replacement circuitry and the cell-to-channel segment configuration bit repositioning circuitry, the overhead is 1 or 16 elements per wire segment. The overhead depends on whether or not it is practical to share some of the fuse and pull-up combinations controlling the pass transistors in these different circuit areas. A shift register configuration memory bit and the pass transistor it controls require about 11 transistors. There are seven to 11 of these bits associated with each wire segment, assuming contributions from the segment-to-segment interconnection bits and from the p/4 = to 4 cell-to-segment connection bits for both of the cells that can connect to a segment. This gives a quite reasonable element overhead of approximately 10 to 15 percent or 13 to 1 percent, depending on pull-up sharing. The overhead of the spare segment itself, and the additional cell-to-segment configuration bits associated with it, depends on the initial number of tracks and will be fairly small. When breaks and shorts in the wire segments are the primary defects of concern, and there is a favorable tradeoff between adding extra elements to tolerate these defects and increasing wire width and spacing to reduce the defect probability, this FT technique should provide good yield improvement. In Section 5, we analyze the yield afforded by this method. 4. Fault-tolerant Grids Replacing an entire grid provides a much more coarsely grained fault tolerance than does replacing individual segments. However, faults can be tolerated not only in the wiring segments themselves, but in the switch box as well, including the shift register memory configuration bits that program the switch box. As with the technique for replacing individual wiring segments, a spare wiring portion must be added, in this case, a spare grid. Unlike in that technique, though, there are no additional elements required to be added within the channels, thus obviating any overhead in signal path delays after reconfiguration. Reconfiguration is accomplished instead at the periphery of the cell matrix as the user s configuration data is loaded into the FPGA. The configuration memory shift register may be arranged as a single shift register, or it may branch to different areas of an FPGA. Here, we consider a branched shift register, although our FT concepts can be adapted easily to a single shift register. Fig. 8 shows the input shift register, through which user configuration data is input to the chip, and several branch shift registers. The input shift register is loaded with a bit for each track of each horizontal channel, and a branch shift register to carry the segment-to-segment interconnection data for a track begins at each of these bit positions. When the input shift register is full, its bits are shifted into the branch shift registers. The muxes shown in Fig. 8 control which branch shift register gets a particular bit from the input shift register. In order to tolerate a faulty grid, these muxes are factory programmed (or user programmed, for operational FT) to bypass the tracks of the faulty grid and route the bits into the branch shift registers corresponding to the replacement tracks. As with the segment replacement technique, a faulty track is replaced by the next track, which is replaced by the next track, and so on, until the spare track is used. This is done for each track of a faulty grid, and so replaces the entire grid by an adjacent one. Since the cell-to-channel segment connections will change uniformly throughout the array in the event of a grid replacement, this connection data can be modified automatically as it is loaded into the cells (recall that cell-tochannel segment connection data is associated with the cell in our FT cell techniques, described in Section ). Fig. 9a shows a possible path of configuration data entering a cell. The muxes are those used to bypass faulty cells, and grid replacement is entirely compatible with the FT cell techniques. Note that the path always enters the Track 0 connection position first. Cell data can then be arranged in the cell data vector shown in Fig. 9b, in which the configuration bits for connecting a pin of the cell to one or more of the t tracks (here, t = 4) are grouped together. The cell

11 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 5 programming data supplied by a user does not include the nulls in this vector. They can be inserted by a factory (or user) programmed state machine on the chip as the data is loaded into the FPGA. Normally, without a faulty grid, nulls are placed in the vector for the connections to the spare tracks in order to disconnect the cell from the tracks of the spare grid. In the case of a faulty grid, the state machine places the nulls instead at the positions of the tracks of the faulty grid. This disconnects the cell from the tracks of the faulty grid and remaps the configuration bits for cell connections onto the replacement grids. Such a scheme is depicted in Fig. 9c. As with the segment-to-segment interconnection data, an input shift register is loaded, interleaving a bit for each row of cells. These bits will all be a particular bit of the cell data vector for each cell in a column of cells. Either these bits or else nulls will be chosen by the state machine s Data Select signal and then shifted into the branch shift registers for the cell rows by the Shift Control signal. Nulls are chosen whenever their positions in the cell data vector arrive. This is either the normal position corresponding to the spare grid, or else, in the case of a fault, the position corresponding to the faulty grid. When the branch shift registers are full, they will each contain a string of the vectors of Fig. 9b, one vector for each cell. Overheads. As stated above, there is no additional time delay in any signal path in a circuit reconfigured around a faulty grid. Also, the hardware overhead is extremely low, since only one simple state machine is required to reposition the cell-to-channel segment connection data for all of the cells. The overhead of spare grids themselves depends on how many spares are desired. If there are more than one, it simplifies the reconfiguration circuitry if they are placed one per group of grids rather than used as global spares. 5 YIELD IMPROVEMENT An important benefit of fault tolerance is an improvement in chip yield, where chip yield is defined as the percentage of usable chips rather than simply the percentage of defectfree chips. We present a comparison of our node-covering cell FT technique and FT interconnect techniques to the spare rows and/or columns techniques of [13], based on a set of common parameters. 5.1 Analysis Procedure A small array of logic cells is defined, to which will be added one spare row and/or column of cells or one spare cell per row. A floor plan of the array layout is shown in Fig. 10, which is based on an FPGA layout presented in [4]. Our set of common parameters assumes that the interconnect channels occupy 50 percent of the array area in the non-ft FPGA and that 50 percent of the channel width is devoted to 16 tracks of wire segments. The remainder of the channel consists of active circuitry, as described next. The interconnect channel area can also be divided into segmentto-segment and cell-to-segment areas, as seen in Fig. 10. Each of these areas consists of wiring, pass transistors to create the actual programmable interconnections, and configuration memory shift register bits that control the pass Fig. 9. (a) Path of configuration data (for programming cell logic and cell-to-channel interconnections) through a cell. (b) Vector containing cell configuration data, when Track/Grid 1 is faulty. (c) Mechanism for repositioning cell-to-channel segment connection data to bypass a faulty grid. transistors. In the channel wiring architecture model of Fig. 1, there are six transistors in the channel segment to channel segment switch and four transistors needed for the cell to channel connections on one wiring segment. A rough estimate of wire area is that there is 50 percent more in the segment-to-segment interconnect area than in one of the cell-to-segment interconnect areas. Using these ratios, we approximate a segment-to-segment interconnect area of the floor plan to be 50 percent greater than a cell-to-segment interconnect area in a non-ft FPGA. When additional tracks are added for fault tolerance, the cell-to-segment interconnect area grows linearly with the track increase. If tracks of only one orientation are added (e.g., horizontal tracks), the segment-to-segment interconnect area also grows linearly. Otherwise, it grows as the square of the increase in the number of tracks. For our computations, the nominal size of a tile (cell area + segment-to-segment area +

6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 Fig. 10. Floor plan of FPGA array. two cell-to-segment interconnect areas) in the non-ft array was taken as 0.004 cm.

12 6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 Fig. 10. Floor plan of FPGA array. two cell-to-segment interconnect areas) in the non-ft array was taken as cm. Each of the FT techniques compared here can tolerate faults in the FPGA logic cells and in certain portions of the interconnect channels. In the spare row and/or column techniques [13], the FT portions of the channel can be combined with the logic cell, forming an FT unit area, for purposes of yield analysis. In our schemes, some of the FT channel area can be combined with the cell, while the other portions are divided into FT unit areas for yield analysis according to the FT interconnect techniques used. Chip yield is calculated as the product of the yields of the FT portions of the array and the yield of the non-ft portions. Fig. 11 shows the scope of the FT and non-ft areas for the different techniques. From the description of spare row and/or column techniques in [13], we deduce that the segment-to-segment interconnect areas are non-ft (see Figs. 11a and 11b). For the spare column method, faults can be tolerated in the cell-to-segment interconnect areas of the vertical channels and in some portions of the horizontal channels, as indicated in Fig. 11a. In this method, the horizontal wire segments themselves are non-ft, as are the portions of the horizontal cell-to-segment interconnect areas that are needed to ensure that a faulty column of cells is disconnected from the horizontal wires. Such a disconnect mechanism can possibly be similar to the one that has been described in Fig. 5 of Section.3 for the node-covering FT cell technique. When both a spare column and a spare row are used, only portions of both the horizontal and vertical cell-to-segment interconnect areas can tolerate faults. Thus, the FT unit area is smaller, as seen in Fig. 11b. For cell FT using node-covering (Figs. 11c and 11d), the FT unit area for the cell is the same as in Fig. 11b, as portions of the cellto-segment areas must remain fault-free in order to ensure that a faulty cell is disconnected from the channel wiring. With FT wiring segments, Fig. 11c shows that the segmentto-segment interconnect areas are non-ft and that each wire segment becomes a FT unit. For the FT wiring grid technique, Fig. 11d shows that the FT units are formed by dividing the entire wire segment and segment-to-segment interconnect area into grids. Details of our yield analysis are available in [11], but are summarized here. To analyze the yields of the various array areas of the different FT schemes, we use the Poisson yield model [5], in which defects are assumed to occur independently and the probability that an FPGA cell is defect-free is given by p e DA n c = -, (1) where A c is the cell area and D n is the defect density or the number of defects per unit area. This model is easy to use, but it should be noted that it tends to underestimate the yield when the DA product increases beyond unity [5]. However, beyond that limit, it remains useful for relative yield comparisons, which are our main concern. We first present yield equations for the analysis of FT techniques applied to logic cells. For each of the faulttolerant techniques, we compute the yield of an array of n rows by m columns of cells as the sum of the probabilities of all usable configurations of defective and defect-free cells. For the technique employing a spare column, this is n nm mn n n i i Y = p m + p i p 11- p6, () where n i sc Â i= 1 is the number of ways of choosing i out of n items. For the technique employing both a spare row and a spare column, the yield expression is Fig. 11. FT unit areas and non-ft areas of an array for different FT schemes. (a) Spare column [13]. (b) Spare row and column [13]. (c) Node covering cell FT with FT wire segments. (d) Node covering cell FT with FT wiring grids.

13 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 7 Y src m+ 1 n+ 1 m + 1 n + 1 m+ 1 n+ 1-1 = p + p p 1-1 0m + 150n + 15 m+ 1 n p p6 - n+ 1 mn+ 1 n n i i + m + p + 1 i p p Â 11-6 i= 3 m+ 1 nm+ 1 m + m j j + n + p 1 j p p Â j= 3 n mn m + 150n + 15 mp 0 = 15 n n i i 1 Â - i p 11 p6 i n mn Â n 0 = n i i m 150n 15p 1- i p 1 p6 i 1 m+ 1m + 1 m+ 1-j j Â j p 1- p 1 6, (3) j= and, for our node-covering cell FT technique, it is Y = p + m + 1 p 1- p nc m+1 m n 4 9. (4) In analyzing the yield of our FT interconnect techniques, p becomes the probability that an individual wire segment or grid is defect-free, where the area in the Poisson model is the area of a wire segment or a grid. FT wiring yield is computed as the sum of the probabilities of all usable configurations of defective and defect-free wiring portions. The yields for a segment and a grid, respectively, are expressed as Y = p + s + 1 p 1- p seg s , (6) s+1 s mn 4 9, (5) Y = p + s + p - p grid s+1 where s is the number of segments across the width of a channel. For the non-ft portions of the array in each of the FT techniques, the non-ft yield is DA YNFT = e - n NFT, (7) where A NFT is the total non-ft area. The product of the yields of the FT portions of the array and the yield of the non-ft portions is the chip yield. Areas of cell and interconnect portions of the FPGA for use in analyzing the yields of the various FT techniques are derived in the Appendix and summarized here. They are calculated by applying adjustment factors to the areas of a non-ft array, depending on the particular technique. These factors, and the other variables used in the area calculations, are defined as follows: A c A cs A ss t T I area of logic cell in non-ft array area of cell-to-segment interconnect block associated with one side of a logic cell, i.e., 1/ the area of cell-to-segment interconnect block of Fig. 10 in non- FT array area of segment-to-segment interconnect block in non-ft array number of tracks per channel in non-ft array track increase factor in a spare row and/or column FT scheme or in the node-covering cell FT scheme T AI F NFT F FT F W O cs O ss dimension (height and/or width) increase factor for cell-to-segment and segment-to-segment interconnect blocks fraction of configuration memory bit circuitry used to disconnect a faulty cell from the channel, it cannot tolerate faults fraction of configuration memory bit not used to disconnect a faulty cell from the channel, it can tolerate faults fraction of channel width devoted to wire segments FT overhead fraction for cell-to-segment circuitry in FT segment scheme non-ft overhead fraction for segment-to-segment circuitry in FT segment scheme Consideration is given in the analysis to the need for adding up to 50 percent more horizontal tracks in the spare column FT technique of [13], in order to retain the original routing flexibility. The FT unit cell area is calculated as Ac, sc = Ac + Acs + AcsTAI1- FW 7 FFT (8) and the total non-ft area is ANFT, sc = n0m + 154AssTAI + AcsTAI3FW + 1- FW 7FNFT89. (9) When both a spare row and a spare column are used, additional tracks must be added in both the horizontal and vertical channels. The FT unit cell area then becomes smaller, Ac, src = Ac + 4AcsTAI1- FW 7 FFT. (10) Accordingly, the non-ft area becomes much larger, as follows: ANFT, src = 0n + 150m + 154AssTAI + 4AcsTAI 3FW + 1- FW 7FNFT89. (11) Our technique of node-covering for fault tolerance in the cells is projected to require a 34 percent increase in the number of tracks in the channels, so the analysis includes this as well. For node-covering cell FT with FT segments, cell area and segment area, respectively, are Ac, ncs = Ac + 4AcsTAI1- FW 7FFT1+ Ocs7, (1) Aseg = AcsTAI FW tti + 17, (13) and the total non-ft area is A NFT, ncs = 0 5 ss AI 4 ss W79 cs AI W7 NFT. (14) n m + 1 A T 1+ O 1- F + 4A T 1- F F For FT grids combined with node-covering cell FT, the expression for cell area A c,ncg is the same as with the spare row and column technique, and the area of a grid is (15) Agrid, ncg = n m + 1 AssTAI + 4AcsTAIFW tti + 1 The non-ft area is very small, ANFT, ncg = n0m AcsTAI1- FW 7FNFT8. (16) Combining the area expressions with the yield equations presented above, and then varying the defect density and/or non-ft cell area, produces the array yield. Gross yield takes into account the fact that increasing chip area to

14 8 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 (a) Fig. 1. Yields afforded by various fault-tolerant techniques versus the yield of an FPGA without fault tolerance. (a) Yield comparison of cell arrays. (b) Gross yield comparison. (b) implement FT techniques means that fewer chips can be fabricated from a wafer. It is calculated by dividing the FT array yield by the ratio of the area of a FT array to the area of the non-ft array. For our node-covering cell FT with wiring FT techniques, we calculate yield results both with and without the extra tracks required to retain the full routability of a non-ft array, as FPGAs with slightly reduced routing capability are still useful chips. Since the spare row and/or column techniques also require additional tracks, we also calculate their yields with less than their full complement of extra tracks, and then compare all of the techniques normalized to the same reduced routability. To take into account the lower expected selling price of FPGAs with reduced routability, we define gross profit yield to be gross yield multiplied by a price factor. This factor is 1-0.5RRF, where 0.5 is the fraction of chip area occupied by the wiring channels and RRF is the Reduced Routability Factor. 5. Results Chip yields for the various FT techniques are plotted against the yield of a non-ft array, as shown in Fig. 1a. For yields of Fig. 1 incorporating our node-covering cell FT technique, the 34 percent track increase necessary to maintain full routability with that technique has been assumed. It can be seen that combining FT grids with nodecovering cell FT affords the greatest yield improvement in the FPGA, doubling a 30 to 40 percent non-ft yield. The FT grid technique combined with node-covering cell FT, although it provides very high fault tolerance in the channel segments, shows less yield improvement due to the absence of fault tolerance in the segment-to-segment interconnect area. Lower yield improvement is given by the spare column technique, again because of lack of fault tolerance in the segment-to-segment interconnect area and in the horizontal channel segments. A spare row and column, while providing high fault tolerance for the cells, require too much additional track and segment-to-segment interconnect area. As these areas are non-ft, there is very little overall yield improvement. Gross yield is calculated by dividing the FT yields of Fig. 1a by the ratio of the area of a FT array to the area of the non-ft array. From the area expressions given above ((8)-(16)), these area ratios can be easily computed. Areas themselves are found by multiplying the FT unit areas of an array by the number of units (cells, segments, or grids) and adding these areas to the non-ft area of the array. For the spare column technique, only the number of horizontal tracks is increased, giving a low area ratio of 1.7. With both a spare row and a spare column, the number of tracks is increased in both the horizontal and vertical channels, giving the highest area ratio, For our techniques of node-covering cell FT with FT interconnect, track numbers are also increased in both the horizontal and vertical channels, but the increase is not as great as for the spare row and column technique. This gives more moderate area ratios of 1.55 and 1.48 for our techniques with FT segments and FT grids, respectively. The FT segment technique has the higher ratio due to the extra elements needed to facilitate segment replacement within the channels. Fig. 1b shows gross yield for the four FT techniques. The area increase of the spare row and column technique is too large to allow any gross yield increase. The spare column technique and our technique of node-covering cell FT with FT segments both show small gross yield improvement, but it disappears as the non-ft yield reaches about 40 percent. However, at that point, our technique of node-covering cell FT with FT grids still gives a gross yield of nearly 55 percent more than 30 percent yield improvement and continues to provide a gross yield improvement until non-ft yield reaches 60 percent. It is relevant to note that Xilinx FPGA chip yields are lower than 50 percent for their largest arrays (48 48 cells) [8]. In the discussion of Section 3.3, it is stated that not adding extra tracks when using our node-covering cell FT technique results in reduced routability, approximately 75 percent of that for a non-ft array. However, such chips can be acceptable to many users if the cost is lower than a chip with full routability. Greater yield is possible without the extra tracks than when they are added, and this greater yield will reduce the cost. We show the chip yields of versions of our nodecovering cell FT and wiring FT techniques with reduced

15 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 9 (a) (b) Fig. 13. Yields afforded by various fault-tolerant techniques versus the yield of an FPGA without fault tolerance when FT arrays have equal reduced routing capability (75 percent for all techniques). (a) Yield comparison of cell arrays. (b) Gross yield comparison. routing capability in Fig. 13a, plotted against non-ft yield. Since the spare row and/or column techniques also require additional tracks (50 percent more) in order to retain full routability, arrays with the same reduced routability can also be defined for these techniques. A 50 percent track increase results in 4 tracks, so 75 percent routability results in 4 75% = 18 tracks for the spare column and spare row and column techniques in Fig. 13. Note that, while the numbers of tracks differ, all FT designs are being compared at the same degree of routability (75 percent). Chip yields are indeed greater in Fig. 13a than in Fig. 1a. More dramatic, however, are the gross yields of Fig. 13b. Here, all of the area overheads are smaller, resulting in much greater yield improvement than shown in Fig. 1b. Area overhead ratios for reduced routability versions of the arrays are 1.13 and 1.7 for the spare column, and spare row and column techniques, respectively, and 1.19 and 1.14 for nodecovering cell FT with segment FT and grid FT, respectively. To take into account the lower expected selling price of FPGAs with reduced routability, we have defined gross profit yield to be gross yield multiplied by a price factor. This factor is 1-0.5RRF, where 0.5 is the fraction of chip area occupied by the wiring channels and RRF is the Reduced Routability Factor, which is 0.5 for 75 percent routability. Fig. 14 shows gross profit yield for the various FT techniques. The node-covering cell FT technique with grid FT again shows the highest performance, providing a yield of 70 percent a 40 percent improvement when non- FT yield is 50 percent. It continues to provide a gross profit yield improvement until non-ft yield exceeds 75 percent. 6 FT RESOURCE ALLOCATION Since the interconnect channels might occupy as much as 90 percent of the area of an FPGA [5], there may be instances when it is more beneficial to allocate resources to providing FT in the wiring rather than in the logic cells. The analysis in Section 5 assumed that interconnect occupied 50 percent of the FPGA area. However, as the relative proportion of interconnect area increases, the yields decrease for all of the techniques presented. This can be seen in Fig. 15, where only node-covering with FT grids provides any yield improvement at all when 70 percent of the chip area is devoted to channel interconnect. Fig. 14. Gross profit yields afforded by various fault-tolerant techniques versus the yield of an FPGA without fault tolerance when FT arrays have equal reduced routing capability (75 percent for all techniques). Fig. 15. Gross yields afforded by various fault-tolerant techniques versus the yield of an FPGA without fault tolerance when 70 percent of chip area is devoted to channel interconnect.

16 30 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 (a) (b) (c) Fig. 16. Gross yield comparison of wiring FT versus wiring FT with logic cell FT as the proportion of chip area devoted to channel interconnect increases. (a) 50 percent interconnect. (b) 70 percent interconnect. (c) 90 percent interconnect. Yields using spare rows and/or columns are degraded compared to Fig. 1b because increasing the wiring area proportion increases the amount of non-ft area. Our techniques combining node-covering cell FT with wiring FT suffer a decrease in yield because the extra tracks required by node-covering for full routability cause a proportionately larger array area increase as the interconnect area fraction increases. When the interconnect area proportion in an FPGA increases relative to the logic cell area, greater yield improvement can be obtained by concentrating FT resources on the wiring rather than on the logic cells. Since our wiring FT techniques are independent from the node-covering cell FT techniques, they can be used by themselves. Note, also, that, without using the node-covering cell FT technique, there is no area overhead due to spare logic cells and no need to add more tracks in order to maintain full routability. Our FT grid technique has lower area overhead than our FT segment technique, and so will be used to illustrate the relative benefits of applying FT to only the wiring. The parameters are the same as stated in Section 5.1 regarding the number of tracks and relative sizes of cell-to-segment and segment-to-segment interconnect areas. Only the fraction of array area to be occupied by the interconnect channels is varied. Fig. 16a shows that gross chip yield improvement for FT grids at a 50 percent wiring fraction is not as great as that of FT grids combined with node-covering cell FT when non-ft yields are below 50 percent. However, the wiring FT by itself does give yield improvement where the combined techniques do not, up to a non-ft yield of about 80 percent. The choice of which technique to use, then, would depend on the expected yield conditions. As the wiring fraction increases to 70 percent and 90 percent, the FT grid technique by itself is clearly preferable, as seen in Figs. 16b and 16c, respectively. At a non-ft yield of 35 percent, it provides a yield of 45 percent (a 30 percent yield improvement) when the wiring fraction is 70 percent, and a yield of 47 percent (a 35 percent yield improvement) when the wiring fraction is 90 percent. 7 CONCLUSIONS We have presented fault-tolerant techniques to increase the yield of FPGAs by reconfiguring factory-detected faults in the logic cells and the interconnect wiring in a user-transparent manner. These techniques can also allow the user to reconfigure around operational faults for increased reliability and availability of FPGAs. One technique makes use of a node-

17 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 31 covering method to tolerate cell faults by allowing a cell in the array to be replaced by its neighbor if it becomes faulty. Reconfiguration is simplified by a routing discipline that provides the hooks for cell replacement by placing the necessary channel segments near the cover cells beforehand. Channel segments reserved for use in reconfiguration do not add extra parasitic delay to nets in a non-reconfigured array, since they are connected (automatically) only when needed. Our wiring FT techniques make use of spare wiring portions that allow either a wire segment or else a grid to be replaced when faulty. No rerouting is necessary in the event of a fault, and the configuration data for the faulty cell, including the connections to the channel wiring, is simply transposed to the cover cell. Likewise, configuration data for a faulty wiring portion is transposed to the replacement segment or grid. The simplicity of the reconfiguration method is an improvement over other techniques that require a new rerouting for each different faulty cell location. Our techniques are also an improvement over row and/or column replacement done at the factory in that they offer substantially greater yield improvement than either of those methods. For example, when non-ft yield is 35 percent, the spare row and column technique offers insignificant yield improvement, while the spare column technique affords a 35 percent yield improvement. At this same non- FT yield, our method combining cell FT and grid FT provides about 55 percent greater yield than the spare column technique. In fact, cell and grid FT more than doubles the non-ft yield. When gross yield is considered, the other techniques cease to provide any yield improvement when non-ft yield increases beyond about 40 percent, while our method of cell and grid FT still affords more than 30 percent yield improvement. Since the interconnect channels might occupy as much as 90 percent of the area in some FPGA designs [5], it can be beneficial to be able to concentrate FT resources on the wiring rather than on the logic cells. As the wiring fraction increases to 70 percent and beyond, the FT grid technique used without cell FT is clearly preferable to any of the other techniques.. At a non-ft yield of 35 percent, it provides a yield improvement of at least 30 percent. APPENDIX SEGMENT OVERHEAD THEOREM 1. A circuit consisting of nonbranching nets that travel from left to right monotonically incurs the following segment overheads when the node-covering cell FT technique is applied. Best case overhead is e/(l(p-e)). Worst case overhead is 4/l. Average overhead is p - e h + v + e, 1p - e61l + a6 where the average segment overhead at a horizontal connection segment is a4 9, 8l + 4l + 4l + 1 h = 4l + 4l + 1 and the average segment overhead at a vertical connection segment is 8l a544l + 8l + 19 v =. 4l + 8l + 1 In the above expressions, e is the number of nets, p is the number of logic cell pins used, l is the average length in segments between adjacent logic cell pins in a net, and a is the probability of a track-hop occurring at a pin. PROOF. The proof is presented briefly here, but complete details are included in [11]. Nets are made up of point-to-point paths which connect at the logic cell pins. Segment overhead is incurred due to the need to cover track-hops at the connecting points of these paths, as well as due to the need to add reserved segments whenever cover segments are not already provided by the net itself. In the best case, there will be no track-hops, and all of the internal pin connections of each net will be covered by segments that are already part of the net. There will be only one reserved segment required to cover the last pin connection at the rightmost, or terminal, end of each of the e nets. Since there are l(p - e) segments used in the nets, the overhead factor is e/(l(p - e)). In the worst case, the net itself will not provide covers for any of the connections (requiring two reserved segments for the worst case cover when the segment is vertical), and the net will track-hop at each of the connecting points of the point-to-point paths (requiring two more reserved segments in the worst case). Thus, each l-segment portion of each net will incur four overhead segments, for an overhead factor of 4/l. For the average overhead, we consider the probabilities of pin connections being made at horizontal or vertical segments, and the probability of track-hops occurring. The case of a horizontal segment at a nonterminal pin of a net is shown in Fig. 17a. The net could continue along the same track and automatically cover the connection without any segment overhead, or it could track-hop and continue in that direction with one segment of overhead to cover the trackhop. The net could, instead, turn upward or downward, in which case, a horizontal reserved segment would be required to cover the connection. If it should track-hop as well, an additional reserved segment would be required to cover the track-hop. The probability of continuing horizontally versus vertically depends on the probability that the next connection point is located in an area that would be reached by traveling horizontally. We derive these probabilities from the relative sizes of the areas the net must reach, and then factor in the probabilities of trackhopping to determine the average segment overhead h for a connection at a horizontal segment. Similarly, for the case of a vertical segment, as in Fig. 17b, we

18 3 IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 1, JANUARY 1998 Fig. 17. Areas reachable by an l-segment path. (a) From a horizontal segment. (b) From a vertical segment. calculate the average segment overhead v for a vertical connection. Since there are p - e nonterminal pins, with an average segment overhead of (h + v)/, the total average segment overhead is (p - e)(h + v)/ + e, where e segments are added to cover the terminal pins. In a non-ft circuit, l(p - e) segments are used. Accounting for track-hops, the total number of segments in a non-ft circuit is then (p - e)(l + a), and the average FT segment overhead factor becomes 1p - e61h + v6 + e. p - e l + a REFERENCES [1] Altera Corporation, FLEX8000 Programmable Logic Device Family Data Sheet [] S. Brown, Univ. of Toronto, private communication, May [3] R. Cliff et al., A Dual Granularity and Globally Interconnected Architecture for a Programmable Logic Device, Proc. IEEE Custom Integrated Circuits Conf., pp , [4] P. Chow et al., A 1.mm CMOS FPGA Using Cascaded Logic Blocks and Segmented Routing, Proc. Oxford 1991 Int l Workshop Field Programming Logic and Applications, pp , FPGAs, W. Moore and W. Luk, eds. Abingdon, England: Abingdon EE & CS Books, [5] J. Cunningham, The Use and Evaluation of Yield Models in Integrated Circuit Manufacturing, IEEE Trans. Semiconductor Manufacturing, vol. 3, no., pp , May [6] S. Dutt and F. Hanchek, REMOD: A New Hardware- and Time- Efficient Methodology for Designing Fault-Tolerant Arithmetic Circuits, IEEE Trans. VLSI Systems, vol. 5, pp , Mar [7] S. Dutt and J.P. Hayes, Some Practical Issues in the Design of Fault-Tolerant Multiprocessors, IEEE Trans. Computers, vol. 41, no. 5, pp , May 199. [8] S. Dutt and N.R. Mahapatra, Node Covering, Error Correcting Codes and Multiprocessors with Very High Average Fault Tolerance, IEEE Trans. Computers, vol. 46, no. 9, pp ,015, Sept [9] F. Hanchek and S. Dutt, Node-Covering Based Defect and Fault Tolerance Methods for Increased Yield in FPGAs, Proc. Int l Conf. VLSI Design, pp. 5-9, Jan [10] F. Hanchek and S. Dutt, Design Methodologies for Tolerating Cell and Interconnect Faults in FPGAs, Proc. Int l Conf. Computer Design, pp , Oct [11] F. Hanchek and S. Dutt, Design Methodologies for Tolerating Logic and Interconnect Faults in FPGAs, technical teport, Sept. 1997, [1] N. Hastie and R. Cliff, The Implementation of Hardware Subroutines on Field Programmable Gate Arrays, Proc. IEEE Custom Integrated Circuits Conf., pp , [13] F. Hatori et al., Introducing Redundancy in Field Programmable Gate Arrays, Proc. IEEE Custom Integrated Circuits Conf., pp , [14] N. Howard, A. Tyrrell, and N. Allinson, The Yield Enhancement of Field-Programmable Gate Arrays, IEEE Trans. VLSI Systems, vol., no. 1, pp , Mar [15] W.K. Huang, X.T. Chen, and F. Lombardi, On the Diagnosis of Programmable Interconnect Systems: Theory and Applications, Proc. 14th IEEE VLSI Test Symp., [16] W.K. Huang and F. Lombardi, An Approach for Testing Programmable/Configurable Field Programmable Gate Arrays, Proc. 14th IEEE VLSI Test Symp., [17] M. Khellah, S. Brown, and Z. Vranesic, Modelling Routing Delays in SRAM-based FPGAs, Proc CCVLSI, pp. 6B.13-6B.18, Nov [18] V. Kumar, A. Dahbura, F. Fisher, and P. Juola, An Approach for the Yield Enhancement of Programmable Gate Arrays, Proc. IEEE Int l Conf. Computer-Aided Design, pp. 6-9, Nov [19] J. McDonald, B. Philhower, and H. Greub, A Fine Grained, Highly Fault-tolerant System Based on WSI and FPGA Technology, Proc. Oxford 1991 Int l Workshop Field Programming Logic Applications, pp , FPGAs, W. Moore and W. Luk, eds. Abingdon, England: Abingdon EE & CS Books, [0] J. Narasimhan, K. Nakajima, C. Rim, and A. Dahbura, Yield Enhancement of Programmable ASIC Arrays by Reconfiguration of Circuit Placements, IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 13, no. 8, pp , Aug [1] K. Roy and S. Nag, On Routability for FPGAs under Faulty Conditions, IEEE Trans. Computers, vol. 44, no. 11, pp. 1,96-1,305, Nov [] C. Stroud, S. Konala, P. Chen, and M. Abramovici, Built-In-Self- Test of Logic Blocks in FPGAs (Finally, a Free Lunch: BIST without Overhead!), Proc. 14th IEEE VLSI Test Symp., [3] C. Stroud, E. Lee, and M. Abramovici, BIST-Based Diagnostics of FPGA Logic Blocks, Proc. Int l Test Conf., [4] Field-Programmable Gate Array Technology, S. Trimberger, ed. Boston: Kluwer Academic, [5] S. Trimberger, Xilinx Corporation, private communication, July [6] J. Turner, FPGA Yield Enhancement through Redundancy, unpublished presentation at Second ACM Inte l Workshop Field Programmable Gate Arrays, reported in SIGDA Newsletter, vol. 4, nos. 1/, pp , [7] Xilinx, Inc., The Programmable Logic Data Book [8] Private communication with Xilinx engineers when S. Dutt visited Xilinx, Aug

He worked for American Edwards Laboratories of Santa Ana, California, in medical electronics and for Dart Container Corporation of Mason, Michigan, in industrial control before receiving the PhD

19 HANCHEK AND DUTT: METHODOLOGIES FOR TOLERATING CELL AND INTERCONNECT FAULTS IN FPGAS 33 Fran Hanchek (S 75-M 8) received the BS and MS. degrees in electrical engineering from Michigan Technological University, Houghton, in 1976 and 1981, respectively. He worked for American Edwards Laboratories of Santa Ana, California, in medical electronics and for Dart Container Corporation of Mason, Michigan, in industrial control before receiving the PhD degree in electrical engineering from the University of Minnesota, Minneapolis, in Currently, he works for Intel Corporation in Aloha, Oregon, doing integrated circuit design. He is a licensed professional engineer and a member of the IEEE. Shantanu Dutt (S 87-M 90) received the BE degree in electronics and communication engineering from the M.S. University of Baroda, India, in 1983, the MTech degree in computer engineering from the Indian Institute of Technology, Kharagpur, India, in 1984, and the PhD degree in computer science and engineering from the University of Michigan, Ann Arbor, in From , he was a research and development engineer at CMC Ltd., Secunderabad, India. He is currently an associate professor in the Department of Electrical Engineering and Computer Science, University of Illinois- Chicago. He was previously with the Department of Electrical Engineering, University of Minnesota, Twin Cities. He was awarded a National Merit Scholarship by the Government of India, a University Fellowship by the M.S. University of Baroda, a Rackham Predoctoral Fellowship by the University of Michigan, and a Research Initiation Award by the U.S. National Science Foundation. His current technical interests include CAD for VLSI circuits, parallel and distributed computing, fault-tolerant computing, and computer architecture. He has published about 40 papers in archival journals and refereed conferences in all these areas. He received a Best-Paper award at the Design Automation Conference, One of his papers in the area of fault-tolerant multiprocessor design was selected to be among the most influential papers published over the last 5 years of the IEEE International Fault-Tolerant Computing Symposium. He has been a session chair for a number of conferences. He is on the program committee of the Fault-Tolerant Computing Symposium, 1997 and 1998, and of the International Conference on Parallel Processing, Dr. Dutt was on a U.S. National Science Foundation panel for selecting CAREER awards. He has contributed an invited article on Roundoff Errors in the Wiley Encyclopedia of Electrical and Electronics Engineering, to appear in He is a member of the IEEE, the IEEE Computer Society, and the ACM Special Interest Groups on Computer Architecture and Design Automation.

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering FPGA Fabrics Reference Wayne Wolf, FPGA-Based System Design Pearson Education, 2004 CPLD / FPGA CPLD Interconnection of several PLD blocks with Programmable interconnect on a single chip Logic blocks executes