Output. Function f. Characteristic Predictor. Predicted Output Characteristic. Checker. Output. Error

WHICH CONCURRENT ERROR DETECTION SCHEME TO CHOOSE? Subhasish Mitra ad Edward J. McCluskey Ceter for Reliable Computig Departmets of Electrical Egieerig ad Computer Sciece Staford Uiversity, Staford, Califoria http://crc.staford.edu Abstract Cocurret error detectio (CED) techiques (based o hardware duplicatio, parity codes, etc.) are widely used to ehace system depedability. All CED techiques itroduce some form of redudacy. Redudat systems are subject to commo-mode failures (CMFs). While most of the studies of CED techiques focus o area overhead, few aalyze the CMF vulerability of these techiques. I this paper, for the first time, we preset simulatio results to quatitatively compare various CED schemes based o their area overhead ad the protectio (data itegrity) they provide agaist multiple failures ad CMFs. Our results idicate that, for the simulated combiatioal logic circuits, although diverse duplex systems (with two differet implemetatios of the same logic fuctio) sometimes have margially higher area overhead, they provide sigificat protectio agaist multiple failures ad CMFs compared to other CED techiques like parity predictio. 1. Itroductio Cocurret Detectio (CED) techiques are widely used to ehace system depedability [Sellers 68, Kraft 81, Hsiao 81, Rao 89, Che 9, Pradha 96, Webb 97, Spaihower 99]. Almost all CED techiques fuctio accordig to the followig priciple: Let us suppose that the system uder cosideratio realizes a fuctio f ad produces output f(i) i respose to a iput sequece i. A CED scheme geerally cotais aother uit which idepedetly predicts some special characteristic of the system-output f(i) for every iput sequece i. Fially, a checker uit checks whether the special characteristic of the output actually produced by the system i respose to iput sequece i is the same as the oe predicted ad produces a error sigal whe a mismatch occurs. Some examples of the characteristics of f(i) are: f(i) itself, its parity, 1 s cout, 0 s cout, trasitio cout, etc. The architecture of a geeral CED scheme is show i Fig. 1.1. Ay CED scheme is characterized by the class of failures i the presece of which the system data itegrity is preserved. By data itegrity, we mea that the system either produces correct outputs or idicates erroeous situatios whe icorrect outputs are produced. I the literature o fault-tolerace, this property has bee referred to as the fault-secure property [Siewiorek 9]. It may be oted that the geeral architecture of a CED scheme such as Fig.1.1 relies o the use of hardware redudacy (predictor ad checker circuits) [Pradha 96] for error-detectio. Time redudacy techiques like alteratedata-retry ad recomputatio with shifted operads [Shedletsky 78, Patel 8] ca also be used for cocurret error detectio. Time redudacy directly affects the system performace although the hardware cost is geerally less tha that of hardware redudacy. The focus of this paper is o CED techiques usig hardware redudacy. Iput Fuctio f Output Output Characteristic Predictor er Predicted Output Characteristic Figure 1.1. Geeral architecture of a cocurret error detectio scheme Several CED schemes have bee proposed ad used commercially for desigig reliable computig systems [Hsiao81, Che 9, Webb 97, Spaihower 99]. These techiques maily differ i their error-detectio capabilities ad the costraits they impose o the system desig. There are may publicatios o system desig with cocurret error detectio. These iclude desigs of datapath circuits (like adders, multipliers, etc.) [Sellers 68, Nicolaidis 93, Nicolaidis 97], ad geeral combiatioal ad sequetial logic circuits [Akseova 75, Jha 93, De 94, Touba 97, Zeg 99] with cocurret error detectio. er circuit desigs for cocurret error detectio are described i [Wakerly 78, McCluskey 90]. Almost all publicatios o CED focus o their area/performace overhead. Reliability aalysis of systems with cocurret error detectio is preseted i [Ramamoorthy 75]. However, the systems cosidered are restricted to those with redudacy through replicatio. All the above-metioed CED techiques guaratee system data itegrity agaist sigle faults. However, these

CED schemes are vulerable to multiple faults ad commo-mode failures. Commo-mode failures are a special ad very importat cause of multiple faults. Commo-mode failures (CMFs) produce multiple faults, occurrig geerally due to a sigle cause; the system data itegrity is ot guarateed i the presece of CMFs. These iclude desig mistakes ad operatioal failures that may be due to exteral (such as EMI, power-supply disturbaces ad radiatio) or iteral causes [Avizieis 84, Lala 94]. CMFs i redudat VLSI systems are surveyed i [Mitra 00a]. Desig diversity has bee proposed i the past to protect redudat systems agaist commo-mode failures. While most of the previous efforts towards defiitio of desig diversity were qualitative, i a earlier paper [Mitra 99a] we developed a metric to quatify diversity amog several desigs ad used this metric to aalyze the reliability of redudat systems i the presece of CMFs. It may be argued that, ulike systems with duplicatio, cocurret error detectio techiques based o error detectig codes (e.g. parity, etc.) itroduce iheret diversity i the system. Thus, qualitatively, these systems must be well-protected agaist CMFs. The problem studied i this paper is to compare five CED techiques for geeral combiatioal logic circuits based o their area overhead ad their vulerability to multiple failures ad CMFs. The CED techiques cosidered are those based o idetical ad diverse duplicatio, parity predictio ad Berger ad Bose-Li codes. These techiques are geeral ad ca be used for ay system, ulike some other applicatio-specific error detectio techiques such as [Mahmood 84, Jou 88, Huag 00]. This paper is orgaized as follows. Sectio presets a brief overview of various CED techiques. I Sec. 3, we preset simulatio results to compare these CED techiques. Sectio 4 describes aalysis techiques to quatify the vulerability of various CED schemes to multiple failures ad CMFs. Some attempts to explai the simulatio results of Sec. 3 ad some ope questios are reported i Sec. 5. The use of trasitio coutig ad residue codes for cocurret error detectio is discussed i Sec. 6. Sectio 7 presets a system-level view of the CED techiques studied i this paper. Fially, we coclude i Sec. 8.. A Overview of Various CED Techiques.1. Duplex System A duplex system is a example of a classical redudacy scheme that ca be used for cocurret error detectio [Sellers 68, Kraft 81, Sedmak 78]. Figure.1 shows the basic structure of a duplex system. Duplicatio has bee used for cocurret error detectio i umerous systems icludig the Bell Switchig System [Kraft 81], systems from compaies like Stratus ad Sequoia [Pradha 96] ad also i the IBM G5 ad G6 processors [Webb 97, Spaihower 99]. I ay duplex system there are two modules (show i Fig..1 as Module 1 ad Module ) that implemet the same logic fuctio. The two implemetatios are ot ecessarily the same. A comparator is used to check whether the outputs from the two modules agree. If the outputs disagree, the system idicates a error. For a duplex system, data itegrity is preserved as log as both modules do ot produce idetical errors (assumig that the comparator is fault-free). Sice the comparator is crucial to the correct operatio of the duplex system, special selfcheckig comparator desigs (e.g., two-rail checker [McCluskey 90]) that guaratee data itegrity agaist sigle comparator faults must be used. Module 1 Module Comparator Figure.1. A Duplex System.. Parity Predictio Parity predictio is a widely used CED techique. The eve/odd parity fuctio idicates whether the umber of 1 s i a set of biary digits is eve or odd. Techiques for desigig datapath logic circuits ad geeral combiatioal circuits with parity predictio have bee described i [Sellers 68, Kraft 81, Nicolaidis 93, Nicolaidis 97, De 94, Touba 97]. CED techiques with parity predictio i sequetial circuits are described i [Zeg 99]. Figure. shows the basic architecture of a system with cocurret error detectio usig a sigle parity bit. The circuit has m outputs ad is desiged i such a way that there is o sharig amog the logic coes geeratig each of the outputs. Thus, a sigle fault ca affect at most oe output bit positio. The parity of the outputs is predicted idepedetly. The parity checker checks whether the actual parity of the outputs matches the predicted parity [McCluskey 90]. Z 1 Z Z m Predicted Parity P Outputs Z1 - Z m Parity er Figure.. Parity predictio usig a sigle parity bit The restrictio of o logic sharig amog differet logic coes ca result i large area overhead for circuits with a sigle parity bit. Hece, the idea of usig a sigle

parity bit has bee exteded to multiple parity bits. This techique partitios the primary outputs ito differet parity groups. Sharig is allowed oly amog logic coes of the outputs that belog to differet parity groups. There is a parity bit associated with the outputs i each parity group. The outputs of each parity group are checked usig a parity checker. Figure.3 shows the geeral structure of a combiatioal logic circuit with two parity groups. P 1 Z 1 Parity er Z k Z k+1 Z m P Parity er uidirectioal errors at the outputs. This imposes a restrictio that the logic circuits should be sythesized i such a way that they are iverter-free [Jha 93]. Iverters ca oly appear at the primary iputs. I geeral, for Berger codes used to detect uidirectioal errors o commuicatio chaels, the check-bits represet the bitwise complemet of the umber of 1 s i the iformatio word. However, sice cocurret error detectio techiques are desiged to guaratee data itegrity i the presece of sigle faults, a sigle fault ca affect either the actual logic fuctio or the logic circuit that predicts the umber of 1 s at the output but ever both at the same time (sice there is o logic sharig betwee the actual circuit ad the circuit that predicts the umber of 1 s). Thus, we eed ot obtai a bit-wise complemetatio of the umber of 1 s [Das 98]. The checker desig for Berger codes is described i [Marouf 78]. Outputs Z1 - Z k 1 Outputs Zk+1 - Z m Figure.3. Multiple parity bits for cocurret error detectio I the circuit of Fig..3, there are two parity groups G1 ad G. The parity group G1 cotais the outputs Z1,, Zk. P1 is the predicted parity for this parity group. It predicts the parity of the primary outputs i G1. The parity group G cotais the outputs Zk+1,, Zm. P is the predicted parity bit associated with this parity group. There is sharig betwee logic coes correspodig to the outputs Zk ad Zk+1. No logic sharig is allowed amog the coes correspodig to outputs Z1,, Zk (Zk+1,, Zm). Sharig is allowed amog logic coes correspodig to other output groups such as Zh ad Zj, 1 h k, k+1 j m..3. Uidirectioal Detectig Codes CED techiques based o uidirectioal error detectig codes have bee proposed i the past. A uidirectioal error detectig code assumes that all errors are uidirectioal; i.e., they chage 0s to 1s or 1s to 0s but ever both at the same time. Two uidirectioal error detectig codes used for cocurret error detectio are Berger codes [Berger 61], ad Bose-Li codes [Bose 85]. For the Berger code, a code-word is formed by appedig a biary strig represetig the umber of 0s (or the bit-wise complemet of the umber of 1s) i the give iformatio word. Thus, for a iformatio word cosistig of bits, the Berger code requires log extra bits to represet the umber of 0s (or the bit-wise complemet of umber of 1s) i the iformatio word. The Berger code has the capability of detectig all uidirectioal errors. Figure.4 shows a cocurret error detectio techique usig Berger codes. Sice the Berger code is a uidirectioal error detectio code, it is importat to esure that a sigle fault causes Logic Fuctio (Iverter-free) Output Predict 1s cout (Iverterfree) er Figure.4. Cocurret Detectio Usig Berger Codes Logic Fuctio (Iverter-free) Max. Faout: outputs Output Predict 1s cout mod 4 (Iverter-free) Max. Faout outputs er Figure.5. Cocurret detectio usig Bose-Li codes Bose-Li codes are capable of detectig t-bit uidirectioal errors i the code-word. The costructios of Bose-Li codes for t = ad t = 3 are give i [Bose 85]. Desig of logic circuits with cocurret error detectio based o Bose-Li codes has bee reported i [Das 98]. Figure.5 shows the architecture of a system with cocurret error detectio based o -bit uidirectioal error

detectig Bose-Li code. Just like Berger codes, we wat the circuit to be iverter-free (except at the primary iputs) so that ay sigle fault creates uidirectioal errors at the outputs. We also eed a restrictio o the amout of logic sharig sice the code is capable of detectig at most uidirectioal errors. The restrictio is that, ay logic gate i the circuit ca be shared by the logic coes of at most two primary outputs. er circuits for Bose-Li codes ca be obtaied from [Jha 91]. 3. Simulatio Results I this sectio, we provide simulatio results to compare the five CED schemes (idetical ad diverse duplicatio, parity predictio, Berger codes ad Bose-Li codes), described i Sec., based o their area overhead ad their vulerability of differet CED schemes to multiple failures ad CMFs. The simulatio results show the superiority of diverse duplicatio over other covetioal CED schemes for the simulated desigs. We cosidered some combiatioal logic circuits from the MCNC 91 bechmark suite for simulatio purposes. We used the Sis tool [Setovich 9] for sythesizig circuits. For desigig a diverse duplex system (with differet implemetatios), we geerated truth tables with complemeted outputs ad sythesized them usig Sis. Fially, we added iverters at the outputs of the resultig implemetatio. For duplex systems, all the sythesis optimizatios ca be applied. We used espresso for twolevel miimizatio ad script.rugged available with the Sis tool for multi-level optimizatio ad mapped the circuits to the LSI Logic G10p techology library [LSI 96]. For the CED scheme with parity predictio we used the techique i [Touba 97]. For sythesizig circuits with Berger codes, we must esure that the idividual circuits are iverter-free. The sythesis techique has bee described i [Jha 93]. We used algebraic trasformatios (usig script.algebraic available with Sis) durig multi-level logic sythesis so that the circuits are iverter-free. For sythesizig circuits with Bose-Li codes, a similar approach was used. However, we have to limit the faout structure such that a gate ca be shared by a maximum of two output fuctios. The techique i [Das 98] was used for sythesizig logic circuits with Bose-Li codes. Table 3.1. Area overhead of various CED schemes Circuit Idetical Diverse Parity Berger Bose - Duplex Duplex Code Li Z5xp1 8 836 840 1335 1068 ic 743 751 69 854 807 squar5 507 485 465 67 570 ex5.0 646 649 593 815 755 misex1 41 43 367 468 488 sao 754 787 748 983 864 rd73 474 480 683 853 763 rd84 64 684 971 1135 1056 Table 3.1 shows a compariso of the area overhead (i terms of the G10p cell areas reported by the Sis tool) of various CED schemes for eight MCNC bechmark circuits. It is clear from Table 3.1 that the area overhead of CED techiques based o Berger codes ad Bose-Li codes are much higher tha those based o parity predictio or duplicatio. For may circuits, the area overhead of parity predictio is margially less tha that of duplicatio. Similar observatios have bee made i [Zeg 99]. Hece, for the rest of this paper we focus maily o CED techiques based o duplicatio ad parity predictio. Next, we preset simulatio results o the vulerability of CED techiques to multiple failures ad CMFs (permaet or temporary). First, we cosider the case of permaet faults. I depedable systems, it is realistic to assume a corrective actio is iitiated after the system geerates a error sigal. Thus, for ay system with cocurret error detectio, data itegrity is guarateed as log as the system does ot produce a udetected corrupt output before idicatig the presece of a error. I the followig discussio, we focus o systems cosistig of combiatioal logic circuits. However, the etire discussio ca be exteded for sequetial logic circuits. Correct Outputs y ij Udetected s 1 - y ij - zij Detected s z ij Figure 3.7. Various compoets of output evets of a system with CED alog with their probabilities The probability that the data itegrity of a combiatioal logic system is guarateed up to time t i the presece of a fault pair (fi, fj) is derived i the followig way. Give a iput distributio, let us suppose that the probability that the system produces correct outputs i the presece of (fi, fj) is yi,j; the probability that the system produces icorrect outputs that ca be detected is zi,j. Figure 3.7 shows a Ve diagram to explai yi,j ad zi,j. Assumig that the fault pair is permaet, the probability that the system data itegrity is guarateed up to time t (after the occurrece of the fault pair) is: t t k t zij, t yij, + y 1 ij, zij, = yij, + ( 1 yij, ) k= 1 1 yij, The above expressio ca be derived from the fact that the system must either produce correct outputs up to time t or idicate a error sigal for the first time without producig ay corrupt data before t. From the above expressio for data itegrity, it is clear zij. that the term wi,j = plays a importat role i 1 yij, determiig the system data itegrity up to time t. This term wi,j, the detected fractio, is the fractio of output

error evets detected i the presece of the fault pair (fi, fj). If the value of this term is 1 the system either produces correct outputs or idicates erroeous situatios whe icorrect outputs are produced. If the value is 0 the system ever produces ay error sigal whe icorrect outputs are produced. Note that, if a CED-based system produces correct outputs for all iput combiatios eve i the presece of a fault, the the fault is redudat. We used the followig procedure to estimate the protectio agaist multiple ad commo-mode failures provided by CED techiques based o duplicatio ad parity predictio. For each sigle-stuck-at fault fi i each of these circuits, we simulated exhaustively all fault pairs ad iput combiatios to idetify aother sigle-stuck-at fault fj i the same circuit that had the miimum value of wi,j. Hece, the fault pair (fi, fj) ca be regarded as a worst-case permaet fault pair. Fially, we averaged the wi,j s over all the worst-case permaet fault pairs to obtai the average value of the worst-case detected fractio of icorrect outputs. Such a metric is pessimistic because we are cosiderig the worst-case permaet fault pairs. The results are show i Table 3.. The bechmark circuits are small eough so that exhaustive simulatio is possible. Table 3.. Average value of the detected fractio of icorrect outputs for the worst-case permaet faults Circuit Idetical Diverse Duplex Parity Duplex Z5xp1 0 0.70 0.46 ic 0 0.68 0.45 squar5 0 0.55 0.53 ex5.0 0 0.30 0.0 misex1 0 0.54 0.40 sao 0 0.60 0.06 rd73 0 0.60 0.40 rd84 0 0.66 0.51 Table 3. demostrates the advatages of usig diverse duplex systems over other CED schemes. It may be oted that for diverse duplex systems, we foud several worstcase permaet fault pairs with the value of wi,j equal to 1. This meas that, eve i the worst-case, system data itegrity is guarateed for these fault pairs i the diverse duplex system. However, we did ot fid ay such worstcase permaet fault pairs for systems with parity predictio. I additio to the above worst-case aalysis, we studied the CED techiques i various other ways. These simulatios also idicate that diverse duplicatio provides better data itegrity compared to idetical duplicatio or parity predictio agaist multiple ad commo-mode failures. I Tables 3.3a (3.3b), for the above bechmark circuits, we report the percetage of all worst-case permaet fault pairs i parity predictio (diverse duplicatio) with greater or equal detectability of icorrect outputs compared to the worst-case permaet fault pairs i diverse duplicatio (parity predictio). For example, for the Z5xp1 circuit, 80% of the worst-case permaet fault pairs i diverse duplicatio have values of icorrect output detectability greater tha or equal to those i parity predictio (Table 3.3b). Table 3.3. (a) Percetage of worst-case permaet fault pairs i parity predictio with greater (or equal) icorrect output detectability compared to those i diverse duplicatio. (b) Percetage of worst-case permaet fault pairs i diverse duplicatio with greater (or equal) icorrect output detectability compared to those i parity predictio (a) (b) Circuit Percetage Circuit Percetage Z5xp1 0 % Z5xp1 80 % ic % ic 81 % squar5 49 % squar5 55 % ex5.0 50 % ex5.0 60 % misex1 35 % misex1 70 % sao 10 % sao 90 % rd73 5 % rd73 77 % rd84 3 % rd77 77 % It may be argued that CMFs ad multiple failures may have temporary effects ad it may be iaccurate to model them as permaet faults. Next, we preset simulatio results to compare the vulerability of various CED schemes to temporary CMFs ad multiple failures (possibly due to trasiet failures like radiatio upsets, power-supply disturbaces, etc. or itermittet failures) that persist for a sigle clock cycle. The vulerability of a CED scheme to a fault pair (fi, fj) resultig from such a failure is give by di,j which is the coditioal probability that the system either produces correct outputs or geerates a error sigal if a icorrect output is produced i the presece of (fi, fj) for a give iput distributio. Note that, for duplex systems, di,j is the same as the diversity with respect to the fault pair (fi, fj) as described i [Mitra 99a]. Table 3.4. Average value di,j s of the worst-case temporary fault pairs Circuit Diverse Duplex Parity Predictio Z5xp1 0.90 0.70 ic 0.9 0.78 squar5 0.90 0.86 ex5.0 0.89 0.68 misex1 0.90 0.70 sao 0.93 0.57 rd73 0.90 0.80 rd84 0.88 0.70 For simulatio purposes, i each of these bechmark circuits with CED, for each sigle-stuck-at fault fi, we simulated exhaustively all fault pairs ad iput combiatios to idetify aother sigle-stuck-at fault fj i the same circuit that had the miimum value of di,j. Hece, the fault pair (fi, fj) ca be regarded as a worst-case temporary fault pair. Fially, we averaged the di,j s over all the worst-case temporary fault pairs. These umbers are

reported i Table 3.4 for CED schemes based o diverse duplicatio ad parity predictio. The bechmark circuits are small eough so that exhaustive simulatio is possible. The simulatio results i this sectio demostrate the advatages of diverse duplicatio i providig protectio agaist multiple failures ad CMFs compared to other CED schemes. However, the major problem with diverse duplicatio is to develop techiques for sythesizig logic fuctios with diversity. We have ivestigated some techiques for desigig two implemetatios of ay give combiatioal logic circuit i order to maximize diversity [Mitra 00b]. 4. Aalysis of Vulerability to Multiple Failures ad CMFs The vulerability of a duplex system to multiple failures ad CMFs ca be quatified usig the idea of the desig diversity metric preseted i [Mitra 99a, Mitra 99b] ad ot repeated here. For systems with parity predictio, if a sigle fault or multiple faults affect a sigle logic coe, the data itegrity of the system is preserved. However, if a failure causes faults i two logic coes i the same parity group, the data itegrity is ot guarateed. A CED techique with a sigle parity bit is oe with a sigle parity group. Referrig to Fig..3, let us suppose that fault fi affects the parity predictio logic of P ad the fault fj affects the part of the logic that is shared by primary outputs Zk ad Zk+1. Let Vi be the set of iput combiatios i respose to which the parity predictio logic produces a icorrect output i the presece of fi. Similarly, let Vj be the set of iput combiatios i respose to which the coe of logic affected by fj produces a icorrect output for Zk+1 but ot for Zk. If the fault fj produces errors o output Zk, the error will be detected by the parity bit P1 ad data itegrity will be preserved. The Vi Vj value of di,j is 1, where is the umber of iputs of the logic circuit i Fig..3. The system cosidered i Table 4.1 has four outputs ad two parity groups. The first parity bit is the parity of the first two outputs ad the secod parity bit is the parity of the remaiig two outputs. Also assume that there is sharig betwee the logic coes of the first ad the third output fuctio. The fault-free outputs ad the correspodig parity bits are show i Table 4.1. Let us suppose that a CMF maifests itself as a sigle-stuck-at fault pair (fi, fj), where fi affects the logic shared by the first ad third primary outputs ad fj affects the logic coe that predicts the first parity bit. The faulty outputs are show i the last two colums of Table 4.1. For the first ad the fourth iput combiatios, a error will be reported by the parity checker correspodig to the secod parity bit. For the secod iput combiatio, a error will be reported by the checker correspodig to the first parity bit. For the third iput combiatio, the system produces erroeous outputs ad oe of the checkers ca detect this erroeous situatio. Thus, the data itegrity is compromised for the third iput combiatio oly. Table 4.1. Diversity calculatio for the fault pair (fi, fj) i a CED scheme with two parity bits Iputs Fault-free outputs Fault-free parity Faulty outputs Faulty parity 00 0 1 0 0 0 1 1 1 1 0 1 1 01 1 0 0 1 0 0 1 0 0 1 1 0 10 0 0 1 1 1 1 1 0 1 1 0 1 11 1 1 1 1 1 1 1 1 0 1 0 1 Note that, for ay fault pair (fi, fj), di,j = yi,j + zi,j. Hece, yi,j ad zi,j ca also be calculated for the fault pair (fi, fj) usig techiques similar to those used for calculatig di,j. 5. Theoretical Aalysis ad Ope Questios I this sectio, we preset some attempts to provide a theoretical explaatio of the simulatio results reported i Sec. 3. It is clear from our discussios i Sec. 4 that the aalysis of vulerability of various CED schemes to multiple failures ad CMFs is depedet o the di,j ad zi,j values of differet faults. Give these values, the aalysis is simple. However, it may be very difficult to deduce relatioships amog the sets of di,j ad zi,j values of faults i a system with hardware duplicatio ad a system with parity checkig. This is because the costraits used to sythesize the systems with differet CED techiques are differet. For example, CED techiques based o a sigle parity bit do ot allow ay logic sharig amog the logic coes correspodig to differet outputs. O the other had, CED techiques usig hardware duplicatio do ot impose ay faout or logic sharig restrictios withi a module. It has bee demostrated i [Mitra 99b, Mitra 00b] that faout restrictios ad logic sharig affect the detectability ad the di,j values of fault pairs i redudat systems. I this sectio, we aalyze these systems based o simplistic error models (e.g., the Beroulli or the q-ary error model used for sigature aalysis [Saxea 97]) to provide a isight ito the simulatio results. Note that, these error models have may drawbacks ad hece, the simplistic assumptios associated with these models are questioable [Saxea 97]. For the Beroulli error model with parameter p, it is assumed that the probability that a fault produces a error o ay output bit is p ad is idepedet of errors o other outputs. Thus, the probability that a particular error vector (obtaied by XOR-ig the fault-free ad the faulty output) with i errors appears (for a -output circuit) is i i p (1 p). The probability that ay arbitrary circuit produces erroeous outputs i the presece of a fault fi is i i p p i i ( 1 ). Hece, for a system with diverse = 1

duplicatio, the expected value of (1 di,j) for (fi, fj) is i i p p i i ( 1 ) = [ p + ( 1 p) ] ( 1 p). = 1 This is because the system data itegrity is ot preserved oly whe both modules produce idetical errors. For a system with parity predictio usig a sigle parity bit, cosider a fault pair (fi, fj) where fi ad fj affect logic coes correspodig to outputs g ad h, respectively. If the probability that fi (fj) produces a error i ay output bit for a geeral logic circuit with o restrictios o logic sharig is p, the the probability that both fi ad fj produce errors at outputs g ad h, respectively, at the same time is p. Sice there is o sharig amog logic coes i a circuit with parity predictio, the probability that fi (fj) produces a error o output g (h) is p; however, the probability that fi (fj) produces a error o ay other output is 0. Thus, the expected value of (1 di,j) for a fault pair (fi, fj) is p. Note that, the detectability values of faults fi ad fj have reduced drastically for the system with parity predictio; this will produce lower values of (1 di,j) compared to diverse duplicatio for practical values of p (< 1). This is ot true as show by the simulatio results. This is because the assumptio of idepedece of errors o differet output bits is ot true for geeral logic circuits. However, suppose that, for a diverse duplex system, we have oe implemetatio which has o logic sharig amog the differet output coes ad the other implemetatio does ot have ay costrait o the amout of logic sharig. Let us suppose that fault fi affects the logic coe of output g i the first implemetatio ad fault fj affects the secod implemetatio. The probability that fi produces a error o output bit g is p. However, sice there is o logic sharig amog the differet output coes, fi does ot affect the other output bits. Hece, the data itegrity of the diverse duplex system is ot preserved i the presece of (fi, fj) oly whe fj produces a error o output g ad o error o other output bits i the secod implemetatio. The probability of this evet is p( 1 p) 1. Hece, the expected value of (1 di,j) for a fault pair (fi, fj) is p ( 1 p) 1 which is less tha p. Thus, i this sceario, eve with the Beroulli model we fid that the data itegrity of a diverse duplex system is better tha that of parity predictio. For coveiece of the above aalysis, it is assumed that both faults fi ad fj produce error at ay output bit with probability p. However, similar aalysis ca be performed ad similar coclusios ca be reached whe the value of parameter p is differet for fi ad fj. O the other extreme, we ca cosider the q-ary model [Pradha 91]. For the q-ary error model, it is assumed that, i a -output circuit ad for a fault f with detectability q (probability that the fault produces icorrect outputs), the probability of ay o-zero error vector (obtaied by xorig the fault-free ad faulty resposes) is q 1. Note that, there are 1 o-zero error vectors. Hece, for a system with diverse duplicatio, the value of (1 di,j) for a q fault pair (fi, fj) is. Note that, for a system with 1 idetical duplicatio, the expected value of (1 di,j) for a worst-case fault pair (fi, fj) is q (sice, the worst-case fault pairs affect idetical leads i both modules). For a system with parity predictio usig a sigle parity bit, cosider a fault pair (fi, fj) where fi ad fj affect logic coes correspodig to outputs g ad h, respectively. If the detectability of fi i a geeral logic circuit (with o restrictios o logic sharig) is q, the the detectability of fi i the circuit with parity predictio is approximately q. This is because, out of 1 error vectors, 1 produce a error o output bit g; hece, the probability that the fault fi produces a error o output g i a geeral logic circuit is q 1 which is approximately q. Hece, the 1 expected value of (1 di,j) for a worst-case fault pair (fi, fj) i a circuit with parity predictio usig a sigle parity bit is q (whe both the faults produce errors) which is more 4 q tha. Similar aalysis ca be performed for 1 circuits with multiple parity bits. Hece, diverse duplicatio provides better data itegrity agaist multiple failures ad CMFs compared to idetical duplicatio ad parity predictio. For coveiece of the above aalysis, it is assumed that both faults fi ad fj have the same detectability q. However, similar aalysis ca be performed ad similar coclusios ca be reached whe fi ad fj have differet detectability values. Table 5.1. (1 di,j) value for fault pair (fi, fj) Diverse Duplex Parity Beroulli [(1-p) +p ] (1-p) * p model p (1-p) -1 ** q-ary model q ( -1) -1 0.5q * - Both implemetatios have o faout restrictios ** - Oe implemetatio has o output coe sharig The results preseted i this sectio are summarized i Table 5.1. Note that, the Beroulli ad the q-ary models may ot be realistic for may logic circuits as poited out i [Saxea 97]. Hece, we reiterate that the problem of developig more sophisticated ad elegat models for theoretically aalyzig the vulerability of various CED techiques to multiple failures ad CMFs is ope.

6. Trasitio Cout ad Residue for CED As metioed i Sec. 1, ay CED techique predicts a particular characteristic of the system output. The output characteristics cosidered i the previous sectios are the output itself, parity fuctios ad 1 s (or 0 s cout). Some other possible output characteristics are trasitio cout ad residue modulo some umber. Trasitio coutig has bee used i the past as a compactio techique for circuit resposes durig off-lie test [Hayes 76]. For a CED scheme based o trasitio cout, the special output characteristic is the umber of up (0 1) ad dow (1 0) trasitios i a give output vector. Trasitio coutig is ot a favorable method for cocurret error detectio. First, the maximum value of the total umber of trasitios (up ad dow) i a -bit biary word is -1. Thus, the umber of bits eeded to represet the umber of trasitios is equal to the umber of bits required to represet the umber of 1 s (or 0 s) i the same word. Trasitio coutig has aother serious problem. Suppose that the correct output word from a system is 100100. The umber of up-trasitios (0 1) i the output word is 1 ad the umber of dow-trasitios (1 0) is. Suppose that i the presece of a fault that causes a sigle error, the output word is chaged to 100110. The umber of up ad dow trasitios i the erroeous output word is the same as that of the correct output word; hece, this error is ot detected. Residue codes are used for cocurret error detectio i maily datapath elemets like adders, multipliers, etc. [Lagdo 70, Avizieis 71]. Give a -bit output vector, the output characteristic captured by a residue code modulo b is the biary represetatio of the umber x = y mod b, where y is the -bit umber represeted by the give output word. The recommeded value of b is of the form m 1. Whe b = 3, we eed two bits to represet the residue of ay umber. For a CED scheme usig residue checkig modulo 3, there caot be ay logic sharig betwee ay coes correspodig to ay two primary outputs. If there is logic sharig betwee two coes correspodig to bit positios i ad j (i > j), a sigle fault ca cause errors i these two positios. Suppose that (i j) is eve ad the correct output word has 0 s i bit positios i ad j. If a sigle fault causes the bit positios i ad j to be flipped to 1, the resultig error will be divisible by 3 ad will ot be detected. Similar argumets ca be made for the case with (i j) is odd. Thus, we eed two extra bits eve though there caot be ay logic sharig ulike parity checkig where we eed oly a sigle bit. Our sythesis results for b = 3 ad 7 also show that the area required for a CED techique based o residue checkig is very high for geeral logic circuits. Note that, this result is ot true for datapath logic circuits like adders ad multipliers because, simple geeral schemes ca be devised to predict the residue of sum or product of two umbers (arithmetic codig). This result is also supported by the followig observatio i [Lagdo 70]: the residue mod 3 check adder is ot ecoomical uless the additio operads are already provided with the mod 3 check bits. 7. System-Level Issues I the previous sectios, we maily focused o CED techiques for combiatioal logic blocks. I Fig. 7.1 we preset a system-level view of cocurret error detectio. The system i Fig. 7.1a cotais a combiatioal logic block implemetig a logic fuctio f; the logic block obtais its iputs from register X ad the outputs are stored i register Z. I Fig. 7.1b, we preset a duplicatio-based CED techique (idetical or diverse) for the system i Fig. 7.1a. The combiatioal logic blocks N1(f) ad N(f) implemet fuctio f. Registers X ad Z ad the system bus are duplicated; this ca possibly cause high area overhead. I order to create diversity i the register cotets, register X (Z) ca store the complemeted forms of the cotets of register X1 (Z1). Figure 7.1c presets a CED scheme based o parity predictio for the system i Fig. 7.1a. Each register has a sigle parity bit (Px for X ad Pz for Z). It has bee demostrated i Sec. 3, through simulatio, that the area overhead of combiatioal logic blocks with parity predictio is margially less tha that of duplicatio; however, if the umber of register flip-flops ad bus lies are couted, the scheme i Fig. 7.1c has sigificatly less logic area overhead tha Fig. 7.1b. Figure 7.1d presets a CED scheme that uses diverse duplicatio for combiatioal logic blocks ad parity predictio for registers ad bus lies. Thus, we ca achieve sigificat improvemet i protectio agaist multiple ad commo-mode failures (through diverse duplicatio) while the total area overhead is comparable to that of parity predictio (Fig. 7.1c). For this purpose, we eed a tree of XOR gates, as show i Fig. 7.1d. The CED scheme i Fig. 7.1d eeds two extra -iput XOR gates ad oe - iput OR gate (XOR-tree ad the equality checker) for each output of the combiatioal logic block compared to the CED scheme i Fig. 7.1c. Note that, the XOR tree may have sigificat delay overhead. This delay overhead ca be reduced by icreasig the umber of parity bits (i.e., the umber of extra flip-flops i the registers). Iterestig problems aalyzig this area-delay trade-off ca be studied i this cotext. The XOR tree i Fig. 7.1d ca be elimiated if the parity bit of the register is geerated from a dual-rail checker used to check the outputs of the combiatioal logic [Nicolaidis 93]. Routig overhead of the desigs i Fig. 7.1b, 7.1c ad 7.1d has ot bee cosidered i the above discussio. 8. Coclusios The theory ad istrumetatio of various cocurret error detectio techiques have bee subjects of active research iterest sice the late 1950s till today. However, o systematic study o the vulerability of these CED schemes to multiple failures ad CMFs has bee reported i the past. I this paper, for the first time, we provided aalytical formulas ad simulatio results to quatify the possible effects of multiple failures ad CMFs o systems employig well-kow CED schemes. The mai coclusios of this paper are: (1) Our simulatio results o

bechmark circuits reveal that we obtai margial reductio i logic area by usig CED schemes based o parity predictio istead of duplicatio; () CED schemes based o Berger codes ad Bose-Li codes icur very high logic area overhead; (3) For the simulated desigs, diverse duplex systems with differet implemetatios of the same logic fuctio have a sigificat advatage over other CED schemes i providig protectio agaist multiple failures ad CMFs. This advatage makes diverse duplex systems a promiet cadidate for implemetig cocurret error detectio i depedable systems. This result supports may of the observatios i [Sedmak 78]. Lookig at the future, research efforts must focus o cost-effective ways of desigig CED techiques based o diverse duplicatio to reduce their area overhead while obtaiig sigificat protectio agaist multiple failures ad CMFs. 9. Ackowledgmets This work was supported by Defese Advaced Research Projects Agecy (DARPA) uder Cotract No. DABT63-97-C-004. The authors wish to thak Prof. Nur Touba of Uiv. of Texas at Austi ad Nirmal Saxea, Philip Shirvai ad Robert Huag of Staford CRC. 10. Refereces [Akseova 75] Akseova, G. P. ad E. S. Sogomoya, Desig of Self-ig Built-I Circuits for Automata with Memory, Automatio ad Remote X Cotrol, Vol. 36, pp. 1169-1177, July 1975. [Avizieis 71] Avizieis, A., Arithmetic Codes: Cost ad Effectiveess Studies for Applicatio i Digital System Desig, IEEE Tras. Computers, Vol. C-0, No. 11, pp. 13-1331, Nov. 1971. [Avizieis 84] Avizieis, A. ad J. P. J. Kelly, Fault Tolerace by Desig Diversity: Cocepts ad Experimets, IEEE Computer, pp. 67-80, Aug. 1984. [Berger 61] Berger, J. M., A Note o Detectio Codes for Asymmetric Chaels, Iformatio ad Cotrol, Vol. 4, pp. 68-73, 1961. [Bose 85] Bose, B. ad D. J. Li, Systematic Uidirectioal -Detectig Codes, IEEE Tras. Comp., pp. 106-103, Nov. 1985. [Che 9] Che, C. L., et al., Fault-tolerace Desig of the IBM Eterprise System/9000 Type 901 Processors, IBM Joural Res. ad Dev., Vol. 36, No. 4, pp. 765-779, July 199. [Das 98] Das, D. ad N. A. Touba, Sythesis of Circuits with Low-Cost Cocurret Detectio based o Bose-Li codes, VLSI Test Symp., pp. 309-315, 1998. [De 94] De, K., C. Nataraja, D. Nair ad P. Baerjee, RSYN: A System for Automated Sythesis of Reliable Multilevel Circuits, IEEE Tras. VLSI, Vol., pp. 186-195, Jue 1994. X1 X Equal f N1(f) N(f) Z Z1 Z Equal (a) (b) X f Z Px Parity Predict Pz Parity Parity X N1(f) Z Px Pz N(f) XOR Tree Parity Equal Parity (c) (d) Figure 7.1. Systems with CED: (a) Example (b) Idetical or Diverse Duplicatio (c) Parity predictio (d) Diverse duplicatio for combiatioal logic; parity predictio for registers ad bus

[Hayes 76] Hayes, J. P., Trasitio Cout Testig of Combiatioal Logic Circuits, IEEE Tras Computers, Vol. C-5, No. 6, pp. 613-60, Jue 1976. [Huag 00] Huag, W., N. R. Saxea ad E. J. McCluskey, A Reliable LZ Data Compressor o Recofigurable Coprocessors, Proc. IEEE Symp. Field Programmable Custom Computig Machies, 000. [Hsiao 81] Hsiao, M-Y, W. C. Carter, J. W. Thomas ad W. R. Strigfellow, Reliability, Availability ad Serviceability of IBM Computer Systems: A Quarter Cetury of Progress, IBM Joural of Research ad Developmet, Vol. 5, No. 5, pp. 453-469, Sept. 1981. [Jha 91] Jha, N. K., Totally Self-ig er Desigs for Bose-Li, Bose, ad Blaum Codes, IEEE Tras. CAD, Vol. 10, No. 1, pp. 136-143, Ja. 1991. [Jha 93] Jha, N. K. ad S. J. Wag, Desig ad Sythesis of Self-ig VLSI Circuits, IEEE Tras. CAD, Vol. 1, pp. 878-887, Jue 1993. [Jou 88] Jou, J-Y, ad J. A. Abraham, Fault-Tolerat FFT Networks, IEEE Tras. Computers, Vol. 37, No. 5, pp. 548-561, May 1988. [Kraft 81] Kraft, G. D. ad W. N. Toy, Microprogrammed Cotrol ad Reliable Desig of Small Computers, 1981. [LSI 96] G10-p Cell-Based ASIC Products Databook, LSI Logic, May 1996. [Lala 94] Lala, J. H. ad R. E. Harper, Architectural priciples for safety-critical real-time applicatios, Proc. of the IEEE, vol. 8, o. 1, pp. 5-40, Ja. 1994. [Lagdo 70] Lagdo, G. G. ad C. K. Tag, Cocurret Detectio for Group Look-ahead Biary Adders, IBM Joural Res. ad Dev., pp. 563-573, Sept. 1970. [Mahmood 84] Mahmood, A., D. M. Adrews ad E. J. McCluskey, Executable Assertios ad Flight Software, Proc. AIAA/IEEE Digital Avioics Systems, Cof., pp. 346-351, 1984. [Marouf 78] Marouf, M. A. ad A. D. Friedma, Desig of Self-checkig ers for Berger Codes, Proc. FTCS, pp. 179-184, 1978. [McCluskey 90] McCluskey, E. J., Desig techiques for Testable Embedded ers, IEEE Computer, Vol. 3, No. 7, pp. 84-88, July 1990. [Mitra 99a] Mitra, S., N. R. Saxea ad E. J. McCluskey, A Desig Diversity Metric ad Reliability Aalysis for Redudat Systems, Itl. Test Cof., pp. 66-671, 1999. [Mitra 99b] Mitra, S., N. R. Saxea ad E. J. McCluskey, A Desig Diversity Metric ad Aalysis of Redudat Systems, Techical Report, Ceter for Reliable Computig, CRC-TR 99-4, Staford Uiversity, 1999. [Mitra 00a] Mitra, S., N. R. Saxea ad E. J. McCluskey, Commo-Mode Failures i Redudat VLSI Systems: A Survey, IEEE Tras. Reliability, 000, To appear. [Mitra 00b] Mitra, S. ad E. J. McCluskey, Combiatioal Logic Sythesis for Diversity i Duplex Systems, Proc. Itl. Test Cof., 000. [Nicolaidis 93] Nicolaidis, M., Efficiet Implemetatios of Self-ig Adders ad ALUs, Proc. Itl. Symp. Fault-Tolerat Computig, pp. 586-595, 1993. [Nicolaidis 97] Nicolaidis, M., R. O. Duarte, S. Maich ad J. Figueras, Fault-secure Parity Predictio Arithmetic Operators, IEEE Desig ad Test of Computers, Vol. 14, No., pp. 60-71, 1997. [Patel 8] Patel, J. H. ad L. Y. Fug, Cocurret Detectio i ALUs by Recomputig with Shifted Operads, IEEE Tras. Computers, Vol. C-31, No. 7, pp. 589-595, July 198. [Pradha 91] Pradha, D. K., ad S. K. Gupta, A New Framework for Desigig ad Aalyzig BIST Techiques ad Zero Aliasig Compressio, IEEE Tras. Computers, Vol. 40, No. 6, pp. 743-763, 1991. [Pradha 96] Pradha, D. K., Fault-Tolerat Computer System Desig, Pretice Hall, 1996. [Ramamoorthy 75] Ramamoorthy, C. V. ad Y-W Ha, Reliability Aalysis of Systems with Cocurret Detectio, IEEE Tras. Computers, Vol. C-4, No. 9, pp. 868-878, Sept. 1975. [Rao 89] Rao, T. R. N. ad E. Fujiwara, -Cotrol Codig for Computer Systems, Pretice-Hall, 1989. [Saxea 97] Saxea, N. R., ad E. J. McCluskey, Parallel Sigature Aalysis Desig with Bouds o Aliasig, IEEE Tras. Computers, Vol. 46, No. 4, pp. 45-438, April 1997. [Sedmak 78] Sedmak, R. M. ad H. L. Liebergot, Fault- Tolerace of a Geeral-Purpose Computer Implemeted by Very Large Scale Itegratio, Proc. FTCS, pp. 137-143, 1978. [Sellers 68] Sellers, F., M-Y Hsiao ad L. W. Bearso, Detectio Logic for Digital Computers, McGraw- Hill Book Compay, 1968. [Setovich 9] Setovich, E. M., et al., SIS: A System for Sequetial Circuit Sythesis, ERL Memo. No. UCB/ERL M9/41, EECS, UC Berkeley, CA 9470. [Shedletsky 78] Shedletsky, J.J., Correctio by Alterate-Data Retry, IEEE Tras. Computers, pp. 106-11, Feb. 1978. [Siewiorek 9] Siewiorek, D. P. ad R. S. Swarz, Reliable Computer Systems: Desig ad Evaluatio, Digital Press, 199. [Spaihower 99] Spaihower, L. ad T. A. Gregg, S/390 Parallel Eterprise Server G5 fault tolerace, IBM Joural of Research Developmet, Vol. 43, pp. 863-873, Sept./Nov. 1999. [Touba 97] Touba, N. A. ad E. J. McCluskey, Logic Sythesis of Multilevel Circuits with Cocurret Detectio, IEEE Tras. CAD, Vol. 16, pp. 783-789, July 1997. [Wakerly 78] Wakerly, J., Detectig Codes, Selfcheckig Circuits ad Applicatios, 1978. [Webb 97] Webb, C. F., ad J. S. Liptay, A High Frequecy Custom S/390 Microprocessor, IBM Joural Res. ad Dev., Vol. 41, No. 4/5, pp. 463-474, 1997. [Zeg 99] Zeg, C., N. R. Saxea ad E. J. McCluskey, Fiite State Machie Sythesis with Cocurret Detectio, Proc. Itl. Test Cof., pp. 67-680, 1999.