Page 1. Outline. Basic Idea. Hamming Distance. Hamming Distance Visual: HD=2

Outline Basic Concepts Physical Redundancy Error Detecting/Correcting Codes Re-Execution Techniques Backward Error Recovery Techniques Basic Idea Start with k-bit data word Add r check bits Total = n-bit codeword (n=k+r) Map 2 k data words to 2 n sized codeword space Overhead = r/k E.g., for (single-bit) parity, the overhead is 1/k ECE 254 / CPS 225 25 ECE 254 / CPS 225 26 Hamming Distance Hamming Distance Visual: HD=2 Hamming distance (HD): number of bits in which two words differ from each other E.g., 0010 and 1110 have a Hamming distance of 2 For a group of codewords, the minimum HD between any two codewords determines the code s ability to detect and/or correct errors This is a fundamental rule, not just some ad-hoc reasoning HD = 2 Can detect single errors ECE 254 / CPS 225 27 ECE 254 / CPS 225 28 Page 1

Hamming Distance Visual: HD=3 Hamming Distance and Error Detection Can detect up to t-bit errors if HD >= t + 1 What if we receive 111? Could ve been 011 Could ve been 101 Could ve been 110 What about 001? Or 000?? What if we receive 011? Could it have been 001??? HD = 3 Can correct single errors Can detect double& single errors HD=2, detects 1-bit errors ECE 254 / CPS 225 29 ECE 254 / CPS 225 30 Hamming Distance and Error Detection Can detect up to t-bit errors if HD >= t + 1 Hamming Distance and Error Correction Can correct up to t-bit errors if HD >= 2t+1 Can correct 2t & detect p more if HD >= 2t+p+1 Single Error Correction, Double Error Detection (SECDED) HD=3, detects 1,2-bit errors ECE 254 / CPS 225 31 ECE 254 / CPS 225 32 Page 2

Hamming Distance and Error Correction Can correct up to t-bit errors if HD >= 2t+1 Can correct 2t & detect p more if HD >= 2t+p+1 What if we receive 011? More likely to have been 111 But could ve been 000 Guess that it was 111 What if we receive 111? Could it have been 000? Single-bit Parity Simplest error detection code Adds one bit of redundancy to each data word Even (odd) parity: add bit such that total number of ones in codeword is even (odd) E.g., 001010 gets a parity bit of 0 for even parity (1 for odd) Can detect all single-bit errors Hamming distance >= 2 Could be greater than 2 if data words don t use all bit combinations Drawbacks: Can t detect anything except single-bit errors ECE 254 / CPS 225 33 ECE 254 / CPS 225 34 More Redundancy Than Single-Bit Parity Overlapping Parity (for single-bit errors) 0 00000 1 11111 Good: Hamming distance = 5 Bad: Overhead = 400% i3 k=4 information bits i2 i1 i0 r=3 parity bits p2 p1 p0 Fortunately, we can get more cost-effective codes! But first we start off with a general way to detect and diagnose single-bit errors Which bit has error? i3 i2 i1 i0 p2 p1 p0 Parity bits affected p2, p1, p0 p2, p1 p2, p0 p1, p0 p2 p1 p0 When receiving codeword, re-compute 3 parity bits and compare to those that were sent. If different, can diagnose error! ECE 254 / CPS 225 35 ECE 254 / CPS 225 36 Page 3

Generalized Overlapping Parity Codes The previous slide showed how to use overlapping parity to detect and diagnose single-bit errors For single-bit errors, there are k+r possible errors Therefore, we need 2 r >= k + r + 1 to uniquely diagnose errors In general, can extend this scheme to detect and diagnose more than single-bit errors General approach called Hamming Codes (7,4) Hamming Code Class of (n,k) Hamming codes, e.g., (7,4) [r= n-k =3] Let i1, i2, i3, i4 be the information bits Let p1, p2, p4 be the check bits p1 = i1 xor i2 xor i4 p2 = i1 xor i3 xor i4 p4 = i2 xor i3 xor i4 Let H be the Parity Check Matrix If C is a codeword, then H C = 0 (mult modulo 2!) Else, H C = S, where S is the syndrome Syndrome identifies where error occurred (i.e., which bit) This works out like magic because of some cute math ECE 254 / CPS 225 37 ECE 254 / CPS 225 38 (7,4) Hamming Code, cont d H = p1 p2 i1 p4 i2 i3 i4 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 Info word: 0101: p1 = 0, p2 = 1, p4 = 0 codeword is 0100101 Example1: received error-free codeword R = 0100101 Compute syndrome: S = H R = 0 = [0 0 0] Example 2: received R =0110101 (i.e., error in bit position 3) Compute syndrome: S = H R = [1 1 0] read backwards this is 011 = 3 Cyclic Redundancy Check (CRC) Considers dataword and codeword to be polynomials E.g., i 0, i 1, i 2,, i n-1 i 0 + i 1 X + i 2 X 2 + + i n-1 X n-1 Codeword = Dataword * Generator C(X) = D(X) * G(X) G(X) is pre-defined CRC polynomial (depends on particular code) Additions performed during multiplication are mod2» 0+0 = 0, 0+1 = 1+0 = 1, 1+1 = 0 At receiver, divide n-bit codeword by CRC polynomial D(X) = C(X) / G(X) If remainder is non-zero, we ve detected an error ECE 254 / CPS 225 39 ECE 254 / CPS 225 40 Page 4

CRC Properties and Varieties An n-bit CRC check can detect all errors of less than n bits and all but 1 in 2 n multi-bit errors Examples: CRC-12: G(X) = X 12 +X 11 +X 3 +X 2 +X+1 CRC-16: G(X) = X 16 +X 15 +X 2 +1 Ethernet uses CRC-32 More bits better error detection capability CRC Implementation Why is CRC popular? Easy to implement! Just need shifters and XORs b11 b10 b9 bits 8-1 b0 input data Circuit for CRC-12. Also known as a linear feedback shift register (LFSR). ECE 254 / CPS 225 41 ECE 254 / CPS 225 42 Reed-Solomon Codes Popular ECC for CDs, DVDs, wireless communications, etc. k data symbols, each of which is s bits r parity symbols, each of which is also s bits Can correct up to r/2 symbols that contain errors Or can correct up to r symbol erasures Erasure = error in a known symbol Denoted by RS(n,k) Common example: RS(255, 223) with s=8 n = 255 255 codeword bytes k = 223 223 dataword bytes r = 32 can correct errors in <= 16 bytes Reed-Solomon Codes, cont d There exist many flavors of RS codes, each of which is tailored to specific purpose Cross-Interleaved Reed-Solomon Coding (CIRC) used in CDs can correct error burst of up to 4000 bits! 4000 bits is roughly equivalent to 2.5mm on the CD surface RS codes are best for bursty error model Just as good at handling 1 error in symbol or s errors in symbol Codewords created by multiplying datawords with generator polynomial (like CRC) I will not provide detail on how the code works or why Galois fields are involved (nor will I tell you who Galois was) ECE 254 / CPS 225 43 ECE 254 / CPS 225 44 Page 5

Berger Codes The r check bits are the binary encoding of the number of zeros in the k-bit dataword Check bits = log 2 (k+1) Can detect all single-bit errors and all unidirectional multi-bit errors Unidirectional: all bit errors are either from 0 1 or from 1 0 Good for detecting coupling faults Change in one bit erroneously causes change(s) in other bit(s) Models short circuits (including bridging faults) Bose-Lin Codes Very similar to Berger codes, but with fixed number of check bits Lower overhead for check bits But can only detect up to r unidirectional errors Check bits = (number of zeros) modulo (2 r ) ECE 254 / CPS 225 45 ECE 254 / CPS 225 46 Arithmetic Codes Self-Checking Circuits Codes that are preserved by arithmetic operations If X and Y are codewords, then Z = F(X,Y) is a codeword Arithmetic codes let us detect errors in ALUs Two types of codes, where f(x) is the encoding of X and C(X) is the check symbol computed from X Separable: f(x) = concatenation of X and C(X) denoted X, C(X) Non-separable: f(x)!= X, C(X) Why is separability a desirable feature? Think about hardware implementation issues Example (assume addition is performed modulo M) AN code: f(x) = A*X A (X+Y mod M) = (AX + AY) mod AM What properties/invariants can we build into circuits such that codeword inputs do not lead to codeword outputs in the presence of faults? Self-testing circuit: for every fault from a prescribed set, the circuit produces a non-codeword output in response to at least one codeword input Fault-secure circuit: for every fault from a prescribed set, the circuit never outputs an incorrect codeword in response to codeword inputs Totally self-checking: self-testing AND fault-secure ECE 254 / CPS 225 47 ECE 254 / CPS 225 48 Page 6

Other Coding Schemes Many error detecting/correcting codes exist Many of them require more math than belongs in this course Refer to the numerous textbooks on this topic Reasons for other types of codes Error models» Multiple-bit errors» Burst errors (particularly for network communication)» Byte errors Cost-efficiency Ease of hardware implementation Implementing EDC/ECC in Hardware Where does EDC/ECC get used? Disk, CD-ROM Memory (DRAM, SRAM) Buses Network Tradeoff between EDC and ECC ECC: Forward error recovery Often on critical path, so can slow down even fault-free system EDC: Backward error recovery Detecting error leads to recovery (can be slow) So would you use ECC or EDC in your L1 cache? How about in DRAM? ECE 254 / CPS 225 49 ECE 254 / CPS 225 50 Page 7