Lecture5: Lossless Compression Techniques

Fixed to fixed mapping: we encoded source symbols of fixed length into fixed length code sequences Fixed to variable mapping: we encoded source symbols of fixed length into variable length code sequences Source Symbol s k Symbol Probability P k Symbol Code, Word c k Code I Code Word Length l k Symbol Code Word c k Code II Code Word Length l k s 0 /2 00 2 0 s /4 0 2 0 2 s 2 /8 0 2 0 3 s 3 /8 2 4 Variable to fixed mapping encode variable length source sequences into a fixed length code words Variable to variable mapping encode variable length source sequences into a variable length code words LECTURES

Shannon Fano Code It is a technique for constructing a prefix code based on a set of symbols and their probabilities (estimated or measured) It is suboptimal in the sense that it does not achieve the lowest possible expected code word length The technique was proposed by Claude Shannon and was attributed to Robert Fano Algorithm -Source symbols are listed in order of decreasing probability from left to right 2-Divide the list into two parts, with the total probability (or frequency of occurrence) of the left part being as close to the total of the right as possible 3-The left part of the list is assigned the binary digit 0, and the right part is assigned the digit This means that the codes for the symbols in the first part will all start with 0, and the codes in the second part will all start with 4-Recursively apply the steps 2 and 3 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has become a corresponding code leaf on the tree LECTURES 2

Shannon Fano Code Ex: Assume a sequence of alphabet A={ a, b, c, d, e, f } with the following occurrence weights, {9, 8, 6, 5, 4, 2} respectively. Apply Shannon Fano coding and discuss the suboptimality LECTURES 3

Shannon Fano Code 7 7 LECTURES 4

Shannon Fano Code 5

Shannon Fano Code LECTURES 6

Shannon Fano Code LECTURES 7

Shannon Fano Code 0 0 e 4 f 2 LECTURES 8

Shannon Fano Code Symbol Frequency P CODE a 9 0.265 00 b 8 0.235 0 c 6 0.76 00 d 5 0.47 0 e 4 0.8 0 f 2 0.059 L K k P k l k L 2x(0.235 0.265) 3x(0.76 0.47 0.8 0.058) 2.5bits / symbol LECTURES 9

Huffman procedure is based on two observations regarding optimum prefix codes: - Symbols that occur more frequently (have a higher probability of occurrence) will have shorter codewords than symbols that occur less frequently 2-The two symbols that occur least frequently will have the same length It is commonly used for lossless compressions Huffman code is a prefix, variable length code that can achieve the shortest average code length for a given input alphabet with pmf In general, Huffman coding is a form of statistical coding as not all characters occur with the same frequency (Probability) The process of finding the optimal code was algorithmized by Huffman A code is an optimal prefix code if it has the same average codeword length as a Huffman code for the given pmf 0

Ex: Consider the following short text Eerie eyes seen near lake Build Huffman tree to encode the text -Get the count of each letter - Sort from lowest to highest count

- Create a new node as the sum of the lowest two count 2

- Create a new node as the sum of the lowest two count - Locate the node at its precedence and create a new one 3

- Locate the node at its precedence and create a new one 4

The frequency of the root node should equal to the number of characters in the text 0 0 0 0 0 0 0 0 0 0 0 24

Least Frequent Same Length More Frequent Shorter Length 25

Huffman Algorithm -Source symbols are listed in order of decreasing probability (frequency) 2-The two source symbols of lowest probability are assigned a 0 and (splitting stage ) 3-These two source symbols are combined into a new source symbol with probability equal to the sum of the two original probabilities (The list of source symbols and therefore source statistics is thereby reduced in size by one) 4-The probability of the new symbol is placed in the list in accordance with its value The steps are repeated until we are left with a final list of source statistics of only two for which a 0 and a are assigned 26

Ex: Find the Huffman code for the following source given the corresponding probabilities; marginal pmf S o 0. S 0.2 S 2 0.4 S 3 0.2 S 4 0. S 2 0.4 S 0.2 S 3 0.2 S o 0. S 4 0. 0 0.2 0.4 0.2 0.2 0.2 0 0.4 0.4 0.4 0.2 0 0.6 0.6 0.4 0 0 000 000 00 H( s) (0.4) log ( ) 2x(0.2) log 2( ) 2x(0.) xlog 2( ) 0.4 0.2 0. 2 L 0.4x 0.2x3 0.2x3 0.x4 0.x4 2.2 H( s) 2.293 CR 3.364 2.2 2 K Pk k ( l k L) 2.36 LECTURES 27

Alternative solution S o 0. S 0.2 S 2 0.4 S 3 0.2 S 4 0. S 2 0.4 0.4 0.4 0.6 0 00 S 0.2 0.2 0.4 0 0.4 0 0.6 S 3 0.2 0.2 0 0.2 0.4 00 S o 0. 0 0.2 0.2 0 S 4 0. H( s) (0.4) log ( ) 2x(0.2) log 2( ) 2x(0.) xlog 2( ) 0.4 0.2 0. 2 L 0.4x2 0.2x2 0.2x2 0.x3 0.x3 2.2 H( s) 2.293 CR 3.364 2.2 2 K Pk k ( l L) Hence, to obtain a minimum variance Huffman code, we always put the combined symbol as high in the list as possible LECTURES k 2 0.6 28

Conclusions Huffman encoding process (Huffman tree) is not unique The code with the lowest code length variance is the best because it ensure the least amount of variability in the encoded stream An encoding scheme that minimizes the average length is called optimal coding Huffman code process is optimal if: H( S) L H( S) Looser bound In fact (from literatures), it can be shown that if P max is the largest probability in the probability model, then for P max 0.5, while for P max < 0.5 H( S) L H( S) P max H( S) L H( S) Pmax 0.086 Tighter bound The average codeword length is equal to the marginal entropy if all marginal probabilities are integer powers of 2 LECTURES 29

Huffman vs. Shannon Symbol Frequency P a 9 0.265 0 b 8 0.235 0 c 6 0.76 d 00 5 0.47 e 0000 4 0.8 0000 f 2 0.059 0 0.265 0.235 0.76 0.76 0.47 0 0.32 0.265 0.235 0 0.76 0.4 0.32 0.265 0 0 L K k P k l k L 2x(0.235 0.265 0.76) 3x0.47 4x(0.8 0.058).352 0.44 0.704 2.470 bits / symbol The average code length is less than that of Shannon Fano code, Hence Huffman is optimum, but Shannon is suboptimal LECTURES 30