Information Theory and Communication Optimal Codes Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/1
Roadmap Examples and Types of Codes Kraft Inequality McMillan Inequality Entropy bound on data compression Shannon Code Huffman Code Wrong Code Stochastic and Stationary Processes c Ritwik Banerjee Information Theory and Communication 2/1
Wrong Code We have seen that the minimum expected codeword length per symbol is arbitrarily close to the entropy. For stationary stochastic processes, this means that the entropy rate is the expected number of bits per symbol required to describe the process. But what happens if the code is designed for the wrong distribution? Because the true distribution may be unknown to us. The wrong distribution may happen to be the best known approximation of the unknown distribution. Let p(x) and q(x) denote the real and the wrong distributions, resp. Consider the Shannon code assignment, where the minimum expected codeword length is within 1-bit of the theoretical optimal for q(x): l(x) = log 1 q(x) We will not achieve expected length L H(p). c Ritwik Banerjee Information Theory and Communication 3/1
Wrong Code What is the increase in the expected description length due to the wrong distribution? c Ritwik Banerjee Information Theory and Communication 4/1
Wrong Code What is the increase in the expected description length due to the wrong distribution? This is given by the KL divergence D(p q). Theorem The expected length under p(x) of the code assignment l(x) = satisfies the inequalities H(p) + D(p q) E(l(x)) < H(p) + D(p q) + 1 where the expectation is w.r.t. the distribution p. That is, the divergence is a measure of the penalty of using an approximation for the coding process. log 1 q(x) c Ritwik Banerjee Information Theory and Communication 4/1
Code Classes and McMillan Inequality We have shown that instantaneous codes satisfy Kraft inequality. Six years after Kraft s proof, McMillan showed that the same if and only if result also holds true for the larger class of uniquely decodable codes. We will skip the proof of this version of the inequality: The codeword lengths of a uniquely decodable D-ary code satisfy D l i 1 Conversely, given a set of integers l i satisfying this inequality, there exists a uniquely decodable code with these integers as its codeword lengths. c Ritwik Banerjee Information Theory and Communication 5/1
Huffman Codes Given a distribution, a prefix code with the shorted expected length (i.e., optimal compression) can be constructed by an algorithm discovered by Huffman in 1952. Commonly used for lossless data compression. Let X = {1, 2, 3, 4, 5} and X is a random variable taking values from X with the following distribution: p(1) = 0.25 p(2) = 0.25 p(3) = 0.2 p(4) = 0.15 p(5) = 0.15 c Ritwik Banerjee Information Theory and Communication 6/1
Huffman Codes Given a distribution, a prefix code with the shorted expected length (i.e., optimal compression) can be constructed by an algorithm discovered by Huffman in 1952. Commonly used for lossless data compression. Let X = {1, 2, 3, 4, 5} and X is a random variable taking values from X with the following distribution: p(1) = 0.25 p(2) = 0.25 p(3) = 0.2 p(4) = 0.15 p(5) = 0.15 The least frequent values should have the longest codewords. The codeword lengths for 4 and 5 must be equal. c Ritwik Banerjee Information Theory and Communication 6/1
Huffman Codes Given a distribution, a prefix code with the shorted expected length (i.e., optimal compression) can be constructed by an algorithm discovered by Huffman in 1952. Commonly used for lossless data compression. Let X = {1, 2, 3, 4, 5} and X is a random variable taking values from X with the following distribution: p(1) = 0.25 p(2) = 0.25 p(3) = 0.2 p(4) = 0.15 p(5) = 0.15 The least frequent values should have the longest codewords. The codeword lengths for 4 and 5 must be equal. Otherwise, we can delete a bit from the longer codeword, and still retain a prefix-free code; which implies that the code we had was not optimal. Construct a code where the two longest codes differ only in the last bit. Combine 4 and 5 into a single source, with probability 0.3. Keep combining the two least probable items into a single source until there is only one item left. c Ritwik Banerjee Information Theory and Communication 6/1
Huffman Codes There is a binary tree structure corresponding to this process! What about other alphabets? In general, for a D-ary alphabet, the algorithm performs a combining of the D least probable items into a single source. For example, if the codewords are in the ternary alphabet D = {1, 2, 3}, we will get a ternary tree. c Ritwik Banerjee Information Theory and Communication 7/1
Huffman Codes Insufficient number of symbols Not enough symbols to combine D items at a time? Create dummy variables with 0 probability! At each stage the number of symbols is reduced by D 1. Therefore, for k merges, we want the total number of symbols to be 1 + k(d 1). Optimality of Binary Huffman Codes Theorem For any distribution, there exists an instantaneous code with minimum expected length such that the following properties hold: 1. The codeword lengths are ordered inversely to the probabilities. 2. The two longest codewords have the same length. 3. The longest codewords, which correspond to the least likely symbols, differ only in the last bit. c Ritwik Banerjee Information Theory and Communication 8/1
Huffman Code is not unique At each split, we have two choices: 01 or 10 For multiple items tied with identical probabilities, multiple ways ordering their codewords exist. c Ritwik Banerjee Information Theory and Communication 9/1
Source Coding and 20-questions We want the optimal seqeunce of yes/no questions to determine an object from a set of objects. Assuming that we know the probability distribution on these objects. A sequence of such questions is equivalent to a code for the object. A question depends only on the answers to the questions asked previously. The sequence of answers uniquely determines the object, therefore, if we model the yes/no answers by 0s and 1s, we have a unique binary encoding for each object in the set. The average length of this code is simply the average number of questions asked. This can be optimized by Huffman code. That is, Huffman code determines the optimal sequence of questions that will identify the object. The expected number of questions in this process, E(Q), satisfies H(X) E(Q) < H(X) + 1 c Ritwik Banerjee Information Theory and Communication 10/1
Alphabetic Codes Consider a special case of the 20-questions game: the elements of X = {1, 2,..., m} are in decreasing order of probability (i.e., p 1 p 2... l m) the only questions allowed are of the form is X > a? (for some a). The Huffman code constructed by the Huffman algorithm may not correspond to slice sets of the form {x : x < a}. But we can do the following: Take the optimal code lengths found by Huffman codes. Use these lengths to assign symbols to the tree by taking the first available node at the current level. This is not Huffman code, but it is another optimal code. At each non-leaf node, it splits the set into two subsets: {x : x < a} and {x : x > a} These are also called alphabetic codes because the tree construction process leads to an alphabetical ordering of the codewords. c Ritwik Banerjee Information Theory and Communication 11/1
Code Redundancy Huffman code has the shortest average codeword length, i.e., L Huffman L for any prefix code. Redundancy of a random variable X is defined as the difference between the average Huffman codeword length and the entropy H(X). Redundancy of Huffman coding is bounded above by p + 0.086, where p is the probability of the most common symbol. c Ritwik Banerjee Information Theory and Communication 12/1
Huffman coding is optimal Theorem If C is a Huffman code and C is any uniquely decodable code, then L(C ) L(C). The proof for binary alphabets can be extended to general D-ary alphabets. Huffman coding is a greedy algorithm. It works by combining the least likely symbols at each step. In general, greedy approaches do not lead to globally optimal solutions, but in the case of Huffman coding, it does. c Ritwik Banerjee Information Theory and Communication 13/1