A Brief Introduction to Information Theory and Lossless Coding

A Brief Introduction to Information Theory and Lossless Coding 1 INTRODUCTION This document is intended as a guide to students studying 4C8 who have had no prior exposure to information theory. All of the following material is covered in 3C5/4BIO2. Information Theory is a subject first described in the seminal paper titled A Mathematical Theory of Communications by Claude Shannon. Although Information Theory is concerned with communications, here we are only interested in Information Theory as it relates to compression. 2 DATA, INFORMATION AND REDUNDANCY The first key concept is to understand the differences between these 3 terms. This is best explained by the following example. Consider the following sentence THE ENGLISH LANGUAGE IS ABOUT ONE HALF REDUNDANT This sentence is said to convey a message (or information) about the redundancy of english. However it is possible to convey the same message using less words (or data) ENGLISH IS HALF REDUNDANT So the original sentence uses more words than is necessary to convey the message. Loosely we can say that DATA = INFORMATION + REDUNDANCY. By removing redundancy, we can reduce the amount of data used to convey a message without reducing the information content. Therefore, we can think of information content as the minimum amount of data that we can use to represent the information contained within a file. 3 ASSIGNING CODEWORDS TO SYMBOLS In general, any file whether an image, video, audio or text is said to consist of a string of symbols. The set of possible symbols in a file is called the alphabet. To represent each symbol on a computer we need to assign it a string of bits.

3.1 A BASIC EXAMPLE Consider the following file that consists of only 4 symbols A, B, C and D. The file consists of one these symbols chosen at random. For example, ABACDDCABBCA. One way to assign a codeword to each symbol is to use a binary number of fixed length for each symbol. For example The file in binary is then A 00 C 10 B 01 D 11 000100101111100001011000 In general the number of bits for each symbol = log 2 (#symbols) So a 4 symbols set/alphabet requires 2 bits per symbol, a file with an 8 symbol alphabet would require 3 bits per symbol etc. 3.2 AN EXAMPLE WITH ONLY 3 SYMBOLS Consider the following file where log 2 (#symbols) is not an integer. ABABCCBCACBCABB 1. #symbols = 3 which is > 2. So more than 1 bit is needed per symbol. We could use 2 bits per symbol A 00 C 10 B 01 As the file is symbols long we require 30 bits of data to encode the entire file. The encoded file is then 000100011010011000100110000101 2. Notice the code 11 is unused. Perhaps we can use less bits for one of the symbols. For example A 00 C 1 B 01

So the file is 0001000111011001011000101 That is 25 bits in total. So it obviously takes fewer bits to encode. However, is it possible to decode the original string of symbols using the above string of bits and the codetable only? 4 DECODING AND INSTANTANEOUS CODES Decoding files is easy when the length of each code is the same and the codetable is known. We just divide up the file into chunks according to the code length and translate from the codetable. (Example No.1 in Section 3.2) 00,01,00,01,10,10,01,10,00,10,01,10,00,01,01 However, this becomes trickier when the codelengths are different. However, in Example No.2 in Section 3.2 the file can be decoded easily by looking ahead in the bit stream until we see a code that exists in the code table A 00 C 1 B 01 Starting at the start of the stream the first bit 0 is not a codeword but the first 2 bits together is the code for symbol A 0001000111011001011000101 Proceeding from the 3 rd bit, the 3 rd and 4 th bits (01) is the codeword for B 01000111011001011000101 Repeating this process the next two decoded symbols are A and B. The remaining portion of the file left to be decoded is now 11011001011000101 The next bit is 1 which is the code for symbol C. We decode the symbol and proceed from the next bit in the stream. Repeating this process until the end will give back the original unencoded file. Codes that can be decoded this way are called INSTANTANEOUS CODES. When choosing codewords for symbols these types of codes are always used.

4.1 DEFINITION A set of codewords are instantaneous if and only if the prefix of each codeword is not itself one of the codewords in the set. The prefix of a codeword is all the bits of a codeword except the last bit. For example Codeword Prefix 0-01 0 0111001 011100 Looking at all of the codetables that have been used in the previous examples, it can be seen that they all are instantaneous codes. 5 TOWARDS MINIMIZING FILE SIZES - ENTROPY Q. Revisiting example 3 in Section 3.2, would we get a smaller filesize if we used the 1 bit code for another symbol other than C? A. Yes, we should give it to the symbol that occurs most often, which is the symbol B. In terms of probability, we give it to the symbol that occurs with highest probability. Symbol Number of Occurrences Probability A 4 4/ B 6 6/ C 5 5/ So if we give the shorter code to the symbol B, we will be able to save one more bit overall. (24 bits instead of 25). This leads to a more general question. If we have a file that contains many symbols (eg. 256 for an image) that each occur with different probability, what codeword length should we choose for each symbol? To answer this question, we need to look at the concept of entropy.

5.1 ENTROPY In A Mathematical Theory of Communications Claude Shannon developed the concept of entropy as a means to quantity the average information content per symbol for a file with a set of symbols whose probabilities are known. Under Shannon s reasoning, the amount of information gained, I k, upon observation of a symbol k of probability p k in the file is given by I k = log 2 (p k ) = log 2 ( 1 p k ). Information content is also measured in units of bits. The Entropy of an information source is defined as the average amount of information gained upon observation of a symbol generated by the source. It is simply the average of the information content of each symbol in the alphabet weighted in proportion to the probability of each symbol occurring. H(X) = p k log 2 (p k ) k Recalling example 2 in Section 3.2, the entropy of the information source that generates the file is H(X) = 4 log 2 ( 4 ) 6 log 2 ( 6 ) 5 log 2 ( 5 ) = 1.57 bits/symbol. 5.2 SOURCE CODING THEOREM The Source Coding Theorem (aka Shannon s Source Coding Theorem) establishes that the entropy is the minimum bound on the average codeword length that will preserve the information content of the file. We cannot go lower than this value without losing information. It also says that we can always achieve an average codeword length within 1 bit per symbol of the minimum. In practice it is possible to get much closer to the entropy than that. Recall that we needed 24 bits to encode symbols in the earlier example. Therefore the average codeword length is 1.6 bits/symbol which is only slightly greater than the entropy (1.57 bits/symbol). Another way to estimate the average codeword length is by a weighted average of the codeword length, l k, for each symbol as follows L = p k l k. k

Hence, in our example, L = 4 2 + 6 1 + 5 2 = 1.6 bits/symbol. 6 PRACTICAL CODING ALGORITHMS THE HUFFMAN ALGORITHM Unfortunately, the coding theorem tells us nothing about how we might select the optimal code length for each symbol. This was easy for our example with only 3 symbols but in general it is much more difficult. Example 3 Consider an information source with four possible symbols A, B, C and D where p A = 0.6 p B = 0.25 p C = 0.1 p D = 0.05 The entropy of the source works out to be H(X) = 1.49 bits/symbol. Lets consider 2 possible ways of encoding the source 1. Assign a fixed codeword length of 2 for each symbol. For example A 00 C 10 B 01 D 11 The average codeword length (L ) here is 2 bits/symbol which is within 1 bit per symbol of the entropy in accordance with the coding theorem but is not very good. The coding efficiency which is defined as η = H(X) 75%. L We should be able to do better. 2. Assign a codeword length close to the information content of each symbol. For example, we could round up the information content to the nearest integer.

Symbol, k Probability, p k Information Content, Codeword I k Length, l k A 0.6 0.74 1 B 0.25 2 2 C 0.1 3.32 4 D 0.05 4.32 5 We could then assign an instantaneous code with the above code lengths as follows A 0 C 1100 B 10 D 11010 This would give an average codeword length of L = 0.6 1 + 0.25 2 + 0.1 4 + 0.05 5 = 1.75 bits/symbol So this is obviously better than before. However, we could have used 4 bits for symbol D and the code would also have been instantaneous. So this code is not optimal. 6.1 THE HUFFMAN CODING ALGORITHM In 1951 David Huffman proposed a method of assigning optimal codewords to symbols, and more specifically of assigning optimal code lengths. No other method of mapping codewords to individual symbols will give a lower average codeword length. The algorithm works by building a binary tree in which the symbols form the leaves of the tree. The algorithm works by building the tree from the leaves to the root. The tree is built by essentially performing the following step over and over. 1. Replace the two symbols with the lowest probabilities/frequencies with a new symbol whose probability/frequency is the sum of the probabilities/frequencies of the two symbols until there is only one symbol left. The Huffman Tree for the above example above is shown in figure 1. The length of the code for each symbol is given by the number of paths between each symbol and the root of the tree. Therefore the optimal codeword lengths for this example are 1 bit for A, 2 bits for B and 3 bits for C and D.

A 0.6 1 B 0.25 1 1 C 0.1 1 0.4 0 0. 0 D 0.05 0 Figure 1: The Huffman Coding Tree for example 3. Symbols C and D are grouped together first (red), symbols B and the replacement symbol CD (green) are grouped, and finally A with the replacement symbol BCD (purple). You can choose any bits you want for each codeword as long as the correct lengths are used and the set of codes is instantaneous. People generally do this by giving each branch of the binary tree a bit value where the two branches of the binary tree at each level are given opposite values. This can be done in any order. The table below summarises the results for this example. Symbol, k Probability, p k Codeword Length, l k Code A 0.6 1 1 B 0.25 2 01 C 0.1 3 001 D 0.05 3 000 The average codeword length obtained using the Huffman code is And the efficiency is now L = 0.6 1 + 0.25 2 + 0.1 3 + 0.05 3 = 1.55 bits/symbol η = 1.49 1.55 96%.

6.1.1 Practical Considerations The Huffman algorithm is widely used algorithm in the compression of all types of data and is used in the majority of compression algorithms and standards (Zip, PNG, JPEG, MPEG, MP3, AAC etc.). This is due to the simplicity of the encoding procedure and due to the fact that it works reliably. To compress a file, the file is first scanned to generate a histogram of the symbols in the file and the Huffman tree and codetable is then calculated. In order to decompress the file, the code table must be included along with the compressed data. This obviously leads to a data overhead. However, it is less significant for larger files. 7 A PROBLEM SOURCES WITH MEMORY, JOINT ENTROPY AND ENTROPY RATE In the previous sections we have been thinking of information sources that an information source simply selects symbols at random from a set independently to form a file. These sources are sometimes referred to as Memoryless. However, this is obviously not a practical model for real life files. Consider a text file that starts with the following characters. The sky is blu Obviously, we would expect the next letter to be the character e with high probability, and much higher than the marginal probability of the character e occurring in English. If we use the Huffman coding procedure as described previously (ie. based on the marginal probabilities of the characters), we will be using too long of a codeword and hence the code will be inefficient. Practical information sources like this are referred to as Sources with Memory. More fundamentally, the concept of entropy has to be adapted to deal with sources with memory. Rather than thinking of a file being generated by one information source, we can consider each observed symbol as being generated by a different information source where the probabilities of the symbols vary. Although in theory it would be possible to calculate a different Huffman codetable for each symbol, it is obviously out of the question given the overhead of including the codetables in the compressed file. So we have to do something else. There are many different types of entropy that can be used to describe sources with memory, here we will look at only 2: joint entropy and the entropy rate.

7.1 JOINT ENTROPY Consider a file of N symbols each generated by separate information sources X 1, X 2,, X N. Then the joint entropy is defined as H(X 1, X 2,, X N ) = p k1,k 2,,k N log 2 (p k1,k 2,,k N ), all possible combinations of symbols Where p k1,k 2,,k N is the joint probability distribution over the N symbols. The joint entropy is essentially the information content of the entire file. This is the theoretical minimum number of bits that we need to encode the file. 7.2 ENTROPY RATE If the joint entropy represents the information content of a file, then the entropy rate is the average information content of a given symbol. The entropy rate of an alphabet X is defined as H(X) = lim ( H(X 1, X 2,, X N ) ) N N Practically, it is almost impossible to calculate. For example, it is not feasible to capture entirely in probabilistic terms the spatial dependency of the pixel intensities in an image. The entropy rate generalizes the concept of entropy to account for sources with memory. It is the minimum bound on average number of bits to encode a symbol of any file, whether the information source is a source with memory or is memoryless. Hence, the entropy rate is notionally the best lossless compression we can achieve. However, we almost never know what this optimum is. If a source is memoryless then the entropy rate is the same as the entropy of the source calculated on the marginal probability distribution of the symbols (ie. Using the equation in Section 5.1). If the source has memory, then the entropy rate is always less. As the entropy rate is normally less than the entropy, more compression is possible than is predicted by the entropy value. In fact the entropy of an 8-bit greyscale image is commonly greater than 7 bits per symbol. But as we have seen already in 4C8, we can achieve lossless compression at a much lower bit rate. The lossless compression algorithms used in file formats like ZIP, GZIP and PNG use dictionary coding techniques in conjunction with the Huffman algorithm to achieve better efficiency. These algorithms achieve average symbol code lengths that tend to the entropy rate as file length tends to infinity. However, efficiency is less for small files. DEFLATE is an example of such a coding technique. The most popular dictionary coding algorithms are LZ77, LZ78 and LZW. The initials refer to the names of the authors of the papers where the algorithms were proposed (Abraham

Lempel, Jacob Ziv, and Terry Welch) and the numbers to the year of publication of the papers that describe the algorithms. You will most likely look at dictionary coding techniques in your telecoms courses. As they are not commonly used in lossy compression formats for audio and video, we will not revisit them here.