Communication Theory II Lecture 13: Information Theory (cont d) Ahmed Elnakib, PhD Assistant Professor, Mansoura University, Egypt March 22 th, 2015 1
o Source Code Generation Lecture Outlines Source Coding Theorem Lossless Data Compression Algorithms Prefix Coding Huffman Coding Lempel-Ziv Coding 2
Source Encoder: How to design? 3
Source Code Generation: Source Encoder ohow to represent data generated by a discrete source of information? Process: Source encoding Device: Source encoder orequirements: Codeword produced by the encoder is in the binary form Codeword is represented with the minimum number of bits (as low as H(S)) Perfect source encoder: the source code is uniquely decodable The original source sequence can be reconstructed perfectly from the encoded binary sequence 4
Shannon s 1 st Theorem: Source-Encoding Theorem o Given a discrete memoryless source whose output is denoted by the random variable S o The entropy H(S) imposes the following bound on the average codeword length for any source encoding scheme: L H(S) o According to this theorem, the entropy H(S) represents a fundamental limit on the average number of bits per source symbol necessary to represent a discrete memoryless source L H(S) 5
When L = H(S)? (optimal value) o If each time we transmit a sequence (S N ) of N messages (N-order extended source), N H m = E I i = P i I i n i=1 n = P i log 2 ( P i ) bits i=1 6
Average Codeword Length of a Source Encoder Figure below 7
Efficiency/redundancy of a Source Encoder o η = L min L = H(m) L o The redundancy γ is defined as: γ = 1 η 8
Lossless Data Compression o A common characteristic of signals generated by physical sources is that, in their natural form, they contain a significant amount of redundant information E.g., business transactions constitutes a redundant sequence in the sense that any two adjacent symbols are typically correlated with each other o Lossless data compression: an operation performed on a digital signal that removes redundant information from the signal prior to transmission (with no loss of information) Produces an output code that efficiently represent the source code with the minimum average number of bits per symbol The original data can be reconstructed with no loss of information o Limits of Lossless data compression: The entropy of the source establishes the fundamental limit on the removal of redundancy from the data L H(S) o Procedure: assigning short descriptions to the most frequent outcomes of the source output and longer descriptions to the less frequent ones 9
Types of Source Codes: Prefix Code A prefix code is a code in which no codeword is the prefix of any other codeword Example: which of the following represent a prefix code? Symbol Source Probability of occurrence Code I Code II Code III 10
Properties of Prefix Code o Uniquely decodable The end of the codeword is always recognizable Code II s 0 s 1 Instantaneous code (why?) ooffers the possibility of realizing an average codeword length that can be made arbitrarily close to the source entropy Where L represents the average codeword length of the prefix code Decision tree for code II 11 s 2 s 3
Extended Prefix Codes o Higher-order extended prefix codes can offer higher efficiency For n-order extension: Average length per message: o The average codeword length of an extended prefix code can be made as small as the entropy of the source, provided that the extended code has a high enough order in accordance with the source-coding theorem o The price we have to pay for decreasing the average codeword length is increased decoding complexity, which is brought about by the high order of the extended prefix code 12
Optimal Prefix Codes: Huffman Encoding Optimal in the sense that the code has the shortest expected length Huffman Coding Algorithm 1. Splitting stage: The source symbols are listed in order of decreasing probability. o The two source symbols of lowest probability are assigned 0 and 1 2. These two source symbols are then combined into a new source symbol with probability equal to the sum of the two original probabilities. o The list of source symbols, and, therefore, source statistics, is thereby reduced in size by one o The probability of the new symbol is placed in the list in accordance with its value 3. The procedure is repeated until we are left with a final list of source statistics (symbols) of only two for which the symbols 0 and 1 are assigned 4. The code for each (original) source is found by working backward and tracing the sequence of 0s and 1s assigned to that symbol as well as its successors 13
Example 1 (a) Construct a Huffman code for a given source that produces 6 messages with the following probabilities Messages Probabilities (a) Compute its efficiency (b) Plot the decision tree (c) Use the constructed Huffman code to encode and decode the sequence 14
Example 1 Solution Messages Probabilities 15
Example 1 Solution 16
Example 1 Solution (cont d) (a) (b) 17
Example 1 Solution (cont d) (c) Decision tree Initial State 0 1 0 1 0 1 m1 0 1 m2 0 1 m3 m4 m5 m6 18
Example 1 Solution (cont d) (d) Notes: (1) The Huffman code is a prefix code, i.e., uniquely decodable, and (2) The Huffman code is not unique. Why? Average length remains constant Lower code length variance is achieved by moving the combined source probability as high as possible 19
Example 2 Messages Probabilities 20
Example 2 Solution o Number of messages = 4 + 3k for k=1 (one reduction) Number of messages should be 7 We should add one dummy message with a probability equals 0 21
Example 2 Solution 22
Example 3 23
Example 3 Solution Messages Probabilities m 1 0.8 m 2 0.2 24
Example 3 Solution: Second-order extension and 25
Example 3 Solution: Third-order extension and 26
Drawbacks of Huffman Coding orequire the knowledge of a probabilistic model of the source Source statistics are not always known apriori o Not suitable for modeling of text sources Codeword grows exponentially fast in the size of each super-symbol of letters (grouping of letters) Impractical storing requirements ohow to overcome these limitations? Using Lempel-Ziv coding 27
Lempel-Ziv Coding ocurrently, it is the standard algorithm for file compression. Why? Simpler Adaptive Fixed length code Suitable for synchronous transmission Afford practical considerations onow, fixed blocks of 12 bits are used (code book of 2 12 possible entries) Idea: the source data stream is parsed into segments that are shortest subsequences not encountered previously 28
Example Consider the example of the binary sequence 000101110010100101.. a) Construct the Lempel-Ziv code for this sequence b) show how to decode the sequence at the receiver 29
Solution 30
Solution (cont d) Innovation symbol Pointer to the root subsequence o The Lempel Ziv decoder: use the pointer to identify the root subsequence and then append the innovation symbol. o E.g., the binary encoded block 1101 in position 9 The last bit, 1, is the innovation symbol The remaining bits, 110, point to the root subsequence 10 in position 6 31 Hence, the block 1101 is decoded into 101, which is correct
Questions 32