1 LECTURE VI: LOSSLESS COMPRESSION ALGORITHMS DR. OUIEM BCHIR
2 STORAGE SPACE Uncompressed graphics, audio, and video data require substantial storage capacity. Storing uncompressed video is not possible with today s technology (CD & DVD) Data transmission of uncompressed video over digital networks require very high bandwidth. To be cost-effective and feasible, multimedia systems must use compressed video and audio streams.
3 INTRODUCTION Compression: the process of coding that will effectively reduce the total number of bits needed to represent certain information General Data Compression Scheme
4 INTRODUCTION If the compression and decompression processes induce no information loss, then the compression scheme is lossless, otherwise, it is lossy. Compression ratio:
COMPRESSION STEPS 5
6 TYPES OF COMPRESSION Symmetric Compression Same time needed for decoding and encoding phases Asymmetric Compression Compression process is performed once and enough time is available, hence compression can take longer. Decompression is performed frequently and must be done fast.
STATISTICAL ENCODING (FREQUENCY DEPENDENT) 7 Fixed length coding Use equal number of bits to represent each symbol - message of N symbols requires L >= log_2(n) bits per symbol. Good encoding for symbols with equal probability of occurrence. Not efficient if probability of each symbol is not equal. Variable length encoding frequently occurring characters represented with shorter strings than seldom occurring characters. Statistical encoding is dependant on the frequency of occurrence of a character or a sequence of data bytes. You are given a sequence of symbols: S1, S2, S3 and the probability of occurrence of each symbol P(Si) = Pi.
BASICS OF INFORMATION THEORY 8 The entropy η of an information source with alphabet S = {s1,s2, sn} is pi: probability that symbol si will occur in S. Log2(1/pi): amount of information contained in si, which corresponds to the number of bits needed to encode si.
9 EXAMPLE Uniform distribution: pi=1/256, hence, the entropy of the image is log2256=8.
10 ENTROPY AND CODE LENGTH Entropy and Code Length The entropy η is weighted-sum of terms log2(1/pi) It represents the average amount of information contained per symbol in the source S. The entropy η specifies the lower bound for the average number of bits to code each symbol in S, i.e. The average length (measured in bits) of the codewords produced by the encoder.
11 RUN-LENGTH CODING Memoryless Source: An information source that is independently distributed: the value of the current symbol does not depend on the values of the previously appeared symbols. Run-Length Coding (RLC) (not memoryless): exploits memory present in the information source. Rational for RLC: If the information source has the property that symbols tend to form continuous groups, then such symbol and the length of the group can be coded.
RUN-LENGTH CODING (RLC) 12 Content dependent coding RLC replaces the sequence of same consecutive bytes with the number of occurrences. The number of occurrences is indicated by a special flag -! RLC Algorithm: If the same byte occurred at least 4 times then count the number of occurrences Write compressed data in the following format: the counted byte!number of occurrences Example Uncompressed sequence - ABCCCCCCCCCDEFFFFGGG Compressed sequence - ABC!4DEF!0GGG (from 20 to 13 bytes)
VARIABLE-LENGTH CODING (VLC) 13 Shannon-Fano Algorithm: a top-down approach 1. Sort the symbols according to the frequency count of their occurrences. 2. Recursively divide the symbols into two parts, each with approximately the same number of counts, until all parts contain only one symbol. Example: coding of HELLO
14
15
Another coding tree for HELLO by Shannon-Fano 16
17
HUFFMAN CODING ALGORITHM 18 Characters are stored with their probabilities Number of bits of the coded characters differs. Shortest code is assigned to most frequently occurring character. To determine Huffman code, we construct a binary tree. Leaves are characters to be encoded Nodes contain occurrence probabilities of the characters belonging to the subtree. 0 and 1 are assigned to the branches of the tree arbitrarily - therefore different Huffman codes are possible for the same data. Huffman table is generated. Huffman tables must be transmitted with compressed data
19 EXAMPLE OF HUFFMAN CODING
PROPERTIES OF HUFFMAN CODING 20 Unique prefix property: No Huffman code is a prefix of any other Huffman code precludes any ambiguity in decoding. Optimality: Minimum redundancy code proved optimal for a given data model (i.e., a given, accurate, probability distribution) The two least frequent symbols will have the same length for their Huffman codes, differing only at the last bit. Symbols that occur more frequent will have shorter Huffman codes than symbols that occur less frequent. The average code length for an information source S is strictly less than
21 ARITHMETIC CODING Each symbol is coded by considering prior data encoded sequence must be read from beginning; no random access possible. Each symbol is a portion of a real number between 0 and 1. When the message becomes longer, the length of the interval shortens and the number of bits needed to represent the interval increases.
ARITHMETIC VS. HUFFMAN 22 Arithmetic encoding does not encode each symbol separately; Huffman encoding does. Arithmetic encoding transmits only length of encoded string; Huffman encoding transmits the Huffman table. Compression ratios of both are similar.
23
24 ARITHMETIC CODING ENCODER
25 Example: Encode Symbols CAEE$
26
27
28 The final step in Arithmetic encoding calls for the generation of a number that falls within the rang [low, high). The above algorithm will ensure that the shortest binary codeword is found.
29
30 ARITHMETIC CODING DECODER
Decoding symbols CAEE$ 31