Huffman Coding - A Greedy Algorithm. Slides based on Kevin Wayne / Pearson-Addison Wesley

Similar documents
Module 3 Greedy Strategy

Module 3 Greedy Strategy

Lecture5: Lossless Compression Techniques

Information Theory and Communication Optimal Codes

Introduction to Source Coding

Coding for Efficiency

# 12 ECE 253a Digital Image Processing Pamela Cosman 11/4/11. Introductory material for image compression

LECTURE VI: LOSSLESS COMPRESSION ALGORITHMS DR. OUIEM BCHIR

A Brief Introduction to Information Theory and Lossless Coding

Multimedia Systems Entropy Coding Mahdi Amiri February 2011 Sharif University of Technology

Communication Theory II

SOME EXAMPLES FROM INFORMATION THEORY (AFTER C. SHANNON).

Lossless Image Compression Techniques Comparative Study

Information Theory and Huffman Coding

Greedy Algorithms. Kleinberg and Tardos, Chapter 4

Monday, February 2, Is assigned today. Answers due by noon on Monday, February 9, 2015.

Entropy, Coding and Data Compression

Module 8: Video Coding Basics Lecture 40: Need for video coding, Elements of information theory, Lossless coding. The Lecture Contains:

Wednesday, February 1, 2017

1 This work was partially supported by NSF Grant No. CCR , and by the URI International Engineering Program.

Comm. 502: Communication Theory. Lecture 6. - Introduction to Source Coding

COMM901 Source Coding and Compression Winter Semester 2013/2014. Midterm Exam

ECE/OPTI533 Digital Image Processing class notes 288 Dr. Robert A. Schowengerdt 2003

Slides credited from Hsueh-I Lu, Hsu-Chun Hsiao, & Michael Tsai

CHAPTER 5 PAPR REDUCTION USING HUFFMAN AND ADAPTIVE HUFFMAN CODES

HUFFMAN CODING. Catherine Bénéteau and Patrick J. Van Fleet. SACNAS 2009 Mini Course. University of South Florida and University of St.

Solutions to Assignment-2 MOOC-Information Theory

Huffman Coding For Digital Photography

6.450: Principles of Digital Communication 1

GENERIC CODE DESIGN ALGORITHMS FOR REVERSIBLE VARIABLE-LENGTH CODES FROM THE HUFFMAN CODE

FAST LEMPEL-ZIV (LZ 78) COMPLEXITY ESTIMATION USING CODEBOOK HASHING

Arithmetic Compression on SPIHT Encoded Images

CSE 100: BST AVERAGE CASE AND HUFFMAN CODES

COMP9414: Artificial Intelligence Adversarial Search

The Lempel-Ziv (LZ) lossless compression algorithm was developed by Jacob Ziv (AT&T Bell Labs / Technion Israel) and Abraham Lempel (IBM) in 1978;

Entropy Coding. Outline. Entropy. Definitions. log. A = {a, b, c, d, e}

CS 540-2: Introduction to Artificial Intelligence Homework Assignment #2. Assigned: Monday, February 6 Due: Saturday, February 18

6.004 Computation Structures Spring 2009

PRIORITY QUEUES AND HEAPS

6.02 Introduction to EECS II Spring Quiz 1

MAS160: Signals, Systems & Information for Media Technology. Problem Set 4. DUE: October 20, 2003

Pooja Rani(M.tech) *, Sonal ** * M.Tech Student, ** Assistant Professor

Chapter 10 Error Detection and Correction

COUNTING TECHNIQUES. Prepared by Engr. JP Timola Reference: Discrete Math by Kenneth H. Rosen

Unit 1.1: Information representation

Chapter 10 Error Detection and Correction 10.1

Multimedia Communications. Lossless Image Compression

Topic 23 Red Black Trees

Lab 9: Huff(man)ing and Puffing Due April 18/19 (Implementation plans due 4/16, reports due 4/20)

!!"!#$#!%!"""#&#!%!""#&#"%!"# &#!%!# # ##$#!%!"'###&#!%!"(##&#"%!"!#&#!%!""# #!!"!#$#!%)# &#!%*# &#"%(##&#!%!# Base or

The Scientist and Engineer's Guide to Digital Signal Processing By Steven W. Smith, Ph.D.

5/17/2009. Digitizing Color. Place Value in a Binary Number. Place Value in a Decimal Number. Place Value in a Binary Number

Past questions from the last 6 years of exams for programming 101 with answers.

REVIEW OF IMAGE COMPRESSION TECHNIQUES FOR MULTIMEDIA IMAGES

Digital Communication Systems ECS 452

Byte = More common: 8 bits = 1 byte Abbreviation:

Algorithms for Data Structures: Search for Games. Phillip Smith 27/11/13

What You ll Learn Today

AN INTRODUCTION TO ERROR CORRECTING CODES Part 2

MULTIMEDIA SYSTEMS

CSL 356: Analysis and Design of Algorithms. Ragesh Jaiswal CSE, IIT Delhi

A SURVEY ON DICOM IMAGE COMPRESSION AND DECOMPRESSION TECHNIQUES

2.1. General Purpose Run Length Encoding Relative Encoding Tokanization or Pattern Substitution

Search then involves moving from state-to-state in the problem space to find a goal (or to terminate without finding a goal).

Introduction to Coding Theory

MAS.160 / MAS.510 / MAS.511 Signals, Systems and Information for Media Technology Fall 2007

MA/CSSE 473 Day 9. The algorithm (modified) N 1

PRIORITY QUEUES AND HEAPS. Lecture 19 CS2110 Spring 2014

From a Ball Game to Incompleteness

The idea of similarity is through the Hamming

THE use of balanced codes is crucial for some information

The Theory Behind the z/architecture Sort Assist Instructions

2 person perfect information

Compression. Encryption. Decryption. Decompression. Presentation of Information to client site

A High-Throughput Memory-Based VLC Decoder with Codeword Boundary Prediction

HEAPABLE SEQUENCES AND SUBSEQUENCES

PUZZLES ON GRAPHS: THE TOWERS OF HANOI, THE SPIN-OUT PUZZLE, AND THE COMBINATION PUZZLE

The Application of Selective Image Compression Techniques

Chapter 8. Representing Multimedia Digitally

Computing and Communications 2. Information Theory -Channel Capacity

The Chinese Remainder Theorem

Heapable Sequences and Subseqeuences

Fast Sorting and Pattern-Avoiding Permutations

3. Image Formats. Figure1:Example of bitmap and Vector representation images

DECISION TREE TUTORIAL

MITOCW 6. AVL Trees, AVL Sort

Run-Length Based Huffman Coding

Error Detection and Correction

Chapter 6: Memory: Information and Secret Codes. CS105: Great Insights in Computer Science

Language of Instruction Course Level Short Cycle ( ) First Cycle (x) Second Cycle ( ) Third Cycle ( ) Term Local Credit ECTS Credit Fall 3 5

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 6, JUNE

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims. Lecture 17: Heaps and Priority Queues

Channel Coding/Decoding. Hamming Method

Foundations of Distributed Systems: Tree Algorithms

Theory of Probability - Brett Bernstein

The Strengths and Weaknesses of Different Image Compression Methods. Samuel Teare and Brady Jacobson

IEEE C /02R1. IEEE Mobile Broadband Wireless Access <

EE521 Analog and Digital Communications

CS101 Lecture 12: Digital Images. What You ll Learn Today

NUMBER THEORY AMIN WITNO

Transcription:

- A Greedy Algorithm Slides based on Kevin Wayne / Pearson-Addison Wesley

Greedy Algorithms Greedy Algorithms Build up solutions in small steps Make local decisions Previous decisions are never reconsidered Slides based on Kevin Wayne / Pearson-Addison Wesley 2

The Task at Hand Encoding symbols using bits Suppose we want to represent symbols using a computer Letters of the English alphabet (or more broadly ASCII characters) Pixels of an image Audio information from a sound clip Ultimately these are converted to bits How do we represent these symbols? How do we do it efficiently using the fewest number of bits? And make sure that we can decode the bits to recover our original symbols This is a fundamental aspect of data compression Slides based on Kevin Wayne / Pearson-Addison Wesley 3

Types of Codes that can be used to represent symbols: Fixed length codes Every symbol is encoded using an equal number of bits E.g. ASCII code 256 symbols Encoded using 8 bits A = 01000001! = 00100001 8 = 00111000 q = 01110001 Decoding Simply chop the data into blocks of 8 bits each and decode one symbol for each block Ease of decoding is an advantage of fixed length codes Slides based on Kevin Wayne / Pearson-Addison Wesley 4

Types of Codes that can be used to represent symbols: Fixed length codes Every symbol is encoded using an equal number of bits E.g. color image pixels most consumer cameras generate 24-bit color images 8 bits for Red 8 bits for Green 8 bits for Blue If a 8-Megapixel image isn t compressed, it requires 24 Megabytes for storage! Question how can we encode symbols more efficiently? We ll define efficiency as the average number of bits needed per symbol Slides based on Kevin Wayne / Pearson-Addison Wesley 5

Consider Encoding English Text using Fixed Length Codes Suppose we restrict ourselves to just the 26 uppercase letters, plus 6 extra characters (space, comma, period,!,?, ). Using a fixed length code: 5 bits per symbol Is this the most efficient way to represent these symbols? Why Not? It assumes that their frequency of occurrence is uniform. What might we try instead? We d conceivably do better if we could somehow assign shorter codewords to frequently occurring symbols at the expense of longer codewords for infrequently occurring symbols Slides based on Kevin Wayne / Pearson-Addison Wesley 6

Letter Probabilities in the English Alphabet Symbol Probability Symbol Probability Space 0.1859 N 0.0574 A 0.0642 O 0.0632 B 0.0127 P 0.0152 C 0.0218 Q 0.0008 D 0.0317 R 0.0484 E 0.1031 S 0.0514 F 0.0208 T 0.0796 G 0.0152 U 0.0228 H 0.0467 V 0.0083 I 0.0575 W 0.0175 J 0.0008 X 0.0013 K 0.0049 Y 0.0164 L 0.0321 Z 0.0005 M 0.0198 N. Abramson, Information Theory and Coding, McGraw Hill, 1963. Slides based on Kevin Wayne / Pearson-Addison Wesley 7

Variable Length Codes Assign shorter codewords to more frequent symbols, longer codewords to less frequent symbols Symbol Probability Code I Code II Code III A 0.60 00 0 0 B 0.30 01 1 10 C 0.05 10 10 110 D 0.05 11 11 111 What s the problem with Code II? Suppose we receive the encoding: 11011 Is that BBABB or BCD or DAD or Code II is not uniquely decodable Slides based on Kevin Wayne / Pearson-Addison Wesley 8

Variable Length Codes Assign shorter codewords to more frequent symbols, longer codewords to less frequent symbols Symbol Probability Code I Code II Code III A 0.60 00 0 0 B 0.30 01 1 10 C 0.05 10 10 110 D 0.05 11 11 111 Code III is a prefix-free variable length code (Somewhat confusingly, we will refer to this as a prefix code) No codeword is a prefix of any other codeword Prefix codes are uniquely decodable Slides based on Kevin Wayne / Pearson-Addison Wesley 9

Variable Length Codes Assign shorter codewords to more frequent symbols, longer codewords to less frequent symbols Symbol Probability Code I Code II Code III A 0.60 00 0 0 B 0.30 01 1 10 C 0.05 10 10 110 D 0.05 11 11 111 What is the efficiency of fixed length (Code I) versus prefix-code III Fixed length requires 2 bits per symbol Code III requires.6*1 +.3*2 +.1*3 = 1.5 bits per symbol Slides based on Kevin Wayne / Pearson-Addison Wesley 10

Optimal Prefix Codes Among all possible prefix codes, can we devise an algorithm that will give us an optimal prefix code? One that most efficiently encodes the symbols with the lowest average bits per symbol Huffman codes are optimal prefix codes Slides based on Kevin Wayne / Pearson-Addison Wesley 11

Representing Prefix Codes as Binary Trees Each leaf node represents a symbol Each path represents an encoding for that symbol Traveling to the left child is a 0 Traveling to the right child is a 1 Why is it a prefix code? For a symbol to a prefix of another, it would have to be a non-leaf Slides based on Kevin Wayne / Pearson-Addison Wesley 12

The Huffman Coding Algorithm Generates a Prefix Code (a binary tree) Overall idea bottom-up approach: Start with all symbols as leaf nodes Associate with each symbol its frequency of occurrence REPEAT until only one symbol remaining Select the two least frequently occurring symbols (ties can be broken arbitrarily) and link them together as children of a new common parent symbol Associate with this new parent symbol the combined frequency of the two children Remove the children from the collection of symbols being considered and replace with the new parent symbol Each step introduces a new parent symbol but removes two children Each step is building the tree from the leaves up to the root (when there is just one symbol the root the algorithm terminates) Slides based on Kevin Wayne / Pearson-Addison Wesley 13

The Huffman Coding Algorithm Generates a Prefix Code (a binary tree) Codewords for each symbol are generated by traversing from the root of the tree to the leaves Each traversal to a left child corresponds to a 0 Each traversal to a right child corresponds to a 1 Huffman ( [a 1,f 1 ],[a 2,f 2 ],,[a n,f n ]) 1. if n=1 then 2. code[a 1 ] 3. else 4. let f i,f j be the 2 smallest f s 5. Huffman ( [a i,f i +f j ],[a 1,f 1 ],,[a n,f n ] ) omits a i,a j 6. code[a j ] code[a i ] + 0 7. code[a i ] code[a i ] + 1 Slides based on Kevin Wayne / Pearson-Addison Wesley 14

The Huffman Coding Algorithm is a Greedy Algorithm At each step it makes a local decision to combine the two lowest frequency symbols Complexity Assuming n symbols to start with Requires O(n) to identify the two smallest frequencies T(n) T(n-1) + dn O(n 2 ) Can we do better? Consider storing frequencies in a heap O(n log n) to build the heap Each iteration requires two extract min operation (log n each) and one add new element (log n). O(n log n) to generate the Huffman codes Slides based on Kevin Wayne / Pearson-Addison Wesley 15

Properties of Optimal Prefix Codes Lemma 1: Let x, y be symbols with f x > f y. Then in an optimal prefix code, length(c x ) length(c y ) If x occurs more often than y, its codeword must be no longer than the codeword for y in an optimal prefix code Lemma 2: If w is a longest codeword in an optimal prefix code, there must be another codeword with equal length Lemma 3: Let x, y be symbols with the smallest frequencies. Then there exists an optimal prefix code in which they differ only in the final bit Theorem: The prefix code output by the Huffman algorithm is optimal. Can be shown by induction. See book (page 174-175) for proof. Slides based on Kevin Wayne / Pearson-Addison Wesley 16

Extensions and Drawbacks of Huffman Codes Each codeword is an integer number of bits Suppose we are encoding a sequence containing just two symbols: x and y, where x has a frequency of 99% and y has a frequency of 1%. The Huffman code will use one bit to represent x, and one bit to represent y But we know most of the time the symbol is x. Can we do better? Arithmetic coding can be used to approach an average bits per symbol equal to the entropy of the data. Entropy represents the amount of information contained in the data. If f x =.5 and f y =.5, then the encoding of one symbol carries one bit of information. The entropy for that distribution is one bit per symbol If f x = 1 and f y = 0, then the encoding of one symbol carries zero bits of information. The entropy for that distribution is zero since it is a certain outcome, there is no information encoded. Distributions in between will have entropy between 0 and 1 bit. Slides based on Kevin Wayne / Pearson-Addison Wesley 17

Extensions and Drawbacks of Huffman Codes As described, Huffman coding assumes each symbol is independent That s like saying that when we encode letters in the English alphabet, after we encode a q, the next letter to be encoded is equally likely to be any of the 26 letters Suppose we had symbols w, x, y and z, and each occurred 25% of the time, but they always occurred in pairs: xxyywwzzxxzzwwwwxxyywwyyzzxxyyzzxxwwyywwyy We could use two bits to encode each symbol, but really we only need to have one codeword for each pair of symbols Huffman codes can often be improved by considering 1 st order dependencies (neighboring symbols not being independent) Slides based on Kevin Wayne / Pearson-Addison Wesley 18

Extensions and Drawbacks of Huffman Codes As described, Huffman codes do not adapt to changes in distributions Suppose the symbols we are encoding have one frequency distribution at the beginning of the sequence, but a different frequency distribution starting in the middle of the sequence One Huffman code will be optimal for the average distribution We d ideally like to have multiple Huffman codes, or be able to have an adaptive Huffman code, that adjusts to the real frequency distribution as it changes over time Slides based on Kevin Wayne / Pearson-Addison Wesley 19

Extensions and Drawbacks of Huffman Codes All of these concepts apply in the field of data compression and image compression Images have distributions that are often not 50%-50% Imagine encoding and sending a black and white fax. Typically most of the pixels are white, a small percentage black. Images have correlation. The color of one pixel is highly correlated to the color of its neighbor. Coding should not treat them as independent. Images have distributions that vary over time (space). One part of an image might be sky with a certain color distribution. Another part might be grass, beach, water, buildings, etc. with another distribution of colors. Slides based on Kevin Wayne / Pearson-Addison Wesley 20

in the Courts JPEG image coding standard uses Huffman coding Forgent sued companies using JPEG (anybody with digital cameras in their products) for royalties based on patent US 4698672 Slides based on Kevin Wayne / Pearson-Addison Wesley 21