CSE 100: BST AVERAGE CASE AND HUFFMAN CODES

Recap: Average Case Analysis of successful find in a BST N nodes Expected total depth of all BSTs with N nodes

Recap: Probability of having i nodes in the left subtree P N (i) : the probability that T L has i nodes D(N i): Expected total depth of BSTs with i nodes in T L T L T R D(N) = N 1 i=0 ( ) P N (i)d N i i nodes N-i-1 nodes What determines the number of nodes in the left subtree (T L )?

Towards a recurrence relation for average BST total depth What is D(N i) in terms of D(i) & D(N-i-1)? Hint: all nodes in each subtree are 1 deeper in tree T A. D(i) + D(N-i-1) B. D(i) + D(N-i-1) + 1 C. D(i) + D(N-i-1) + N T L i nodes T R N-i-1 nodes

Average total depth of a BST with N nodes N 1 D(N) = P N (i)d N i i=0 ( ) True or false: The term in the blue box is equal to the term in the red box A. True B. False

N*D(N) = (N+1) *D(N-1) + 2N 1 How does this help us again? A. We can solve it to yield a formula for D(N) that does not involve N! B. We can use it to compute D(N) directly C. I have no idea, I m totally lost

Through unwinding and some not-so-complicated algebra (which you can find in your reading, a.k.a. Paul s slides) we arrive at: No N! to be seen! Yay! And with a little more algebra, we can even show an approximation: Conclusion: The average time to find an element in a BST with no restrictions on shape is Θ(log N).

The importance of being balanced BSTs average time for find: Θ (log 2 N) What does this tell us? On an average things are not so bad provided assumptions 1 and 2 hold But the probabilistic assumptions we made often don t hold in practice Assumption #1 may not hold: we may search some keys many more times than others Assumption #2 may not hold: approximately sorted input is actually quite likely, leading to unbalanced trees with worstcase cost closer to O(N) when N is large We would like our search trees to be balanced

The importance of being balanced We would like our search trees to be balanced Two kinds of approaches Deterministic methods guarantee balance, but operations are somewhat complicated to implement (AVL trees, red black trees) Randomized methods (treaps, Randomized Search Trees) from our analysis: deliberate randomness in constructing the tree helps!! Operations are simpler to implement Balance not absolutely guaranteed, but achieved with high probability We will return to this topic later in the course

Changing gears: Data Compression Problem What is the encoding scheme that would result in the shortest binary representation? Why is this important? Encoding Scheme 101010010101001010101010 101010101001001010101001 110100101010101001001010 Binary representation

How do we encode data? Step 1: Figure out the alphabets that constitute the data (dictionary) Step 2: Determine the binary code word for each alphabet Step 3: Replace each alphabet by its code word For example if the alphabet was s, p, a, m we might define the following encoding: Alphabet s p a m Code word

Fixed length encoding In fixed length, each alphabet is represented using a fixed number of bits For example if the alphabet was s, p, a, m we might define the following encoding: Alphabet s 00 p 01 a 10 m 01 Code word For a dictionary consisting of M symbols, what is the minimum number of bits needed to encode each symbol (assume fixed length binary codes)? A. 2 M B. M C. M/2 D. ceil(log 2 M) E. None of these

Variable length codes ssssssssssssssss ssssspppamppam Text file Code A Symbol Codeword s 00 p 01 a 10 m 11 Symbol Frequency s 0.6 p 0.2 a 0.1 m 0.1 Code B Symbol Codeword s 0 p 1 a 10 m 11 Is code B better than code A? A. Yes B. No C. Depends

Variable length codes ssssssssssssssss ssssspppamppam Text file Code A Symbol Codeword s 00 p 01 a 10 m 11 Symbol Frequency s 0.6 p 0.2 a 0.1 m 0.1 Code B Symbol Codeword s 0 p 1 a 10 m 11 Average length (code A) = 2 bits/symbol Average length (code B) = 0.6 *1 +0.2 *1 + 0.1* 2+ 0.1*2 = 1.2 bits/symbol

Decoding variable length codes ssssssssssssssss ssssspppamppam Text file Code A Symbol Codeword s 00 p 01 a 10 m 11 Symbol Frequency s 0.6 p 0.2 a 0.1 m 0.1 Code B Symbol Codeword s 0 p 1 a 10 m 11 Decode the binary pattern 0110 using Code B? A. spa B. sms C. Not enough information to decode