Assignment 4: Permutations and Combinations

Assignment 4: Permutations and Combinations CS244-Randomness and Computation Assigned February 18 Due February 27 March 10, 2015 Note: Python doesn t have a nice built-in function to compute binomial coeffiecients, and surprisingly, matplotlib does not seem to either. If you want to take the trouble you can download the scipy package (same site as numpy) and then type import scipy.special scipy.special.binom(n,k) It is probably simpler just to use the following function for computing these coefficients: def combinations(n,k): prod = 1.0 for j in range(k): prod = prod*(n-j)/(j+1) return prod I will post this code on the website. I can t seem to leave the birthday stuff alone, and I had to restrain myself from making every problem about birthdays. 1. Use the exponential approximation to estimate how many people need to be present in order for the probability of a coincidental birthday to be greater than 0.9, 0.95, 0.99. (So there are three answers here.) Then answer the questions 1

again using the exact probabilities you may want to write jut a little bit of Python code. Compare the two results they should be quite close. Solution. We approximate the probability of no coincidental birthday in a group of k people by e k2 /2N, where N = 365. We thus have to solve the equations of the form e k2 /730 = a, where a = 0.1, 0.05, 0.01. Taking logs of both sides and simplifying a little gives k = 730 ln 1 a. Substituting 10,20, and 100 for 1 a gives the solutions: 40.998, 46.74, 57.98. Let s round these up to 41, 47, 58. To check the answer against the exact probabilities, we use the following code: >>> def birthday_coincidence(numpeople): j=1 for k in range(1,numpeople): j *= 1.0*(365-k)/365 return 1-j >>> birthday_coincidence(40) 0.891231809817949 >>> birthday_coincidence(41) 0.9031516114817354 >>> birthday_coincidence(46) 0.9482528433672548 >>> birthday_coincidence(47) 0.9547744028332994 >>> birthday_coincidence(57) 0.9901224593411699 >>> birthday_coincidence(58) 0.9916649793892612 2

In the first two cases, our approximation gave the best possible answer. In the third case it was off by one (57 would have been a better answer). 2. Suppose you have a database of biographies of prominent people from the past. Each biography contains a date of birth and a date of death. If there are 1000 records in the database, what is the probability that two of them share both a date of birth and a date of death (we are ignoring the year of birth and the year of death, and just looking at the month and the day)? You should use the exponential estimate for the generalized birthday problem. Solution. It s just the birthday problem with k = 1000 and N = 365 2. The probability at least one shared birthday-death day pair is approximately 1 e 10002 /2 365 2 = 0.98168. It s almost a sure thing. In fact, with only 100 people, the probability of a shared pair is already well over 50%. 3. You walk into a room with k people. What is the probability that someone in the room has the same birthday as you? (Observe that this is very different from the question we asked earlier, about whether there is any pair of people in the room with the same birthday.) Express this exactly, and then approximate it using the exponential approximation 1 x e x for small positive x. How many people need to be in the room for the probability to exceed one-half? Solution. The probability that a randomly chosen person has a different birthday from me is 1 1. The probability that k people have different birthdays from 365 me is (1 1 365 )k. With the exponential approximation, this is about e k/365. Let s find the value of k that makes this one-half: We take logs and reciprocals and get k 365 ln 2 = 252.998. An exact calculation shows so our approximation was very accurate. (1 1 365 )253 = 0.4995, 3

4. If I asked you to compute the probabilities of various poker hands, it would take you less than a millisecond to find the Wikipedia page Poker odds with all the answers, complete with the number of relevant outcomes for each hand expressed in terms of binomial coefficients. So I had to make up some new poker hands and ask you their probabilities. Explain your reasoning carefully, and try to express your answers both in terms of binomial coefficients and powers, and as numerical values. It s easy to be led astray here, and a very good way to check your answer is to write a simulation. You are not required to do this for the homework, but it s not a bad idea if you want to see if you were right. (a) The picture cards are the three ranks Jack, Queen, King. What is the probability of getting all picture cards? Solution. There are 12 picture cards, so the total number of 5-card hands containing only picture cards is ( ) ( 12 5. The desired probability is thus 12 ) ( 5 / 52 ) 5 = 0.0003047, which is quite a lot smaller than I would have guessed! (b) Two of the suits contain black cards, and two of the suits contain red cards. What is the probability of having all 5 cards be the same color? Solution. There are 26 red cards and 26 black cards. We can proceed as in (a) to compute the probability of getting all red cards. The desired probability is twice this value. So the answer is: ( ) ( ) 26 52 2 / = 0.0506. 5 5 (c) What is the probability of having all five cards belong to exactly two of the suits? Remember there are two ways this can split: 3 of one suit and 2 of the other, or 4 of one suit and 1 of the other. For a 3-2 split, there are 4 ways to choose the 3-suit, and then 3 ways to choose the 2-suit. Once the 3-suit is chosen, there are ( ) 13 3 ways to choose 3 cards from it, and also ( ) 13 2 ways to choose 2 cards from the 2-suit. So the number of distinct hands in which there are three cards from one suit and two from the other is ( ) ( ) 13 13 4 3. 3 2 By essentially identical reasoning, the number of distinct hands in which there are 4

four cards from one suit and one from another suit is ( ) ( ) 13 13 4 3. 4 1 Put it all together and the total number of hands in question is ( ) ( ) ( ) ( ) 13 13 13 13 4 3 + 4 3 = 379236. 3 2 4 1 So the desired probability is ( ) 52 379236/ = 0.1459. 5 Here is another way to get the same result it s hard to say if this is simpler or not. First choose our two suits: there are ( 4 2) = 6 ways to do this. Then choose 5 cards from the 26 cards in the 2 suits. This gives ( ) 26 6 = 394680. 5 The problem is that in this tabulation, we have also counted the hands that consist of cards from a single suit, and moreover, we have counted each of these hands several times. For example, there are ( ) 13 5 hands consisting entirely of hearts, and in our tabulation, each of these has been counted three times (assuming our two suits are hearts-spades, hearts-clubs, hearts-diamonds). That means we must subtract ( ) 13 4 3 = 15444 5 from our total. And, what do you know, 394680 15444 = 379236. It always feels good when two different methods give the same answer! 5. There are two candidates in an election. Candidate A has received 55% of the votes, candidate B 45%. There is a very large number of voters (several million, let s say). We randomly sample 100 voters. This is sampling without replacement, since we should not poll the same voter twice!, but the voter pool is so large 5

that you can treat it as a problem of sampling with replacement, which makes the calculation somewhat easier. What is the probability that in the sample, candidate B receives more votes? Express this answer as a formula using the binomial coefficients, and then compute the probability exactly. HINT: Think of the underlying experiment as flipping 100 biased coins in succession. We saw how to express the probability of getting exactly k heads in terms of binomial coefficients, so here you will have a sum of about 50 such probabilities. You will thus need to write a little code to answer the question. Solution. Just as a reality check, we would expect this answer to be less than one-half, because candidate A received more votes overall. By the coin analogy, the probability that candidate B receives exactly k votes is ( ) 100 0.45 k 0.55 100 k. k Thus the probability that candidate B receives strictly more votes than candidate A is the sum of all these values as k varies from 51 to 100. A quick computation with Python gives 100 k=51 ( ) 100 0.45 k 0.55 100 k = 0.1346. k This shows you something about the effectiveness of polling if we have a truly representative sample, and a 55-45 margin, then we can predict the result of the election correctly 87% of the time by sampling only 100 people. With a sample of 200 people, the success rate rises to 93%. 6. (Real birthdays) This is the most involved problem in terms of programming, although not all that deep in terms of math. One very useful part of the problem concerns how to sample from a given nonuniform distribution. I am going to give you actual data on birth dates in the United States from one year. You are to simulate the birthday problem using this distribution, and then superimpose a plot of the result on the one obtained from exact calculation using the uniform distribution model. The birth data for 1978 is posted on the course website. I found this at the Chance website from Dartmouth, which also has the Grinstead-Snell book, but I don t know the original source for the data. You will want to read the second column into a Python list. If you ve forgotten (or never knew) how to do this, you can use the following code (of course you have to change the full path name for the file.) 6

Figure 1: Distribution of US birthdays in 1978 infile = open( /Users/straubin/teaching/244/244website/birthday.txt, r ) bdaylist = [] for j in range(365): s=infile.readline().split() bdaylist.append(int(s[1])) (a) Make a scatter plot or a stem plot of the data. You can see the nonuniformity very clearly: it is somewhat exaggerated if you display the plot with the default settings, so I suggest that you base the y-coordinates at 0, using xlim(0,11000). I find the results astonishing. There is indeed a seasonal variation (explained by what? planning for the optimal time for the baby to be born? seasonal variation in sexual behavior? in fertility?) but the amazing thing is that it looks as if there are two entirely separate data series, with roughly the same seasonal variation, but one significantly lower than the other. (Speculation about the variation is not part of the assignment, just some random musing.) 7

Solution The code to produce the plot is posted on the website, and the plot is in Figure 1. There definitely is a seasonal variation: the plot shows a spike in births around the end of October and at the very end of the year, and a low in mid-april. The October and April births point to more people conceiving around Christmastime and fewer at the height of summer (I m not sure why). I wonder if the year-end spike is there for tax purposes that shows some very careful planning! But the real surprise is that the data falls into two series, and the differences between these two series is larger than the variation within the series. What could explain this extraordinary non-seasonal variation? If you study the data carefully, you ll realize that the lower series largely consists of two days out of every week. This is probably due to scheduled Caesarean sections, which constitute a very significant fraction of total births in the United States the hospitals don t like to schedule these for weekends. (b) You are to plot, for k = 1 to about 65, the probability of coincidental birthdays using this probability distribution. The first hurdle is that you need some way of generating random birthdays based on this probability model. You can code this by hand using rand(), but there is a built-in method in numpy. You have to add to your program import numpy.random as npr You then use a function called choice. To see how this works, a call to npr.choice([1,2,3,4,5,6],p=[0.3, 0.3, 0.2, 0.1,0.05,0.05]) will generate a value in {1, 2, 3, 4, 5, 6} distributed according the probability mass function p: that is, 1 will occur with probability 0.3, 2 with probability 0.3, etc. Use this function and the information read in from the file to randomly generate birthdays according to the given distribution. The second hurdle is efficiently performing the simulation. You can do this any way you like, but there is a nice trick for speeding things up, based on the following insight: Consider the experiment where we repeatedly sample people from the population until we find a birthday that is the same as one we have already drawn, and look at the the random variable that gives the number of rounds this experiment lasts. Our original plot of birthday probabilities is just the cumulative distribution function of this random variable assuming uniformly distributed birthdays. This means that we can repeatedly sample birthdays with the choice function, and make a cumulative histogram. You have to scale things so that the values rise from 0 to 1, and use the value returned by the histogram function to produce 8

a line plot. (c) Add to your program the code that generates the plot for the probability of coincidental birthdays under the uniform distribution, and superimpose the two. Do you see much difference between the two plots? How adequately does the uniform probability distribution model the real-life version of this problem? Solution. The code to produce the plot is posted on the website, and the plot itself is shown in Figure 2. It performs 10,000 trials of the experiment of repeatedly sampling birthdays until a duplicate birthday is found, and recording the number of samples. It then plots a cumulative histogram of the result (as a line graph, not a bar graph). This is superimposed on a calculation of the exact probabilities assuming 365 equally likely birthdays. There is only a very slight difference between the two plots. So for purposes of the repeated-birthday problem, the simple uniformity model gives accurate results, in spite of the nonuniformity present in the real-life data. 9

Figure 2: Probability of a repeated birthday, showing the results of a simulation based on the 1978 data, and the exact probabilities assuming uniform distribution of birthdays 10