Chapter 12 Notes (Sample Surveys) In everything we have done thusfar, the data were given, and the subsequent analysis was exploratory in nature. This type of statistical analysis is known as exploratory data analysis (EDA). Here, and in the next chapter, we will study techniques for producing or collecting data to answer specific questions. We will see that methods of collecting a sample of data in an unbiased fashion hinge on the idea of randomness. Three Keys to a Good Sample 1. Sampling: In most problems, we are interested in learning something about a population of individuals. However, it will often be the case that the population is too large or too difficult to examine completely, so we take a sample of individuals from the population which we hope is representative of the population as a whole. Examples: What is the average size of Ponderosa pines in a certain area? What proportion of Missoula residents have served on a jury? What proportion of American adults approve of the recently approved bail-out plan? What proportion are satisfied with the state of the country? Polls, such as this last example are known as sample surveys. 100
The most important aspect of a sample, no matter how the sample is taken, is that it is representative of the population from which it comes. If the sample is not representative of the population, we say it is biased. A biased sample is a useless sample! Bias arises in many ways. Consider the following examples which illustrate poor sampling techniques leading to biased samples. What s wrong with these samples? Examples: (a) Suppose CNN takes a poll where they ask viewers to call in and state whether or not they are happy in their marriage. 90% of the call-ins say they are unhappy in their marriage. Do we conclude that among the US population, 90% of all married people are unhappy in their marriage? (b) A university instructor wants to know how students feel about their statistics course. As students come to her office hours, she asks them to answer a few questions about the course. How accurate will the information gathered be? (c) A survey was given to UM students regarding their opinions on possible new businesses to open in the UC. The survey was administered to any student willing to fill it out. Would the responses received be accurate or somehow biased? 101
(d) Historical Mishaps: 1936 1948 This case illustrates a poor sampling technique that led to a sample not being representative of the intended population. In addition to the problems raised in the examples above (voluntary response, interviewer bias, convenience sampling), another common problem is nonresponse. People are sometimes difficult to locate or simply refuse to cooperate. How many of you have thrown out a mail survey or hung up the phone on someone who wants to ask you a few questions? If there is something different about the way nonrespondents would respond if they did respond, this can introduce bias in your results. How can we protect against unseen sources of bias? (e) Wording of Questions: Dont you agree that social workers should earn more money than they currently earn? 102
2. Randomization: The key to avoiding the introduction of bias in a sample is the use of randomization in selecting which population units will comprise the sample. Examples: (a) Reconsider the UC survey on new businesses. Although we might think that simply having students fill out the survey voluntarily is just as good as sampling students at random, can you think of sources of bias that might result? (b) Suppose a biologist captures and radio tags 50 cutthroat trout in the Rock Creek drainage to study the types of habitats in which these fish live. Do you think these 50 fish are representative of all cutthroat in terms of their habitat? Random selection of units to comprise a sample from the population protects against a particular type of bias known as selection bias. Such bias is the result of important but unrealized differences in the units of the population relative to what you are measuring. This is one of the startling truths about sampling. The introduction of randomness in selection actually allows us to draw accurate conclusions about the population. 103
3. Sample Size: The fundamental question when planning a study is: How large a sample do we need for the sample to be representative of the population? Although you might be tempted to think we should take a certain fraction or percentage of the population, it turns out that the size of the population (as long as it s large) is unimportant. In other words, a sample of 100 Missoula residents will be about as representative of the Missoula population as a sample of 100 US residents of the entire US. If the sample consists of the entire population, it is called a census. What problems might we encounter in trying to take a census? (a) (b) (c) 104
Parameters and Statistics: Typically, the purpose behind taking a sample is to gain information about some aspect of the population as a whole. In particular, we are often interested in estimating the mean or standard deviation of some variable, or the proportion of population units with some characteristic. For example, we might want to estimate: the average energy bill for Missoula residents, or the proportion of Americans currently unemployed, or the mean annual income of Montana residents. These unknown population quantities are called parameters. The point of taking a sample is to estimate these unknown parameters with statistics computed from the sample. (i.e.: we might use the sample mean energy bill y from a sample of 30 Missoula households to estimate the true but unknown average energy bill of Missoula residents). Notation: Common notation used is summarized below. (Sample) (Population) Name Statistic Parameter Mean y µ (pronounced mu ) Standard Deviation s σ (pronounced sigma ) Proportion p p Correlation r ρ (pronounced rho ) Regression coefficient b β (pronounced beta ) 105
Other Important Sampling Terminology Sampling Unit: The sampling unit is the basic unit on which we will measure the variables of interest (one value per unit); units might be people, households, animals, plots of land, etc. Sampling Frame: The sampling frame is a list of individual units from which the sample is chosen. This will not always be the same as the population of interest - examples? Sampling Variability: This is the notion that every time we take a sample from some population, we will generally not get the same answer. Consider taking a sample of size n = 10 from this class to estimate the average maximum distance traveled by foot in a day by a 241 student. Average Distance Sample 1 Sample 2 Sample 3 If we were to repeat this many times, we would have several distance averages, hopefully distributed around the true average. The variability in these averages is what we mean by sampling variability. The distribution of these averages is known as the sampling distribution of the mean maximum distance traveled in a sample of size 10. 106
Identify each of the following as a parameter or a statistic, and give the symbol used to represent it. In a sample of 2290 U.S. voters, 65% claim they will be voting for a certain Presidential candidate on election day. The proportion of all U.S. voters voting in the last election that voted for Barack Obama. 51% of all U.S. babies are boys. The proportion of all UM campus computers with working versions of SPSS. The 15% of women in the US Senate. The standard deviation of monthly incomes for 50 Missoula residents. The proportion of students at the University of Montana who participate in school- sponsored athletics. Now that we know all the terminology, let s consider some basic sampling methods which rely on randomness to select a representative sample from the population. 107
1. Simple Random Sample (SRS): Consider selecting a sample of size n. If this sample is drawn so that every possible sample of size n has the same chance of being selected, it is said to be a simple random sample (SRS). Example: Suppose we have a piece of land and we want to estimate the volume of timber or the number of woodpecker nests on the piece of land. A census might be too costly. One simple way to take a sample might be to divide the area into equal-sized blocks as shown below. The blocks should be small enough to survey reliably. Suppose the area is divided into 36 blocks and we ve decided to survey a sample of 9 blocks. To select an SRS, label the blocks in any order. Go to the random number table and select a row at random to generate 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 the sample. For example, suppose we choose row 7: 73184 95907 05179 51002 83374 52297 07769 99792 78365 93487 Starting at row 7, select an SRS of 9 plots, and mark them on the picture above. 108
Do the plots selected look random? In an SRS, every combination of 9 blocks has the same probability of being selected. Selecting an SRS does not guarantee that the particular sample selected is perfectly representative of the population. It is not the sample you select which is unbiased; it s the procedure by which the sample is selected which is unbiased. If we were selecting an SRS from an alphabetical list of 36 people, we probably wouldn t worry that the names weren t evenly distributed through the list, since we have no reason to believe that the variable being measured (e.g.: their opinion on some issue) is associated with their position on the list. However, in this example, we might know that there is geographic variation across the area (perhaps the left side is at a higher elevation than the right side). If this were true, we can use this extra information to ensure a more geographically representative sample by taking a stratified random sample of plots. 109
2. Stratified Random Sample: Suppose we divide the area into 3 rectangular subareas (from left to right with the elevation gradient) each containing 12 plots. Then take a separate SRS of size 3 within each subarea (using different random numbers for each subarea). This still gives a sample of size 9 as before, but under this plan, the sample taken is more equally representative of the varying elevations in the area. Note that we only need to label the individuals within each stratum. 01 07 01 07 01 07 02 08 02 08 02 08 03 09 03 09 03 09 04 10 04 10 04 10 05 11 05 11 05 11 06 12 06 12 06 12 Elevation Use row 29 from the random # table to select a stratified random sample, starting in the left stratum and proceeding to the right: 72042 12287 21081 48426 44321 58765 Select plots 01-12 according to: Ignore the values 96,97,98,99. Why do we do this? In what situations does stratified random sampling work best versus simple random sampling? 110 Random Random Plot Numbers Plot Numbers 1 00-07 7 48-55 2 08-15 8 56-63 3 16-23 9 64-71 4 24-31 10 72-79 5 32-39 11 80-87 6 40-47 12 88-95
3. Systematic random sampling: An alternative to a stratified random sample, which works quite well in sampling over a geographic area, is a systematic random sample. Systematic sampling is easiest when the population size is a multiple of the sample size n, as it is here. We then calculate the population size divided by the sample size, 36/9 = 4. Next, we randomly choose one of the first 4 plots randomly using the random number table. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 The sample consists of this first plot chosen, and then every fourth plot after that. To take a systematic sample here, use row 14 of the random number table: 87736 Systematic sampling also works well for sampling plots along a transect, names from a list, etc. One advantage of systematic sampling is that we don t have to know how many individuals there are in the population because we don t have to assign a label to every individual beforehand. Other advantages? Issues? 111
4. Cluster Sampling: In cluster sampling, we first select a random sample of clusters of individuals, and then survey every individual in the selected clusters. Examples: (a) To survey households in Missoula, I might select an SRS of 20 street blocks and include every household on the selected blocks. (b) To estimate the average height of trees in an area, we might randomly select 5 plots and measure every tree in each plot. When would we use a cluster sample versus a stratified random sample? 5. Multistage Designs: All of the sampling methods discussed above require a list of every individual in the population. Often, such a list is not available (such as sampling US households), so multistage sample designs are used. For example, we might take a stratified random sample of counties in the US (with geographic regions as strata), then an SRS of blocks within each county, and an SRS of households within each block. Here, we only need a list of households for the selected blocks, not every household in the US. 112
Data Collection Strategies: Having discussed some of the details involved with taking good samples, the table below summarizes the advantages and disadvantages of some common data collection methods. Strategies Advantages Disadvantages Personal High response Interviewer bias Interview rate Leading questions Cost/time Telephone Less expensive Good lists unavailable Interview Easy to monitor (Undercoverage) Must be shorter Questionnaires Inexpensive Low response rate No interviewer bias Bias? Direct Generally very accurate Time consuming Observation Observer error 113