Variance Estimation in US Census Data from Kathryn M. Coursolle. Lara L. Cleveland. Steven Ruggles. Minnesota Population Center

Similar documents
Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

INTEGRATED COVERAGE MEASUREMENT SAMPLE DESIGN FOR CENSUS 2000 DRESS REHEARSAL

Scenario 5: Family Structure

An Introduction to ACS Statistical Methods and Lessons Learned

Socio-Economic Status and Names: Relationships in 1880 Male Census Data

Rental and O wner- Occupied Housing Demand, Rolf Pendall Urban Institute

Table 5 Population changes in Enfield, CT from 1950 to Population Estimate Total

Estimation Methodology and General Results for the Census 2000 A.C.E. Revision II Richard Griffin U.S. Census Bureau, Washington, DC 20233

Comparing Generalized Variance Functions to Direct Variance Estimation for the National Crime Victimization Survey

1 NOTE: This paper reports the results of research and analysis

Measuring Multiple-Race Births in the United States

CONTRIBUTIONS OF THE INTERNATIONAL METROPOLIS PROJECT TO THE GLOBAL DISCUSSIONS ON THE RELATIONS BETWEEN MIGRATION AND DEVELOPMENT 1.

1980 Census 1. 1, 2, 3, 4 indicate different levels of racial/ethnic detail in the tables, and provide different tables.

Working with United States Census Data. K. Mitchell, 7/23/2016 (no affiliation with U.S. Census Bureau)

Finding U.S. Census Data with American FactFinder Tutorial

The Impact of the Great Migration on Mortality of African Americans: Evidence from the Deep South

not human choice is used to select the sample.

Thailand - The Population and Housing Census of Thailand IPUMS Subset

Estimates and Implications of the U.S. Census Undercount of the Native-Born Population. Janna E. Johnson PRELIMINARY.

Understanding the Census A Hands-On Training Workshop

PUBLIC EXPENDITURE TRACKING SURVEYS. Sampling. Dr Khangelani Zuma, PhD

Section 2: Preparing the Sample Overview

2010 Census Coverage Measurement - Initial Results of Net Error Empirical Research using Logistic Regression

Welcome to: A Tour of Data Sources from the U.S. Census Bureau. Monday, October 19, :00 am 12:00 noon CT

Handout Packet. QuickFacts o Frequently Asked Questions

Census Pro Documentation

Understanding and Using the U.S. Census Bureau s American Community Survey

Not To Be Quoted or Cited Without Permission of the Author 6/01/03 THE CONCEPT OF THE FAMILY: DEMOGRAPHIC AND GENEALOGICAL PERSPECTIVES

American Community Survey 5-Year Estimates

American Community Survey 5-Year Estimates

Paper ST03. Variance Estimates for Census 2000 Using SAS/IML Software Peter P. Davis, U.S. Census Bureau, Washington, DC 1

Botswana - Botswana AIDS Impact Survey III 2008

Country report Germany

Introduction INTRODUCTION TO SURVEY SAMPLING. Why sample instead of taking a census? General information. Probability vs. non-probability.

Vincent Thomas Mule, Jr., U.S. Census Bureau, Washington, DC

Quick Reference Guide

Public Use Microdata Sample Files Data Note 1

Data Integration Projects

Thailand - The Population and Housing Census of Thailand IPUMS Subset

Produced by the BPDA Research Division:

Introduction INTRODUCTION TO SURVEY SAMPLING. General information. Why sample instead of taking a census? Probability vs. non-probability.

Some Indicators of Sample Representativeness and Attrition Bias for BHPS and Understanding Society

THE EVALUATION OF THE BE COUNTED PROGRAM IN THE CENSUS 2000 DRESS REHEARSAL

Estimating Sampling Error for Cluster Sample Travel Surveys by Replicated Subsampling

6 Sampling. 6.2 Target population and sampling frame. See ECB (2013a), p. 80f. MONETARY POLICY & THE ECONOMY Q2/16 ADDENDUM 65

1) Analysis of spatial differences in patterns of cohabitation from IECM census samples - French and Spanish regions

Census Response Rate, 1970 to 1990, and Projected Response Rate in 2000

Simulated Statistics for the Proposed By-Division Design In the Consumer Price Index October 2014

The Unexpectedly Large Census Count in 2000 and Its Implications

SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES American Community Survey 5-Year Estimates

Saint Lucia Country Presentation

Sierra Leone - Multiple Indicator Cluster Survey 2017

Who s in Your Neighborhood? Using the American FactFinder. Salma Abadin and Carrie Koss Vallejo Data You Can Use

Claritas Demographic Update Methodology Summary

2016 Election Impact on Cherokee County Voter Registration

Calabrese Café

An Overview of the American Community Survey

Liberia - Household Income and Expenditure Survey 2016

Italian Americans by the Numbers: Definitions, Methods & Raw Data

SAMPLING. A collection of items from a population which are taken to be representative of the population.

6 Sampling. 6.2 Target Population and Sample Frame. See ECB (2011, p. 7). Monetary Policy & the Economy Q3/12 addendum 61

Poverty in the United Way Service Area

ACS ACS Long form long form ACS Kish 1990 Kish, 1990 Alexander, 2000, p.54 Kish 1941 annual sample census Kish 1981 Current Population Survey C

Census: Gathering information about every individual in a population. Sample: Selection of a small subset of a population.

Chapter 3 Monday, May 17th

FOR SALE Bees Ferry Rd & Main Rd/Hunt Club Charleston, SC. $1,250, Acres

Documentation for April 1, 2010 Bridged-Race Population Estimates for Calculating Vital Rates

Zambia - Demographic and Health Survey 2007

The Savvy Survey #3: Successful Sampling 1

Guyana - Multiple Indicator Cluster Survey 2014

PSC. Research Report. The Unexpectedly Large Census Count in 2000 and Its Implications P OPULATION STUDIES CENTER. Reynolds Farley. Report No.

What s New & Upcoming in 2017

How Will the Changing U.S. Census Affect Decision-Making?

The American Community Survey Motivation, History, and Design. Workshop on the American Community Survey Havana, Cuba November 16, 2010

Other Effective Sampling Methods

National Longitudinal Study of Adolescent Health. Public Use Contextual Database. Waves I and II. John O.G. Billy Audra T. Wenzlow William R.

Turkmenistan - Multiple Indicator Cluster Survey

Survey of Massachusetts Congressional District #4 Methodology Report

Trends, Data and Definitions The Household Reference Person. Greg Ball BSPS Council & independent consultant

An Assessment of the Age Reporting in the IPUMS-I Microdata

COMPARISON OF ALTERNATIVE FAMILY WEIGHTING METHODS FOR THE NATIONAL HEALTH INTERVIEW SURVEY

Los Angeles American Indian and Alaska Native Project 1 Technical Memo 5: AIAN Underrepresentation in the ACS

Prepared by. Deputy Census Manager Zambia

Stats: Modeling the World. Chapter 11: Sample Surveys

October 6, Linda Owens. Survey Research Laboratory University of Illinois at Chicago 1 of 22

Indonesia - Demographic and Health Survey 2007

Evaluation of the Completeness of Birth Registration in China Using Analytical Methods and Multiple Sources of Data (Preliminary draft)

Manuel de la Puente ~, U.S. Bureau of the Census, CSMR, WPB 1, Room 433 Washington, D.C

SAMPLE DESIGN A.1 OBJECTIVES OF THE SAMPLE DESIGN A.2 SAMPLE FRAME A.3 STRATIFICATION

Sample Surveys. Chapter 11

Chapter 12: Sampling

ILO-IPEC Interactive Sampling Tools No. 5. Listing the sample Primary Sampling Units (PSUs)

Methodology Marquette Law School Poll June 22-25, 2017

PROBABILITY-BASED SAMPLING USING Split-Frames with Listed Households

Sample size, sample weights in household surveys

MATRIX SAMPLING DESIGNS FOR THE YEAR2000 CENSUS. Alfredo Navarro and Richard A. Griffin l Alfredo Navarro, Bureau of the Census, Washington DC 20233

The Demographic situation of the Traveller Community 1 in April 1996

2007 Census of Agriculture Non-Response Methodology

Year Census, Supas, Susenas CPS and DHS pre-2000 DHS Retro DHS 2007 Retro

Chapter 1: Economic and Social Indicators Comparison of BRICS Countries Chapter 2: General Chapter 3: Population

Transcription:

Variance Estimation in US Census Data from 1960-2010 Kathryn M. Coursolle Lara L. Cleveland Steven Ruggles Minnesota Population Center University of Minnesota-Twin Cities September, 2012 This paper was prepared for submission to the annual meeting of the Population Association of America. Research for this project was supported by the National Institutes of Health (grant number R01HD043392). Opinions expressed herein are those of the authors. Please do not cite or quote without permission of the authors. Address correspondence to Kathryn M. Coursolle, Minnesota Population Center (MPC), University of Minnesota, 50 Willey Hall, 225 19 th Avenue South, Minneapolis, MN 55455, e-mail: kathrync@umn.edu. 1

Abstract Modern census microdata feature complex sample designs that clustered within households and incorporate stratification. Yet, researchers often calculate standard errors utilizing methods designed for simple random samples. Variance estimates can differ dramatically adjusting for complex survey design clustering and stratification relative to estimates assuming simple random sampling. Examining potential differences in variance estimation in recent IPUMS-USA samples is essential because US census microdata are among the most heavily used data sources for social, historical, demographic, and policy research. This project uses decennial census data from 1960-2000 and American Community Survey data from 2000-2010 to compare standard errors under the assumptions of simple random sampling to estimates which adjust for clustering and stratification, and subsample replicate weights for recent ACS data. We conclude by discussing potential implications of these techniques on statistical inference. 2

Background The Integrated Public Use Microdata Series (IPUMS-USA) consists of more than fifty high-precision samples of the US population drawn from decennial censuses from 1850-2000 and the American Community Surveys from 2000-2010. These samples represent the richest source of US microdata and have been heavily used in demographic scholarly research. For example, census microdata was used in more articles of Demography than any other data source in recent decades. Census microdata are gathered using complex sampling designs that are clustered by households, incorporate stratification, and sometimes have differential probability of selection. However, most researchers apply methods of variance estimation designed for simple random samples. Failure to adjust for clustering and stratification in the sample design may lead to incorrect standard errors and invalid statistical inferences (Davern & Strief; Kish, 1995; Lohr, 2000). The impact of sample design on standard errors has been documented on historical census data from 1850-1950 (Davern, Ruggles, Swenson, Alexander, & Oakes, 2009). However, differences in standard errors after adjusting for clustering and strata has not been tested in modern census data from 1960-2010 and sampling techniques in modern census data differ substantially from historical census samples. Using decennial census data from 1960-2000 and American Community Survey (ACS) data from 2000-2010 we evaluate the impact of sample design on standard error estimates. We compare standard error estimates under the assumption of simple random sampling to variance estimates accounting for clustering and strata using Taylor series linearization. In the ACS 2005-2010 samples we also compare standard error estimates to the Census Bureau s subsample replicate weights. We conclude by discussing strategies for 3

estimating standard errors in modern census microdata and potential directions for future revisions of this research. Background The sample designs of modern census microdata are individual-level data clustered by households that incorporate stratification. For variables which tend to be similar within households, like race and birthplace, adjusting for clustering may produce standard errors that are larger than variance estimates assuming a random sample of the same size (Cleveland, Davern, & Ruggles, 2011; Graubard & Korn, 2002). In the worst case scenario, standard errors would be inversely proportional to the square root of the number of households rather than individuals if the characteristics of the people in the household are identical. However, variance estimates of variables that tend to be heterogeneous within households such as age and sex may actually be smaller than estimates under simple random sampling. Stratification, on the other hand, tends to have the opposite effect of clustering. Standard errors can be smaller than simple random sampling adjusting for stratification when the characteristics of individuals or clusters are homogenous within strata. Stratification in IPUMS-USA samples from 1960-2010 For the 1960-2000 decennial IPUMS-USA samples, strata were based on the criteria the Census Bureau used to select PUMS samples. For 1960, 38 strata were defined on the basis of various characteristics of household size, home ownership, race, and group quarters residence. The procedures used to select cases for inclusion in the 1970 public use samples were similar to those used in 1960, but were slightly more elaborate. Seventy-five strata were created based on 4

home ownership, race, sex of head, household size, presence of own children, inmate status, and other residence in group quarters. For the 1980 samples, strata were created based on race, Spanish origin, home ownership, and presence of own children, producing 51 strata. For the 1990 and 2000 samples, strata were created based on presence of own children, race, Spanish origin, and home ownership. In addition, to avoid singletons Asian race categories were collapsed into one category and this criteria was also used for the 2000 samples (White/Other Race/Two or More Races Hispanic; White/Other Race/Two or More Races Non-Hispanic; Asian and Pacific Islander; Black and American Indian). For the 1990 samples age was not used for non-institutional group quarters to avoid singletons. Any remaining singletons were collapsed into the White Non-Hispanic Origin strata. These methods produced 119 strata for the 1990 samples and 131 strata for the 2000 samples. For the American Community Survey samples, strata are based on the lowest level of geography available in the sample. For the 2000-2004 samples, each state forms a stratum. In the 2005 onward ACS samples, strata are defined as unique Public Use Micro-data Areas (PUMA). For more detailed information see: http://usa.ipums.org/usa/complex_survey_vars/strata_historical.shtml. Subsample Replicate Weights in the ACS Replicate weights were added to the ACS starting in 2005. These weights are produced by the Census Bureau and allow the sample to mimic multiple samples, which can produce more informed standard error estimates and reflect relevant sample design information. Standard errors produced using replicate weighting techniques are usually larger, and produce more conservative statistical inferences, than those under the assumption of simple random sampling (Davern & Strief). The Census Bureau recommends using replicate weights to obtain unbiased standard 5

error estimates (US Census Bureau, 2005). However, using these procedures is often cumbersome and takes substantially more computing time relative to Taylor series estimates. It is worthwhile to know whether standard errors produced adjusting for clustering and strata are similar to those obtained utilizing the ACS replicate weights. Results Table 1 presents the comparison of standard errors using several methods of selected variables in census data from 1960-2000 and ACS data from 2000, 2004, 2005, and 2010. 1 The first column shows the population parameter estimate from the IPUMS sample and the second column presents the standard error estimates based on the assumption of that the survey design was based on a simple random sample. 2 This estimate uses the person weight only. The third and fourth columns display the ratio of the standard error using Taylor series and replicate weight methods relative to the standard errors assuming simple random sampling methods. A ratio above one indicates that the standard error is larger than variance from a simple random sample of the same size, and a ratio below one indicates that the standard error would be smaller than a simple random sample. Turning first to the results for the decennial census 1960-2000 samples, we can see that for aspects of individuals that tend to be homogeneous within households such as foreign-born, socioeconomic index, and race often produce larger standard errors than techniques which assume simple random sampling survey techniques for several of the sample years. This suggests 1 Results were very similar for other ACS samples. To present simplified results, only these samples are included. 2 In some census years the person weight also adjusts for aspects of probability sampling, such as 1990 and 2000. See: http://usa.ipums.org/usa/chapter2/chapter2.shtml for more information. 6

that for research examining those characteristics using standard errors calculated under the assumption of simple random sampling may produce less conservative criteria for statistical significance. However, the opposite is the case for other characteristics such as age, gender, marital status, school enrollment, and labor force participation, which are characteristics more likely to be heterogeneous within households. Generally the variance of these parameters tends to be smaller after adjusting for clustering and stratification. Indeed prior research suggests that standard errors produced that adjust for clustering and stratification may be smaller than the simple random sample standard error estimates when the effects of stratification are more pronounced (Davern & Strief, but see also Kish, 1995). We next present the results of the comparison of variance estimates for the American Community Survey Samples. Although these samples also have clustering by households similar to the decennial sample design, pseudo-strata are calculated by the lowest level of geography available in each survey year. For these samples all of the standard error using Taylor series methods are larger than standard errors than would be obtained from a simple random sample of the same size, with the exception of gender. In the 2005 and 2010 ACS the table presents ratios of the standard error calculated using the subsample replicate weights. For marital status, foreign-born, and socioeconomic index the variance estimates were larger utilizing the subsample replicate weights than under the assumption of simple random sample, and the opposite was true for the other measures. Differences between the ratios of the Taylor series and replicate weight methods were fairly modest, with the exception of age, but computing burden was substantially less with the Taylor series techniques. In future revisions, we plan to analyze differences between the Taylor series and replicate weight methods in greater detail. 7

Discussion This paper documents the comparison of standard error calculations under the assumption of simple random sampling, clustering and stratification, and utilizing ACS replicate weights in the IPUMS 1960-2010 samples. For the decennial samples, Taylor series standard error estimates were often smaller than standard errors obtained from a simple random sample of the same size, except for variables that tend to be highly corrected within households which are not included in the design of strata, such as foreign-born. On the other hand estimates obtained from pseudo geography-based strata in the ACS samples led to generally larger standard errors than under the assumption of simple random sampling. For these samples, utilizing Taylor series methods would lead to more conservative criterion for statistical inference. However, it is important to remember that for most IPUMS data, the samples are quite large, and there is little risk of drawing incorrect conclusions due to underestimated standard errors. However, for analysis that examines only small subpopulations, the risk could be higher. Providing examples of when this may be the case seems like a logical next step for this research. Future revisions of this project will also compare in more depth differences in standard errors computed using the ACS replicate weights to the Taylor series estimates. In addition, it may be useful to create subsample replicates for the decennial census samples to compare Taylor series variance results to a gold standard. 8

References Cleveland, L. L., Davern, M., & Ruggles, S. 2011. Drawing Statistical Inferences from International Census Data. IPUMS-International Working Paper: https://international.ipums.org/international/resources/misc_docs/cleveland_davern_ruggl es_variance.pdf Davern, M., Ruggles, S., Swenson, R., Alexander, J. T., & Oakes, J. M. 2009. Drawing Statistical Inferences from Historical Census Data, 1850-1950. Demography. 46: 429-449. Davern, M. & Strief, J. IPUMS User Note: Issues Concerning the Calculation of Standard Errors (i.e., variance estimation) Using IPUMS Data Products Ipums.org: http://usa.ipums.org/usa/resources/complex_survey_vars/usernote_variance.pdf Graubard, B., & Korn, E. 2002. Inferences for superpopulation parameters using sample surveys. Statistical Science 17: 73-96. Kish, L. 1995. Survey Sampling. Wiley Classics Library Edition. New York, NY: Wiley and Sons. Lohr, S. 2000. Sampling: Design and Analysis. Pacific Grove, CA: Duxbury Press. US Census Bureau. 2005. PUMS Accuracy of the Data (2005) Washington, DC: US Census Bureau. 9

Table 1. Standard Errors Assuming Simple Random Samples Compared with Taylor Series and Subsample-Replicate Estimates: Selected Person Characteristics Selected Person Characteristics Sample Mean or % Standard Error Assuming Simple Random Sampling Ratio of Standard Error Estimate to Simple Random Sample Taylor Series Adjusting for Clustering and Strata Subsample Replicate Method (2005-2010 ACS) 1960 Age (mean) 31.03 0.0165 0.93 Male (%) 49.12 0.0004 0.75 Married (%) 46.38 0.0004 0.71 Nonwhite (%) 11.47 0.0002 0.33 Foreign-born (%) 6.02 0.0002 1.38 Socioeconomic Index (mean) 16.88 0.0174 0.95 Enrolled in School (%) 24.43 0.0003 0.94 Labor force participant (%) 38.87 0.0004 0.86 1970 Age (mean) 31.83 0.0111 0.84 Male (%) 48.57 0.0002 0.71 Married (%) 45.04 0.0002 0.49 Nonwhite (%) 12.33 0.0002 0.79 Foreign-born (%) 5.72 0.0001 1.44 Socioeconomic Index (mean) 19.93 0.0124 0.95 Enrolled in School (%) 29.43 0.0003 0.89 Labor force participant (%) 40.64 0.0002 0.89 1980 Age (mean) 33.44 0.0066 0.88 Male (%) 48.47 0.0001 0.75 Married (%) 44.71 0.0001 0.83 Nonwhite (%) 14.46 0.0001 1.06 Foreign-born (%) 7.15 0.0001 1.44 Socioeconomic Index (mean) 22.04 0.0078 1.04 Enrolled in School (%) 27.70 0.0001 0.97 Labor force participant (%) 46.61 0.0001 0.99 (Continued on next page) 10

Table 1 (Continued) 1990 Age (mean) 34.83 0.0069 0.83 Male (%) 48.73 0.0002 0.77 Married (%) 43.60 0.0002 0.91 Nonwhite (%) 19.63 0.0001 1.28 Foreign-born (%) 9.23 0.0001 1.45 Socioeconomic Index (mean) 24.86 0.0086 1.00 Enrolled in School (%) 26.12 0.0001 1.02 Labor force participant (%) 50.35 0.0002 0.92 2000 Age (mean) 35.81 0.0066 0.87 Male (%) 48.99 0.0002 0.78 Married (%) 42.72 0.0001 0.93 Nonwhite (%) 24.90 0.0001 1.32 Foreign-born (%) 12.31 0.0001 1.39 Socioeconomic Index (mean) 25.48 0.0083 0.99 Enrolled in School (%) 27.23 0.0001 0.98 Labor force participant (%) 49.33 0.0002 0.94 2000 American Community Survey Age (mean) 35.62 0.0422 1.38 Male (%) 48.82 0.0010 0.80 Married (%) 42.19 0.0010 1.14 Nonwhite (%) 25.52 0.0009 1.87 Foreign-born (%) 12.33 0.0007 1.60 Socioeconomic Index (mean) 26.39 0.0538 1.15 Enrolled in School (%) 26.63 0.0009 1.13 Labor force participant (%) 50.54 0.0010 1.06 (Continued on next page) 11

Table 1 (Continued) 2004 American Community Survey Age (mean) 36.22 0.0243 1.38 Male (%) 48.94 0.0006 0.80 Married (%) 42.36 0.0006 1.15 Nonwhite (%) 24.38 0.0005 1.81 Foreign-born (%) 13.23 0.0004 1.50 Socioeconomic Index (mean) 26.30 0.0312 1.14 Enrolled in School (%) 26.42 0.0005 1.12 Labor force participant (%) 50.92 0.0006 1.06 2005 American Community Survey Age (mean) 36.38 0.0164 1.37 0.85 Male (%) 49.02 0.0004 0.81 0.50 Married (%) 42.16 0.0004 1.14 1.59 Nonwhite (%) 25.36 0.0004 1.74 1.06 Foreign-born (%) 13.64 0.0003 1.48 1.15 Socioeconomic Index (mean) 26.36 0.0211 1.11 1.37 Enrolled in School (%) 26.04 0.0003 1.12 0.88 Labor force participant (%) 51.07 0.0004 1.05 0.92 2010 American Community Survey Age (mean) 37.35 0.0152 1.36 0.48 Male (%) 49.15 0.0003 0.83 0.23 Married (%) 39.09 0.0003 1.14 1.58 Nonwhite (%) 25.83 0.0003 1.72 0.97 Foreign-born (%) 14.27 0.0003 1.44 1.10 Socioeconomic Index (mean) 25.32 0.0191 1.10 1.53 Enrolled in School (%) 26.74 0.0003 1.13 0.61 Labor force participant (%) 50.76 0.0003 1.05 0.86 Source: 1960, 1970, 1980, 1990, 2000, 2005, 2010 IPUMS samples. 12