Modernizing Disclosure Avoidance: Report on the 2020 Disclosure Avoidance Subsystem as Implemented for the 2018 End-to-End Test (Continued)

Similar documents
US Census. Thomas Talbot February 5, 2013

An Overview of the American Community Survey

1 NOTE: This paper reports the results of research and analysis

Census Data Boot Camp

Taming the Census TIGER:

Documentation for April 1, 2010 Bridged-Race Population Estimates for Calculating Vital Rates

The 2010 Census: Count Question Resolution Program

Overview of Census Bureau Geographic Areas and Concepts

Salvo 10/23/2015 CNSTAT 2020 Seminar (revised ) (SLIDE 2) Introduction My goal is to examine some of the points on non response follow up

2020 Census: Researching the Use of Administrative Records During Nonresponse Followup

The American Community Survey Motivation, History, and Design. Workshop on the American Community Survey Havana, Cuba November 16, 2010

Scenario 5: Family Structure

Overview of Demographic Data

Census Data for Grant Writing Workshop Cowlitz-Wahkiakum Council of Governments. Heidi Crawford Data Dissemination Specialist U.S.

Estimation Methodology and General Results for the Census 2000 A.C.E. Revision II Richard Griffin U.S. Census Bureau, Washington, DC 20233

Conducting Research in the ACRDC

Who s in Your Neighborhood? Using the American FactFinder. Salma Abadin and Carrie Koss Vallejo Data You Can Use

Claritas Demographic Update Methodology Summary

How Will the Changing U.S. Census Affect Decision-Making?

American Community Survey Overview

Imputation research for the 2020 Census 1

Maintaining knowledge of the New Zealand Census *

GIS Data Sources. Thomas Talbot

American Community Survey Review and Tips for American Fact Finder. Sarah Ehresman Kentucky State Data Center August 7, 2014

ESP 171 Urban and Regional Planning. Demographic Report. Due Tuesday, 5/10 at noon

Understanding the Census A Hands-On Training Workshop

Acquiring and Using New Census Data to Understand Service Area, Gaps, and Need

Quick Reference Guide

Redistricting San Francisco: An Overview of Criteria, Data & Processes

The U.S. Decennial Census A Brief History

UK Data Service Introduction to Census

The Canadian Century Research Infrastructure: locating and interpreting historical microdata

Data Integration Projects

1980 Census 1. 1, 2, 3, 4 indicate different levels of racial/ethnic detail in the tables, and provide different tables.

Secretary of Commerce

Learning to Use the ACS for Transportation Planning Report on NCHRP Project 8-48

Census Data for Transportation Planning

Working with United States Census Data. K. Mitchell, 7/23/2016 (no affiliation with U.S. Census Bureau)

RESULTS OF THE CENSUS 2000 PRIMARY SELECTION ALGORITHM

Symposium 2001/36 20 July English

Census Response Rate, 1970 to 1990, and Projected Response Rate in 2000

INTEGRATED COVERAGE MEASUREMENT SAMPLE DESIGN FOR CENSUS 2000 DRESS REHEARSAL

Case 2:12-cv RJS-DBP Document 184 Filed 08/26/15 Page 1 of 12 UNITED STATES DISTRICT COURT FOR THE DISTRICT OF UTAH CENTRAL DIVISION

Dallas Regional Office US Census Bureau

ACS ACS Long form long form ACS Kish 1990 Kish, 1990 Alexander, 2000, p.54 Kish 1941 annual sample census Kish 1981 Current Population Survey C

What Do We know About the Presence of Young Children in Administrative Records By William P. O Hare

Data Integration Activities on the Way to the Dutch Virtual Census of 2011

2020 Census. Bob Colosi Decennial Statistical Studies Division February, 2016

How It Works and What s at Stake for Massachusetts. Wednesday, October 24, :30-10:30 a.m.

Virginia Employment Commission

2020 Census Geographic Partnership Programs. Update. Atlanta Regional Office Managing Census Operations in: AL, FL, GA, LA, MS, NC, SC

Virginia Employment Commission

The Statistical Administrative Records System and Administrative Records Experiment 2000: System Design, Successes, and Challenges

Virginia Employment Commission

Introduction to the Wisconsin Census Research Data Center. Health Projects

The Unexpectedly Large Census Count in 2000 and Its Implications

Estimates and Implications of the U.S. Census Undercount of the Native-Born Population. Janna E. Johnson PRELIMINARY.

2020 Census Program Update

Handout Packet. QuickFacts o Frequently Asked Questions

THE EVALUATION OF THE BE COUNTED PROGRAM IN THE CENSUS 2000 DRESS REHEARSAL

United Nations Statistics Division Programme in Support of the 2020 Round of Population and Housing Censuses

Searching, Exporting, Cleaning, & Graphing US Census Data Kelly Clonts Presentation for UC Berkeley, D-lab March 9, 2015

Measuring Multiple-Race Births in the United States

An Introduction to ACS Statistical Methods and Lessons Learned

Variance Estimation in US Census Data from Kathryn M. Coursolle. Lara L. Cleveland. Steven Ruggles. Minnesota Population Center

The American Community Survey. An Esri White Paper August 2017

Using Administrative Records and the American Community Survey to Study the Characteristics of Undercounted Young Children in the 2010 Census

The Road to 2020 Census

The 2020 Census: Preparing for the Road Ahead

Survey of Massachusetts Congressional District #4 Methodology Report

My Tribal Area: Census Data Overview & Access. Eric Coyle Data Dissemination Specialist U.S. Census Bureau

Jerry Reiter Department of Statistical Science Information Initiative at Duke Duke University

2010 Census Data. Get Ready for Changes in Your 2014 AAPs. Ellen Shong & Associates, LLC 9/13/ Past EEO Tabulations

Country presentation

Using Administrative Records for Imputation in the Decennial Census 1

U.S. Census Bureau. Measuring America: People, Places, and Our Economy. Community Analysis Workshop. Armando Mendoza Data Dissemination Specialist

The 2020 Census A New Design for the 21 st Century

Using 2010 Census Coverage Measurement Results to Better Understand Possible Administrative Records Incorporation in the Decennial Census

Census Data Determines Who Gets $300 Billion Annually Are You Getting Your Share?

Chapter 1: Economic and Social Indicators Comparison of BRICS Countries Chapter 2: General Chapter 3: Population

PROBABILITY-BASED SAMPLING USING Split-Frames with Listed Households

Table 5 Population changes in Enfield, CT from 1950 to Population Estimate Total

What s New & Upcoming in 2017

MATRIX SAMPLING DESIGNS FOR THE YEAR2000 CENSUS. Alfredo Navarro and Richard A. Griffin l Alfredo Navarro, Bureau of the Census, Washington DC 20233

Sudan Experience in Conducting Population Censuses. Hagir Osman Eljack (corresponding author) & Awatif El Awad Musa.

Realigning Historical Census Tract and County Boundaries

In-Office Address Canvassing for the 2020 Census: an Overview of Operations and Initial Findings

American Community Survey Accuracy of the Data (2014)

Census Data Tools. Hands-on exercises July 17 & 19, LULAC National Convention

Country Paper : Macao SAR, China

Notes on the 2014 ACS 5-Year Estimates

Name Position Telephone First contact. [redacted under

Road to the 2020 Census October 13, :15 p.m. 5:15 p.m. WEBINAR Presentation for: South Dakota - State Data Center s 5 th Annual Demography

Los Angeles American Indian and Alaska Native Project 1 Technical Memo 5: AIAN Underrepresentation in the ACS

HOW TO USE THE NEW AMERICAN FACT FINDER

Using Administrative Records to Improve Within Household Coverage in the 2008 Census Dress Rehearsal

Reference Guide for Journalists: Using the American Community Survey

The American Community Survey and the 2010 Census

Methodology Statement: 2011 Australian Census Demographic Variables

Key Considerations for Planning and Management of Census Operations: Bangladesh Perspective based on POPULATION AND HOUSING CENSUS 2011

Transcription:

Modernizing Disclosure Avoidance: Report on the 2020 Disclosure Avoidance Subsystem as Implemented for the 2018 End-to-End Test (Continued) Simson L. Garfinkel Chief, Center for Disclosure Avoidance Research U.S. Census Bureau 2017 Census Scientific Advisory Committee Fall Meeting Suitland, MD 11:00AM September 15, 2017

Acknowledgments This presentation incorporates work by: Dan Kifer (Scientific Lead) John Abowd (Chief Scientist) Tammy Adams, Robert Ashmead, Aref Dajani, Jason Devine, Michael Hay, Cynthia Hollingsworth, Meriton Ibrahimi, Michael Ikeda, Philip Leclerc, Ashwin Machanavajjhala, Christian Martindale, Gerome Miklau, Brett Moran, Ned Porter, Anne Ross and William Sexton 2

Outline Motivation Differentially private 2020 Disclosure Avoidance System High-level goals Flow diagrams Query examples Conclusion 3

Motivation: To protect the privacy of individual survey responses 2010 Census: 5.6 billion independent tabular summaries published. Based on 308 million person records Database reconstruction (Dinur and Nissim 2003) is a serious disclosure threat that all statistical tabulation systems from confidential data must acknowledge. The confidentiality edits applied to the 2010 Census were not designed to defend against this kind of attack. 4

The Disclosure Avoidance Subsystem (DAS) implements the privacy protections for the decennial Census. Features of the DAS: Operates on the edited Census records Designed to make Census records safe to tabulate Census Edited File Disclosure Avoidance System Hundred percent Detail File (2000 and 2010) -- Microdata Detail File (2020) 5

The 2000 and 2010 Disclosure Avoidance Systems relied on swapping households: Advantages of swapping: Easy to understand Does not affect state counts if swaps are within a state Can be run state-by-state Operation is invisible to rest of Census processing Town 1 Disadvantages: Does not provide formal privacy guarantees Does not protect against database reconstruction attacks Privacy guarantee relies on lack of external data State X 6 Town 2

The 2000 and 2010 Disclosure Avoidance System operated as a filter, on the Census Edited File: Enumeration responses, unduplication: Census Unedited File Edits, imputations: Census Edited File Confidentiality edits (household swapping), tabulation recodes: Hundred-percent Detail File Pre-specified tabular summaries: PL94-171, SF1, SF2 (SF3, SF4, in 2000) Special tabulations and post-census research 7

Data accuracy The 2020 Census disclosure avoidance system will use differential privacy to defend against a reconstruction attack, Differential privacy provides: Provable bounds on the accuracy of the best possible database reconstruction given the released tabulations. Algorithms that allow policy makers to decide the trade-off between accuracy and privacy. Privacy loss budget (ε) 8 Pre-Decisional

The 2020 Disclosure Avoidance System also operates as a filter but it s much more visible. Enumeration responses, unduplication: Census Unedited File Edits, imputations: Census Edited File Global Confidentiality Protection Process Disclosure Avoidance System Microdata Detail File (releasable) Pre-specified tabular summaries: PL94-171, SF1, SF2 Special tabulations and post-census research Privacy Budget, Accuracy Decisions 9

The 2020 Disclosure Avoidance System relies on infusing formally private noise. Advantages of noise infusion with formal privacy: Easy to understand Provable and tunable privacy guarantees Privacy guarantees do not depend on external data Protects against database reconstruction attacks Privacy operations are composable Disadvantages: Global Confidentiality Protection Process Disclosure Avoidance System Entire country must be processed at once for best accuracy Every use of private data must be tallied in the privacy loss budget ε 10

Differentially Private Disclosure Avoidance System: Requirements DAS must be able to read the Census Edited File (CEF): CEF must be exactly specified and contain all information necessary for all tabulation recodes CEF must be kept confidential after DAS runs (as it was for historical censuses) DAS must generate the Microdata Detail File (MDF): Must contain all information that appears in any publicly released table (e.g. PL94-171, SF1, SF2) Should not contain any information that does not appear in a publicly released table May be publicly released (in whole or in part) Non-functional requirements: The disclosure avoidance system must provably move information from the CEF to PL94/SF1/SF2 with an adjustable total privacy-loss budget The source code and parameters for the DAS will be made publicly available 11 Pre-Decisional

Why generate a differentially private MDF? Familiar to internal and external stakeholders Operates with tabulation system to produce PL-94 and SF-1 tabulations Guarantees population totals (voting age, non-voting age, householder) exact at all levels of geography Consistency among query answers 12

Some queries must be privacy preserving. Some queries must be exact ( invariant ) Specific PL-94 queries must be exact: Block population Block voting age population Block householders & vacancies per 2000 Department of Justice letter to the Director, Kenneth Prewitt Other PL-94 and SF-1 queries will not be exact: Age distribution under 18 Age distribution 18 and over Race and ethnicity distribution Household relationship distribution Household ownership distribution Final privacy-loss budget determined by Data Stewardship Executive Policy Committee (DSEP) with recommendation from Disclosure Review Board (DRB) 13

How the 2020 System Works: High-level Overview The new system is similar to swapping, with key differences: Every record in the population may be modified But modifications are bounded by DOJ policy and global privacy budget. Records in the tabulation data have no exact counterpart in the confidential data There is no one-to-one mapping between CEF and MDF records. But there are the same number of records for every block. Explicitly protected tabulations (PL-94 and SF-1) have provable, public accuracy levels 2020 will publish the algorithms, the parameters and the accuracy of the tabulations. 14

Proposed Top-Down Algorithm National table of US population 2 x 255 x 17 x 115 Spend ε 1 privacy-loss budget National table with all 500,000 cells filled, structural zeros imposed with accuracy allowed by ε 1 2 x 255 x 17 x 115 Sex: Male / Female Race + Hispanic: 255 possible values Relationship to Householder: 17 Age: 0-114 Reconstruct individual micro-data without geography 325,000,000 records 15

State-level State-level tables for only certain queries; structural zeros imposed; dimensions chosen to produce best accuracy for PL-94 and SF-1 Spend ε 2 privacy-loss budget Target state-level tables required for best accuracy for PL-94 and SF-1 Exact state voting-age, non-voting age, and householder counts as enumerated. Construct best-fitting individual micro-data with state geography 325,000,000 records now including state identifiers 16

identifiers County-level County-level tables for only certain queries; structural zeros imposed; dimensions chosen to produce best accuracy for PL-94 and SF-1 Spend ε 3 privacy-loss budget Target county-level tables required for best accuracy for PL-94 and SF-1 Exact county voting-age, non-voting age, and householder counts as enumerated. Construct best-fitting individual micro-data with state and county geography 325,000,000 records now including state and county identifiers 17 Pre-Decisional

county identifiers Census tract-level Tract-level tables for only certain queries; structural zeros imposed; dimensions chosen to produce best accuracy for PL-94 and SF-1 Spend ε 4 privacy-loss budget Target tract-level tables required for best accuracy for PL-94 and SF-1 Exact tract voting-age, non-voting age, and householder counts as enumerated. Construct best-fitting individual micro-data with state, county, and tract geography 325,000,000 records now including state, county, and tract identifiers 18

and tract identifiers Block-level Block-level tables for only certain queries; structural zeros imposed; dimensions chosen to produce best accuracy for PL-94 and SF-1 Spend ε 5 privacy-loss budget Block tract-level tables required for best accuracy for PL-94 and SF-1 Exact block voting-age, non-voting age, and householder counts as enumerated. Construct best-fitting individual micro-data with state, county, tract and block geography 325,000,000 records now including state, county, tract identifiers 19

MDF for tabulating Construct best-fitting individual micro-data with state, county, tract and block geography 325,000,000 records now including state, county, tract, and block identifiers MDF used for tabulating PL-94, SF-1 20

MDF for tabulating How accurate is the MDF? Construct best-fitting individual micro-data with state, county, tract and block geography 325,000,000 records now including state, county, tract, and block identifiers Disclosure Avoidance Certificate Certifies that the DAS passed tests Reports the accuracy of the MDF MDF used for tabulating PL-94, SF-1 Requires ε A 21

Operational Decisions Set total privacy loss budget: ε Ensure that ε 1 + ε 2 + ε 3 + ε 4 + ε 5 + ε A = ε Within each stage, allocate privacy-loss budget between: PL-94 Parts of SF-1 not in PL-94 These are policy levers provided by the system. Levers are set by the Data Stewardship Executive Policy Committee 22 Pre-Decisional

Inputs Used by the Development Team Lists of matrices in technical documentation express core queries in the workload PL94: https://www.census.gov/prod/cen2010/doc/pl94-171.pdf SF1: https://www.census.gov/prod/cen2010/doc/sf1.pdf SF2: https://www.census.gov/prod/cen2010/doc/sf2.pdf Over 1,000 pages of edit specifications for 2010 CEF Uncurated tabulation recode programs 23

Data accuracy We are creating A framework for Disclosure Avoidance Systems: Development & Test Mode Production Mode DAS3 DAS0 Testing Systems: DAS0 100% accuracy, no privacy (No disclosure avoidance) DAS1 100% privacy, no accuracy DAS2 bottom-up engine Operational System: DAS3 top-down engine DAS1 DAS2 Privacy loss budget (ε) 24

Plans for the 2018 End-to-End Test The 2018 End-to-End test will incorporate differential privacy Likely DAS2 Bottom-up algorithm Only the prototype PL94-171 files will be produced No decisions yet regarding the privacy-loss budget or accuracy level Questions? 25

Reference Dinur, Irit and Kobbi Nissim (2003). Revealing information while preserving privacy. in Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '03). ACM, New York, NY, USA, 202-210. DOI: 10.1145/773153.773173. 26