The Statistical Administrative Records System and Administrative Records Experiment 2000: System Design, Successes, and Challenges Dean H. Judson Planning, Research and Evaluation Division U.S. Census Bureau
How Administrative Records Are Created and Used Events and Objects (population) Observed Events and Objects ("sampling frame") Recorded Events and Objects (administrative record) Policy changes which change the definition of events and objects Ontologies and thresholds for observation Data collection Data entry errors and coding schemes Database Data management issues Presentation (query results and displays) Query structure and spurious structure 11/20/2000 U.S. CENSUS BUREAU 2
Ontologies and Data Quality Proper Representation Incomplete Representation State 1 State 1 State 1 State 1 State 2 State 2 State 2 State 2 State 3 State 3 State 3 State 4 Ambiguous Representation Meaningless States State 1 State 1 State 1 State 1 State 2 State 2 State 2 State 2 State 3 State 3 State 4 Source: Wand and Wang, 1996:90 11/20/2000 U.S. CENSUS BUREAU 3
Background and History Statistical Administrative Records System Six large Federal input files: IRS 1040, IRS 1099, Selective Service, Medicare, Indian Health Service, HUD-TRACS One lookup file: SSA/Census Numident AREX 2000 Attempt to use STARS data to simulate administrative records census 11/20/2000 U.S. CENSUS BUREAU 4
11/20/2000 U.S. CENSUS BUREAU 5 A Diagrammatic Depiction of Files Used to Create the Final StARS Database SSS Person Edited File 5.15 IRS 1040 Person Edited File 5.20 IRS 1099 Person Edited File 5.25 Medicare Person Edited File 5.30 HUD-TRACS Person Edited File 5.35 IHS Person Edited File 5.40 Medicaid Person Edited File (future possibility) 5.45 CHUMS Person Edited File (future possibility) 5.50 FAFSA Person Edited File (future possibility) 5.55 Composite Person Output 5.60 Address Output (aka 4.25) 5.70 5 Concatenate, sort, and unduplicate 7 9 Person Characteristic File (PCF) (aka 14.100) 5.05 Merge 5.85 Person Output 5.90 Return to 4 5.10 Original Address Pointers 5.65 Unduplicate & Reset Address Pointers 7 9 Updated Address Pointers 5.80 5.75
Characteristics of Files Included in the IRS Individual Master 1040 File: STARS System Tax year data; April, 2000 refers to tax year 1999 TY 99 file arrives October, 2000 Business entities, estates, other institutions included 120 million records/year Households below the filing threshold do not need to file Tax Filing Unit Housing Unit Czajka, 2000: 10-20% of addresses are PO Boxes, business addresses, tax preparers Limited microdata content: TY95+: SSN s of dependents requested, recorded Czjaika, 2000: 1987 study:.5% of primary filer, 1.6% of secondary filer, 3.4% of dependents SSN s in error Age, race, sex hispanic origin microdata not available 11/20/2000 U.S. CENSUS BUREAU 6
Characteristics of Files Included in the STARS System, cont. IRS Information Returns (1099) File: Tax year data; April, 2000 refers to tax year 1999 TY 99 file arrives October, 2000 Business entities, estates, other institutions included 775 million records/year Recipient address Housing Unit Czajka, 2000: 10-20% of addresses are PO Boxes, business addresses, tax preparers Limited microdata content: Age, race, sex hispanic origin microdata not available 11/20/2000 U.S. CENSUS BUREAU 7
Characteristics of Files Included in the STARS System, cont Selective Service File: About 13 million records Registration required in 1940, suspended in 1975, resumed in 1980 Presumably, males 18-25 are required to inform SSS when they move Females, non-immigrant aliens, hospitalized, incarcerated, and institutionalized males, and members of the armed forces are exempt Limited microdata content: Race, Hispanic origin microdata not available Address information may not be current 11/20/2000 U.S. CENSUS BUREAU 8
Characteristics of Files Included in the STARS System, cont. Medicare Enrollment Database (EDB): Current and historical Medicare enrollment Active and Inactive cases 35-40 million records at any one point in time; September 93: 77 million records (active + inactive) Proxy recipients listed on the file (e.g., John Doe s benefits c/o Jane Doe; John Doe s benefits c/o nursing home) A small portion of records at any point in time are probably deceased (Kim and Sater, 2000) Used in population estimates system for 65+ household population estimates 11/20/2000 U.S. CENSUS BUREAU 9
Characteristics of Files Included in the Medicare EDB, cont.: STARS System, cont. Recipient Address Housing Unit Proxy recipients Coverage is believed high (93-102%) but not perfect and unevenly distributed geographically Snowbird states appear to have lower ratios of medicare to 65+ population than non-snowbird states 11/20/2000 U.S. CENSUS BUREAU 10
Characteristics of Files Included in the STARS System Indian Health Service patient file: About 10 million patient/transaction records Transaction record person record Unduplication about 10 million patient records, 2 million unduplicated SSN s Many missing SSN s about 20% missing SSN s 11/20/2000 U.S. CENSUS BUREAU 11
Characteristics of Files Included in the STARS System, cont. Housing and Urban Development Tenant Rental Assistance Certification System (HUD-TRACS): HUD subsidy payments Currently, about 3.3 million records Short form data for all members of household (Race/Hispanic only for head of household) Address information may represent project or landlord address 11/20/2000 U.S. CENSUS BUREAU 12
Characteristics of Files Included in the Census NUMIDENT File: STARS System, cont. 750 million transaction records 400 million individual SSN records Post 1985: Enumeration at birth For each SSN: Date of birth, gender, race, place of birth About 50-60 million persons on the file are deceased but not identified as such No current residence information on the file Taxpayer ID Numbers (TINs) not on the file About 35% of SSN s on file have alternate names (marriage, divorce, etc.) 6% missing gender Race coding has changed (prior to 1980, 3 races: White, Black, Other); 20% either unknown or other About 25% of SSN s have transactions with different race codes 11/20/2000 U.S. CENSUS BUREAU 13
STARS Processing Diagrams Two Goals: For person data: One output record per person, assigned to an individual residence corresponding as closely as possible to Census residence definitions, in a household structure corresponding as closely as possible to Census household structure, containing microdata corresponding as closely as possible to Census short form microdata, and excluding persons which are not in the population of interest. For address data: One output record per individual housing unit at a Basic Street Address, geocoded to Census TIGER geography, with address microdata and concepts corresponding as closely as possible to DMAF address fields and concepts, and excluding locations which are not in the population of interest. 11/20/2000 U.S. CENSUS BUREAU 14
STARS Processing Overview 11/20/2000 U.S. CENSUS BUREAU 15 No Hold for next cycle 15.10 Process file this cycle? 15.05 Program Development Address Processing Program 15.15 8 Address Data Processing 15.20 10 Verified IHS File 15.55 Address Output 15.25 Go To End 15a 15 Process file this cycle? 15.05 Yes Program Development Person Editing Program 15.30 8 Person Editing 15.35 15 Program Development SSN Verification Program 15.45 8 Social Security Number (SSN) Verification 15.50 13 Create Person Characteristic File (PCF) 15.65 14 Person Characteristic File (PCF) 15.70 Program Development Household Processing Program 15.85 8 Household Data Processing 15.90 17 Program Development Final Output Program 15.100 8 Final StARS Processing 15.105 18 Data Delivery 15.115 5 No Edited IHS File 15.40 Is current year s PCF available? 15.60 Person Output 15.80 Yes Process Person Data 15.75 16 Household Output 15.95 Final StARS Output 15.110 Process file this cycle? 15.05 End
Administrative Records Experiment in 2000 (AREX 2000) Five selected sites in Maryland and Colorado MD: Baltimore city, Baltimore county; CO: El Paso county, Douglas county, Jefferson county Attempt to simulate an Administrative Records Census Not all aspects of an Administrative Records Census are simulated Group Quarters survey Coverage measurement survey Special operations not included in StARS Request for physical address (PO boxes/rr s) MAFGOR Geocoding Field verification of addresses not matched to DMAF 11/20/2000 U.S. CENSUS BUREAU 16
AREX 2000 Overview Flowchart Methods 11 and and 22 11/20/2000 U.S. CENSUS BUREAU 17 National Administrative Address Records File 17.15 Computer geocode the National File (GEO) 17.20 Create StARS 1999 from MD&CO Files (PRED) 17.35 StARS 1999 Master Housing File (MHF) for MD&CO 17.40 Copy test site records to create AREX Address File (PRED) 17.60 AREX P.O. Box and rural-style addresses (aka 2.40) 17.100 G Q Person Data from Census 17.170 Additional Geocoded Test Site Records 17.50 Receive MD&CO Files from GEO (PRED) 17.30 Perform Exploratory Data Analysis (EDA) on test sites (PRED) 17.45 Maryland & Colorado (MD&CO) Geocoded Files (with test site records flagged) 17.25 Request for Physical Addresses Mailout & Processing (DSCMO/NPC/GEO/RCCs) 17.110 2 StARS Person Data 17.175 Acquire National Administrative Records File (PRED) 17.10 Additional Ungeocoded Test Site Records 17.55 AREX Address File 17.65 Copy P.O.Box and rural-style addresses (PRED) 17.95 Update AREX Address File with Req. for Phys. Addr. results (PRED) 17.115 Extract test site records from MD&CO Files (GEO) 17.700 Extract ungeocoded city-style records (GEO) 17.75 Clerical Resolution of Ungeocoded Addresses (MAFGOR) (GEO/FLD/RCCs) 17.80 3 Update AREX Address File with MAFGOR results (PRED) 17.85 Geocoded City-style AREX Addresses 17.90 AREX Address File (after MAFGOR, Request for Physical Addresses, and Field Address Verification updates) 17.180 Start Planning & OMB Approval (PRED) 17.05 Method Method 22 Only Only (Bottom-Up) (Bottom-Up) Pull off address records from DMAF by AREX test site counties (PRED) 17.130 Match Geocoded City-style AREX Addresses to DMAF (PRED) 17.135 Unmatched Admin. Record Addresses 17.145 Field Address Verification & Processing (FLD / DSCMO / NPC) 17.150 4 Update AREX Address File with Fld. Addr. Ver. & Proc. results (PRED) 17.155 Matched Addresses 17.185 DMAF 17.120 Obtain DMAF from DSCMO (PRED) 17.125 Perform clerical review of match results (PRED) 17.140 Unmatched DMAF Addresses 17.160 Obtain person data from Census 2000 (DSCMO)17.165 Census 2000 Person Data 17.190 Post-Processing For For details, details, see see AREX AREX 2000: 2000: Administrative Administrative Records Records Research File Processing Flowcharts. 17.195
AREX 2000 Evaluation Plans g Evaluation 1: Comparison of both methods site and block level counts of population by race, Hispanic origin, age groups and gender, with comparable decennial census counts g Evaluation 2: Analyzing selected components of the AREX implementation processing g Evaluation 3: Comparison of bottom up housing unit and household level information with comparable Census 2000 housing unit and household information g Evaluation 4: Assessing the feasibility of using administrative records in lieu of a field interview to obtain data on nonresponding households 11/20/2000 U.S. CENSUS BUREAU 18
Major Analytic Issues with StARS Ontologies Processing A delivery address suitable for receiving a payment check may not suffice for putting individuals at a street address Difficult to distinguish individual units within the Basic Street Address Race coding: Hispanic Origin is a separate race on NUMIDENT Transaction data person data How many names does a person have (and in what order)? Proxies IRS & Medicare records JOHN WILSON The address is for Mary Smith. John Wilson may or C/O MARY SMITH may not live there. 1004 LAUREL LANE ROCKMONT, MD 22345 11/20/2000 U.S. CENSUS BUREAU 19
Major Analytic Issues with StARS Processing Addresses that are difficult to place on the ground Huang and Kim, 2000: About 10 % of addresses are rural style PO Boxes: 45% for IHS, 9.5% for Medicare, 7.5% for IRS 1040, 6.8% for SSS, 3.8% for IRS 1099,.4% for HUD-TRACS Sater, 1995 IRS/CPS match: 86.5% of tax return cases had the same address as residence address, 94% coded to same county John Smith H&R BLOCK P.O. BOX 12 GREENWAY, MD 29752 Addresses with both business and residential components Dean H. Judson JUDSON OLD GROWTH LOGGING & SPOTTED OWL EXTERMINATION SERVICES 45850 BACKWOODS HIGHWAY BOONDOCKS, OR 96432 11/20/2000 U.S. CENSUS BUREAU 20
Major Analytic Issues with StARS Processing, cont. Unduplication and matching When addresses or personal characteristics are measured with substantial variation, it is often not obvious whether a particular pair of records represent a duplicate or not. Yet, with multiple files, unduplication decisions must be made. CHUMS-enhanced IMH File MAF A Banana St 1 Apple St B 17 Banana St 3 Apple St Apt 1 C 19 Banana St Apt 5 3 Apple St Apt 2 D 44 MLK, Jr. Blvd 3 Apple St Apt 3 E 100 Route 4 3 Apple St Apt 4 F 7 Marie Ln 7 Apple St G Wife Mrs. Smith 9 Apple St H 5 Apple St # Apple St I 27 Apple St # Martin Luther King, Jr. Blvd J Apple St # Pennsylvania Ave K 9999 Apple St 7 Maria Ln L 3 Apple St Apt 5 M 1 Apple St N 3 Apple St Apt A O 3 Apple St ZZ P 3 Apple St Q 3 Apple St Apt 1 11/20/2000 U.S. CENSUS BUREAU 21
Major Analytic Issues with StARS Processing, cont. Outcome of "CHUMS-enhanced IMH File" / MAF Match MATCH Street BSA BSA+Unit Possible Explanations Example NO N/A N/A 1 Street is not in MAF, either it was A,B,C just missing or it's a new street 2 Different, but valid representation of D,E street name 3 Misspelling of street name F 4 Erroneous street name G YES NO N/A 1 BSA is not in MAF, either it was H just missing or it's a new BSA - There is a "hole" in MAF 2 BSA is not in MAF, either it was I just missing or it's a new BSA - A missing "street extension" 3 Existing street with no incoming J street number 4 Erroneous street number K YES YES NO 1 Unit not in MAF, either it was just L missing or it's a new unit 2 Valid match - a BSA without M separate units 3 Different representation of a unit N 4 Erroneous unit information O 5 Missing unit information P YES YES YES 1 Valid match Q 11/20/2000 U.S. CENSUS BUREAU 22
Major Analytic Issues with StARS Processing, cont. Variations in data from different sources Huang and Kim, 2000: Of the 50% of SSN s found on multiple files, about 1% have more than one gender recorded about 32% have multiple addresses about 2% have multiple races Imputation from the NUMIDENT Many files have limited microdata. For those that are found on the NUMIDENT, we can impute microdata from the approximately equivalent NUMIDENT fields. 11/20/2000 U.S. CENSUS BUREAU 23
Major Analytic Issues with StARS Processing, cont. Changing information states Distinct problem from point in time data collection Information states change over time/over databases Address information ages over time and varies over databases SAM SMITH SAM SMITH BOX 2 RURAL ROUTE 37 486 MAIN STREET WESTPORT, VA 32784 FAIRFIELD, VA 33412 (Dated 10/14/98 from Medicare) (From TY97 IRS file, filed sometime in 1998) Mortality information ages over time and varies over databases One database provides information about the other, provided that matching can be performed Data processing requires complex, and substantively important, decision logic at each step 11/20/2000 U.S. CENSUS BUREAU 24
References Bye, B (1998). Race and ethnicity modeling with SSA Numident Data: Interim report: File development and tabulations. Unpublished document available from the U.S. Bureau of the Census. Bryant, C. (1995). Comparing the LUCA address list to local records. Paper presented at the 1995 State Data Center Meeting, San Francisco, CA, April 4, 1995. Czajka, J. (1999). Can we count on administrative records in future U.S. Censuses? Presentation at the Bureau of the Census, December 15, 1999. Huang, E., and Kim, J. (2000). One Percent Sample Study Report (SRD-DRAFT). Unpublished document available from the U.S. Bureau of the Census, February 10, 2000. Judson, D.H., and Popoff, C.L. (2000). Research Use of Administrative Records. Unpublished document. Judson, Dean H. (2000). The Statistical Administrative Records System: System Design, Successes, and Challenges. Unpublished document. Kim, Myoung Ouk, and Sater, Douglas (2000). Defining the Medicare Data Universe for the U.S. Census Bureau's Population Estimates Program. Paper presented at the Southern Demographic Association meetings, New Orleans, LA, August 29, 2000. Sater, D. (1995). Differences in Location of Households and Tax Filing Units. Paper presented at the 1995 meeting of the Population Association of America, San Francisco, CA, April 6, 1995. Wand, Yair, and Wang, Richard Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39: 86-95. Zanutto, E. (1996). Estimating a population roster from an incomplete census using mailback questionnaires, administrative records, and sampled nonresponse followup. Presentation to the U.S. Bureau of the Census, August 6, 1996. Zanutto, E., and Zaslavsky, A. (1999). Using Administrative Records to Impute for Nonresponse. Paper presented at the International Conference on Survey Nonresponse, Portland, OR., October 29, 1999. 11/20/2000 U.S. CENSUS BUREAU 25