Use of administrative sources and registers in the Finnish EU-SILC survey Workshop on best practices for EU-SILC revision Marie Reijo, Senior Researcher
Content Preconditions for good registers utilisation Register use in the Finnish SILC/IDS, overview Register use by the Finnish SILC/IDS survey stages Sampling design and sample selection Weighting and unit non-response correction Data collection and processing Data analysis Integrated modules, e.g. HCFS 2013
Preconditions: comprehensive and reliable register system Basic registers Major registers (incl. statistical registers by Statistics Finland) Statistics production and releasing by Statistics Finland Efficient information system for collecting registers Register-based census system created with the 1970 census, from 1987 census entirely from administrative sources Totally register-based statistics, e.g. Statistics on taxable income since 1969, Total statistics on income distribution (TSID) since 1995 Unified identification codes, exact matching Registers used for sample based surveys since 1970 s, HBS originates in 1966 and Income distribution statistics (IDS) in 1977 with integrated SILC 2004 Legislative basis for statistical purposes Public approval Best practices (Statistics Finland 2004; UN/ECE 2007; UN 2012; see also Wallgren & Wallgren 2014)
Registers use in the Finnish SILC/IDS, overview Stage Sources Linkage units Methods Aim Sampling (1. phase) Sampling (2. phase) Data collection Data processing Estimation, Weighting The Population Information System The Population Information System, Taxation register Several register sources. Several register sources. The register data on household-dwelling population by Statistics Finland, The TSID data. - Direct use. Sampling frame, sample selection (master sample) and update Person, householddwelling unit Person, enterprise, region Person, enterprise, region Person, dwelling, building, enterprise, region Person, householddwelling unit, region Deterministic record linkage. Deterministic record linkage. Deterministic record linkage. Deterministic record linkage, and methods to derive, estimate impute and code variables, e.g. regression estimation, stratification. Deterministic record linkage, several methods, e.g. regression estimation, calibration methods. Quality analysis Total data, e.g. TSID Person Direct use, Deterministic record linkage. Strata construction, sample selection of selected persons from the master sample by stratum. Auxiliary data to the sample for CATI Blaise questionnaire: data editing in interviews. Replaced interview and substitutive information for target variables: data collection for target variables. Auxiliary information for interviewed data checking and editing, detecting and correcting errors (e.g. inconsistencies at unit level) for target variables. Auxiliary and substitutive information for editing, imputing of missing information for target variables. Using information combined with interviewed or register information to derive and form target variables. Information for unit non-response analysis, unit nonresponse correction, adjusting data to the target (total) population. Using data on crucial frequencies and income and income receiver sums. Data comparisons, unit non-response (e.g. panel attrition) and other analysis
Register sources in sampling Registers: Basic register: Population Information System of the Population Register Centre National Board of Taxes Persons Buildings and dwellings Taxation Data of Statistics Finland: Sample frame: total data copy of persons, buildings and dwellings Master sample, Master sample by stratum SILC/IDS sample
Registers use for two-phase stratified sampling Sample frame of the Population Information System, up-to-date Persons residing permanently in Finland at the end of the year, ordered by domicile code (address) Unified identification codes for persons Selected systematically for the 1 st phase master sample (about 50 000) Over-coverage (persons not in the target population sy t-1;31.12. ) excluded, checked against updated register data Socioeconomic strata for the 2 nd phase sample selection Socioeconomic strata: data linked from taxation register (sy t-2 ) to the persons living in sample person s household dwelling unit -> 12 strata: information on taxable income type and level, defined by the highest earner in the household-dwelling unit SILC/IDS gross sample (about 13 500 persons) selected by simple random sampling with non-proportional allocation from strata Use of taxation registers data for stratification ensures less biased estimates for important output measures.
Register sources in weighting and unit non-response correction Administrative registers: Population Register Centre National Board Finnish Centre Social Insurance National Institute for Other register sources: of Taxes for Pensions Institution: Health and Welfare Persons, buildings and dwellings Statistics Finland, Taxation... Data: Pensions Population data Social insurance Social assistance Total statistic on income distribution data Education fund State Treasury SILC/IDS Financial Supervision Authority Treasury Ministry of Agriculture and Forestry Statistics: Householddwelling units
Registers use for weighting and unit non-response correction Unit non-response analysis by register data Calibration of non-response adjusted design weights by frequencies and sums from the household-dwelling units and TSID data by Statistics Finland (register household-dwelling population and household-dwellings sy t-1;31.12 and their income for the sy t-1 ): Number of households Sex * age (5-year) groups of household-dwelling population, the oldest age group 85+ Number of members in household-dwelling unit (1,2,..,6+) Region (nuts3, Helsinki and capital area separated) Degree of urbanisation Sums of the 12 income components Number of the 3 income component receivers Standard methods and calibration variables are used over the years
Total disposable household income means by strata, 1 st wave 1000 euros 100 90 80 Mean (sample) 70 60 50 40 30 20 Mean (design weight, nonresponse adjusted) 10 0 Mean (calibrated weight) Source: IDS/SILC sy2015
Total disposable household income means by strata, 4 th wave 1000 euros 100 90 80 Mean (sample) 70 60 50 40 30 20 Mean (design weight, nonresponse adjusted) 10 0 Mean (calibrated weight) Source: IDS/SILC sy2015
Register sources in data collection and processing Administrative registers: Population Register Centre National Board Finnish Centre Social Insurance National Institute for Other register sources: of Taxes for Pensions Institution: Health and Welfare Persons, buildings and dwellings Taxation Authority Statistics Finland... Treasury Registers, Data: Business register Student register Pensions Population data Statistics: Social insurance Social assistance Total statistic on income distribution data, incl. indebtedness Education fund State Treasury SILC/IDS Financial Supervision Ministry of Agriculture and Forestry Register on degrees Families Householddwelling units
Registers use in data collection and processing Detecting and correcting erroneous responses for target variables during the interview. Auxiliary information is prefilled to householddwelling I wave or housekeeping unit II-IV waves persons in the CATI/CAPI - Blaise questionnaire by exact matching. HH-members sy t are determined first in the interview, if exact match, information is used. Automatic coding during the interview. Editing and coding interviewed data for variables in statistics data base system automatically programmed or manually (loaded to editing system display). Register data linked to persons (exact matching). Forming target variables by record linkage, e.g. data on income, or by editing or imputing non-responded items of objective type of variables by statistical methods. Exact matching. Standard editing rules, if no changes in sources or definitions. Consistencies of data from different sources are ensured for units.
Data collection for variables from registers Registers use have many advantages: e.g. lower response burden and costs, better accuracy Assessing registers exploitation, which is efficient and sufficient enough for the SILC data quality? Relevance? Definitions: SILC variables vs. register variables Opinions, subjective type of data rarely available from registers All factual variables are not available at all from registers Validity of factual data which are available from registers Comprehensiveness and completeness Reference time periods and time points Register data: no information available from interview time point => Data consistency of multipurpose survey data in particular Consistency within domains Consistency between domains Statistical domain registers delay, SILC timeliness Coherence of statistics in statistical system
Case: Income Almost all of the SILC/IDS income from registers, about 98 99 % Statistical data on household dwelling population data by Statistics Finland as base data, many comprehensive registers sources: Earliest register received in April, others mostly in August to November The final taxation register received in November TSID released in December (survey year) Errors may possible (e.g. missing units, missing or erroneous items), then need for updated data from register providers Preliminary error detecting first by Data Collection Unit of Statistics Finland Data filled both in TSID and SILC/IDS sample data base files Common, consistent income classification by detailed register items, information on changes beforehand for data collection and planning Unified data compilation, e.g. edited and derived variables formed to total data and sample, apart from register files and variables. Original register, interviewed and derived variables in separate files of statistics production data base. Contents described in meta data system. Macro and micro checks, sample for error detecting at unit level Early registers for interviewed data editing, checked against final data
Case: Main activity Income from registers for calendar year, many main activity variables filter by PL031(Current=December), definitions are based on person s own perception. Interviewed IDS activity months are edited against registers during the reference year: decision rules are based on income type and level and other factual information on person s economic position. Overlapping activities are allowed for edited IDS months: sum = 12 or >12. SILC PL073 PL090 and PL211A PL211L: PL211L = PL031 (December). Final IDS months: edited to 11 % of persons Final December (PL031): edited to 4 % of persons Final PL073 PL090: edited to 15 % of persons. The number of months for both sources were equal to 85 % of persons. PL211A PL211L: Months are same for 86,5 %, errors corrected for about 2 %, if the same main activity (incl. PL031) lasted for the whole year. No other corrections. Consistency with SILC and IDS months, IDS months used for socioeconomic groups classification.
Case: Housing Discrepancy between household definitions (housekeeping and household-dwelling units): sharing the same dwelling (i.e. rentals) with other household, dispersing across many dwellings Discrepancy between interviewed and register dwellings: incl. variables irrespective of household definition (HH010, HH021): Definitions: household s main vs. permanently or usual residence Measurement error, reference time: responded, registrations Measurement error, quality: responded, registrations However, e.g. dwelling municipality is same for 99 %: + dwelling type (apartments or flats vs. others) for 96 %, + housing tenure for 88 %, but + number of rooms for only 50 % of the sample units(= S-R). Number of rooms differ in detached houses with 5 or more rooms. When detecting dwelling for all persons responsible for accommodation hb080, hb090 the dwelling municipality is same for 99 %, dwelling type 96 % of persons, no changes (see above) Register data is used primarily for automatic editing (erroneous, missing values) of objective type of data, linked to S-R. More efforts for exploitation registers? More efforts for decision rules for validating responded main dwelling of the housekeeping unit.
Data analysis: systematic comparisons of estimates Comparisons with household-dwelling population and TSID data: Analyzing sampling and estimation effect. Variables from registers linked to SILC/IDS sample units, adjusting away household and other definitional effects: comparisons of total sums and frequencies. Household definition Income discrepancy due to interviewed income items Other discrepancies, e.g. income classifications Comparisons of sums, frequencies, classifications with register statistics by Statistics Finland, e.g. NA, TSID. Comparisons of frequencies and sums, classifications with external register statistics, e.g. the ESSPROS statistics by the National Institute for Health and Welfare
Integration HFCS with SILC 2013 survey The Finnish SILC sample for HFCS (2 nd wave) compilation. Clearly defined domain, related to income data Used many register and other statistical data sources (in addition to major registers) and many focused techniques for the hard-tointerview HFCS data: Unit linking from registers (comprehensive sources) Register-based estimation, imputing methods based on available data for statistical units from external sources, e.g. separate valuation, perpetual inventory method Statistical matching from HBS by common register variables, e.g. predictive mean matching, file concatenation Some of the wealth data, e.g. opinion types, were interviewed, Additional variables in calibration Methods are developed further for the next HFCS (3 rd wave) in the 2017 SILC survey, as combined with the SILC ad hoc module on wealth and consumption
Thank you for your attention