A GI Science Perspective on Geocoding: Accuracy, Repeatability and Implications for Geospatial Privacy Paul A Zandbergen Department of Geography University of New Mexico
Geocoding as an Example of Applied GI Science Why geocoding? Arguably the most successful application of GISc In very widespread use Fundamental step in many types of spatial analysis Often considered as relatively easy and largely error-free Big questions: How good is current geocoding? What are the minimum quality expectations for geocoding? What are the effects of errors in geocoding on spatial analysis? What are the (unintended) consequences of widespread, highquality geocoding? How can geocoding be improved?
Types of Geocoding County ZIP code (ZIP and ZIP+2/4) Street Parcel Address point Routes Types of geocoding will vary by country and region, but in the United States street geocoding is by far the most widely used, both for research and commercial applications
What s Behind the Geocoding Process? Probabilistic Record Linkage Standardization Soundex Well-established algorithms, dating back to early efforts by US Census (DIME, TIGER) Ongoing research into modifications, including Markov-chain models
How Does Street Geocoding Work? End Offset (%) 747 Main St 701 Side Offset 799 700 798 1. Find the zone (ZIP, City, etc.) 2. Match the street ( by Name, Type, Dir, etc.) 3. Match the segment with the proper range 4. Linear interpolation along segment 5. Apply offsets
Typical Errors Spelling issues Incomplete street address Prefix, suffix, direction conflicts Apartment and unit numbers Ambiguous street names Outdated/incomplete street reference data Well documented types of errors
Framework for Geocoding quality 1. Completeness The percentage of records that could be reliably matched 2. Positional accuracy The difference between the geocoded location and the true location 3. Repeatability/robustness Agreement between results from repeated geocoding
What do we know? Completeness Surprisingly little research Most research makes ad-hoc decisions on what is acceptable Positional accuracy Errors can be substantial, but distribution not well characterized Larger error in rural areas Repeatability/robustness Limited research Effect of different geocoding algorithms appears limited Quality of street reference data the determining factor Potential trade- offs: e.g. increase match rate while sacrificing positional accuracy
Research Case Studies Data sets School children and schools in Orlando County Sex offenders in Florida Banks and grocery stores in Florida Fishing and boat license holders in Florida Analyses Positional accuracy Repeatability Effects of positional error on spatial analysis Comparison of alternatives
Maple Street NE
Left range: 401-453 Right range: 400-460
441 Maple St NE
GoogleMaps: 441 Maple St NE
MapQuest: 441 Maple St NE
Error Measurement Tool Fishbone Tool
Results get messy sometimes
The Squeeze Effect
Typical Error Distribution 100 90 80 Cumulative Frequency (%) 70 60 50 40 30 N =104,865 school children in Orange County, FL 20 10 0 0 200 400 600 800 1,000 Positional Error (m) Zandbergen, P.A. 2007. Influence of geocoding quality on environmental exposure assessment of children living near high traffic roads. BMC Public Health, 7:37.
Typical Error Estimates Statistic Value (m) Log-normalized Q-Q plot Min ~0 4 Max 32,356 Mean 66 2 SD Median 90 th % 95 th % 435 41 100 137 Expected Normal 0-2 99 th % N 373 104,865-4 0 1 2 3 Observed Value Based on the 90 th percentile, typical street geocoding does not meet the accuracy standards for a 1:100,000 scale map based on National Map Accuracy Standards! Zandbergen, P.A. 2007. Influence of geocoding quality on environmental exposure assessment of children living near high traffic roads. BMC Public Health, 7:37.
Quality of Street Reference Data
Use of Offsets Small offsets are commonly used Very minor effect on positional accuracy Optimum ~25 meters
Effect of Density and Parcel Size 100 90 80 Cumulative Number of Locations (%) 70 60 50 40 30 20 10 Urban Residential - Parcel < 1 acre Urban Residential - Parcel > 1 acre Rural Residential - Parcel < 1 acre Rural Residential - Parcel > 1 acre 0 0 100 200 300 400 500 600 700 800 900 1000 Positional Error (feet) Zandbergen, P.A. Spatial variability in the positional accuracy of street geocoding. International Journal of Geographical Information Science (under review).
Repeatability How do geocoding results (match rate and positional accuracy) vary by: Geocoding algorithms Street reference data Quality of street reference data is the most significant factor Local street centerlines and/or E-911data are typically superior How do commercial firms compare to results from GIS Analysts? Again, quality of street reference data dominates, which for most commercial firms is good (unless they use TIGER 2000 data) Substantial variability among providers No relationship between cost and quality
Volusia County, FL 4 land use types 3 street datasets Zandbergen, P.A. Repeatability of street geocoding. Computers and Geosciences (under review).
Example: Sex Offender Residency Restrictions Zandbergen, P.A. and T.C. Hart. 2006. Reducing housing options for convicted sex offenders: Investigating the impact of residency restriction laws using GIS. Justice Research and Policy, 8:1-24.
Errors in Classification Street geocoding of schools and offenders In Out Total Parcel geocoding of schools and offenders In Out 55 5 103 460 158 465 Total 60 563 623 False positives False negatives
Mapping Residency Restrictions: Parcel-level Zandbergen, P.A. and T.C. Hart. 2006. Reducing housing options for convicted sex offenders: Investigating the impact of residency restriction laws using GIS. Justice Research and Policy, 8:1-24.
Example: Schools and School Children in Proximity to High Traffic Roads
Error and Bias in Analysis Results Zandbergen, P.A. 2007. Influence of geocoding quality on environmental exposure assessment of children living near high traffic roads. BMC Public Health, 7:37.
Alternatives to Street Geocoding Street geocoding is dominant in the US Techniques used in other jurisdictions: Postal code (Canada) Address points (Australia and UK) Two alternatives for the US: Parcel boundaries Address points Both produce exceptional positional accuracy, but application currently limited by: Data availability / consistency Unproven match rates
Match rate results Match rates for different databases for Bay County, Fl 100 90 80 70 60 50 40 Address Parcel Roads 30 20 10 0 Commercial Banks Daycares Elevators Fishing Licenses Grocery Stores Sex Offenders Zandbergen, P.A. A comparison of address point, parcel and street geocoding techniques. Computers, Environment and Urban Systems (under review).
Error Propagation Modeling Not very well developed, for something as simple and common as street geocoding Monte-Carlo simulation of variability in: Match rate Positional accuracy Determine effects of error/bias on: Clustering Proximity analysis Neighborhood assignment Contribute to standards and procedures for geocoding In development.
Geospatial Privacy Collection of individual level data is growing rapidly in public and private sector Analysis of the individual level is very attractive for researches (overcomes MAUP and ecological fallacy issues) Spatial identifiers have not received the same level of concern as individual identifiers (like name) As geocoding becomes easier, cheaper and more accurate, so does the ability to reverse geocode spatial data Geospatial privacy has been recognized in the literature, but limited formal requirements or guidelines exist that recognize the increasing availability and capabilities of geospatial tools
Reverse Geocoding
Geographic Masking Traditional approach to protect privacy is through geographic masking typically a random perturbation Masking may alter data in undesirable ways effects of masking on spatial analysis has not received a lot of attention How much masking is needed to adequately protect geospatial privacy is poorly understood
Tracking Technologies Old-fashioned radio telemetry Traditional GPS GPS enabled cellphones Hybrid GPS / Wi-Fi Widespread adoption of (real-time) tracking is eminent
GPS-enabled Cellphones
Real-time Tracking
Google Earth Plug-ins
Real-time Online Tracking
Future Developments Address point geocoding will grow in importance and become the standard in the US Individual-level data will become a more widespread unit of analysis Tracking technologies will start to make a major impact Geospatial privacy protection needs to catch up to current trends
Elements of my research agenda Ongoing: Geocoding quality (completeness, positional accuracy, repeatability) Error propagation modeling Reverse geocoding and geographic masking Reliability of tracking technologies (i.e. cellphones vs. WiFi vs. GPS) and implications for geocoding Planned for: Spatial-temporal geocoding
Spatial-temporal Geocoding Where people spend their time not just where their mailing address is. Short-term: Where people live, work, shop etc. How people spend their day/week Long-term: Migration over years/decades How people spend their life Applications: transportation, emergency management, urban planning, environmental health How do we do this technically, practically and ethically?