A Probabilistic Geocoding System based on a National Address File

Similar documents
A method and a tool for geocoding and record linkage

MapMarker NZL 4.5 Server Release Notes Data

Geocoding DoubleCheck: A Unique Location Accuracy Assessment Tool for Parcel-level Geocoding

Geocoding regional and remote poor quality address records with confidence

GIS Lecture 8: Geocoding

Lecture 8 Geocoding. Dr. Zhang Spring, 2017

A GI Science Perspective on Geocoding:

CRA Wiz & Fair Lending Wiz Geocoding Basics. August 2017

Assessing Geocoding Solutions

VGIN Geocoding Service

Central Cancer Registry Geocoding Needs

Chapter 10. What is geocoding?

Improving the Quality of Geocoded Data

ARCGIS DESKTOP DEMO (GEOCODING, SERVICE AREAS, TABULAR & SPATIAL JOINS)

MapMarker Brazil 4 Release Notes

An ESRI White Paper May 2009 ArcGIS 9.3 Geocoding Technology

Postal Codes OM by Federal Ridings File (PCFRF) 2013 Representation Order, Reference Guide

Location, Location, Location

Postal Code Conversion for Data Analysis

Using Location-Based Services to Improve Census and Demographic Statistical Data. Deirdre Dalpiaz Bishop May 17, 2012

Geocoding: Acquiring Location Intelligence to Make Be er Business Decisions

Family Tree Analyzer Part II Introduction to the Menus & Tabs

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

On the suitability of Volunteered Geographic Information for the purpose of geocoding

Influence of street reference data on geocoding quality

ArcGIS Tutorial: Geocoding Addresses

QualityStage AVI+Geo+US Census+UK PAF v10.5 Output as of 2015 Q3 AVI Release

GIS Data Sources. Thomas Talbot

Location, Location, Location Lecture 8 (catch-up from lecture 7) Larry Rudolph. Problem Set 3

US Census. Thomas Talbot February 5, 2013

Location, Location, Location

Postal Code OM Conversion File (PCCF), Reference Guide

Catalogue no X. Geography Catalogue. Census year 2011

Automatic Cleaning and Linking of Historical Census Data using Household Information

Historical Business Academic Layout

Transforming the Census

Geocoding Techniques and Options for US and International Locations

GPCD Calculations. Process Steps for SAWS Calculation

MAPS & ENHANCED CONTENT

Economic and Social Council

Geocoding Address Data & Using Geocoded Data

Methodologies and IT-tools for managing and monitoring field work using geo-spatial tools and other IT- Tools for monitoring

Economic and Social Council

A COMPARISON OF GEOCODING BASELAYERS FOR ELECTRONIC MEDICAL RECORD DATA ANALYSIS

Pitney Bowes Software Geocoding: Capabilities and Roadmap

Table no Title Page. Persons in the aggregate town and aggregate rural areas of each province, county and city with percentage change, 2006 and 2011

Building Java Apps with ArcGIS Runtime SDK

Defining boreal HNV farmland areas with quantitative biodiversity data at a fine scale

How a People Classification Can Add Value to Census Data. Simon Perry

Application Study of a Global Genealogy Database

Geographic Terms. Manifold Data Mining Inc. January 2016

Lab Exercise 6: Vector Spatial Analysis

Journal of Biomedical Informatics

Generating Virtual Environments by Linking Spatial Data Processing with a Gaming Engine

Postal Code OM Conversion File (PCCF), Reference Guide

Location Intelligence Backwards and Forwards

Accuracy and Precision of the NAACCR Geocoder. Recinda L Sherman, MPH CTR David J Lee, PhD University of Miami, Florida Cancer Data System

Postal Code Conversion File (PCCF), Reference Guide

Geocoding and Address Matching

CRA Wiz and Fair Lending Wiz 7.1: Release Notes

Australian Standard. Design review AS IEC IEC 61160, Ed.2 (2005) AS IEC

2020 CENSUS LOCAL UPDATE OF CENSUS ADDRESSES OPERATION (LUCA) U.S. Census Bureau Geography Division

BIG CELLULAR NETWORK DATA. Olof Görnerup IAM Lab SICS Swedish ICT

GIS and Remote Sensing BIO8014. Data acquisition

In-Office Address Canvassing for the 2020 Census: an Overview of Operations and Initial Findings

Version 2.2 April Census Local Update of Census Addresses Operation (LUCA) Frequently Asked Questions

A Web Application That Can Save You Money

Natalia Vassilieva HP Labs Russia

A Metric-Based Machine Learning Approach to Genealogical Record Linkage

Evaluating the accuracy of GPS-based taxi trajectory records Zheng, Z.; Rasouli, S.; Timmermans, H.J.P.

A Supervised Learning and Group Linking Method for Historical Census Household Linkage

Automatic record linkage of individuals and households in historical census data

NCRIS Capability 5.7: Population Health and Clinical Data Linkage

The Use of GPS in Travel Time Surveys

Prepared for: CACI Acorn microsite Prepared by: CACI Product Development Team Date issued: 15th March Acorn technical document

Confidently Assess Risk Using Public Records Data with Scalable Automated Linking Technology (SALT)

Postal Codes OM by Federal Ridings File (PCFRF) 2013 Representation Order, Reference Guide

Manifold s Methodology for Updating Population Estimates and Projections

The Census Bureau s Master Address File (MAF) Census 2000 Address List Basics

Enhanced reporting of deaths among Aboriginal and Torres Strait Islander peoples using linked administrative health datasets

Using Administrative Records for Imputation in the Decennial Census 1

PRACTICAL MAGIC WITH THE SOCIAL SECURITY DEATH MASTER FILE

Catalogue No. 92F-0153-GIE Postal Code Conversion File September 2002 Postal Codes Reference Guide

in the New Zealand Curriculum

2016 Census of Population and Housing: Submission Form for Content or Procedures, 2016

Determining Routing of Wireless Sectors in a Multi PSAP System. Robert Thurston GIS Technician Snohomish County

Virtual Reality Driving Simulator. for Older Drivers

The Nature of Informatics

Choosing the best path:

AN EFFICENT SYSTEM FOR ACQUIRING EARTHQUAKE DAMAGE INFORMATION IN DAMAGE AREA

Business-strength Geocoding

The Postal Code Conversion File (PCCF) User Guide

Product Performance Predictive Mdeling

The 2020 Census Geographic Partnership Opportunities

UNIVERSITY ART MUSEUMS AUSTRALIA: SUBMISSION TO THE NATIONAL CULTURAL POLICY

Geocoding Techniques and Options for US and International Locations. Thomas Oaks Tosia Shall

Texas Hold em Inference Bot Proposal. By: Brian Mihok & Michael Terry Date Due: Monday, April 11, 2005

Learning Objectives. Getting Started With Your Family History. US Census: Population Schedules. Why census data is valuable to family history

ArcGIS Online: World Geocoding. Bruce Harold, Esri

Postal Code Conversion File October 1999 Postal Codes Reference Guide

Transcription:

A Probabilistic Geocoding System based on a National Address File Peter Christen, Tim Churches and Alan Willmore Data Mining Group, Australian National University Centre for Epidemiology and Research, New South Wales Department of Health Contact: peter.christen@anu.edu.au Project web page: http://datamining.anu.edu.au/linkage.html Funded by the NSW Department of Health Peter Christen, December 2004 p./6

Outline Geocoding Geocoded National Address File (G-NAF) Febrl geocoding system Address cleaning and standardisation Processing G-NAF Geocode matching engine First results and geocoding examples Future work Peter Christen, December 2004 p.2/6

Geocoding The process of assigning geographical coordinates (longitude and latitude) to addresses It is estimated that 80% to 90% of governmental and business data contain address information US Federal Geographic Data Committee Useful in many application areas GIS, spatial data mining Health, epidemiology Business, census, taxation Various commercial systems available (e.g. MapInfo, www.geocode.com) Peter Christen, December 2004 p.3/6

Geocoding techniques.. 3.. 2. 4 5. 8.. 7 6. 0 9.. 2. 3. Street centreline based (many commercial systems) Property parcel centre based (our approach) A recent study found substantial differences (specially in rural areas) Cayo and Talbot; Int. Journal of Health Geographics, 2003 Peter Christen, December 2004 p.4/6

Geocoded National Address File Need for a national address file recognised in 990 32 million source addresses from 3 organisations 5-phase cleaning and integration process Resulting database consists of 22 files or tables Hierarchical model (separate geocodes for each) Address sites Streets Localities (towns and suburbs) Aliases and multiple locations possible Peter Christen, December 2004 p.5/6

Simplified G-NAF data model Street Locality Alias Locality Geocode n Street Locality Geocode n Locality n Locality Alias n Street n n Adress Detail n n n Adress Alias Adress Site n Adress Site Geocode Peter Christen, December 2004 p.6/6

G-NAF file characteristics G-NAF data file Number of records / attributes ADDRESS_ALIAS 289,788 / 6 ADDRESS_DETAIL 4,45,365 / 28 ADDRESS_SITE 4,096,507 / 6 ADDRESS_SITE_GEOCODE 3,336,778 / 2 LOCALITY 5,07 / 7 LOCALITY_ALIAS 700 / 5 LOCALITY_GEOCODE 4,978 / STREET 58,083 / 6 STREET_LOCALITY_ALIAS 5,584 / 6 STREET_LOCALITY_GEOCODE 28,609 / 3 New South Wales data only Peter Christen, December 2004 p.7/6

Febrl geocoding system User data file Geocoding module Web server module G NAF data files AustPost data GIS data Febrl clean and standardise Web interface input data Febrl clean and standardise Build inverted indexes Inverted index data files Febrl geocode match engine Geocoded Web data Process GNAF module Geocoded user data file Febrl (Freely extensible biomedical record linkage) (open source, object oriented, written in Python) Experimental platform for rapid prototyping of new and improved linkage algorithms Modules for data cleaning and standardisation, data linkage, deduplication, and geocoding Peter Christen, December 2004 p.8/6

Address cleaning and standardisation Real world data is often dirty (missing values, different coding formats, typographical errors, out-of-date data) For accurate geocode matching, we want clean data in well defined fields Febrl address cleaning is a three step process. Input data is cleaned (make lower case, remove certain characters, correct misspellings and abbreviations) 2. Split input into a list of words and numbers, then tag them (using rules and user definable look-up tables) 3. Give tag lists to a probabilistic hidden Markov model (which assigns tags to output fields) Peter Christen, December 2004 p.9/6

HMM standardisation example 5% 90% Start 8% 2% Wayfare Number Wayfare Name 95% Wayfare Type 95% 95% 2% Locality Name 3% 40% 40% 2% Territory 80% Postal Code 0% 90% 20% End 3% 2% 8% Raw input: 73 Miller St, NORTH SYDENY 2060 Cleaned into: 73 miller street north sydney 2060 Word and tag lists: [ 73, miller, street, north_sydney, 2060 ] [ NU, UN, WT, LN, PC ] Example path through HMM Start -> Wayfare Number (NU) -> Wayfare Name (UN) -> Wayfare Type (WT) -> Locality Name (LN) -> Postal Code (PC) -> End Peter Christen, December 2004 p.0/6

Processing G-NAF Two step process. Do cleaning and standardisation as discussed (to make G-NAF data similar to input data) 2. Build inverted indices (sets, implemented as keyed hash tables with field values as keys) Example (postcode): 2000 :(603099,656024) Within geocode matching engine, intersections are used to find matching records Inverted indices are built for 23 G-NAF fields Peter Christen, December 2004 p./6

Additional data files Use external Australia Post postcode and suburb look-up tables for correcting and imputing (e.g. if a suburb has a unique postcode this value can be imputed if missing, or corrected if wrong) Use boundary files for postcodes and suburbs to build neighbouring region lists Idea: People often record neighbouring suburb or postcode if it has a higher perceived social status Create lists for direct and indirect neighbours (neighbouring levels and 2) Peter Christen, December 2004 p.2/6

Geocode matching engine Rules based approach for exact or approximate matching Start with address and street level matching set intersection Intersect with locality matching set (start with neighbouring level 0, if no match increase to, finally 2) Refine with postcode, unit, property matches Return best possible match coordinates Exact / average address Exact / many street Exact / many locality / no match Peter Christen, December 2004 p.3/6

First results Match status Number of records Percentage Exact address level match 7,288 72.87 % Average address level match 23 2.3 % Exact street level match,290 2.90 % Many street level match 54.54 % Exact locality level match 97 9.7 % Many locality level match 35.35 % No match 3 0.03 % 0,000 NSW Land and Property Information records Average 43 milliseconds for geocoding one record on a 480 MHz UltraSPARC II Peter Christen, December 2004 p.4/6

Geocoding examples Red dots: Febrl geocoding (G-NAF based) Blue dots: Street centreline based geocoding Peter Christen, December 2004 p.5/6

Future work Improve probabilistic data cleaning and standardisation Improve performance (scalability and parallelism) Improve matching algorithm Improve user interface (currently simple Web demo) Provide feedback on G-NAF to improve data quality Develop privacy preserving geocoding Peter Christen, December 2004 p.6/6