Language Standardization for Mortality Coding A German Approach Stefanie Weber, Orlando Özer

Similar documents
National capacity in CRVS 2 nd workshop Session 5 Cause of Death (CoD) Workshop for national CRVS focal points 6-10 March 2017

Data Dictionary: HES-ONS linked mortality data

Generating reliable cause-of-death information within a civil registration and vital statistics system

WORLD HEALTH ORGANIZATION - Questionnaire on mortality data

Presented by Doris Ma Fat on behalf of the. Department of Health Statistics and Information Systems World Health Organization, Geneva

CODING OF CAUSES OF DEATH IN EUROPEAN COMMUNITY

TRINIDAD AND TOBAGO. Registration of Cause of Death

SNOMED CT January 2018 International Edition. SNOMED International Management Release Note

Quiddler Skill Connections for Teachers

Examples of needed amendments to STCW Code. Zbigniew Szozda. Report

FOURTH LECTURE : SEPTEMBER 18, 2014

ETSI EN V1.3.1 ( ) Harmonized European Standard (Telecommunications series)

Quality Assessment of the Philippine Civil Registration and Vital Statistics

Provided by. RESEARCH ON INTERNATIONAL MARKETS We deliver the facts you make the decisions

Department of Economic and Social Affairs 20 June 2011 United Nations Statistics Division

Death Clearance Overview, 2006 Edition

Security Portal Web Address: EDRS Informational Website:

ENGLISH TRANSLATION MILLIMETER-WAVE RADAR EQUIPMENT FOR SPECIFIED LOW POWER RADIO STATION

This is the author s version of a work that was submitted/accepted for publication in the following source:

An Hybrid MLP-SVM Handwritten Digit Recognizer

International Civil Aviation Organization ASSEMBLY 38TH SESSION EXECUTIVE COMMITTEE

SECTION II. DEATH REGISTRATION AND CAUSE OF DEATH CLASSIFICATION IN IRELAND FOR DEATHS OCCURRING IN 2002

Óbuda University Donát Bánki Faculty of Mechanical and Safety Engineering. TRAINING PROGRAM Mechatronic Engineering MSc. Budapest, 01 September 2017.

Identifying and Managing Joint Inventions

A Guide to Linked Mortality Data from Hospital Episode Statistics and the Office for National Statistics

Safety with system. Test and Measuring Instruments for Trees and Timber Structures

A method and a tool for geocoding and record linkage

Part 7: Thermography

Computers and Medicine

APPENDIX AVAILABLE ON REQUEST. Research Report 154

C. PCT 1486 November 30, 2016

Allen County 4-H Creative Writing

ETSI EN V1.2.1 ( ) Harmonized European Standard (Telecommunications series)

Vital Statistics Database Death Event Reference Guide For Years

AN0503 Using swarm bee LE for Collision Avoidance Systems (CAS)

ETSI ETR TECHNICAL July 1998 REPORT

Issuing and Appropriate Completion of Medical Certification of Causeof-Death Forms by Physicians at a Tertiary Teaching Hospital in Nigeria

INTERNATIONAL TELECOMMUNICATION UNION SERIES K: PROTECTION AGAINST INTERFERENCE

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 1: Intro

ETSI EN V1.1.2 ( ) Harmonized European Standard

technical documentation Requirements CAD-Import GEMOS / Checklist CAD-Drawing GEMOS4

ETSI EN V1.1.1 ( )

Correlates of undefined cause of injury coded mortality data in Australia

39 Years of Learning the Hard Way. Mark A Etrheim, Mastercraft Homes Inc

This document is a preview generated by EVS

Why is CRVS so important?

An Introduction to Spectrum Analyzer. An Introduction to Spectrum Analyzer

ETSI EN V1.5.1 ( ) Harmonized European Standard (Telecommunications series)

ECOS SRIF Operating Instructions

Mike Hogarth, MD, FACP DATA FROM LIFE S UNAVOIDABLE EVENTS CALIFORNIA S BIRTH AND DEATH REGISTRIES

Strengthening civil registration and vital statistics systems - a necessity for the region to overcome socioeconomic and health challenges

The SCOTTISH LONGITUDINAL STUDY (SLS)

HEALTH STATUS. Health Status

Learning To Love. Revision

ETSI EN V1.2.3 ( ) Harmonized European Standard (Telecommunications series)

Improving the Quality of Geocoded Data

New Legislation: HB 1823

Tools and Methodologies for Pipework Inspection Data Analysis

International Smoking Statistics. Spain

ISO INTERNATIONAL STANDARD. Non-destructive testing of welds Radiographic testing Part 1: X- and gamma-ray techniques with film

ETSI EN V2.1.1 ( ) Harmonized European Standard (Telecommunications series)

LOREM IPSUM XXX MEDICAL DEVICES NEWS OCTOBER 2012 SPECIAL POINTS OF INTEREST: XXX EDITORIAL

Nassau County Camera Club Image Competition Rules & Procedures for Digital Images

ETSI EN V1.3.1 ( )

How to get published. Your essential guide YOUR ESSENTIAL GUIDE ON HOW TO WRITE FOR SCHOLARLY PUBLICATION

Final draft ETSI EN V1.1.1 ( )

Guide on the General and Administrative Aspects of the Voluntary System of Modular Evaluation of Measuring instruments

Common Sense Tips By Rhonda Sciortino

Question 1: How do you count choices using the multiplication principle?

Unpacking Digital Technologies

Panel Study of Income Dynamics: Mortality File Documentation. Release 1. Survey Research Center

Chapter 3. Communication and Data Communications Table of Contents

DEATHS - 7 th Listing (6 th Update) & CANCER 4 th Listing (3 rd Update) JUNE 2009

DEVELOPING THE WORKFORCE

ETSI EN V1.1.1 ( )

TxDOT Project : Evaluation of Pavement Rutting and Distress Measurements

MT Alliance Temperature Sensor Calibration Procedure

Magnetic In-Line Inspection of Pipelines: Some Problems of Defect Detection, Identification and Measurement

NZ China EEEMRA. Topics:

no.10 ARC PAUL RABINOW GAYMON BENNETT ANTHONY STAVRIANAKIS RESPONSE TO SYNTHETIC GENOMICS: OPTIONS FOR GOVERNANCE december 5, 2006 concept note

User Experience Questionnaire Handbook

INTERNATIONAL TELECOMMUNICATION UNION

Chapter 6 Bandwidth Utilization: Multiplexing and Spreading 6.1

An introduction to the methods used to record scores in a duplicate bridge pair or Swiss team game. TARGET AUDIENCE: Players looking to master these

Academic Vocabulary Test 1:

Face Registration Using Wearable Active Vision Systems for Augmented Memory

This document is a preview generated by EVS

Series P Supplement 16 (11/88)

Timely death data of great value to local public health

technical documentation Requirement CAD-Import GEMOS / Checklist CAD-Drawing GEMOS4

ETSI EN V1.3.1 ( )

A U T O E S C M E N U

Clinical Natural Language Processing: Unlocking Patient Records for Research

Spectrum Analyzer. EMI Receiver

Context-sensitive speech recognition for human-robot interaction

GESIS Leibniz Institute for the Social Sciences

ETSI EN V1.1.1 ( ) Harmonized European Standard (Telecommunications series)

ETSI EN V1.1.1 ( )

This document is a preview generated by EVS

WCRATE File Companion Documentation

Transcription:

MEETING OF WHO COLLABORATING CENTRES FOR THE FAMILY OF INTERNATIONAL CLASSIFICATIONS Tunis, Tunisia 29 Oct. - 4 Nov. 2006 A German Approach Stefanie Weber, Orlando Özer Abstract In Germany 16 counties ( Länder ) are manually coding their mortality data separately, and for a long time there has been the urge to harmonize the regional differences in coding. Even though regular education was conducted only up to 70 % of the data seem to be coded manually in the same way. Reasons for the difference in assigning the underlying cause vary from unequal knowledge levels to different interpretation of the instruction rules and medical expressions. Automated coding was considered a good way of reaching unified coding results but existing systems do not handle the German language. As the development of a separate tool for Germany was considered too resource-intense and as the translation of the American dictionary of MMDS too error-prone, the development of IRIS was watched closely and Germany decided to try to use IRIS for mortality coding in Germany. As IRIS does not only offer automated coding with the entry of ICD-10 codes but as well allows to include a dictionary and standardization tables for a specific language, we decided to adapt the German morbidity index to mortality needs and to create language standardization tables specific to the German mortality vocabulary. The first tests showed that the vocabulary of the morbidity index (74,000 terms) does not suffice for use in automated coding. Analysis of terms from death certificates showed the great variety of text combination and the creativity of German physicians towards abbreviations and new terms. Therefore standardization through regular expressions, as used in IRIS, was enforced. Even though the results show the obstacles and limitations of standardization for a complicated language as German they are quite encouraging. In a relatively short period of time the single line recognition of texts could be raised from 30% to 70%. Therefore we are quite confident that for the planned implementation in Germany for 2008 the system will be very efficient and will harmonize the output data significantly. This document is not issued to the general public, and all rights are reserved by the World Health Organization (WHO). The document may not be reviewed, abstracted, quoted, reproduced or translated, in part or in whole, without the prior written permission of WHO. No part of this document may be stored in a retrieval system or transmitted in any form or by any means - electronic, mechanical or other - without the prior written permission of WHO. The views expressed in documents by named authors are solely the responsibility of those authors. - 1 -

WHO-FIC 2006 Content Abstract...1 Introduction...3 coding...3 First results...3 Progress in testing...4 Analysis...5 Discussion...6-2 -

WHO-FIC 2006 Introduction In many spoken languages changes and amendments to words occur over the years. Grammatical flections are used and composed words can be altered through rearrangement of its components. New vocabulary is added to a language whereas some old terms disappear from the used vocabulary. While this phenomenon is a positive aspect of a spoken language, it is a complication to electronic processing of the vocabulary. Lots of research has already been invested in this topic and many approaches towards natural language processing have been created. Most of these approaches use nearest match algorithms or phrase interpretation. For the automated processing of the vocabulary on death certificates these approaches are insufficient. The entries are short and the alteration of only one letter can change the meaning of the word used to express mortality information. coding As Germany would like to implement automated mortality coding and decided to use IRIS for this project, creating a dictionary and ways of standardization of the German vocabulary on death certificates were the tasks at hand. For Morbidity a dictionary of 74,000 entries already existed with normalized permutations of each entry. This dictionary was included in Iris for preliminary tests. Additionally, Iris offers the possibility to standardize entries through regular expressions (RegEx), a widely used computer science application for finding and matching of defined strings. As RegEx are a powerful tool to standardize entries we decided to use them sparingly to avoid to much mistakes in the first test. First results In the first tests we used the morbidity dictionary only and tested single line certificates to avoid interference of to many factors in one test run. Only very few standardization steps were applied for the special German characters like ä, ö and ü. For the test we used about 250,000 one line certificates, some of them with two or more diagnosis on a single line. The data was provided through local cancer registries in paper format and was manually typed in. The results were discouraging with only about 30% of one line matches. Analysis of about 80,000 of these certificates was conducted in order to create new entries for the dictionary and to generate RegEx for the standardization tables of Iris. - 3 -

WHO-FIC 2006 Screenshot 1: Standardization Table in Iris for the German language Progress in testing Over the following month various new test were conducted and the rate of automatic recognition of one line certificates rose to 70% over a short period of time. To achieve this rise we added about 200 terms to the existing dictionary and applied about 1,800 steps of standardization. Still, this resulted in an automated coding of certificates (with an average of 2,3 lines) and assignment of the underlying cause (UC) of only 24%. Including the results we focused on a test deck of 44,000 certificates with all lines of the certificates that have been filled in. 25,207 of these certificates were manually pre-coded. - 4 -

WHO-FIC 2006 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 Number 0 All Certificates UC through Iris on all certificates Precoded Certificates UC through Iris on precoded certificates Match of manual UC and Iris UC Graph 1 Number of Certificates coded through Iris 25% of the manually pre-coded certificates could be coded automatically through Iris. Of these about 53% of the automated UC was identical to the manually assigned UC. This is only a little lower as previous studies on coding variation amongst the 16 counties show. Analysis The next step was now to analyze the certificates where Iris-coding did not agree with the manual UC. Furthermore, the certificates that could not be coded automatically through Iris had to be checked. Therefore we analyzed almost 1,000 certificates so far. We found that about 50% of the assigned manual UC were wrong. This has to be put into perspective as for the test we only used the lines of part 1 and 2 of the certificate and left out the additional information on the certificate, like the epicrisis given through the physician or the category of accident a separate field on the German certificate. Therefore IRIS-coding might have been wrong as well as we did not consider this additional information. For about 48% of the certificates we did not have a respective entry in the dictionary for at least one word. New texts for the dictionary resulting from the previous work had not been included in this test as they have to be validated according to internal regulations before being entered into the dictionary. We - 5 -

WHO-FIC 2006 expect though to have about 6000 new terms ready for the next test and with them lower the percentage significantly. About 32% of the certificates contained at least one text that was spelled wrong and even 44% of the certificates contained at least one abbreviation. Only a low percentage of these could potentially be corrected through standardization but most of them will have to be corrected manually before automated coding. Another 4,3% of the certificates could not be coded correctly because they contained external cause information. Here the problem to some extend lies within ACME as it is not giving two codes for these cases: The external cause code and the injury code itself. This problem was discussed in the last Iris Meeting in Alexandria and will be taken care of in one of the next ACME versions. Problems Number of certificates with this problem - 6 - Total Number of certificates analyzed % Manually assigned UC wrong 494 943 52% One or more expression on certificate not in dictionary 455 953 48% One or more expression on certificate is spelled wrong 305 953 32% One or more expression on certificate is abbreviated 420 953 44% Discussion Table 1 Results from analysis of approximately 1,000 certificates The use of Iris with a complicated language like German holds a lot of challenges to be solved. Even with a pre-existing dictionary for morbidity we could not get around intensive work on language standardization and adaptation of the dictionary. Still, as the rise in recognition of single line entries shows, results can be improved a lot with relatively few steps in the first place. Later of course, as standardization and dictionary are already elaborated and efficient a rise in automated assigning of UC can only be achieved in slower steps. A key role in the level of automatically assigned UC is the way the texts are entered in Iris in the first place. As we did estimate the tests proved that a minimum of 44% of the German death certificates cannot be recognized correctly because of abbreviations and spelling mistakes. This can only be absorbed up to a certain degree through standardization with the risk of misinterpretation of very short abbreviations. Problem areas of ACME (external causes, maternal mortality etc.) that can not be handled automatically will be passed on to Iris and reject a certain amount of cases.

WHO-FIC 2006 Even if we can reach only 30% of automated assigning of UC for the first test year in Germany this will still represent 180,000 Certificates per year. With this perspective more time could be spend on tricky cases and standardization of mortality coding for international statistics will be taken a step forward. With Iris in use in test-areas in Germany starting next year and the ongoing tests in DIMDI we hopefully will obtain further important information on missing terms for the dictionary and on new entries to the standardization tables of Iris. Dr. Stefanie Weber WHO-FIC Collaborating Centre for the German Language German Institute for Medical ation and Information (DIMDI) Waisenhausgasse 36-38A 50676 Köln Germany Email: stefanie.weber@dimdi.de Phone: +49 221 4724 485 Fax: +49 221 4724 444 http://www.dimdi.de - 7 -