Automating the Detection of Personally Identifiable Information (PII) in Japanese-American WWII Incarceration Camps

Similar documents
Indiana K-12 Computer Science Standards

200 West Baltimore Street Baltimore, MD TTY/TDD marylandpublicschools.org

Introduction to Computer Science - PLTW #9340

CSTA K- 12 Computer Science Standards: Mapped to STEM, Common Core, and Partnership for the 21 st Century Standards

Violent Intent Modeling System

DRAFT 2016 CSTA K-12 CS

Required Course Numbers. Test Content Categories. Computer Science 8 12 Curriculum Crosswalk Page 2 of 14

Towards a Software Engineering Research Framework: Extending Design Science Research

15-388/688 - Practical Data Science: Visualization and Data Exploration. J. Zico Kolter Carnegie Mellon University Spring 2018

BEYOND SHALL STATEMENTS: MODERNIZING REQUIREMENTS ENGINEERING

BIG IDEAS. Personal design interests require the evaluation and refinement of skills. Learning Standards

The Next Generation Science Standards Grades 6-8

Correlations to NATIONAL SOCIAL STUDIES STANDARDS

INTRODUCTION TO CULTURAL ANTHROPOLOGY

Introduction to adoption of lean canvas in software test architecture design

Department for Education and Child Development School Enrolment Census Data Quality Statement

Japanese-American Relocation Center Newsletters,

TJHSST Senior Research Project Exploring Artificial Societies Through Sugarscape

COMPREHENSIVE COMPETITIVE INTELLIGENCE MONITORING IN REAL TIME

Computational Reproducibility in Medical Research:

Case 4:74-cv DCB Document Filed 09/01/17 Page 293 of 322 APPENDIX V 156

Analysis & Geoprocessing: Case Studies Problem Solving

Archives and Native American Genealogy: A Researcher s Perspective

If These Crawls Could Talk: Studying and Documenting Web Archives Provenance

Societal and Ethical Challenges in the Era of Big Data: Exploring the emerging issues and opportunities of big data management and analytics

PowerAnchor STEM Curriculum mapping Year 10

PowerAnchor STEM Curriculum mapping Year 9

OVERVIEW OF ARTIFICIAL INTELLIGENCE (AI) TECHNOLOGIES. Presented by: WTI

2009 New Jersey Core Curriculum Content Standards - Technology

Computer Progression Pathways statements for KS3 & 4. Year 7 National Expectations. Algorithms

Image Processing and Particle Analysis for Road Traffic Detection

WWII. Revisited. ipad & Tablet Ready! inside. BIM for. Quarry Management Changing the face. Data. Knowledge needed. Challenges and benefits

Robot Task-Level Programming Language and Simulation

Coursework 2. MLP Lecture 7 Convolutional Networks 1

Steps toward reproducible research

Classroom Tips and Techniques: Applying the Epsilon-Delta Definition of a Limit

Background. Computer Vision & Digital Image Processing. Improved Bartlane transmitted image. Example Bartlane transmitted image

February 24, [Click for Most Updated Paper] [Click for Most Updated Online Appendices]

Steps toward reproducible research

Level Below Basic Basic Proficient Advanced. Policy PLDs. Cognitive Complexity

understand the hardware and software components that make up computer systems, and how they communicate with one another and with other systems

CONSENT IN THE TIME OF BIG DATA. Richard Austin February 1, 2017

Artificial Intelligence and Asymmetric Information Theory. Tshilidzi Marwala and Evan Hurwitz. University of Johannesburg.

Curriculum Standards for Social Studies of the National Council for the Social Studies NCSS

BCCDC Informatics Activities

Oregon Science Content Standards Grades K-6

TEACHERS OF SOCIAL STUDIES FORM I-C MATRIX

The 2018 Publishing Landscape: Technological Horizons. Lyndsey Dixon Editorial Director, APAC Journals Taylor & Francis Group

CS/NEUR125 Brains, Minds, and Machines. Due: Wednesday, February 8

CPM EDUCATIONAL PROGRAM

Microsoft Excel: Data Analysis & Graphing. College of Engineering Engineering Education Innovation Center

Science Binder and Science Notebook. Discussions

Aeronautical Science - Professional Pilot - BS

Eric Tillinghast: Water/Nymph at Richard Levy Gallery

Drum Transcription Based on Independent Subspace Analysis

Tuning-CALOHEE Assessment Frameworks for the Subject Area of CIVIL ENGINEERING The Tuning-CALOHEE Assessment Frameworks for Civil Engineering offers

Waves. Read from Lesson 1 of the Waves chapter at The Physics Classroom:

Anne Gilliland Summer School in the Study of Old Books Zadar, Croatia, 27 September, 2009

Design and Application of Multi-screen VR Technology in the Course of Art Painting

Mehrdad Amirghasemi a* Reza Zamani a

Mining Phasor Data To Find The Hidden Gems In Your Archive

Making Sense of the Census

ACTIVITY 1: Measuring Speed

I. INTRODUCTION II. LITERATURE SURVEY. International Journal of Advanced Networking & Applications (IJANA) ISSN:

Years 9 and 10 standard elaborations Australian Curriculum: Digital Technologies

San Francisco 311 Data Visualizations: August 2013

FORESIGHT AND UNDERSTANDING FROM SCIENTIFIC EXPOSITION (FUSE) Incisive Analysis Office. Dewey Murdick Program Manager

Introduction to Data- PASS

RESEARCHING THE NATIONAL ARCHIVES. Compiled by: Sandra M Barnes From

The Effect of Opponent Noise on Image Quality

Achievement Targets & Achievement Indicators. Envision, propose and decide on ideas for artmaking.

Introduction. amy e. earhart and andrew jewell

Living on the LAM: Libraries, Archives and Museums in the Digital Age

Applying Principles from Performance Arts for an Interactive Aesthetic Experience. Magy Seif El-Nasr Penn State University

Open Science for the 21 st century. A declaration of ALL European Academies

Using Software Metrics to Better Understand Complexity Growth during Software Evolution

RESEARCH INTEGRITY AND DATA MANAGEMENT

Big Data Analytics in Science and Research: New Drivers for Growth and Global Challenges

Guidelines for the Professional Evaluation of Digital Scholarship by Historians

Pilot: Device-free Indoor Localization Using Channel State Information

WRITING ABOUT THE DATA

1368 ~ ~ ~ 356

UW Campus Navigator: WiFi Navigation

Analysis and Geoprocessing Sessions and Demo Theater Presentations

Drawing Management Brain Dump

Gamification of Certification

Tennessee Senior Bridge Mathematics

The Need for Gate-Level CDC

ND STL Standards & Benchmarks Time Planned Activities

MEDIA AND INFORMATION

Three Visualization Tools to Grasp Dynamism in the Global Economy: PRISM, TRADE MAPPER and EMERGENT

Chapter 0 Getting Started on the TI-83 or TI-84 Family of Graphing Calculators

Analysis of the electrical disturbances in CERN power distribution network with pattern mining methods

Information Communication Technology

Chapter 3 Describing Logic Circuits Dr. Xu

Precision Flash Lamp Current Measurement Thermal Sensitivity and Analytic Compensation Techniques

Report Documentation Page

Creating reproducible reports using R Markdown. C. Tobin Magle Cyberinfrastructure facilitator Colorado State University

Big Data and the Question of Objectivity

Detection of Compound Structures in Very High Spatial Resolution Images

Transcription:

Automating the Detection of Personally Identifiable Information (PII) in Japanese-American WWII Incarceration Camps Richard Marciano William Underwood Mohammad Hanaee Connor Mullane Aakanksha Singh Zayden Tethong Computational Archival Science Workshop #3 IEEE International Conference on Big Data 2018 December 12, 2018, Westin Hotel, Seattle, Washington

The records of the WRA (Record Group 210 from 1941-47) at the National Archives in Washington D.C. and Maryland, are comprised of over 100 series with motion picture films, drawings of incarceration centers, photos, maps, correspondence, yearbooks, rosters, etc. Series 51 & 52 have immense value for survivors of the camps, their families, and historians, yet they are still not accessible. Series 51, the Internal Security Case Reports from 1942 to 1946, comprises narrative reports prepared by camp investigators, police officers, and directors of internal security, relating cases of alleged disorderly conduct, rioting, seditious behavior, etc. at each of the 10 camps, with detailed information on the names and addresses in the camps of the persons involved, the time and place where the alleged incident occurred, an account of what happened, and a statement of action taken by the investigating officer.

Objective Infuse Computational Thinking into Archival Science

A. G. B. C. D. E. F. H. I.

A. Creating Data The increasingly computational nature of working with data in archival science underscores the importance of developing computational thinking practices in the classroom. Part of the challenge is teaching students that answers are drawn from the data available. In many cases archivists use computational tools to generate data at scales that would otherwise be impossible.

Last Name First Name Birth Year Original State Gender Birth Place Family No Individual No File Number Assembly Center ABE FRANK 1910 CA M CA 24067 24067A 208156 None ABE FRANK 1940 CA M CA 24067 24067 201888 None ABE FRANK 1905 CA M Honolulu county 8605 08605A 950783 Fresno ABE FRANK 1913 CA M Oregon 18050 18050B 805536 Santa Anita ABE FRANK 1881 CA M Sakhalin 34424 34424C 207865 None WRA Form 26 register Japanese-American Internee Data File NARA AAD Final Accountablity Rosters (FAR) Box 8 -- #269 # Last First 96 ABE Tomo Frank Family No Sex Birth Citizenship Alien # 18050 M 2-1-13 R - Entry Granada, CO Entry Date 9-18-43 Preevacuation Addr Los Angeles, CA Type of Final Departure Dept. of Justice Internment Date of Final Departure Destination of Final Departure 6-24-45 Santa Fe, NM

B. Manipulating Data Computational tools make it possible to efficiently and reliably manipulate large and complex archival holdings. Data manipulation includes sorting, filtering, cleaning, normalizing, and joining disparate datasets.

C. Analyzing Data There are many strategies that can be employed when analyzing data for use in an archival context, including looking for patterns or anomalies, defining rules to categorize data, and identifying trends and correlations. We used NER software to extract metadata from the incident cards. This was done with the open source GATE. This is based on pattern matching through recognition rules. The matching rules are often refined through iterative tuning. For example, a rule for recognizing a person s name would be based on a lastname, followed by a comma, followed by a Japanese firstname, followed by an Anglo first name in parentheses. As we process additional cards we would note that there are other styles of names, so the pattern would be generalized account for stylistic variations. If the pattern is made to be robust enough it will eventually work on all of the instances of names. GATE, General Architecture for Text Processing, https://gate.ac.uk/

D. Visualizing Data Communicating results is an essential component of understanding archival data and computational tools can greatly facilitate that process. Tools include both conventional visualizations such as graphs and charts, as well as dynamic, interactive displays. Box 8 WRA Form 26 FAR Tule LAke

E. Designing Computational Models The ability to create, refine, and use models of phenomena is a central practice. Models can include flowcharts and diagrams. Part of taking advantage of computational power is designing new models that can be run on a computational device. There are many reasons that might motivate designing a computational model, including wanting to better understand a phenomenon under investigation, to test out a hypothesis. Students will be able to define the components of the model, describe how they interact, decide what data will be produced by the model.

F. Constructing Computational Models An important practice is the ability to create new or extend existing computational models. This requires being able to encode the model features in a way that a computer can interpret.

G. Computer Programming Enabling students to explore archival problems using computational problem solving practices such as programming, algorithm development, and creating computational abstractions. The ability to encode instructions in such a way that a computer can execute them is a powerful skill for investigating archival problems. Programs include ten-line Python scripts.

H. Developing Modular Computational Solutions When working toward a specific archival outcome, there are often a number of steps or components involved in the process; these steps, in turn, can be broken down in a variety of ways that impact their ability to be easily reused, repurposed, and debugged. Developing computational solutions in a modular, reusable way has many implications. By developing modular solutions, it is easier to incrementally construct solutions, test components independently, and increase the likelihood that components will be useful for future problems. We make use of abstraction and functional programming through the use of modular components such as: PII_DateCheck(), FORM26_lookup(), and FAR_lookup(). This allows for reusable chunks of code that can be tested locally. The larger program is the composition of these modules, which makes it both more readable and maintainable.

I. Troubleshooting and Debugging Troubleshooting broadly refers to the process of figuring out why something is not working or behaving as expected. There are a number of strategies one can employ while troubleshooting a problem, including clearly identifying the issue, systematically testing the system to isolate the source of the error, and reproducing the problem so that potential solutions can be tested reliably. To facilitate group debugging, we use an interactive server-based shared version of Jupyter Notebook. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. Jupyter Notebook Documentation, see: https://media.readthedocs.org/pdf/jupyter-notebook/latest/jupyter-notebook.pdf Project Jupyter, see: http://jupyter.org/

PII Algorithm Performance

PII Algorithm Performance

Computational Archival Science (Mapping Computational Thinking to Archival Practices). Movement of People (Global Journeys, Local Communities), Presidential Transparency (Managing White House Correspondents Association - WHCA - Pool Reports), Revealing Untold Stories (Japanese-American WWII Camps), Racial Zoning (Mapping Inequality), Community Displacement (Human Face of Big Data), Cyberinfrastructure for Digital Curation (Big Data Archives, Enhancing User Access to Big Data Archives),