The (In)ability to Triangulate in Data Driven Healthcare Research

Similar documents
Clinical Natural Language Processing: Unlocking Patient Records for Research

How Machine Learning and AI Are Disrupting the Current Healthcare System. Session #30, March 6, 2018 Cris Ross, CIO Mayo Clinic, Jim Golden, PwC

EHR Optimization: Why Is Meaningful Use So Difficult?

A Brief Overview of Facebook and NLP. Presented by Brian Groenke and Nabil Wadih

Analysis of Data Mining Methods for Social Media

WORKSHOP. Sara Bauer Ma, MSc. Computational Linguistics or "How your last tweet will be used against you" September 28, 2018

Social media corpora, datasets and tools: An overview

e-science Acknowledgements

Health Informatics Principles - Excerpt -

THE CHALLENGES OF SENTIMENT ANALYSIS ON SOCIAL WEB COMMUNITIES

Artificial Intelligence in Medicine. The Landscape. The Landscape

Introduction to Computational Intelligence in Healthcare

Social Media Networks in Online Health Care for Topic Analysis And Sentiment Analysis Using Text Mining Techniques

WP6 Genomics Organizing the societal debate on the use of genomic information in healthcare

Alternative lossless compression algorithms in X-ray cardiac images

& Medical Tourism. DIHTF - Dubai 20 th -21 st Feb 2018 V S Venkatesh -India

COMPUTATIONAL SOCIAL SCIENCE AND ADVANCED COMPUTING INFRASTRUCTURE: CHALLENGES AND OPPORTUNITIES

The Long Tail of Research Data

Understanding the city to make it smart

Using Deep Learning for Sentiment Analysis and Opinion Mining

Quantitative Reasoning: It s Not Just for Scientists & Economists Anymore

Digital Health AI in Life Sciences

Transforming while performing Deep Dive: Artificial Intelligence. Hype or not?

TRANSLATION OF GENOMICS FOR PATIENT CARE AND RESEARCH PATIENT S PERSPECTIVE

PEDIGREE Publication Policy & Procedures

RADIOLOGY August 2017

Innovationen gemeinsam entwickeln und umsetzen

Crucible: A FHIR Testing Tool

Why Artificial Intelligence will Revolutionize Healthcare including the Behavioral Health Workforce.

Generification in change: the complexity of modelling the healthcare domain.

Libraries on the Cutting Edge: The Evolution of The Journal of escience Librarianship

Voters Attitudes toward Science and Technology Research and the Role of the Federal Government

A Case Study on the Use of Unstructured Data in Healthcare Analytics. Analysis of Images for Diabetic Retinopathy

Speaking with One Voice Unifying Scripps Under a Master Brand Strategy

Promoting Patient and Researcher Engagement with Distributed Data Research Networks through Hurdle Free Tools

Mara H. Rogers, Partner Norton Rose Fulbright

The EFPIA Perspective on the GDPR. Brendan Barnes, EFPIA 2 nd Nordic Real World Data Conference , Helsinki

Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Notice of Privacy Practices

It all starts here. Format fermé : 9" x 12" Format ouvert : 18 x 12 Marges : 0,125" Couleurs : CMYK

envision workshop Designing for Change The Exam Process in 2030 American Academy of Ophthalmology 2015

Canada s Research-Based Pharmaceutical Companies (Rx&D) 2015 Pre-Budget Submission House of Commons Standing Committee on Finance.

Techniques for Sentiment Analysis survey

Imagine your future lab. Designed using Virtual Reality and Computer Simulation

Paola Bailey, PsyD Licensed Clinical Psychologist PSY# 25263

Re-engineering Collaborative Mechanisms and Knowledge Networks to Accelerate Innovation for Alzheimer s

Latest trends in sentiment analysis - A survey

Submission for the 2019 Federal Budget. Submitted by: The Canadian Federation of Library Associations

Can we better support and motivate scientists to deliver impact? Looking at the role of research evaluation and metrics. Áine Regan & Maeve Henchion

Introducing Elsevier Research Intelligence

Exploring the New Trends of Chinese Tourists in Switzerland

14 th Berlin Open Access Conference Publisher Colloquy session

The Learning Health System: Visions of the Present and Future. Charles P. Friedman, PhD University of Michigan NSF Workshop April 11-12, 2013

For personal use only

November 6, Keynote Speaker. Panelists. Heng Xu Penn State. Rebecca Wang Lehigh University. Eric P. S. Baumer Lehigh University

REALIZATION OF TAI-CHI MOTION USING A HUMANOID ROBOT Physical interactions with humanoid robot

News - Politics and Law October 16, :00 PM PDT SEC plans XBRL standard to liberate financial data Posted by Stephanie Condon

South-South Exchange Meeting on the Conservation and Sustainable Use of Forest Biodiversity, 8-10 July 2009

Christina Narensky, Psy.D.

BRINGING DEEP LEARNING TO ENTERPRISE IMAGING CLINICAL PRACTICE

Altmetric. Ben McLeish

The Health Informatics Process

THELMA ELIZABETH ROBERTSON CLARK ENDOWED SCHOLARSHIP

SOCIAL INSTITUTIONS. Image Source: 123RF

Opening Science & Scholarship

Horizon Societal Challenge 1: Health, demographic change and wellbeing. Jeremy Bray DG Research & Innovation European Commission

census 2016: count yourself in

EACL th Conference of the European Chapter of the Association for Computational Linguistics

Tutorial: Open Data. Open Source EHR Summit & Workshop October 17-18, 2012 National Harbor, MD

Running head: SOCIAL NETWORK RESEARCH 1. The Evolving Nature of Social Network Research: A commentary to Gleibs (2014)

Attribution and impact for social science data

People-powered Public Services. OECD/CSTP Workshop on Social Challenges

ICT and collaboration. Lasse Berntzen Associate Professor Vestfold University College

Keeping up with the times Tensions between workflow, status quo, and technology

More Than Citations and Impact Factor: Altmetric.com

Roleplay Technologies: The Art of Conversation Transformed into the Science of Simulation

Disrupting our way to a Very Human City

Artificial Intelligence and Law. Latifa Al-Abdulkarim Assistant Professor of Artificial Intelligence, KSU

Fall State of the Industry Report UF SID MARTIN FLORIDA BIODATABASE

Scottish Health and Life Sciences Innovation Workshop. The Industrial Strategy Challenge Fund

Health Record Linkage at Statistics Canada

Created & Presented by Carrie Kelly, LPC, CSAC, NCC

DG RTD: Launching the policy debate in Europe

Wireless Brain-Machine-Interface Wearable 1. Company introduction

EOSC Governance Development Forum 6 April 2017 Per Öster

OCR H446 A-Level Computer Science

Boston Area CDISC User Network 20 July Melissa Cook Octagon Research Solutions

Technology and Innovation in the NHS Scottish Health Innovations Ltd

NCRIS Capability 5.7: Population Health and Clinical Data Linkage

ICAD Public Engagement Strategy

Digitally transformed pharmacovigilance

Digital Health. Jiban Khuntia, PhD. Assistant Professor Business School University of Colorado Denver

EP Interest Group Mental Health, Brussels 22/09/11 Tinne Vandensande King Baudouin Foundation

The SEMATECH Model: Potential Applications to PV

Making Medicine Modern In silico medicine and the Avicenna Alliance AVICENNA IN ONE YEAR. avicenna-alliance.com

American Chamber of Commerce in Taipei

NHS Next Stage Review: Innovation

NLP Researcher: Snigdha Chaturvedi. Xingya Zhao, 12/5/2017

Implementation of Systems Medicine across Europe

A F.A.I.R. model for Australia s research outputs: emerging policies and new strategies

Transcription:

The (In)ability to Triangulate in Data Driven Healthcare Research Philip Resnik University of Maryland resnik@umd.edu SBS Decadal Survey - Workshop on Culture, Language, and Behavior National Academies of Sciences, Engineering, and Medicine October 11, 2017

Modeling political attitudes using behavior Legislator a votes 'Yea' on bill b with probability Political attitude of legislator a One-dimensional ideal point of legislator a NAY NAY NAY p(v a,b = Yea) = Φ(u a x b + y b ) LIBERAL CONSERVATIVE NAY Polarity of bill b Popularity of bill b Martin and Quinn, 2002; Bafumi et al., 2005; Gerrish and Blei, 2011 Figure adapted from Viet-An Nguyen

Triangulating: behavior and language Bill text NAY NAY Speeches NAY NAY Votes Nguyen et al., 2015 Extending Gerrish and Blei 2012, Lauderdale and Clark 2014 Figure adapted from Viet-An Nguyen

Practical concerns: blame, gov t shutdown Party principle concerns: Debt, taxes, entitlements Establishment Tea Party Nguyen et al., Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress. Association for Computational Linguistics, Beijing, July 2015.

Triangulating: behavior and language Language A A D A A D A D D A A Non-political actors Votes Real-time responses Social media sentiment Daniel Argyle, Philip Resnik, and Vlad Eidelman, Using Ideal Point Models to Characterize Political Reactions in Non-Political Actors, Seventh Annual Conference on New Directions in Analyzing Text as Data, Boston, Oct 14-15 2016

Adapted from http://multimedia.3m.com/mws/media/988566o/paths-to-success-cac-nlp-white-paper.pdf.

A sampling of NLP research datasets Blog Authorship Corpus: consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. 681,288 posts and over 140 million words. (298 MB) Cornell Movie Dialog Corpus: contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters, 617 movies (9.5 MB) Enron Email Data: consists of 1,227,255 emails with 493,384 attachments covering 151 custodians (210 GB) https://github.com/niderhoff/nlp-datasets Hansards text chunks of Canadian Parliament: 1.3 million pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament. (82 MB) Reddit Submission Corpus: all publicly available Reddit submissions from January 2006 - August 31, 2015). (42 GB) Twitter Sentiment140: 1.6 million Tweets related to brands/keywords. (77 MB) Yahoo! Answers Comprehensive Questions and Answers: Yahoo! Answers corpus as of 10/25/2007. Contains 4,483,032 questions and their answers. (3.6 GB)

A sampling of healthcare NLP research datasets SemEval-2017: Clinical TempEval. 400 manually de-identified clinical notes and pathology reports from cancer patients at the Mayo Clinic. CLEF ehealth 2016. Suominen H, Zhou L, Hanlen L, Ferraro G. Benchmarking Clinical Speech Recognition and Information Extraction: New Data, Methods, and Evaluations. JMIR Med Inform 2015;3(2):e19 Synthetic dataset of 101 handover records. MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). ~2M free text notes from ~40K critical care patients at Beth Israel Deaconess Medical Center. CLPsych 2015. Triage of posts from a mental health forum; 65K posts. Choudhury, Munmun De et al. Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media. CHI (2016). ~80K posts from mental health related forums on Reddit. CLPsych 2016. Triage of posts from a mental health peer support forum; 65K posts. Not clinical ground truth

What s the problem? HIPAA balkanizes research Language data is hard to fully de-identify EHRs create pressure to avoid language It s easy to just work on something else

NAACL 2016 keynote Adapted from https://people.csail.mit.edu/regina/talks/cnlp.pdf

What s the problem? HIPAA balkanizes research Researchers can t fix HIPAA Language data is hard to fully de-identify High accuracy automation isn t enough EHRs create pressure to avoid language NLP is helping, but not fast enough It s easy to just work on something else We need to find a different way

Philip Resnik and Deanna Kelly, Development of Computational Modeling to Identify Symptom Changes in Schizophrenia and Depression, UMB/UMCP MPower seed grant ourdatahelps.org umd.ourdatahelps.org

Recruiting Consent infrastructure Non-PHI clinical data Social media Collection/anonymiza tion infrastructure Computing environment

Progress UPenn Linguistic Data Consortium (LDC) Framingham Heart Study project Health Natural Language Processing Center (hnlp) LDC-like repository/dissemination of healthcare data NIH All of Us (Precision Medicine Initiative) EHR data may be sent directly by the participant s health care provider or sent by the participant to the program through Sync for Science The initial data types to be included are demographics, visits, diagnoses, procedures, medications, laboratory tests, and vital signs, but may be expanded to all parts of the EHR, including health care provider notes. A chicken-egg problem: The unstructured data problem requires more investment We need progress with unstructured data to justify it

Take-aways Healthcare is a national security issue Language data is a hugely valuable resource for triangulation We have a lot of catching up to do More needs to be done faster