University of Massachusetts Medical School escholarship@umms University of Massachusetts and New England Area Librarian e-science Symposium 2015 e-science Symposium Apr 9th, 9:15 AM Data Science @ the NIH: What is Happening & What is Coming: A Conversation Philip E. Bourne National Institutes of Health Follow this and additional works at: http://escholarship.umassmed.edu/escience_symposium Part of the Public Health Commons, Scholarly Communication Commons, and the Science and Technology Policy Commons This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License. Bourne, Philip E., "Data Science @ the NIH: What is Happening & What is Coming: A Conversation" (2015). University of Massachusetts and New England Area Librarian e-science Symposium. 5. http://escholarship.umassmed.edu/escience_symposium/2015/program/5 This material is brought to you by escholarship@umms. It has been accepted for inclusion in University of Massachusetts and New England Area Librarian e-science Symposium by an authorized administrator of escholarship@umms. For more information, please contact Lisa.Palmer@umassmed.edu.
Data Science @ the NIH What is Happening & What is Coming A Conversation Philip E. Bourne, PhD, FACMI Associate Director for Data Science National Institutes of Health March 31, 2015
This is Just the Beginning Evidence: Google car 3D printers Waze Robotics Sensors From: The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies by Erik Brynjolfsson & Andrew McAfee
Addressing the Opportunities & Challenges 6/12 2/14 3/14 Findings: Sharing data & software through catalogs Support methods and applications development Need more training Need campus-wide IT strategy Hire CSIO Continued support throughout the lifecycle
What Have I Learned Thus Far?. Working with the full spectrum of data types is challenging Xtreme translation A large ship takes a long time to stop and turn, but a great crew helps That crew is in places I was not used to There are complexities I could not have imagined going in based on the funding ecosystem
What Have I Learned Thus Far? Policies take time when they come from the bottom up, but they may work are i.e. implemented and adhered to Policies from the top down can be problematic What you set out to do is often not what you end up doing e.g. precision medicine, NLM rethink This is just the beginning
Additional NIH Disruptors
Additional NIH Disruptors NLM 15 Year Vision Statement (Personal View) The National Biomedical Knowledge Portal is a community resource dedicated to the preservation, free, open and collaborative access to the world s biomedical research output. http://www.plexsci.com/assets/files/1/images/knowledge-management-.jpg
Early Findings Bad News We do not yet have a data sustainability plan Global policies define the why but not the how We do not know how all the data we currently have are used We need to ramp up training programs in data science Good news Genuine willingness across the IC s to address the problems Global communities are emerging and should be nurtured We are beginning to define & quantify the issues e.g. reproducibility Disruptors accelerate change
Office of Biomedical Data Science Mission Statement To foster an open ecosystem that enables biomedical research to be conducted as a digital enterprise that enhances health, lengthens life and reduces illness and disability & to train the next generation of data scientists Goals expanded from recommendations in the June 2012 DIWG and BRWWG reports.
The BD2K Program is Central to the Mission $120,000,000 Planned Black; Available- Green $100,000,000 $80,000,000 $60,000,000 $40,000,000 $20,000,000 $0 FY14 FY15 FY16 FY17 FY18 FY19 FY20 FY21
Elements of The Digital Enterprise Communities Policies Infrastructure Intersection: Sustainability Efficiency Collaboration Training
Elements of The Digital Enterprise Communities Policies Virtuous Research Cycle Infrastructure Intersection: Sustainability Efficiency Collaboration Training
Consider an example
Big Data: The study involved MRI images & GWAS data from over 30,000 people Collaboration: Data came from many different sights affiliated with the ENIGMA consortium Methods: To homogenize data from different sites, the group designed standardized protocols for image analysis, quality assessment, genetic imputation, and association Found five novel genetic variants Results provided insight into the variability of brain development, and may be applied to study of neuropsychiatric dysfunction
Policies: Now & Forthcoming Data Sharing Genomic data sharing announced Data sharing plans on all research awards Data sharing plan enforcement Machine readable plan Repository requirements to include grant numbers http://www.nih.gov/news/health/aug2014/od-27.htm
Policies - Forthcoming Data Citation Goal: legitimize data as a form of scholarship Process: Machine readable standard for data citation (done) Endorsement of data citation for inclusion in NIH bib sketch, grants, reports, etc. Example formats for human readable data citations Slowly work into NLM/NCBI workflow dbgap in the cloud (soon!)
Infrastructure - The Commons Labs BD2K Center BD2K Center Labs Labs BD2K Center Software BD2K Center Labs DDICC Standards BD2K Center BD2K Center
The Commons Digital Objects (with UIDs) The Commons Search (indexed metadata) Computing Platform Vivien Bonazzi George Komatsoulis
The Commons: Compute Platforms The Commons Conceptual Framework Public Cloud Platforms Google, AWS (Amazon) Microsoft (Azure), IBM, other? Super Computing (HPC) Platforms Traditionally low access by NIH Other Platforms? In house compute solutions Private clouds, HPC Pharma The Broad Bionimbus
[George Komatsoulis] The Commons: Business Model
NIH philip.bourne@nih.gov Turning Discovery Into Health