Vanderbilt CQS: Next Steps for Next-Generation Success

Vanderbilt CQS: Next Steps for Next-Generation Success Yu Shyr, PhD 03 15 2018 SCLC RESEARCH CONSORTIUM MEETING

Shakespeare s Word Counts 1985: Gary Taylor found in the Bodleian Library a poem of 429 words that he attributed to Shakespeare: Shall I die, shall I fly Lovers barbs and deceits Sorrow breeding?... Controversy: Was this poem written by Shakespeare? Statistician predicts 6.97 new words in the poem. Actually 9 new words: admirations, besots, exiles, inflection, joying, scanty, speck, tormentor, explain.

Big data marks the beginning of a major transformation By 2020, big data is expected reach 40 zettabytes (1 Zib = 1 million PB) The number of connected devices has already surpassed the number of humans on the planet. By 2020, the number of connected devices is projected to exceed 50 billion Big data is dramatically transforming all sectors of society: cloud computing, machine learning, AI, digital humanities, personalized medicine, intelligent business models, etc.

Big data marks the beginning of a major transformation UK Biobank - Genotyping and Imputation Data Release May 2017. This document provides further information for the release of genotyping and imputation data for all 500,000 participants in UK Biobank.

Big data marks the beginning of a major transformation

STATISTICS, AI AND MACHINE LEARNING: A GRAMMAR FOR FUTURE WORLD DEVELOPMENT

Skin cancer, the most common human malignancy, is primarily diagnosed visually, beginning with an initial clinical screening and followed potentially by dermoscopic analysis, a biopsy and histopathological examination. We train a convolutional neural networks (CNN) using a dataset of 129,450 clinical images consisting of 2,032 different diseases. We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images. The authors demonstrated an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists.

This should have been a warning that the big data were over-fitting the small number of cases a standard concern in data analysis.

Simpson s Paradox Time of Day Pilot On-time Arrival Rate John Mary Day Night Overall 90 out of 100 90% 19 out of 20 95% 10 out of 20 50% 75 out of 100 75% 100 out of 120 83% 94 out of 120 78%

Unique position and capability of the CQS Founded in 2011 to coordinate and integrate quantitative sciences across the institution CQS membership: 48 faculty from 23 departments/divisions campus-wide Between 2011-2017, CQS published >2,650 papers in peer-reviewed journals Members currently collaborate on 173 federal grants and 39 foundation awards Since 2013, CQS has funded 16 pilot project awards (12 VUMC/4 VU); since 2015, CQS has funded 20 travel awards (18 VU/2 VUMC) Since 2014, CQS Summer Institute has trained >800 students, staff and faculty (Dean of the Graduate School, Dr. Mark Wallace, and Interdisciplinary Graduate Program (IGP) Director, Dr. James Patton)

The project has five major areas of interest: To continue development of novel methodologies, algorithms, and tools for processing, analyzing, and interpreting big data To expand big data to additional disciplines Vanderbilt Data Science Vision Workgroup (Co-Chairs: Shyr & Berlind): The foundation for a Vanderbilt Institute of Data Science To enhance the CQS state-of-the-art informatics platform, Synalytics, with a focus on complementary functionality to Vanderbilt s flagship EDC system, REDCap To broaden support for successful CQS programs, in particular, our Summer Institute.

Aim Two: Expansion of big data analysis to additional fields of research Currently, CQS has four program areas: biostatistics, bioinformatics, systems biology, and data coordination Our experience and capacity positions the CQS to extend big data analysis to additional disciplines: chemical, civil, computer, electrical and mechanical engineering, business economics, digital humanities and the social sciences. Big data analysis expansion can accelerate discovery across all divisions of the university

Center for Quantitative Sciences Why Synalytics? Handling of complex longitudinal data Inter-variable validations Complex branching logic Field calculations and auto-population Audit trail/study Data monitoring workflow including auto-queries Job scheduling for system actions, notifications, alerts on a timer State machines for differing presentation of forms and features per user type Statistical/bioinformatic data analysis and reporting

END

Questions