What is the UC Irvine Data Science Initiative? Padhraic Smyth Director of the UCI Data Science Initiative Department of Computer Science University of California, Irvine
A Revolution in the Technology of Data Graphic from Ray Kurzweil, singularity.com Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 2
A Paradigm Shift in Data Analysis Technological drivers Sensors (cheap and ubiquitous) Data storage (everyone is a data owner ) Computational power Data analysis methods Data access via the internet Convergence..tremendous demand for data analysis In the sciences, in medicine, in engineering, in business, and more In the past, this demand was met by statistics, but. Does not scale up too few statisticians Even statisticians need computers Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 3
Human Computers The historical meaning of the term computer : one who computes (i.e., a person) Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 4
Human Computers The historical meaning of the term computer : one who computes (i.e., a person) Statisticians have been using computers for centuries e.g., Karl Pearson s team of human computers around 1900..but human computers could only work on relatively small problems Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 5
Statistics and Computing Post World War II Increasing use of computing to solve algorithmic aspects of statistical analyses 1960 s Development of statistical computing and exploratory data analysis 1980 s Computing allowed statisticians to explore more flexible models Increase in use of non-parametric techniques and simulation methods 1990 s Development of machine learning very flexible predictive modeling techniques Today Distinctions between statistics and computer science often blurred Interface is a very active and exciting research area Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 6
The role of theory in research is being dangerously ignored in favor of purely empirical work that proceeds without so much as a hypothesis. Public Opinion Quarterly, 1972 Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 7
From http://www.tylervigen.com/ Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 8
From http://www.tylervigen.com/ Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 9
Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 10
Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 11
How Much Climate Data Do We Actually Have? Image from http://cimss.ssec.wisc.edu/ Image from ipcc.ch Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 12
What is Data Science? Algorithms, Databases, Systems + Statistics, Optimization, Machine Learning + Decisions, Policy, Privacy Data Models and Predictions Humans Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 13
What is Data Science? Algorithms, Databases, Systems + Statistics, Optimization, Machine Learning + Decisions, Policy, Privacy Applications of Data Analysis Science, Medicine, Engineering, Humanities, Business Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 14
Challenges in Data Science Statistical Data is often observational, not a random sample How can we better combine theory and data-driven modeling? Algorithmic Scalability: how to work with an N 3 algorithm when N = 100 million? Can the models be updated automatically? Human and Socio-Cultural Balancing privacy and data usage Better tools to allow data users to see inside the black box Educational Shortage of people with skills in both statistics and computer science Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 15
The Shape of Data d = number of variables N = number of samples Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 16
The Shape of Data over Time Pre 1990 Post 1990 Post 2005 Small N, d Large d Large N Large N is good (many algorithms are linear in N).but large d is a challenge, both statistically and computationally Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 17
Computer Systems 101 CPU RAM Disk Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 18
How Far Away is the Data? CPU RAM Disk 10-8 seconds 10-3 seconds Random Access Times Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 19
How Far Away is the Data? CPU RAM Disk 1 meter 100 kilometers Effective Distances Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 20
Legislation on Restrictions on Data Collection Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 21
What is the UCI Data Science Initiative? Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 22
What is the UCI Data Science Initiative? Campus-wide initiative supported by the UCI Office of Academic Initiatives Started July 1, funded for 3 years One of 5 currently-funded initiatives From the Office of Academic Initiatives: Initiatives are expected to encompass projects that involve research, undergraduate and graduate education programs, outreach to public and private organizations, and philanthropic potential...the intent of this program is to support initiatives with a wider range of activities than can be accommodated by ORUs or campus and school research centers. teams of engaged faculty are expected to develop new programs of interschool excellence. Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 23
From the Initiative Website Data Science encompasses the full spectrum of theories and methods that use data to understand and make predictions about the world around us. This includes fundamental research on statistical methods, prediction algorithms, data management techniques, and policy issues; as well as a broad range of domain-specific data-driven research problems in the sciences, engineering, humanities, education, medicine, and business. Website: http://datascience.uci.edu Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 24
Faculty Advisory Board Anima Anandkumar Engineering Jessica Utts Pierre Baldi Geof Bowker Michael Carey Information and Computer Sciences Peter Krapp Humanities Jim Randerson Physical Sciences Suzanne Sandmeyer Medicine Vijay Gurbaxani Business Mark Warschauer Education Kevin Thornton Biological Sciences George Tita Social Ecology Mark Steyvers Social Sciences Tom Boellstoff Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 25
Current Activities of the Data Science Initiative Mini-Symposia Short-Courses Data Science Undergraduate Major Proposal Emphasis on statistics and computer science A minor option planned for later Other Support for large proposal efforts Clearing house for data-science-related information via mailing list and Website More activities being planned Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 26
Data Science Website: http://datascience.uci.edu Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 27
Short Courses Fall 2014 Introduction to R (Nov 14 th ) Introduction to Linux and HPC (Nov 17 th ) Analyzing Data in Linux (Nov 18 th ) Application deadline: November 1 st (on Web site) Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 28
Short Courses Fall 2014 Introduction to R (Nov 14 th ) Introduction to Linux and HPC (Nov 17 th ) Analyzing Data in Linux (Nov 18 th ) 2015 Advanced R (early 2015) Exploratory Data Analysis in Python Software Carpentry (early 2015) Big Data Management Predictive Modeling in Python Application deadline: November 1 st (on Web site) + Repeats of R and Linux/HPC courses Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 29
Research MiniSymposia ½ day to full day MiniSymposia on topics of relevance to data science Interdisciplinary in nature Roughly once per quarter Currently in planning mode for 2015 Statistical and Algorithmic Modeling of Social Network Data Social Science, Statistics, Computer Science March 2015 (tentative) Data Analysis in Education: Learning Analytics Education, Machine Learning Spring 2015 (tentative) Business and Informatics Spring 2015 (tentative) Additional topics under consideration Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 30
Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 31
How can you Participate? Visit the Website, sign up for the mailing list, attend events Participate in short-courses Suggest new short courses Volunteer to teach a short course Propose/organize a mini-symposium Organize and chair a half/full day mini-symposium Emphasis on emerging research topics, data-centric, inter-disciplinary To start? Contact the faculty advisory board member in your school (names on Website) Grant proposals in the Data Science area Want to write a joint proposal with a data science angle and need collaborators? Let us know and we will try to make things happen If you have an idea.let us know Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 32
What the Initiative does not have. Open faculty positions Hardware Direct funding for research projects Consulting support for projects Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 33
Today s Kickoff Event Algorithms, Databases, Systems + Statistics, Optimization, Machine Learning + Decisions, Policy, Privacy Applications of Data Analysis Science, Medicine, Engineering, Humanities, Business Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 34
Session 1: Foundations Algorithms, Databases, Systems + Statistics, Optimization, Machine Learning + Decisions, Policy, Privacy Michael Carey Hal Stern Pierre Baldi Geof Bowker Applications of Data Analysis Science, Medicine, Engineering, Humanities, Business Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 35
Session 2: Applications Algorithms, Databases, Statistics, Optimization, Machine Learning + + Systems Decisions, Policy, Privacy Short Applications Talks on: Text Analysis, Particle Physics, Engineering, Genomics, the Environment, and Business Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 36
Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 37
Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 38
Padhraic Smyth, UC Irvine Data Science Initiative, Oct 2014: 39