From Morphological Box to Multidimensional Datascapes S. George Center for Data-Driven Discovery and Dept. of Astronomy, Caltech AstroInformatics 2016, Sorrento, Italy, October 2016
Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Dan Ariely
What is Fundamentally New Here? The information volumes and rates grow exponentially Most data will never be seen by humans A great increase in the data information content Data driven vs. hypothesis driven science A great increase in the information complexity There are patterns in the data that cannot be comprehended by humans directly
From Morphological Box to the Observable Parameter Spaces Fritz Zwicky Zwicky s concept: explore all possible combinations of the relevant parameters in a given problem; these correspond to the individual cells. in a Morphological Box Example: Zwicky s discovery of the compact dwarfs
Systematic Exploration of the Observable Parameter Space (OPS) Its axes are defined by the observable quantities Every observation, surveys included, carves out a hypervolume in the OPS Technology opens new domains of the OPS New discoveries
Measurements Parameter Space Colors of stars and quasars SDSS Physical Parameter Space Fundamental Plane of hot stellar systems E dsp h GC Dimensionality the number of observed quantities Both are populated by objects or events
Measurements Parameter Space Color-magnitude diagram Physical Parameter Space H-R diagram Theory + Other data Not filled uniformly: clustering indicates different families Clustering + dimensionality reduction = correlations High dimensionality poses analysis challenges
Exploration of Parameter Spaces is the Central Problem of Data Science Clustering, classification, correlation and outlier searches, Machine Learning Is the Key Methodology Challenges: Algorithm and data model choices Data incompleteness Feature selection and dimensionality reduction Uncertainty estimation Scalability Visualization... etc. } Especially with the data dimensionality
Pattern or structure (Correlations, Clustering, Outliers, etc.) Discovery in High-Dimensional Parameter Spaces D >> 3 parameter space hypercube High-D data cloud: mostly noise, of an arbitrary distribution But in some corner of some sub-d projection of this data space, there is something noise
From Light Curves to Feature Vectors We compute ~ 70 parameters and statistical measures for each light curve: amplitudes, moments, periodicity, etc. This turns heterogeneous light curves into homogeneous feature vectors in the parameter space Apply a variety of automated classification methods 17.2 17.1 17.0 16.9 16.8 Mag 16.7 16.6 16.5 16.4 16.3 16.2 16.1 5.36 5.38 5.40 5.42 5.44 5.46 5.48 5.50 5.52 5.54 5.56 MJD 4 x10
Optimizing Feature Selection Rank features in the order of classification quality for a given classification problem, e.g., RR Lyrae vs. WUMa RR Lyrae Eclipsing binary (W U Ma) (Lead: C. Donalek)
Quasar Selection in a Combined Parameter Space of Variability and WISE Colors QSO region Initial results from the Kepler field: a 100% success rate! (Leads: M.Graham, D. Stern)
Looking for Outliers in the QSO Variability Parameter Space Spectra for the outliers in this parameter space (with anomalous/unusual variability patterns) show: BAL QSOs with evolving spectra Type-changing quasars (Type I Type 2) Double-peak emitters (Lead: M. Graham) Correlated photometric/spectroscopic variability
From the Information Technology to the Cognition Technology: Towards a Human-Computer Collaborative Discovery Vannevar Bush (1945) (1960)
(Lead: M. Graham)
The Rise of the Machines: Science on the Carbon-Silicon Interface Data processing: Automated data quality control (anomaly/fault detection/repair) Data mining and analysis: Clustering, classification, outlier or anomaly detection Pattern recognition, multivariate correlation search Machine discovery of analytical relationships Assisted dimensionality reduction for visualization Code design and implementation: from art to science?
A Key Challenge: Visualisating Multidimensional Data Spaces Hyperdimensional structures (clusters, correlations, etc.) may be present in many complex data sets, whose dimensionality may be D ~ 102 104, or higher It is a matter of data understanding, choosing the right data mining algorithms, and interpreting the results We are biologically limited to perceiving up to ~ 3-12(?) dimensions What good are the data if we cannot effectively extract knowledge from them?
Traditional 2D Visualization Quasar colors in an 8-Dimensional parameter space: typical 2-D projections
Diving Into Multidimensional Datascapes New interactive and collaborative data visualization tools using immersive or augmentative Virtual Reality
Effective Navigation and Interaction in VR Beyond a keyboard and a mouse: gesture based interfaces and control devices Developing optimal user interaction tools and methods for the new VR/AR platforms
Telepresence and Holoportation Scientific collaboration in shared virtual spaces, collaborative visual data exploration Virtual Mars at JPL (S. Davidoff, J. Norris, et al.) Holoportation with Microsoft HoloLens TM
Why Virtual Reality? Multi > 3; multi-d multiple 1-D VR/AR is the next computing platform, following on the mainframe, desktop, and mobile VR solves the problems that traditionally plagued 3-D visualization: occlusion, perspective, navigation, etc. The key concepts are proprioception (sense of the relative position) and kinesthesia (movement sense) VR is a natural platform for a collaborative visual exploration and collaboration Leverages a multi-$z investment by the games industry
Data Science Methodology Transfer There are common challenges and a common underlying methodology to much of the data science (computing, IT, ML, statistics...) How can we transfer the cyberinfrastructure developments, experience, and solutions from one scientific domain to others?
Center for Data-Driven Discovery A new research center at Caltech Serves research efforts Institute-wide A part of a new, Caltech-JPL joint initiative for data science and technology The goals are to assist faculty in formulation and execution of data-intensive projects, and facilitate interdisciplinary sharing of methods, ideas, novel projects, etc.
From Sky Surveys to Neurobiology Using the data analytics tools based on ML, developed for the analysis of sky surveys, to design a better diagnostics for autism Feature importance using random forests => Next: correlate with MRI scans (with R. Adolphs et al.) J. Bunn, CD 3
From Sky Surveys to Neurobiology Feature importance => 6-dimensional parameter space Mixed <= => Control C.Donalek, CD 3 Outlier Cylinders = Autistic, Cubes = Control Stripped = Male, Solid = Female
From%Sky%Surveys%to%Neurobiology% Symbolic)regression)finds)best2fitting)mathematical) description)of)a)sample)of)data)via)evolutionary)algorithm) Cast)binary)classification)as:) f(x))is)equation)of)discriminating)hyperplane) Dependent)features:) I)find)it)easy)to)put)myself)in)somebody)else s)shoes ) I)can)tell)if)someone)is)masking)their)true)emotion ) I)feel)at)upset ) class = g( f (x 1, x 2, x 3,..., x n ))) Accuracies)of)~90%))but)small)sample)data) set)and)feature)degeneracy) M.#Graham,#CD 3#
The Key Points A systematic exploration of high-dimensionality data spaces is a key arena for any data-rich science, astronomy included Machine learning and computational statistics tools are essential, and many challenges remain Uses of machine intelligence will lead to a collaborative human-computer discovery, and a cognition technology Multidimensional visualization is a key bottleneck Virtual reality will be a powerful platform for both data visualization and scientific collaboration Many data science challenges are common to all fields; their solutions constitute a rise of the new scientific methodology and methodology transfer can and should be done