COMPUTATIONAL SOCIAL SCIENCE AND ADVANCED COMPUTING INFRASTRUCTURE: CHALLENGES AND OPPORTUNITIES Myron Gutmann Directorate for the Social, Behavioral and Economic Sciences March, 2012 1 10/24/11
Portrait of Herman Hollerith courtesy of the Computer History Museum www.computerhistory.org Hollerith Electric Tabulator, US Census Bureau, 1908 Photograph by Waldon Fawcett Library of Congress, LC-USZ62-45687 image courtesy of the Early Office Museum www.earlyofficemuseum.com 2 3/2/2012
TAKING THE CENSUS 1870. ILLUSTRATION IN HARPER S WEEKLY, NOVEMBER 19, 1870, P.749 3 3/2/2012 Photo Credit: U.S. Census Bureau, Public Information Office Digital ID: cph 3b39850 Source: b&w film copy neg. Reproduction Number: LC-USZ62-93675 (b&w film copy neg.) [LC]
SBE RESEARCH INFRASTRUCTURE NHGIS General Social Survey 4 3/2/2012
5 3/2/2012
6 3/2/2012
COMMUNITY STRUCTURE OF POLITICAL BLOGS (2004) SHOWN USING A GEM LAYOUT IN THE GUESS VISUALIZATION AND ANALYSIS TOOL. THE COLORS REFLECT POLITICAL ORIENTATION, RED FOR CONSERVATIVE, AND BLUE FOR LIBERAL. ORANGE LINKS GO FROM LIBERAL TO CONSERVATIVE, AND PURPLE ONES FROM CONSERVATIVE TO LIBERAL. THE SIZE OF EACH BLOG REFLECTS THE NUMBER OF OTHER BLOGS THAT LINK TO IT. 7 3/2/2012 From The Political Blosphere and the 2004 US Election: Divided They Blog by Lada Adamic and Natalie Glance
FROM THE COLLECTIVE DYNAMICS OF SMOKING IN A LARGE SOCIAL NETWORK By Nicholas A. Christakis and James H. Fowler 8 3/2/2012 New England Journal of Medicine 358:21 (May 22, 2008)
New York Times, March 17, 2010 9 10/24/11
SUMMARY SO FAR: Long tradition of using computational technology in SBE research Shared traditions grew out of large-scale surveys and traditions of archiving, sharing, and reuse Newest research infrastructure is solidly cyber and largely sustainable Next-generation research questions at new scales while preserving confidentiality and privacy 10 3/2/2012
SBE 2020: WHAT WE LEARNED Goals: Identify decadal scale research through a community-based process Understand the programmatic implications for the directorate 252 white papers, several campus visits, attendance at professional organizations to solicit input and ideas Rebuilding the Mosaic (http://www.nsf.gov/sbe/sbe_2020) Vision: Future research will be collaborative, multidisciplinary, and data intensive and will address societal problems and fundamental scientific questions 11 3/2/2012
FUTURE SBE RESEARCH: TECHNOLOGY AND DATA DRIVERS Scale: More data from more sources (environmental, sensor, administrative, survey, commercial, usage, and so on) Density (merge, overlap, georectify) Tools (statistics, GIS, network analysis, modeling, scenarios) Granularity (fmri, administrative, commercial and behavioral level) Greater access to and demand for high performance computational resources 12 3/2/2012
SBE DEMAND: TWO MAJOR DIMENSIONS Platform for analysis Access to data (discovery) Access to related tools & software Access to compute cycles Access to assistance, training, and relevant expertise Infrastructure Maintain, archive, store, and preserve data in a useable form (understanding that some can be very large, e.g., fmri) Make available a core set of tools to enable comparable results 13 3/2/2012
BIGGER, FASTER, SMARTER. BUT HOW BIG IS BIG? HOW FAST IS FAST? Big Data : Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze (McKinsey Global Institute, Big data: the next frontier for innovation, competition, and productivity, May 2011). Fast: faster than can be achieved at the (1) desktop; (2) local cluster maintained by the department; (3) computational resources I am accustomed to using 14 3/2/2012
WHERE ARE WE SEEING APPLICATIONS...AND FRUSTRATIONS? What do investigators want to do? Text analysis at scale Network analysis Some statistical techniques (e.g., Bayesian analysis) Simulation Visualization And what fails? Exceeds the capabilities of the software package or the operating system Exceeds the computational power of the resource ( too slow ) Requires skills the investigator/ team doesn t have ( hire a programmer ) 15 3/2/2012
DATA ANALYSIS, PROCESSING, AND MERGING AND REGISTRATION 16 3/2/2012 Text analysis/language processing techniques at scale to understand dimensions of innovation, firm behavior, and productivity Patent, award, citation, text databases Record linking (individuals, firms) & geocoding Name/entity disambiguation Un or semi-structured text Large-scale data (IPUMS now as > 800 million records) Machine-learning techniques on images (fmri) to enable encoding Merging data from multiple streams (sensor, brain, administrative)
NETWORK ANALYSIS Social Exploit social media, such as Twitter, to under stand social networks and transmission of information (e.g., financial) Extensive application in commercial research Neuroscience/brain function Interactions of neurons in large-scale neural systems FROM ENTROPY OF DIALOGUES CREATES COHERENT STRUCTURES IN E-MAIL TRAFFIC By Jean-Pierre Eckmann, Elisha Moses, and Danilo Sergi Proceedings of the National Academy of Sciences 101(40): 14333-14337 17 10/24/11
SIMULATION AND MODELING Processes are: Inherently compute intensive Must be repeated (Monte Carlo) to remove noise Comparisons of models A hardware/software issue: Not all problems can be subdivided and distributed but must be run in parallel Applications in: Decision making and global climate change Brain function and specialization Learning Credit: Matthew K Leonard, University of California, San Diego Avian Flu Timeseries 18 10/24/11 www.nature.com
WHERE ARE THE CHALLENGES? 19 3/2/2012 Dirty data : Using commercial, administrative, and usage data will require new solutions disambiguation solutions will also be useful for cleaning non-traditional data sources. Commercial, administrative, & usage data may be restricted for subsequent use or may be covered by competing regulatory regimes. Streamed data requires extensive post-processing. Large and linked datasets may be exploited to identify individuals. Notions of consent affect use of legacy and future datasets. Public perceptions of the research may affect how data might be used.
WHERE IS THE SCIENCE Decision making, in the context of climate change but also in many other areas Effective policy making, for science and more generally Networks (social, information, neural) as either the object of study (e..g. Twitter) or a mechanism for understanding relationships (firm behavior, innovation) Neuroscience Learning What else? 20 3/2/2012
NSF S ROLE: CIF21 ADVANCED COMPUTING INFRASTRUCTURE Foundational research in computation Partnerships with the scientific domains Building, testing, and deploying innovative and sustainable resources in collaborative ecosystems Education and workshop programs in relevant scientific and technical areas Development and evaluation of transformational and grand challenge programs 21 3/2/2012
SBE S ROLE Continue investments in the existing data and computational infrastructure and in upgrades to it Release a new FY2012 solicitation : Building Community and Capacity for Data-Intensive Research in the Social, Behavioral, and Economic Sciences and in Education and Human Resources Seek opportunities to find ways to provide relevant training Continue programmatic support for infrastructure activities such as data collection, management, archiving, and storage 22 3/2/2012
QUESTIONS FOR YOU If social & behavioral scientists self-censor, defining experiments in terms of what they know or believe they know how to do, what should we do? What is the real capacity need? Cycles? Software? New data? Services? What s the role of NSF, other public bodies, universities, researchers? How should we meet those services, at the campus, regional, and national levels? 23 3/2/2012
10/24/11 THANK YOU!