Building an Infrastructure for Data Science Data and the Librarians Role IAMSLIC, Anchorage August, 2012 Linda Pikula, NOAA and IODE GEMIM
Lots and lots of data The predicted data deluge is a reality in many fields. Scientific instruments are generating data at greater speed, densities, and detail than heretofore possible. Older data are being digitized from print and analog forms at a prodigious rate. As data storage capacity increases and storage cost decreases, preserving these vast amounts of data is becoming feasible. Improvements in searching, analysis, and visualization tools are enabling scholars to interpret ever- larger amounts of data. Borgman, UCLA
Digital data are the common thread Digital technologies are reshaping the practice of science. Digital imaging, sensors, analytical instrumentation and other technologies are becoming increasingly central to experimental and observational research in all areas of science. Increases in computational capacity and capability drive more powerful modeling, simulation, and analysis to link theory and experimentation and extend the reach of science. Improvements in network capacity continually increase access to information, instrumentation and colleagues around the globe. from Harnessing the Power of Digital Data for Science and Society. 2009
Vision A digital scientific data universe in which data creation, collection, documentation, analysis, preservation, and dissemination can be appropriately, reliably and readily managed. This will enhance the return on our nation s research and development investment by ensuring that digital data realize their full potential as catalysts for progress in our global society
Scholarly Life Cycle for Data or The Role of Data in the Value Chain
Scholarly Communica=on Cycle - Open Access - L. Lyon
Defini=ons E- Science (UK) Cyberinfrastructure (US) Data intensive science Wikipedia- is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing... Wikipedia describes the new research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet.
Scholarly Info Cycle for Data, Value Chain: Method Library Role Legitimization of Data Trust in Data Peer Review Registration of Data Metadata, preservation, curation Certification of Data Dissemination of Data Access, preservation
Four categories of data Observational Computational Experimental Records
Incen=ves to share data Scholars concerns Publishers concerns Librarians concerns Agreements among research partners COLLABORATION RECOGNITION Reciprocity Coercion Open Science Economic Preservation, access, documentation Documentation, provenance, access, preservation
Incen=ves NOT to share data Rewarded for publication not data management Performance evaluations, bibliometrics peer citing Difficult and time consuming to document data for another's use subsequent to own use Competition for grant funding and recognition amongst scientists Keep control of intellectual property Publishers, scientists
Key issues for Data and e- science Issues for Librarians Discovery and Identification: What data exist? Where are the data and how can they be accessed? Access: Who has access? How will the privacy of both users and research subjects be protected? What kinds of rights management structures need to be established, if any? Interoperability: In what formats will data be stored and presented? What kinds of metadata will be applied? How will variables be described? What data models apply? Retention Criteria: Is the data likely to be reused? Will another researcher be able to reasonably replicate or build upon the original results using this data? What is the cost of metadata creation, and how does that compare to the expected value of the data to other researchers?
Key issues for Data and e- science Issues for Librarians Migration/Preservation: Will data need to be converted or migrated in order to be usable? Will legacy system configurations need to be preserved or emulated in order to ensure long- term usability of this data? Idiosyncratic practices for data management: How was the data managed in the laboratory environment? If researchers developed their own ad hoc systems, what impact will this have on how the data will need to be stored for future usability? Culture of data as private good : On what grounds do researchers and institutions object to data sharing? Is there a sense that the data is personally or institutionally owned? Is this the case legally or ethically
EXAMPLES OF LIBRARY INVOLVEMENT IN THE DATA ARENA ROLES LIBRARIANS AND LIBRARY STAFF ARE FULFILLING Data management including collection, organization, description, curation, archiving, and dissemination- creating a plan Creation of new data- and scholarship- based electronic resources for university and/or public use Development of new models, standards, and architectures for various aspects of data management, description, etc.
SOME EXAMPLES OF LIBRARY INVOLVEMENT IN THE DATA ARENA ROLES LIBRARIANS AND LIBRARY STAFF ARE FULFILLING cont. Building accessible linkages between all the components and stages of research, from data to researchers to publications *Bridging institutional hierarchies and departmental divisions in service of interdisciplinary initiatives
Oceanography Fisheries Atmospheric Sciences
Oceanographic Data Carbon Cycles Ocean Temperature, Color,Depth,Salinity Time Series Mixed layer surface currents Meridional heat transport Global Heat Storage Global Surface Currents Essential Climate Variables data(salinity chlorophyll, altimetry, surface wind and current) Wave Data Coastal Climatologies Data for use in Marine Spatial Planning and Decision support applications for climate, ecosystems and coastal planning
Atmospheric - Air/Sea Interac=on Air pressure and winds Near surface winds (ocean surface) Hurricane and storm data Other
Fisheries Data Fisheries Catch, abundance, sex, size Commercial Fisheries landings/exploitation recreational fisheries Stock assessments/abundance, species, habitat assessments, surveys at sea, recruitment Environmental- habitat, water quality, climate cycles International Organizations that collect or maintain Fisheries Statistics http://www.st.nmfs.noaa.gov/st1/ International_National_Organizations.html
Ten Ques(ons to Begin a Conversa(on With Your Faculty About Data Cura(on 1. What is the story of your data? 2. What form and format are the data in? 3. What is the expected lifespan of your data? 4. How could your data be used, reused, and repurposed? 5. How large is your dataset, and what is its rate of growth? 6. Who are poten=al audiences for your data? 7. Who owns the data? 8. Does the dataset include any sensi=ve informa=on? 9. What publica=ons or discoveries have resulted from the data? 10. How should the data be made accessible? Wi#, M. & Carlson, J. (2007). Conduc9ng a data interview. h#p://docs.lib.purdue.edu/lib_research/81/.
References: o Borgman, Christine L. Scholarship in the digital age: information, infrastructure, and the Internet. MIT Press, Cambridge, Mass., 2007 o The Data deluge o Borgman, Christine L. Towards a new information space : innovation and renovations online video o Brandt, D.S. et al. Investigating data curation practices across disciplines ILMS Grant, Purdue University and University of Illinois, Urbana, 2007 o Curating e- science data Digital Curation Centre Aug 24, 2006... The data represent a new foundation for new research, science,... JISC Senior Management Briefing Paper, The Data deluge o Harnessing the power of digital data for science and society. Report of the interagency working group on digital data to the committee on science of the National Science and Technology and Technology Council. January 2009 o Pikula, Linda. Data sources in marine and oceanographic science. OceanTeacher Digital Curation Course, IODE, Oostende, Be. September, 2011
References E- Science Talking Points for ARL Deans and Directors by E. Jones, October 24, 2008. http://ww.arl.org/rtl/ escience/