Scientific Reproducibility and Software

Similar documents
Applying the Creative Commons Philosophy to Scientific Innovation

The Impact of Computational Science on the Scientific Method

Open Methodology and Reproducibility in Computational Science

Tools for Academic Research: Resolving the Credibility Crisis in Computational Science

Scientific Transparency, Integrity, and Reproducibility

Reproducibility in Computational Science: Opportunities and Challenges

Reproducibility Interest Group

Two Ideas for Open Science (forget Open Data!)

Elements of Scholarly Discourse in a Digital World

The Importance of Scientific Reproducibility in Evidence-based Rulemaking

Document Downloaded: Wednesday September 16, June 2013 COGR Meeting Afternoon Presentation - Victoria Stodden. Author: Victoria Stodden

When Should We Trust the Results of Data Science?

Law & Ethics of Big Data Research Dissemination

The Reproducible Research Movement in Statistics

Reproducibility in Computational Science: A Computable Scholarly Record

Open Licensing and Science Policy

How Science is Different: Digitizing for Discovery

Computational Reproducibility in Medical Research:

Software Patents as a Barrier to Scientific Transparency: An Unexpected Consequence of Bayh-Dole

Benchmarking to Close the Credibility Gap: A Computational BioEM Benchmark Suite

Enhancing Reproducibility for Computational Methods

Software Patents as a Barrier to Scientific Transparency: An Unexpected Consequence of Bayh-Dole

A CyberInfrastructure Wish List for Statistical and Data Driven Discovery

Our stock of scientific knowledge is now accumulating in 17:

Disseminating Numerically Reproducible Research

Open Education Resources: open licenses

The Value of Computational Transparency

Reproducible Research for Scientific Computing: Tools and Strategies for Changing the Culture

Fullhan FH8520 Image Signal Processor

Marvell I1062-B0 Hard Drive Controller SoC

Apple/Dialog Semiconductor 343S0622-A1/D2018A WLED Driver

u-blox M8030-KT Concurrent Multi-GNSS Receiver

Marvell Avastar 88W ac Wi-Fi 2x2 MIMO Combo Chip

Texas Instruments/Apple 343S0538 Touch Screen Controller with F Die Markings

RDA Microelectronics RDA8851A GSM/GPRS Baseband SoC

Marvell 88E6046-TAH1 Four Port Fast Ethernet Plus Two Port Gigabit Ethernet Switch

MediaTek MT6167A Smartphone Radio Frequency (RF) Transceiver

Qualcomm MDM9235M 4G LTE Advanced Modem

ENABLING REPRODUCIBLE RESEARCH: OPEN LICENSING FOR SCIENTIFIC INNOVATION

Samsung SDP1301 DTV SERDES Interface

Apple/Cirrus Logic 338S1081/46L01 Multi-Standard Audio Decoder

Qualcomm APQ8084 Snapdragon 805 Application Processor

Qualcomm MSM8926 Snapdragon 400 Application Processor

Qualcomm QFE1100 Envelope Tracking PA Power Supply

Freescale MCIMX535DVV1C i.mx535 Mobile Applications Processor

FocalTech FT5316 Touch Screen Controller

MediaTek MT3333AV (BT10085B Die) Satellite Receiver SoC

Qualcomm MSM8260A Snapdragon S4 Dual-Core System-on-Chip (SoC) Mobile Applications Processor

Freescale MCIMX6Q5EYM10AC (i.mx6q) Integrated Multimedia Applications Processor

FocalTech Systems FT5336GQQ and FT5436iGQQ (FS-123ATPBC Die) Capacitive Touch Screen Controller

Intel Xeon E3-1230V2 CPU Ivy Bridge Tri-Gate 22 nm Process

STRATEGIC FRAMEWORK Updated August 2017

Texas Instruments TXS0108EZXYR 8 Bit Bidirectional Voltage-Level Translator

FocalTech FT5206GE1 Capacitive Touch Screen Controller IC

BEST PRACTICES IN INNOVATIONS IN MICROPLANNING FOR POLIO ERADICATION

Opening Science & Scholarship

If These Crawls Could Talk: Studying and Documenting Web Archives Provenance

Intellectual Property Management - How to capture, protect and exploit your ideas

Open Science policy and infrastructure support in the European Commission. Joint COAR-SPARC Conference. Porto, 15 April 2015

Qualcomm Atheros AR8035 Ultra Low Power Single RGMII Gigabit Ethernet PHY

Open Science for the 21 st century. A declaration of ALL European Academies

Thoughts on Reimagining The University. Rajiv Ramnath. Program Director, Software Cluster, NSF/OAC. Version: 03/09/17 00:15

STMicroelectronics STMT05 S-Touch Capacitive Touch Screen Controller

A POLICY in REGARDS to INTELLECTUAL PROPERTY. OCTOBER UNIVERSITY for MODERN SCIENCES and ARTS (MSA)

Fiscal 2007 Environmental Technology Verification Pilot Program Implementation Guidelines

Apple A5 APL0498 (APL0498E01 Die Markings) Mobile Processor Extracted from the ipad 2

Open Data, Open Science, Open Access

Texas Instruments WL1283C WiLink 7.0 Single Chip WLAN, GPS, Bluetooth, and FM Transceiver

Advancing Data Science through a Lifecycle Approach

AAAS Project on Science and Intellectual Property in the Public Interest

What is Intellectual Property?

sencill abiert s de licencias CIUDAD DE MÉXIC CT 2, 2014

Novatek NT35695 LCD Driver IC

Vision. The Hague Declaration on Knowledge Discovery in the Digital Age

EL PASO COMMUNITY COLLEGE PROCEDURE

Finland s drive to become a world leader in open science

MediaTek MT6752V Application Processor

S CIENCEC ONCEPTS &PROCESSES

RF Micro Devices RF1500 Antenna Switch

New forms of scholarly communication Lunch e-research methods and case studies

RESEARCH DATA MANAGEMENT PROCEDURES 2015

AMD ATI TSMC 28 nm Gate Last HKMG CMOS Process

A New Path for Science?

Texas Instruments. TPA2028D1 Class-D Audio Amplifier. Circuit Analysis of the Analog Functional Blocks

Loyola University Maryland Provisional Policies and Procedures for Intellectual Property, Copyrights, and Patents

Definitions of Ambient Intelligence

What is a collection in digital libraries?

TRACING THE EVOLUTION OF DESIGN

Atmel. MXT540E Touch Screen Controller. Circuit Analysis of Charge Integrator and I/O Blocks

Increasing Access to Certain North Carolina Environmental Data -- North Carolina Policy Collaboratory Project Update

Trusted Data Intermediaries

Get Creative (and stay legal): Copyright Compliance with Creative Commons and Open Educational Resources

Geometrical product specifications (GPS) Geometrical tolerancing Profile tolerancing

University of Massachusetts Amherst Libraries. Digital Preservation Policy, Version 1.3

Texas Instruments X66AK2E05XABD25 Multi-Core DSP + ARM KeyStone II SoC

Science as an Open Enterprise

THE LABORATORY ANIMAL BREEDERS ASSOCIATION OF GREAT BRITAIN

Comments of the AMERICAN INTELLECTUAL PROPERTY LAW ASSOCIATION. Regarding

Radical Collaboration: The Science of CrowdSourcing and CrowdSourcing Science

Digital Preservation Strategy Implementation roadmaps

Transcription:

Scientific Reproducibility and Software Victoria Stodden Information Society Project @ Yale Law School <vcs@stanford.edu> Institute for Computational Engineering and Sciences The University of Texas at Austin February 8, 2010

Agenda 1. Error Control: V,V,&EQ 2. Hypothesis: Increased Reproducibility Needed to comply with the Scientific Method 3. Survey: Barriers to Open Code/Data 4. Untangling Intellectual Property Issues 5. Barriers in like software lockin and idea evolution (Gaussian) Software needs: analog to digital so need to track (survey)

Controlling Error is Central to Scientific Progress The scientific method s central motivation is the ubiquity of error - the awareness that mistakes and self-delusion can creep in absolutely anywhere and that the scientist s effort is primarily expended in recognizing and rooting out error. David Donoho et al. (2009)

Software Verification Accuracy with which a computational model delivers the results of the underlying mathematical model: Solution verification: does the discretization error approach zero as we approach the continuous model? Code verification: test suites: problems with known solutions, known rates of convergence..

Model Validation Accuracy of a computational model with respect to observed data: Model error: misspecification Observation error, measurement error

Uncertainty Quantification Numerical accuracy of estimates, Sensitivity of given results to boundary conditions, Model calibration metrics, Parameter estimation, confidence intervals.

Controlling Error is Central to the Scientific Method In stochastic modeling the possibility of erroneous decisions cannot be eliminated, and the best one can do is to seek methods of making decisions that, in a sense, minimize the risk of mistakes. Jerzy Neyman, Statistics - Servant of All Sciences, Science, 1955, p. 401

Hypothesis Verification, Validation, & Error Quantification are necessary but not sufficient for the practice of the scientific method. Error control is not insular, but also involves community vetting.

Reproducibility Computational science: researcher works with code or data in generating published results. Reproducibility: the ability of other to recreate and verify computational results, given appropriate software and computing resources.

Reproducibility: Hypothesis Facts are established through social acceptance (Latour), through independent replication through open inspection. Reproducibility of results is essential for computational science to conform with the scientific method.

Science is Changing A transformation of the scientific enterprise through massive computation, in scale, scope, and pervasiveness, is currently underway.. JASA June 1996: 9 of 20 articles computational JASA June 2006: 33 of 35 articles computational

Example: Community Climate System Model (CCSM) Collaborative system simulation Code available by permission Data output files by permission

Example: High Energy Physics 4 LHC experiments at CERN: 15 petabytes produced annually Data shared through grid to mobilize computing power Director of CERN (Heuer): Ten or 20 years ago we might have been able to repeat an experiment.they were simpler, cheaper and on a smaller scale. Today that is not the case. So if we need to re-evaluate the data we collect to test a new theory, or adjust it to a new development, we are going to have to be able reuse it. That means we are going to need to save it as open data. Computer Weekly, August 6, 2008

Example: Astrophysics Simulation Collaboratory Data and code sharing within community Interface for dynamic simulation mid 1930 s: calculate the motion of cosmic rays in Earth s magnetic field..

Example: Proofs Mathematical proof via simulation, not deduction Breakdown point: 1/sqrt(2log(p)) A valid proof? A contribution to the field of mathematics?

The Third Branch of the Scientific Method Branch 1: Deductive/Theory: e.g. mathematics; logic Branch 2: Inductive/Empirical: e.g. the machinery of hypothesis testing; statistical analysis of controlled experiments Branch 3: Large scale extrapolation and prediction: Knowledge from computation or tools for established branches?

Contention About 3rd Branch Anderson: The End of Theory. (Wired, June 2008) Hillis Rebuttal: We are looking for patterns first then create hypotheses as we always have.. (The Edge, June 2008) Weinstein: Simulation underlies existing branches 1. Tools to build intuition (branch 1) 2. Hypotheses to test (branch 2)

Emerging Credibility Crisis in Computational Science Typical scientific communication doesn t include code, data, test suites. Much published computational science near impossible to replicate. Accession to 3rd branch of the scientific method involves the production of routinely verifiable knowledge.

Potential Solution: Really Reproducible Research Pioneered by Jon Claerbout An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. (quote from David Donoho, Wavelab and Reproducible Research, 1995)

Barriers to Sharing: Survey Hypotheses: 1. Scientists are primarily motivated by personal gain or loss. 2. Scientists are primarily worried about being scooped.

Survey of Computational Scientists Subfield: Machine Learning Sample: American academics registered at top Machine Learning conference (NIPS). Respondents: 134 responses from 638 requests.

Reported Sharing Habits 81% claim to reveal some code and 84% claim to reveal some data. Visual inspection of their websites: 30% had some code posted, 20% had some data posted.

Top Reasons Not to Share Code Data 77% 44% 40% 34% - 30% 52% 30% 20% Time to document and clean up Not receiving attribution Possibility of patents Legal barriers (ie. copyright) Time to verify release with admin Potential loss of future publications Dealing with questions from users Competitors may get an advantage Web/Disk space limitations 54% 42% - 41% 38% 35% 34% 33% 29%

For example..

Top Reasons to Share Code Data 91% 90% 86% 82% 85% 81% 85% 78% 71% Encourage scientific advancement Encourage sharing in others Be a good community member Set a standard for the field Improve the caliber of research Get others to work on the problem Increase in publicity Opportunity for feedback Finding collaborators 81% 79% 79% 76% 74% 79% 73% 71% 71%

Have you been scooped? Idea Theft Count Proportion At least one publication scooped 2 or more scooped No ideas stolen 53 31 50 0.51 0.30 0.49

Preliminary Findings Surprise: Motivated to share by communitarian ideals. Not surprising: Reasons for not revealing reflect private incentives. Surprise: Scientists not that worried about being scooped. Surprise: Scientists quite worried about IP issues.

Legal Barriers to Reproducibility Original expression of ideas falls under copyright by default (written expression, code, figures, tables..) Copyright creates exclusive rights vested in the author to: reproduce the work prepare derivative works based upon the original Exceptions and limitations: Fair Use, Academic purposes

Creative Commons Founded by Larry Lessig to make it easier for artists to share and use creative works A suite of licenses that allows the author to determine terms of use attached to works

Creative Commons Licenses A notice posted by the author removing the default rights conferred by copyright and adding a selection of: BY: if you use the work attribution must be provided, NC: work cannot be used for commercial purposes, ND: derivative works not permitted, SA: derivative works must carry the same license as the original work.

License Logos

Open Source Software Licensing Creative Commons follows the licensing approach used for open source software, but adapted for creative works Code licenses: BSD license: attribution GNU GPL: attribution and share alike Hundreds of software licenses..

Apply to Scientific Work? Remove copyright s block to fully reproducible research Attach a license with an attribution component to all elements of the research compendium (including code, data), encouraging full release. Solution: Reproducible Research Standard

Reproducible Research Standard Realignment of legal framework with scientific norms: Release media components (text, figures) under CC BY. Release code components under Modified BSD or similar. Both licenses free the scientific work of copying and reuse restrictions and have an attribution component.

ShareAlike Inappropriate ShareAlike : licensing provision that requires identical licensing of downstream libraries, Issue 1: Control of independent scientists work, Issue 2: Incompatibility of differing licenses with this provisions. GPL not suitable for scientific code.

Releasing Data? Raw facts not copyrightable. Original selection and arrangement of these facts is copyrightable. (Feist Publ ns Inc. v. Rural Tel. Serv. Co., 499 U.S. 340 (1991))

Benefits of RRS Focus becomes release of the entire research compendium Hook for funders, journals, universities Standardization avoids license incompatibilities Clarity of rights (beyond Fair Use) IP framework supports scientific norms Facilitation of research, thus citation, discovery

Reproducibility is an Open Problem (and scale matters) Simple case: open data and small scripts. Suits simple definition. Hard case: Inscrutable code, organic programming. Harder case: massive computing platforms, streaming data. Can we have reproducibility in the hard cases?

Solutions for Harder Cases Tools for reproducibility: Standardized testbeds Open code for continuous data processing, flags for continuous verifiability Standards and platforms for data sharing Provenance and workflow tracking tools (Mesirov) Tools for attribution: Generalized contribution tracking Legal attribution/license tracking tracking and search (RDFa)

Case Study: mloss.org Machine Learning Open Source Software Active code repository Code release at least as important as data release Open question: software support

Case Study: DANSE Neutron scattering Make new data available Unify software for analysis

Case Study: Wolfram Alpha Obscure code - testbeds for verifiability Dataset construction methods opaque (claims copyright over outputs)

Openness and Taleb s Criticism Open Access movement removes the notion of a scientific community

Real and Potential Wrinkles Reproducibility neither necessary nor sufficient for correctness, but essential for dispute resolution, Software lock-in and the evolution of scientific ideas (standards lock-in), Attribution in digital communication: Legal attribution and academic citation not isomorphic Contribution tracking (RDFa) RRS: Need for individual scientist to act, progress depends on artificial aids becoming so familiar they are regarded as natural I.J. Good, How Much Science Can You Have at Your Fingertips, 1958

Papers and Links Enabling Reproducible Research: Open Licensing for Scientific Innovation 15 Years of Reproducible Research in Computational Harmonic Analysis The Legal Framework for Reproducible Research in the Sciences: Licensing and Copyright http://www.stanford.edu/~vcs http://www.stanford.edu/~vcs/conferences/ RoundtableNov212009/

Appendix: Attribution Legal attribution and academic citation not isomorphic. Minimize administrative burden Evolving norms / field specific norms / technology keep intact all copyright notices for the Work and provide, reasonable to the medium or means You are utilizing.