The Five R s for Developing Trusted Software Frameworks to increase confidence in, and maximise reuse of, Open Source Software Ryan Fraser 1, Lutz Gross 2, Lesley Wyborn 3, Ben Evans 3 and Jens Klump 1 1 CSIRO 1 University of Queensland 1 NCI Australian National University @NCInews
The NCI Integrated Data-Intensive Science Platform Data Services THREDDS Server-side analysis and visualization VDI: Cloud scale user desktops on data 10PB+ Research Data Web-time analytics software
A deluge of 10 PB of Shared Science Data CMIP5 3PB Earth Observ. 2 PB Marine Videos 10 TB Astronomy (Optical) 200 TB Atmosphere 2.4 PB Water Ocean 1.5 PB Weather 340 TB Bathy, DEM 100 TB Geophysics 300 TB BOM GA CSIRO ANU Other National International Mirrored from major science agencies and other sources
A Tsunami of CPU s 4/25 Raijin: 57,472 cores (Intel Xeon Sandy Bridge technology, 2.6 GHz) in 3592 compute nodes; 160 TBytes (approx.) of main memory; Infiniband FDR interconnect; and 7 PBytes (approx.) of usable fast filesystem (for short-term scratch space). 1.5 MW power; 100 tonnes of water in cooling Partner Cloud Same generation of technology as raijin (Intel Xeon Sandy Bridge technology, 2.6 GHz) but only 1500 cores; Infiniband FDR interconnect; Collaborative platform for services and The platform for hosting non-batch services NCI Nectar Cloud Same generation as partner cloud Non-managed environment Weak integration
There is a famine of trusted software: the free lunch is over Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software https://www.cs.utexas.edu/~lin/cs380p/free_lunch.pdf
Stages and Key events in Software Evolution Year Stages Year Key Events 1950-1985 Write your own 1983 GNU available 1986-2010 Commercial: Buy Buy Buy 1991 Linux released 2010-2015 Rise of Open Source: scalabilty becomes king 2006 The Free Lunch Ended
OK so what is the issue How do I find the software I need: Problem 1: Understand flood risk in Perth Solution: Simulate tsunami hazard(s), develop risk map Problem 2: I want to model earthquake risk Solution: Simulate fault movement, develop risk map And when I find the code(s) how do I: appreciate the dependencies, understand the environment required know if the code is scientifically robust (trust) know if I will crash the facility I chose to run it on (trust)
Software sites are springing up
How do I find the code for the problem I want to solve?
2015 Data-Intensive Science Climate Report!!! Data Deluge Tsunami of CPU s Famine of Software Megadrought of funding Source: http://jordanrussiacenter.org/event/feast-famine/
The approach to software development is changing Conceptual Section Mathematical Model Numerics Section f ( x) = f ( x) = 3 {,,..., } FEM f ( x) = 2 ( x) x f = Particle Code Fault Intrusive Sandstone f ( x) = x + sin( x) Generic Algorithm Conceptual-Mathematical Interface Mathematical Numerics Interface
Software in the era of Data Intensive Science EVOLUTION Source: http://www.nsf.gov/pubs/2012/nsf12113/nsf12113.p df Software needs to transition from a set of individual research projects to a production infrastructure via a trusted software framework
Introducing the 5 R s of a Trusted Software Framework Number Component Goal 1 Register Find the required software 2 Review Can I trust it? 3 Reference Who else was game enough to use it 4 Run Get cracking 5 Repeat Provide on-line exemplars
Component 2 : Peer Review of software
Component 3: Reference - Who else has used it? Figshare Impactstory Impactstory is an open-source, web-based tool that helps scientists explore and share the diverse impacts of all their research products from traditional ones like journal articles, to emerging products like blog posts, datasets, and software.
Component 4: Run
Component 5: Reference implementations (1) A job s console log can be inspected All of a job s outputs are also accessible Each job has a lifecycle that can be managed Presentation title Presenter name Source: Carina Kemp - The Virtual Geophysics Laboratory 2015 and beyond
Component 5: Reference implementations (2) Each provenance record tracks all inputs, outputs, processing scripts and other metadata allowing repeatability and tranparency... Input/output data... Successful jobs can have their entire process captured in a ISO 19115 provenance record Source: Carina Kemp - The Virtual Geophysics Laboratory 2015 and beyond
Reviewing the 5 R s of Building trusted software Number Component Goal What 1 Register Find the required software Finding relevant software from multiple open source code repositories. 2 Review Can I trust it? Verifying of the software through peer review forums (Mozilla Science Lab/Journals) to assist users to know which codes to trust 3 Reference Who else used it Linking the Software to Figshare or ImpactStory that help disseminate and measure the impact of scientific research, including program code 4 Run Get cracking Draws on information supplied in the registration process, benchmark cases described in the review to instantiate the scientific code 5 Repeat Provide on-line exemplars Provenance Workflow engines that capture information that relate to a run of that software, input and output artefacts, and transactions
Building into a Science Software Solution Centre User asks What Solutions do you have for inundation Modelling Hazards Virtual Laboratory SSSC offers Solutions: x, y & z User Chooses Solution User requests Solution Details from SSSC (science code, dependencies, template) Scientific Software Solution Centre (Registry + Governance) Selected Solution Details returned to VL Solution instantiated and ready for user in VL
How much of this stuff do we actually have? Number Component What is missing How 1 Register Metadata profile, DOI frameworks Develop a metadata standard which includes: licensing, hardware environments, testing procedures, critical dependencies, core scientific algorithms, numerical methods 2 Review Geoscientific Model Development Journal, Mozilla Science 3 Reference Figshare, Impact Story 4 Run GitHub, SourceForge 5 Repeat Provenance Workflow engines For 1: We need to harness expertise in the data community to move forward All components need to be linked and have persistent identifiers
Conclusions: R U now aware? A trusted software framework 1. Is critical to creating Data-Intensive Science Platforms 2. Will enable Researchers to: Rapidly access Reliable code, Reduce the time to deploy it greatly facilitate Reuse and Reinstallation of code and then Rejoice 3. Will provide operational Robustness around our Science 4. But we need to get started on Rolling-out the Register
R U Ready with Questions? Ryan Fraser: Lutz Gross: Lesley Wyborn: Ben Evans: Jens Klump: ryan.fraser@csiro.au l.gross@uq.edu.au lesley.wyborn@anu.edu.au ben.evans@anu.edu.au jens.klump@anu.edu.au http://forum.en.grepolis.com/showthread.php?55048-it-s-a-fairy-tale 2015 Science ICT Network Conference