A New Approach to Safety in Software-Intensive Systems

Similar documents
A New Systems-Theoretic Approach to Safety. Dr. John Thomas

Intro to Systems Theory and STAMP John Thomas and Nancy Leveson. All rights reserved.

Welcome to the STAMP/STPA Workshop

Engineering a Safer World. Prof. Nancy Leveson Massachusetts Institute of Technology

Week 2 Class Notes 1

Engineering a Safer World

My 36 Years in System Safety: Looking Backward, Looking Forward

Engineering a Safer and More Secure World

Engineering a Safer and More Secure World

Safety-Driven Design for Software-Intensive Aerospace and Automotive Systems

Software Challenges in Achieving Space Safety

Applying systems thinking to safety assurance of Nuclear Power Plants

A New Accident Model for Engineering Safer Systems

rones-vulnerable-to-terrorist-hijackingresearchers-say/

The Need for New Paradigms in Safety Engineering

An Integrated Approach to Requirements Development and Hazard Analysis

Engineering Spacecraft Mission Software using a Model-Based and Safety-Driven Design Methodology

Lecture 13: Requirements Analysis

Focusing Software Education on Engineering

Introduction. 25 th Annual INCOSE International Symposium (IS2015) Seattle, WA, July 13 July 16, 2015

Safety in large technology systems. Technology Residential College October 13, 1999 Dan Little

Constellation Systems Division

4 th European STAMP Workshop 2016

Resilience Engineering: The history of safety

WHAT WILL AMERICA DO IN SPACE NOW?

Managing the risk of major accidents

Designing for recovery New challenges for large-scale, complex IT systems

MSL Lessons Learned Study. Presentation to NAC Planetary Protection Subcommittee April 29, 2013 Mark Saunders, Study Lead

Design and Operation of Micro-Gravity Dynamics and Controls Laboratories

A system-theoretic, control-inspired view and approach to process safety

Including Safety during Early Development Phases of Future ATM Concepts

Addressing System Boundary Issues in Complex Socio-Technical Systems CSER 2007

4. OPE INTENT SPECIFICATION TRACEABILITY...

STPA FOR LINAC4 AVAILABILITY REQUIREMENTS. A. Apollonio, R. Schmidt 4 th European STAMP Workshop, Zurich, 2016

Design Principles for Survivable System Architecture

PREFERRED RELIABILITY PRACTICES. Practice:

System Safety Engineering

NASA s Exploration Plans and The Lunar Architecture

Lesson 1: Technology to the Rescue. Why do you need it?

Testimony to the President s Commission on Implementation of the United States Space Exploration Policy

Modelling and Hazard Analysis for Contaminated Sediments Using STAMP Model

Fault Management Architectures and the Challenges of Providing Software Assurance

The NASA-ESA. Comparative Architecture Assessment

Failure And Avoiding It In Space Vehicle Mechanisms

A systems approach to risk analysis of maritime operations

Perspectives on human and robotic spaceflight. Steve Squyres Chairman, NASA Advisory Council Cornell University

A RENEWED SPIRIT OF DISCOVERY

The Preliminary Risk Analysis Approach: Merging Space and Aeronautics Methods

Understanding STPA-Sec Through a Simple Roller Coaster Example

Assurance Cases The Home for Verification*

Using STPA in the Design of a Nuclear Power Plant Control Room

Human Factors of Standardisation and Automation NAV18

Focus on Mission Success: Process Safety for the Atychiphobist

Design for Affordability in Complex Systems and Programs Using Tradespace-based Affordability Analysis

Application of STPA in Radiation Therapy: a Preliminary Study

Credits. National Aeronautics and Space Administration. United Space Alliance, LLC. John Frassanito and Associates Strategic Visualization

Fundamentals of Systems Engineering

10/29/2018. Apollo Management Lessons for Moon-Mars Initiative. I Have Learned To Use The Word Impossible With The Greatest Caution.

Ethics in Materials Engineering

ELECTRIC SHOCK FAULT TREE STUDY VANCOUVER, BRITISH COLUMBIA

Introduction To Cognitive Robots

A RENEWED SPIRIT OF DISCOVERY

Systems Engineering Overview. Axel Claudio Alex Gonzalez

NextGen Aviation Safety. Amy Pritchett Director, NASA Aviation Safety Program

NASA Keynote to International Lunar Conference Mark S. Borkowski Program Executive Robotic Lunar Exploration Program

ASTRA ERA and Future Robotics (for Exploration)

NASA s Down- To-Earth Principles Deliver Positive Strategic Outcomes

links humans and space Japan Manned Space System Corporation

C. R. Weisbin, R. Easter, G. Rodriguez January 2001

Don t shoot until you see the whites of their eyes. Combat Policies for Unmanned Systems

ASSEMBLY - 35TH SESSION

A SPACE STATUS REPORT. John M. Logsdon Space Policy Institute Elliott School of International Affairs George Washington University

Background T

Getting to Mars Psychological needs/concerns By: Elisabeth Ambrose

The Global Exploration Roadmap International Space Exploration Coordination Group (ISECG)

The Lunar Split Mission: Concepts for Robotically Constructed Lunar Bases

ASSEMBLY AND SERVICING OF SPACE TELESCOPES

Introduction to ILWS. George Withbroe. Office of Space Science Sun Earth Connection Division NASA Headquarters

Mr. Mike Pley. President and CEO,

A Call for Boldness. President Kennedy September 1962

NATIONAL AERONAUTICS AND SPACE ADMINISTRATION

Executive Summary. Chapter 1. Overview of Control

Robot: Robonaut 2 The first humanoid robot to go to outer space

Breaking News English.com Ready-to-Use English Lessons by Sean Banville

National Aeronautics and Space Administration

The Role of Software in Spacecraft Accidents

Understand that technology has different levels of maturity and that lower maturity levels come with higher risks.

Product Safety and RF Energy Exposure Booklet for Portable Two-Way Radios

The Future of the US Space Program and Educating the Next Generation Workforce. IEEE Rock River Valley Section

INTRODUCTION TO STAMP

Requirements and Safety Cases

A Software Perspective on GNSS Receiver Integration and Operation

GAO INTERNATIONAL SPACE STATION

Leveraging 21st Century SE Concepts, Principles, and Practices to Achieve User, Healthcare Services, and Medical Device Development Success

Putting the Systems in Security Engineering An Overview of NIST

Focus Session on Commercial Crew

ESA Human Spaceflight Capability Development and Future Perspectives International Lunar Conference September Toronto, Canada

NASA Mission Directorates

The Value of Simple Checks

PSAS. Welcome!! And thanks to our sponsors: Akamai Technologies Liberty Mutual Insurance General Motors Corp.

Transcription:

A New Approach to Safety in Software-Intensive Systems Nancy G. Leveson Aeronautics and Astronautics Dept. Engineering Systems Division MIT

Why need a new approach? Without changing our patterns of thought, we will not be able to solve the problems we created with our current patterns of thought. Albert Einstein Traditional safety engineering approaches developed for relatively simple electro-mechanical systems Accidents in complex, software-intensive systems are changing their nature Role of humans in systems is changing We need more effective techniques for these new systems

It s still hungry and I ve been stuffing worms into it all day.

Chain-of-Events Model Explains accidents in terms of multiple events, sequenced as a forward chain over time. Simple, direct relationship between events in chain Events almost always involve component failure, human error, or energy-related event Forms the basis for most safety-engineering and reliability engineering analysis: e,g, FTA, PRA, FMECA, Event Trees, etc. and design: e.g., redundancy, overdesign, safety margins,.

It s never what we don t know that stops us; it s what we do know that just ain t so Assumptions: Accidents are caused only by system component failure(s) Safety is increased by increasing the reliability of the individual system components. If components do not fail, then accidents will not occur. Software can be treated just like hardware (with perhaps a few minor changes). Highly reliable software is safe. Human error can be treated like hardware failure.

Accident with No Component Failures

Types of Accidents Component Failure Accidents Single or multiple component failures Usually assume random failure Component Interaction Accidents Arise in interactions among components Related to interactive complexity and tight coupling Exacerbated by introduction of computers and software New technology introduces unknowns and unk-unks

Interactive Complexity Critical factor is intellectual manageability A simple system has a small number of unknowns in its interactions (within system and with environment) Interactively complex (intellectually unmanageable) when level of interactions reaches point where can no longer be thoroughly Planned Understood Anticipated Guarded against

It s only a random failure, sir! It will never happen again.

Safety = Reliability Safety and reliability are NOT the same Sometimes increasing one can even decrease the other. Making all the components highly reliable will have no impact on system accidents. For relatively simple, electro-mechanical systems with primarily component failure accidents, reliability engineering can increase safety. But this is untrue for complex, software-intensive sociotechnical systems.

Software-Related Accidents Are usually caused by flawed requirements Incomplete or wrong assumptions about operation of controlled system or required operation of computer Unhandled controlled-system states and environmental conditions Merely trying to get the software correct or to make it reliable will not make it safer under these conditions.

Software-Related Accidents (2) Software may be highly reliable and correct and still be unsafe: Correctly implements requirements but specified behavior unsafe from a system perspective. Requirements do not specify some particular behavior required for system safety (incomplete) Software has unintended (and unsafe) behavior beyond what is specified in requirements.

Fumbling for his recline button Ted unwittingly instigates a disaster

STAMP: An Extended Causality Model Accidents arise from interactions among humans, machines, and the environment Not simply chains of events or linear causality, but more complex types of causal connections (non-linear, feedback, ) Safety is an emergent property that arises when system components interact with each other within a larger environment A set of constraints related to behavior of system components enforces that property Accidents occur when interactions violate those constraints (a lack of appropriate constraints on the interactions)

STAMP (2) Treat safety as a dynamic control problem rather than a component failure problem. O-ring did not control propellant gas release by sealing gap in field joint of Challenger Space Shuttle Software did not adequately control descent speed of Mars Polar Lander Temperature in batch reactor not adequately controlled in system design Public health system did not adequately control contamination of the milk supply with melamine Financial system did not adequately control the use of financial instruments Events are the result of the inadequate control Result from lack of enforcement of safety constraints in system design and operations

A change in emphasis: STAMP (3) prevent failures enforce safety constraints on system behavior Losses are the result of complex dynamic processes, not simply chains of failure events Most major accidents arise from a slow migration of the entire system toward a state of high-risk Need to control and detect this migration

STAMP (4) Includes component failures and what do now but extends to include new types of accident causes Compatible with MIL-STD-882 No changes to current military standards required to use it No harder than what we do now and in some ways easier

Example Safety Control Structure

Control processes operate between levels of control Controller Control Actions Model of Process Feedback Process models must contain: - Required relationship among process variables - Current state (values of process variables - The ways the process can change state Controlled Process Explains software errors, human errors, component interaction accidents

Uses for STAMP More comprehensive accident/incident investigation and root cause analysis Basis for new, more powerful hazard analysis techniques (STPA) Safety-driven design (physical, operational, organizational) Can integrate safety into the system engineering process Assists in design of human-system interaction and interfaces Organizational and cultural risk analysis Identifying physical and project risks Defining safety metrics and performance audits Designing and evaluating potential policy and structural improvements Identifying leading indicators of increasing risk ( canary in the coal mine ) New holistic approaches to security

Technical Does it work? Is it practical? Safety analysis of new missile defense system (MDA) Safety-driven design of new JPL outer planets explorer Safety analysis of the JAXA HTV (unmanned cargo spacecraft to ISS) Incorporating risk into early trade studies (NASA Constellation) Orion (Space Shuttle replacement) Safety of maglev trains (Japan Central Railway) NextGen (for NASA, just starting) Accident/incident analysis (aircraft, petrochemical plants, air traffic control, railway accident, )

Analysis of the management structure of the space shuttle program (post-columbia) Risk management in the development of NASA s new manned space program (Constellation) NASA Mission control re-planning and changing mission control procedures safely Food safety Does it work? Is it practical? Social and Managerial Safety in pharmaceutical drug development Risk analysis of outpatient GI surgery at Beth Israel Deaconess Hospital Analysis and prevention of corporate fraud

Evaluation (1) Performed a non-advocate risk assessment for inadvertent launch on new BMDS Deployment and testing of BMDS held up for 6 months because so many scenarios identified for inadvertent launch. In many of these scenarios: All components were operating exactly as intended E.g., missing cases in software, obscure timing interactions Could not be found by fault trees or other standard techniques Complexity of component interactions led to unanticipated system behavior STPA also identified component failures that could cause inadvertent launch (most analysis techniques consider only these failure events) Now being used proactively as changes made to system

Evaluation (2) Joint research project between MIT and JAXA to determine feasibility and usefulness of STPA for JAXA projects Comparison between STPA and FTA for HTV Problems identified? Resources required?

Comparison between STPA and FTA ISS component failures Crew mistakes in operation Crew process model inconsistent Activation missing/inappropriate Activation delayed HTV component failures HTV state changes over time Out of range radio disturbance Physical disturbance t, x feedback missing/inadequate t, x feedback delayed t, x feedback incorrect Flight Mode feedback missing/inadequate Flight Mode feedback incorrect Visual Monitoringmissing/inadequate Identified by both (STPA and FTA) Identified by STPA only Wrong information/directive from JAXA/NASA GS

Conclusions Traditional system safety techniques based on assumptions no longer true for the systems we are building Trying to add software and human error to them is hopeless New, more sophisticated causality models are needed to handle the new causes of accidents New, more powerful hazard analysis techniques, like STPA, work better than the traditional techniques for complex systems Software that is correct (satisfies its specifications) and reliable is NOT necessarily safe.