A New Approach to Safety in Software-Intensive Systems

A New Approach to Safety in Software-Intensive Systems Nancy G. Leveson Aeronautics and Astronautics Dept. Engineering Systems Division MIT

Why need a new approach? Without changing our patterns of thought, we will not be able to solve the problems we created with our current patterns of thought. Albert Einstein Traditional safety engineering approaches developed for relatively simple electro-mechanical systems Accidents in complex, software-intensive systems are changing their nature Role of humans in systems is changing We need more effective techniques for these new systems

It s still hungry and I ve been stuffing worms into it all day.

Chain-of-Events Model Explains accidents in terms of multiple events, sequenced as a forward chain over time. Simple, direct relationship between events in chain Events almost always involve component failure, human error, or energy-related event Forms the basis for most safety-engineering and reliability engineering analysis: e,g, FTA, PRA, FMECA, Event Trees, etc. and design: e.g., redundancy, overdesign, safety margins,.

It s never what we don t know that stops us; it s what we do know that just ain t so Assumptions: Accidents are caused only by system component failure(s) Safety is increased by increasing the reliability of the individual system components. If components do not fail, then accidents will not occur. Software can be treated just like hardware (with perhaps a few minor changes). Highly reliable software is safe. Human error can be treated like hardware failure.

Accident with No Component Failures

Types of Accidents Component Failure Accidents Single or multiple component failures Usually assume random failure Component Interaction Accidents Arise in interactions among components Related to interactive complexity and tight coupling Exacerbated by introduction of computers and software New technology introduces unknowns and unk-unks

Interactive Complexity Critical factor is intellectual manageability A simple system has a small number of unknowns in its interactions (within system and with environment) Interactively complex (intellectually unmanageable) when level of interactions reaches point where can no longer be thoroughly Planned Understood Anticipated Guarded against

It s only a random failure, sir! It will never happen again.

Safety = Reliability Safety and reliability are NOT the same Sometimes increasing one can even decrease the other. Making all the components highly reliable will have no impact on system accidents. For relatively simple, electro-mechanical systems with primarily component failure accidents, reliability engineering can increase safety. But this is untrue for complex, software-intensive sociotechnical systems.

Software-Related Accidents Are usually caused by flawed requirements Incomplete or wrong assumptions about operation of controlled system or required operation of computer Unhandled controlled-system states and environmental conditions Merely trying to get the software correct or to make it reliable will not make it safer under these conditions.

Software-Related Accidents (2) Software may be highly reliable and correct and still be unsafe: Correctly implements requirements but specified behavior unsafe from a system perspective. Requirements do not specify some particular behavior required for system safety (incomplete) Software has unintended (and unsafe) behavior beyond what is specified in requirements.

Fumbling for his recline button Ted unwittingly instigates a disaster

STAMP: An Extended Causality Model Accidents arise from interactions among humans, machines, and the environment Not simply chains of events or linear causality, but more complex types of causal connections (non-linear, feedback, ) Safety is an emergent property that arises when system components interact with each other within a larger environment A set of constraints related to behavior of system components enforces that property Accidents occur when interactions violate those constraints (a lack of appropriate constraints on the interactions)

STAMP (2) Treat safety as a dynamic control problem rather than a component failure problem. O-ring did not control propellant gas release by sealing gap in field joint of Challenger Space Shuttle Software did not adequately control descent speed of Mars Polar Lander Temperature in batch reactor not adequately controlled in system design Public health system did not adequately control contamination of the milk supply with melamine Financial system did not adequately control the use of financial instruments Events are the result of the inadequate control Result from lack of enforcement of safety constraints in system design and operations

A change in emphasis: STAMP (3) prevent failures enforce safety constraints on system behavior Losses are the result of complex dynamic processes, not simply chains of failure events Most major accidents arise from a slow migration of the entire system toward a state of high-risk Need to control and detect this migration

STAMP (4) Includes component failures and what do now but extends to include new types of accident causes Compatible with MIL-STD-882 No changes to current military standards required to use it No harder than what we do now and in some ways easier

Example Safety Control Structure

Control processes operate between levels of control Controller Control Actions Model of Process Feedback Process models must contain: - Required relationship among process variables - Current state (values of process variables - The ways the process can change state Controlled Process Explains software errors, human errors, component interaction accidents

Uses for STAMP More comprehensive accident/incident investigation and root cause analysis Basis for new, more powerful hazard analysis techniques (STPA) Safety-driven design (physical, operational, organizational) Can integrate safety into the system engineering process Assists in design of human-system interaction and interfaces Organizational and cultural risk analysis Identifying physical and project risks Defining safety metrics and performance audits Designing and evaluating potential policy and structural improvements Identifying leading indicators of increasing risk ( canary in the coal mine ) New holistic approaches to security

Technical Does it work? Is it practical? Safety analysis of new missile defense system (MDA) Safety-driven design of new JPL outer planets explorer Safety analysis of the JAXA HTV (unmanned cargo spacecraft to ISS) Incorporating risk into early trade studies (NASA Constellation) Orion (Space Shuttle replacement) Safety of maglev trains (Japan Central Railway) NextGen (for NASA, just starting) Accident/incident analysis (aircraft, petrochemical plants, air traffic control, railway accident, )

Analysis of the management structure of the space shuttle program (post-columbia) Risk management in the development of NASA s new manned space program (Constellation) NASA Mission control re-planning and changing mission control procedures safely Food safety Does it work? Is it practical? Social and Managerial Safety in pharmaceutical drug development Risk analysis of outpatient GI surgery at Beth Israel Deaconess Hospital Analysis and prevention of corporate fraud

Evaluation (1) Performed a non-advocate risk assessment for inadvertent launch on new BMDS Deployment and testing of BMDS held up for 6 months because so many scenarios identified for inadvertent launch. In many of these scenarios: All components were operating exactly as intended E.g., missing cases in software, obscure timing interactions Could not be found by fault trees or other standard techniques Complexity of component interactions led to unanticipated system behavior STPA also identified component failures that could cause inadvertent launch (most analysis techniques consider only these failure events) Now being used proactively as changes made to system

Evaluation (2) Joint research project between MIT and JAXA to determine feasibility and usefulness of STPA for JAXA projects Comparison between STPA and FTA for HTV Problems identified? Resources required?

Comparison between STPA and FTA ISS component failures Crew mistakes in operation Crew process model inconsistent Activation missing/inappropriate Activation delayed HTV component failures HTV state changes over time Out of range radio disturbance Physical disturbance t, x feedback missing/inadequate t, x feedback delayed t, x feedback incorrect Flight Mode feedback missing/inadequate Flight Mode feedback incorrect Visual Monitoringmissing/inadequate Identified by both (STPA and FTA) Identified by STPA only Wrong information/directive from JAXA/NASA GS

Conclusions Traditional system safety techniques based on assumptions no longer true for the systems we are building Trying to add software and human error to them is hopeless New, more sophisticated causality models are needed to handle the new causes of accidents New, more powerful hazard analysis techniques, like STPA, work better than the traditional techniques for complex systems Software that is correct (satisfies its specifications) and reliable is NOT necessarily safe.