Designing for recovery New challenges for large-scale, complex IT systems

Similar documents
Objectives. Designing, implementing, deploying and operating systems which include hardware, software and people

Design and technology

Our digital future. SEPA online. Facilitating effective engagement. Enabling business excellence. Sharing environmental information

The Information Commissioner s response to the Draft AI Ethics Guidelines of the High-Level Expert Group on Artificial Intelligence

Why do so many technology programmes in health and social care fail?

The Library's approach to selection for digitisation

Applications of Dependable Computing Concepts to National Infrastructure Systems

in the New Zealand Curriculum

Assessing the Welfare of Farm Animals

Leading Systems Engineering Narratives

THE FUTURE OF DATA AND INTELLIGENCE IN TRANSPORT

Countering Capability A Model Driven Approach

Tuning-CALOHEE Assessment Frameworks for the Subject Area of CIVIL ENGINEERING The Tuning-CALOHEE Assessment Frameworks for Civil Engineering offers

Cognitive Systems Engineering

ANU COLLEGE OF MEDICINE, BIOLOGY & ENVIRONMENT

CHAPTER 1: INTRODUCTION TO SOFTWARE ENGINEERING DESIGN

Domain Understanding and Requirements Elicitation

Almost by definition, issues of risk are both complex and complicated.

December Eucomed HTA Position Paper UK support from ABHI

GROUP OF SENIOR OFFICIALS ON GLOBAL RESEARCH INFRASTRUCTURES

System of Systems Software Assurance

Current Systems. 1 of 6

Disposing of objects you may not own

VAR Generator Operation for Maintaining Network Voltage Schedules

A review of the role and costs of clinical commissioning groups

Competency Standard for Registration as a Professional Engineer

HTA Position Paper. The International Network of Agencies for Health Technology Assessment (INAHTA) defines HTA as:

J Project Methods. V (%) Network with high generation and low load. Network with low generation and high load

Mde Françoise Flores, Chair EFRAG 35 Square de Meeûs B-1000 Brussels Belgium January Dear Mde.

A New Approach to Safety in Software-Intensive Systems

Sustainable Society Network+ Research Call

VAR Generator Operation for Maintaining Network Voltage Schedules

Technology and Innovation in the NHS Scottish Health Innovations Ltd

UNIT IV SOFTWARE PROCESSES & TESTING SOFTWARE PROCESS - DEFINITION AND IMPLEMENTATION

SWEN 256 Software Process & Project Management

VAR Generator Operation for Maintaining Network Voltage Schedules

Human factors and design in future health care

A New Systems-Theoretic Approach to Safety. Dr. John Thomas

LSCB Pan-Lancashire LSCB Online Safeguarding Strategy

EXPLORATION DEVELOPMENT OPERATION CLOSURE

Submission to the Productivity Commission inquiry into Intellectual Property Arrangements

Convention on Certain Conventional Weapons (CCW) Meeting of Experts on Lethal Autonomous Weapons Systems (LAWS) April 2016, Geneva

From: President Magna Charta Observatory To: Council and Review Group Date: 8 September Towards a new MCU a first exploration and roadmap

Societal and Ethical Challenges in the Era of Big Data: Exploring the emerging issues and opportunities of big data management and analytics

Urban Big Data and City Dashboards: Praxis and Politics. Rob Kitchin NIRSA, National University of Ireland Maynooth

Feelings of guilt. Possible reasons for guilt and suggestions for coping Mistakes. Unrealistic expectations

Pathway Descriptions. Titles 100 Characters Descriptions 1000 Characters. 1. Ancient Civilizations

Principled Construction of Software Safety Cases

Issues and Challenges in Ecosystems of Federated Embedded Systems

Ministry of Justice: Call for Evidence on EU Data Protection Proposals

VAR Generator Operation for Maintaining Network Voltage Schedules

Deviational analyses for validating regulations on real systems

Transmission Innovation Strategy

Human Factors Points to Consider for IDE Devices

Key elements of meaningful human control

VAR Generator Operation for Maintaining Network Voltage Schedules

Daniel Lee Kleinman: Impure Cultures University Biology and the World of Commerce. The University of Wisconsin Press, pages.

ITAC RESPONSE: Modernizing Consent and Privacy in PIPEDA

FM p.i-xxii 4/2/04 11:39 AM Page v. Preface

ThinkPlace case for IBM/MIT Lecture Series

Why BPM Is Unique & Important

Information & Communication Technology Strategy

Software LEIC/LETI. Lecture 21

COAL CREEK COMMUNITY PARK MUSEUM AND COLLECTION POLICY

MedTech Europe position on future EU cooperation on Health Technology Assessment (21 March 2017)

The Preliminary Risk Analysis Approach: Merging Space and Aeronautics Methods

Systems. Professor Vaughan Pomeroy. The LRET Research Collegium Southampton, 11 July 2 September 2011

Gender pay gap reporting tight for time

Mr Hans Hoogervorst International Accounting Standards Board 1 st Floor 30 Cannon Street London EC4M 6XH. MV/288 Mark Vaessen.

Centre for the Study of Human Rights Master programme in Human Rights Practice, 80 credits (120 ECTS) (Erasmus Mundus)

NEW 2LDS ADVISORY PANEL. RECOMMENDATIONS TO THE auda BOARD, AUGUST In December 2009 the auda board established the New 2LDs Advisory Panel to:

Methodology for Agent-Oriented Software

Towards a proactive safety approach in the design process: The case of printing machinery

ENHANCED HUMAN-AGENT INTERACTION: AUGMENTING INTERACTION MODELS WITH EMBODIED AGENTS BY SERAFIN BENTO. MASTER OF SCIENCE in INFORMATION SYSTEMS

FEE Comments on EFRAG Draft Comment Letter on ESMA Consultation Paper Considerations of materiality in financial reporting

Lecture 13: Requirements Analysis

Photography policy. Policy history

People s Union. Understanding and addressing inequalities

Software Maintenance Cycles with the RUP

SAFETY CASE PATTERNS REUSING SUCCESSFUL ARGUMENTS. Tim Kelly, John McDermid

Efficient UMTS. 1 Introduction. Lodewijk T. Smit and Gerard J.M. Smit CADTES, May 9, 2003

Transmission Innovation Strategy

The Electric Utility as a Complex Adaptive System

Designing a New Communication System to Support a Research Community

Chapter 3. Communication and Data Communications Table of Contents

The Tool Box of the System Architect

Safety of programmable machinery and the EC directive

Programme Curriculum for Master Programme in Economic History

BUREAU OF LAND MANAGEMENT INFORMATION QUALITY GUIDELINES

Solutions to selected exercises

INDIAN INSTITUTE OF MANAGEMENT BANGALORE

Colombia s Social Innovation Policy 1 July 15 th -2014

UNECE Comments to the draft 2007 Petroleum Reserves and Resources Classification, Definitions and Guidelines.

Socio-cognitive Engineering

The Human and Organizational Part of Nuclear Safety

WORKSHOP ON BASIC RESEARCH: POLICY RELEVANT DEFINITIONS AND MEASUREMENT ISSUES PAPER. Holmenkollen Park Hotel, Oslo, Norway October 2001

THE IMPACT OF SCIENCE DISCUSSION PAPER

Ethics and technology

Cover Page. The handle holds various files of this Leiden University dissertation.

ASSEMBLY - 35TH SESSION

Transcription:

Designing for recovery New challenges for large-scale, complex IT systems Prof. Ian Sommerville School of Computer Science St Andrews University Scotland

St Andrews Small Scottish town, on the north-east coast of the UK Home of golf Scotland s oldest university (founded in 1413) Small university focusing on research and teaching excellence

A question to the audience A system is designed to maintain the value of some integer variable (say B), and to provide information about B to users. The value of this variable [in the world] is X, with the value of X changing over time. The system specification states that the value of B should be X Sometimes the system reports to users (correctly) that B = X; sometimes the system reports to users that B = Y, where Y < X In circumstances where the system reports that B = Y (i.e. it provides an incorrect value), is this a failure?

Complex IT systems Organisational systems that support different functions within an organisation Can usually be considered as systems of systems, ie different parts are systems in their own right Usually distributed and normally constructed by integrating existing systems/components/services Not subject to limitations derived from the laws of physics (so, no natural constraints on their size) Data intensive, with very long lifetime data An integral part of wider socio-technical systems

Characteristics of complex IT systems Operational independence of the system elements Managerial independence of the system elements Multiple stakeholder viewpoints Evolutionary development Emergent behaviour Geographic distribution

Socio-technical systems Laws, regulations, custom & practice System users Software intensive system Organisational culture Business processes

Reductionism Reductionism an approach to understanding the nature of complex things by reducing them to the interactions of their parts, or to simpler or more fundamental things. Reductionism underpins most engineering, including software engineering Reductionism has problems with scale. When things get too big, then reductionist approaches become intellectually unmanageable because of the complexity of the interactions between the parts of the whole

Software engineering Developments in software engineering have largely adopted a reductionist perspective: Design methodologies Formal methods Agile approaches Software architecture Model-driven engineering Reductionist approaches to software engineering have been successful in allowing us to construct larger software systems More effective reductionist approaches allow us to deal with increasingly complicated systems.

Reductionist assumptions Control Reductionist approaches assume that we have control over the organisation of the system. It is then possible to decompose the system into parts that can themselves be engineered using reductionist approaches A rational world Reductionist approaches assume that rationality will be the principal influence in decision making Definable problems Reductionist approaches assume that the problem can be defined and the system boundaries established

Complex and complicated systems Reductionist approaches are intended to help deal with complicated systems i.e. systems where there are many interactions between components but which can (in principle) be understood and controlled However, we are now building complex systems where is is impossible to acquire and maintain a complete understanding of the system and where elements are independently controlled and often have undocumented side-effects

Services = complexity S1 Services S3 Services S2 Services S4

What is failure? From a reductionist perspective, a failure can be considered to be a deviation from a specification. An oracle can examine a specification and observe a system s behaviour and detect failures. Failure is an absolute - the system has either failed or it hasn t Of course, some failures are more serious than others; it is widely accepted that failures with minor consequences are to be expected and tolerated

A question to the audience A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit. It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system. Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available. In circumstances where the system reports that an incorrect number of beds are available, is this a failure?

Bed management system The percentage of system users who considered the system s incorrect reporting of the number of available beds to be a failure was 0%. Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital. When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate. They used other methods to find out whether or not a bed was available for an incoming patient.

Failure is a judgement Specifications are a simplification of reality Users don t read and don t care about specifications Whether or not system behaviour should be considered to be a failure, depends on the judgement of an observer of that behaviour This judgement depends on: The observer s expectations The observer s knowledge and experience The observer s role The observer s context or situation The observer s authority

System failure Failures are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate system behaviour This extra work constitutes the cost of recovery from system failure

Failures are inevitable Technical reasons When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood Failures often can be considered to be failures in data rather than failures in behaviour Socio-technical reasons Changing contexts of use mean that the judgement on what constitutes a failure changes as the effectiveness of the system in supporting work changes Different stakeholders will interpret the same behaviour in different ways because of different interpretations of the problem

Conflict inevitability Impossible to establish a set of requirements where stakeholder conflicts are all resolved Therefore, successful operation of a system for one set of stakeholders will inevitably mean failure for another set of stakeholders Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.

Where are we? Large-scale information systems are inevitably complex systems Such systems cannot be created using a reductionist approach Failures are a judgement and this may change over time Failures are inevitable and cannot be engineered out of a system

The way forward Systems design has to be seen as part of a wider process of socio-technical systems engineering We need to accept that technical system failures will always occur and examine how we can design these systems to allow the broader socio-technical systems to recognise, diagnose and recover from these failures

Software dependability A reductionist approach to software dependability takes the view that software failures are a consequence of software faults Techniques to improve dependability include Fault avoidance Fault detection Fault tolerance These approaches have taken us quite a long way in improving software dependability. However, further progress is unlikely to be achieved by further improvement of these techniques as they rely on a reductionist view of failure.

Failure recovery Recognition Recognise that inappropriate behaviour has occurred Hypothesis Formulate an explanation for the unexpected behaviour Recovery Take steps to compensate for the problem that has arisen

Coping with failure Socio-technical systems are remarkably robust because people are good at coping with unexpected situations when things go wrong. We have the unique ability to apply previous experience from different areas to unseen problems. Individuals can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things. People can prioritise and focus on the essence of a problem

Recovering from failure Local knowledge Who to call; who knows what; where things are Process reconfiguration Doing things in a different way from that defined in the standard process Work-arounds, breaking the rules (safe violations) Redundancy and diversity Maintaining copies of information in different forms from that maintained in a software system Informal information annotation Using multiple communication channels Trust Relying on others to cope

Design for recovery The aim of a strategy of design for recovery is to: Ensure that system design decisions do not increase the amount of recovery work required Make system design decisions that make it easier to recover from problems Earlier recognition of problems Visibility to make hypotheses easier to formulate Flexibility to support recovery actions Designing for recovery is a holistic approach to system design and not (just) the identification of recovery requirements Should support the natural ability of people and organisations to cope with problems

Problems Security and recoverability Automation hiding Process tyranny Multi-organisational systems

Security and recoverability There is an inherent tension between security and recoverability Recoverability Relies on trusting operators of the system not to abuse privileges that they may have been granted to help recover from problems Security Relies on mistrusting users and restricting access to information on a need to know basis

Automation hiding A problem with automation is that information becomes subject to organizational policies that restrict access to that information. Even when access is not restricted, we don t have any shared culture in how to organise a large information store Generally, authorisation models maintained by the system is based on normal rather than exceptional operation. When problems arise and/or when people are unavailable, breaking the rules to solve these problems is made more difficult.

Process tyranny Increasingly, there is a notion that standard business processes can be defined and embedded in systems that support these processes Implicitly or explicitly, the system enforces the use of the standard process But this assumes three things: The standard process is always appropriate The standard process has anticipated all possible failures The system can be respond in a timely way to process changes

Multi-organisational systems Many rules enforced in different ways by different systems. No single manager or owner of the system. Who do you call when failures occur? Information is distributed - users may not be aware of where information is located, who owns information, etc. Processes involve remote actors so process reconfiguration is more difficult Restricted information channels (e.g. help unavailable outside normal business hours; no phone numbers published, etc.) Lack of trust. Owners of components will blame other components for system failure. Learning is inhibited and trust compromised.

Local knowledge Process reconfiguration Redundancy and diversity Trust Design guidelines

Local knowledge Local knowledge includes knowledge of who does what, how authority structures can be bypassed, what rules can be broken, etc. Impossible to replicate entirely in distributed systems but some steps can be taken Maintain information about the provenance of data Who provided the data, where the data came from, when it was created, edited, etc. Maintain organisational models Who is responsible for what, contact details

Process reconfiguration Make workflows explicit rather than embedding them in the software Not just continue buttons! Users should know where they are and where they are supposed to go Support workflow navigation/interruption/restart Design systems with an emergency mode where the the system changes from enforcing policies to auditing actions This would allow the rules to be broken but the system would maintain a log of what has been done and why so that subsequent investigations could trace what happened Support Help, I m in trouble! as well as Help, I have a problem?

Redundancy and diversity Maintaining a single golden copy of data may be efficient but it may not be effective or desirable Encourage the creation of shadow systems and provide import and export from these systems Allow schemas to be extended Schemas for data are rarely designed for problem solving. Always allow informal extension (a free text box) so that annotations, explanations and additional information can be provided Maintain organisational models To allow for multi-channel communications when things go wrong

Trust Trust is inherent in problem recovery as it involves trusting people to be well-intentioned and to focus on solving the problem rather than on narrower concerns As we move away from co-located systems, which allow personal relationships to be created, establishing trust becomes more and more difficult There is some research on trust models but it is not clear (to me) how this can be applied to recoverability

Current research Our current work is concerned with the development of responsibility models that make responsibilities across different organisations explicit These models show who is responsible for what and the resources required to discharge responsibilities They provide a basis for maintaining local knowledge about a situation and discovering who to involve when problems have to be solved

Summary A reductionist approach to software engineering is no longer viable. on its own, for complex systems engineering Improving existing software engineering methods will help but will not deal with the problems of complexity that are inherent in distributed systems of systems We must learn to live with normal, everyday failures Design for recovery involves designing so that the work required to recover from a failure is minimised Recovery strategies include supporting information redundancy and annotation and maintaining organisational models