CONFERENCE PROCEEDINGS OF REFEREED PAPERS Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Proceedings of the Improving Systems and Software Engineering Conference (ISSEC) Achieving the Vision Canberra, 10-12 August 2009 Editor: Angela Tuffley All papers contained in these Proceedings have been subjected to anonymous peer review by at least two members of the review panel. Publication of any data in these proceedings does not constitute a recommendation. Any opinions, findings or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the conference sponsors. All papers in this publication are copyright (but have been released for reproduction) by their respective authors and/or organisations. ISBN: 978-0-9807680-0-8 CONFERENCE SECRETARIAT Eventcorp Pty Ltd PO Box 3873 South Brisbane BC QLD 4101 AUSTRALIA Tel: +617 3334 4460 Fax: +617 3334 4499 2 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 CONTENTS ABSTRACTS AND BIOGRAPHIES SYSTEMS ENGINEERING AND SYSTEMS INTEGRATION THE EFFECT OF THE DISCOVERY OF ADDITIONAL WORK ON THE DYNAMIC PROJECT WORK MODEL .................................................................................................................. 15 PETER A.D. BROWN, ALAN C. MCLUCAS, MICHAEL J. RYAN LESSONS LEARNED FROM THE SYSTEMS ENGINEERING MICROCOSM SANDPIT..... 27 QUOC DO, PETER CAMPBELL, SHRAGA SHOVAL, MATTHEW J. BERRYMAN, STEPHEN COOK, TODD MANSELL AND PHILLIP RELF MAKING ARCHITECTURES WORK FOR ANALYSIS ................................................................ 37 RUTH GANI, SIMON NG SYSTEMS ENGINEERING IN-THE-SMALL: A PRECURSOR TO SYSTEMS ENGINEERING IN-THE-LARGE...................................................................................................................................... 49 PHILLIP A. RELF, QUOC DO, SHRAGA SHOVAL, TODD MANSELL, STEPHEN COOK, PETER CAMPBELL, MATTHEW J. BERRYMAN SOFTWARE ENGINEERING REQUIREMENTS MODELLING OF BUSINESS WEB APPLICATIONS: CHALLENGES AND SOLUTIONS .................................................................................................................................. 65 ABBASS GHANBARY, JULIAN DAY DESIGN UNCERTAINTY THEORY - EVALUATING SOFTWARE SYSTEM ARCHITECTURE COMPLETENESS BY EVALUATING THE SPEED OF DECISION MAKING .................................................................................................................................................. 77 TREVOR HARRISON, PROF. PETER CAMPBELL, PROF. STEPHEN COOK, DR. THONG NGUYEN PROCESS IMPROVEMENT APPLYING BEHAVIOR ENGINEERING TO PROCESS MODELING....................................... 95 DAVID TUFFLEY, TERRY ROUT SAFETY MANAGEMENT AND ENGINEERING KEYNOTE ADDRESS AND PAPER BRINGING RISK-BASED APPROACHES TO SOFTWARE DEVELOPMENT PROJECTS 111 FELIX REDMILL MODEL-BASED SAFETY CASES USING THE HIVE WRITER................................................. 123 TONY CANT, JIM MCCARTHY, BRENDAN MAHONY AND KYLIE WILLIAMS THE APPLICATION OF HAZARD RISK ASSESSMENT IN DEFENCE SAFETY STANDARDS ......................................................................................................................................... 135 C.B.H. EDWARDS; M. WESTCOTT; N. FULTON INTEGRATING SAFETY AND SECURITY INTO THE SYSTEM LIFECYCLE .................... 147 BRUCE HUNTER WHAT CAN THE AGENT PARADIGM OFFER SAFETY ENGINEERING? .......................... 159 LOUIS LIU, ED KAZMIERCZAK AND TIM MILLER COMPLEXITY & SAFETY: A CASE STUDY................................................................................. 171 GEORGE NIKANDROS 3 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Programme Committee and Peer Review Panel Programme Chair Angela Tuffley (Griffith University) Systems Engineering Chair Stephen Cook (DASI, University of South Australia) Software Engineering Chair Leon Sterling (University of Melbourne) Systems Integration Chair Todd Mansell (Defence Science & Technology Organisation) Process Improvement Chair Terry Rout (Software Quality Institute, Griffith University) Safety Management and Engineering Chair Tony Cant (Defence Science & Technology Organisation) Committee and Review Panel Members Wesley Acworth Viviana Mascardi Warwick Adler Tariq Mahmood Matt Ashford Brendan Mahony Judy Bamberger Duncan McIntyre Clive Boughton Tafline Murnane Peter Campbell George Nikandros David Carrington Adrian Pitman Quoc Do Phillip A. Relf Tim Ferris Stephen Russell Aditya Ghose Mark Staples Jim Kelly Paul Strooper Martyn Kibel Kuldar Taveter Peter Lindsay Richard Thomas 4 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 ABSTRACTS AND BIOGRAPHIES “The Effect of the Discovery of Additional Work on the Dynamic Project Work Model” - Peter Brown ABSTRACT In their paper “Knowledge sharing and trust in collaborative requirements analysis”, Luis Luna-Reyes et al built on the body of system dynamics knowledge to propose a model of project work that demonstrates key dynamic elements of IT projects. By expanding the model to include additional criteria, useful insights into the likely impact of work required to achieve essential project outcomes – but not identified at the beginning of a project – can be derived. Essential new work discovered during the course of a project increases a project’s scope, requiring replanning. In addition to the relatively straight-forward scope increase resulting from undiscovered work, work discovered late in a project usually requires some of the work already completed satisfactorily – particularly integration work, testing and sometimes design and fabrication – to be redone. Where scope changes resulting from these two impacts are significant, re-approval or even project termination may result. Organisations can use insights gained through application of the expanded model to improve initial project planning and more effectively manage ‘unknowns’ associated with project scope. BIOGRAPHY In 2004, following a successful career in the RAAF, Peter joined KoBold Group, a rapidly growing services and systems company recognised in both government and commercial sectors for innovative, high quality systems solutions and implementations. Peter is currently part of the Customs project team responsible for managing the Australian Maritime Identification System Program, a high profile, national maritime security-related initiative. Peter teaches engineering management and systems dynamics at UNSW (ADFA) in a part-time capacity and is currently enrolled in a research degree into the system dynamics aspects of project management through the School of Information Technology and Electrical Engineering. “Lessons Learned from the Systems Engineering Microcosm Sandpit” – Quoc Do ABSTRACT Lessons learned can be highly valuable to engineering projects. They provide a means for systems engineers to thoroughly investigate and anticipate potential project risks before starting the project. Up-front analysis of the end-to-end process pays on-going dividends. This paper describes: 1) an evolutionary Microcosm for investigating systems integration issues, fostering model-based systems engineering research, and accelerating systems engineering education; 2) the lessons learned during the first stage of the Microcosm development; 3) how these lessons learned have informed the design and implementation of the Microcosm Stage Two. Interestingly, the lessons learned from the Microcosm Stage One reflect many of the common lessons learned found in much larger industry projects. This demonstrates the Microcosm Sandpit’s capability in replicating a wide range of systems development issues common to complex systems. Thus it provides an ideal environment for systems engineering education, training and research. BIOGRAPHY Dr. Quoc Do works at the Defence and Systems Institute (DASI), University of South Australia. He completed his B.Eng, M.Eng and PhD at the University of South Australia in 2000, 2002 and 2006 respectively. His research interests are in the areas of mobile robotics (UAVs & UGVs), vision systems, systems engineering and systems integration research and education, and model-based systems engineering. 5 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 “Making Architectures Work For Analysis” - Ruth Gani ABSTRACT The Department of Defence mandates an architectural approach to developing and specifying capability. According to the Chief Information Office Group, application of the Defence Architectural Framework ‘enables ICT architectures to contribute to effectively acquiring and implementing business and military capabilities, ensuring interoperability, and value for money and can be used to enhance the decision making process’ . Architectural views form a component of the Operational Concept Document for each major project within Defence. However, the utility of architectures is often limited by poor appreciation of their potential to support design decisions. In support of a key Defence capability acquisition project, the Defence Science and Technology Organisation used project architectural data to support analysis of the risks associated with the integration of the capability into the broader Defence Information Environment (DIE). Key to the analysis were: x transformation of the architectural data into an analytically tractable form; x creation of a comprehensive database open to querying; and x presentation of results in a manner meaningful to decision-makers. Results were expressed in terms of the impact that poor compliance with accepted DIE standards would have on the ability of the proposed system to conduct operational missions. The methodology used provides a template for future effective analytical use of architectures—important if Defence is to make best use of the architectural information required for each project. The study also highlights the importance of building an architecture with a view to its purpose and touches on the challenges of stove-piped architectural development. BIOGRAPHY Ruth Gani has worked for Defence Science Technology Organisation since 2001. She has been involved in range of studies, including analysis to support capability acquisition, architecture development and evaluation and technology trending. “Systems Engineering In-The-Small: A Precursor to Systems Engineering In-TheLarge” - Phillip A. Relf ABSTRACT The teaching of the systems engineering process is made problematic due to the enormity of experience required of a practising systems engineer. A ‘gentle’ project-based introduction to systems engineering practice is currently being investigated under the Microcosm programme. The Microcosm programme integrates a robotics-based system-of-systems, as the first stages in building a systems engineering teaching environment. Lessons learnt have been collected from the Microcosm Stage One project and the systems engineering processes, used during the project, have been captured. This paper analyses the processes and lessons learnt, compares them against typical large-scale Defence systems engineering projects, and discusses the lesson learnt captured by the systems engineers who had been working in-the-small. While executing the case study it was found that the lessons learnt which are known to industry, would have been militated against their occurrence by the use of robust industrial systems engineering processes but that the Microcosm project schedule, with these industrial processes, would have been exceeded. As the Microcosm Stage One project was successfully completed, effort must now be expended to ensure that the participants understand the limitations and strengths of systems engineering in-the-small procedures and also understand the issues associated with the scaling up of the procedures. BIOGRAPHY Dr. Phillip Anthony Relf gained his PhD from the Engineering faculty of the University of Technology, Sydney Australia. He has over three decades experience in large system integration, most of which was gained while working in the Defence industry. 6 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 “Requirements Modelling of Business Web Applications: Challenges and Solutions” Abbass Ghanbary ABSTRACT The success of web application development projects greatly depend upon the accurate capturing of the business requirements. This paper discusses the limitations of the current modelling techniques while capturing the business requirements in order to engineer a new software system. These limitations are identified by modelling the flow of information in the process of converting user requirements to a physical system. This paper also defines the factors that influence the change in business requirements. Those captured business requirements are then transferred into pictorial and visual illustrations in order to simplify the complex project. In this paper, the authors define the limitations of the current modelling techniques while communicating those business requirements with various stakeholders. The authors in this paper also review possible solutions for those limitations which will form the basis for a more systematic investigation in the future. BIOGRAPHY Abbass Ghanbary (PhD) has completed his PhD at the University of Western Sydney. Abbass is focused on the issues and challenges faced by business integration modelling techniques, investigating the improvements of the Web Services applications across multiple organisations. Abbass is also a consultant in the industry in addition to his lecturing and tutoring in the university. He is a member of Australian Computer Society and is active in attending various forums, seminars and discussion. Abbass is also a committee member of the Quantitative Enterprise Software Performance (QESP) association. “DESIGN UNCERTAINTY THEORY - Evaluating Software System Architecture Completeness by Evaluating the Speed of Decision Making” - Trevor Harrison ABSTRACT There are two common approaches to software architecture evaluation [Spinellis09, p.19]. The first class of evaluation methods determines properties of the architecture, often by modelling or simulation of one or more aspects of the system. The second, and broadest, class of evaluation methods is based on questioning the architects to assess the architecture. This research paper details a third, more finegrained approach to evaluation by assuming an architecture emanates from a large set of design and design-related decisions. Evaluating an architecture by evaluating decision making and decision rationale is not new (see Section 3). The novel approach here is to base an evaluation largely on the time dimensions of decision making. These time dimensions are (1) time allowed for architecting, and (2) speed of architecting. It is proposed that progress of architecture can be measured at any point in time. For example: “Is this project on track during the concept development stage of a system life cycle?” The answer can come from knowing how many decisions should be expected to be finalised at a particular moment in time, taking into account a plethora of human factors affecting the prevailing decision-making environment. Though aimed at ongoing evaluations of large military software architectures, the literature review for this research will examine architectural decisions from the disciplines of systems engineering, information technology, product management and enterprise architecture. BIOGRAPHY Trevor Harrison's research interests are in software systems architecture and knowledge management. His background is in software development (real-time information systems), technology change management and software engineering process improvement. Before studying full-time for a PhD, he spent 6 years with Logica and 11 years with the Motorola Australia Software Centre. He has a BSc(Hons) in Information Systems from Staffordshire University and an MBA (TechMgt) from La Trobe University. 7 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 “Applying Behavior Engineering to Process Modeling” - David Tuffley ABSTRACT The natural language used by people in everyday life to express themselves is often prone to ambiguity. Examples abound of misunderstandings occurring due to a statement having two or more possible interpretations. In the software engineering domain, clarity of expression when specifying the requirements of software systems is one situation where absence of ambiguity is important. Dromey’s (2006) Behavior Engineering is a formal method that reduces or eliminates ambiguity in software requirements. This paper seeks an answer to the question: can Dromey’s (2006) Behavior Engineering reduce or eliminate ambiguity when applied to the development of a Process Reference Model? BIOGRAPHY David Tuffley is a Lecturer in the School of ICT at Griffith University, and a Consultant with the Software Quality Institute since 1999. Before academia, David consulted in the computer industry for 17 years. Beginning in London in the 1980's as a Technical Writer, he progressed to business analysis and software process improvement work. His commercial work has been in the public and private sectors in the United Kingdom and Australia. David is currently doing postgraduate research on developing a process reference model for the leadership of virtual teams. 8 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 KEYNOTE ADDRESS AND PAPER “Bringing Risk-Based Approaches to Software Development Projects” - Felix Redmill ABSTRACT The history of software development is strewn with failed projects and wasted resources. Reasons for this include, among others: • Failure to take an engineering approach, despite using the epithet ‘software engineering’; • Focus on process rather than product; • Failure to learn lessons and use them as the basis of permanent improvement; • Neglect to recognise the need for high-quality project management; • Reliance on tools to the exclusion of understanding first principles; and • Focus on what is required without consideration of what could go wrong. If change is to be achieved, and software development is to become an engineering discipline, an engineering approach must be embraced. This paper does not attempted to spell out the many aspects of engineering discipline. Rather, it addresses the risk-based way of thinking and acting that typifies the modern engineering approach, particularly in safety engineering, and it proposes a number of ways in which a risk-based approach may be incorporated into the structure of software development. Taking a risk-based approach means attempting to predict what undesirable outcomes could occur in the future (within a defined context) and taking decisions – and actions – to provide an appropriate level of confidence that they will not occur. In other words, it uses knowledge of risk to inform decisions and actions. But, if knowledge of risk is to be used, that knowledge must be gained, which means acquiring appropriate information. In safety engineering, such an approach is essential because the occurrence of accidents deemed to be preventable is not considered acceptable. (As retrospective investigation almost always shows how accidents could have been prevented, this often gives rise to contention, but that’s another matter.) In the security field, although a great deal of practice is carried out ad hoc, standards are now based on a risk-based approach: identifying the threats to a system, determining the system’s vulnerabilities, and planning to nullify the threats and reduce the vulnerabilities in advance. However, in much of software development, the typical approach is to arrive at a product only by following a specification of what is required. Problems are found and fixed rather than anticipated, and consideration is seldom given to such matters as the required level of confidence in the ‘goodness’ of any particular system attributes. A risk-based approach carries the philosophy of predicting and preventing, and this is an asset both in the development of products and the management of projects. This paper therefore proposes some first steps in creating a foundation for the development of such an approach in software development and project management. The next section briefly introduces the subject of risk, and this is followed by introductions to two techniques, used in risk analysis, which are applicable in all fields and are therefore useful as general-purpose tools. Subsequent sections offer thoughts on the introduction of a risk-based approach into the various stages of software development projects. It is hoped that the explanations offered in this paper are easily understandable, but they do not comprise a textbook. Risk is a broad and tricky subject, and this paper does not purport to offer a full education in it. BIOGRAPHY Based in London, UK, Felix Redmill is a self-employed consultant, lecturer and writer. His fields of activity are the related subjects of risk and its management, safety engineering, project management, software engineering, and the application of risk principles to other fields, such as software testing. 9 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 With a BSc in Electrical Engineering, he started work as a Computer Programmer and, thereafter, had parallel careers in telecommunications and software engineering, as engineer and manager, for more than 20 years prior to setting up his own consulting business. In the 1970s he attended Manchester University on a bursary to do an MSc in Computation, and was seconded to Essex University to carry out research into the stored program control of telephone exchanges. He has since conducted private research into several subjects, including risk-based software testing. He gained experience in all aspects of engineering, including maintenance, managed many system-development projects and, as head of department, designed and led a number of quality-improvement campaigns. He was the inaugurating Co-ordinator of the UK_ Safety-Critical Systems Club in 1991, organised sixteen annual Safety-critical Systems Symposia, and still edits its newsletter, Safety Systems, which is now in its eighteenth year. Felix has been an invited lecturer at several universities in the UK and other countries in safety engineering and management and in various aspects of software engineering and is an Honorary Professor at Lancaster University. He has published and presented papers and articles on many subjects, including telecommunications, computing, software engineering, project management, requirements engineering, Fagan inspection, quality management, risk, safety engineering, the safety standard IEC 61508, and engineering education. Some papers and articles have been published in other languages: French, Spanish, Russian, Polish, Arabic, and Japanese. He has also written and edited a number of books on some of these subjects, and has been invited to give keynote addresses in the USA, Australia, India, Poland, Germany, as well as the UK. He is a Chartered Engineer, a Fellow of both the Institution of Engineering and Technology and the British Computer Society, and a Member of the Institute of Quality Assurance. He is currently active in promoting professionalism among safety engineers, developing the profession of safety engineering and helping to define its direction. “Model-Based Safety Cases Using the HiVe Writer” - Tony Cant ABSTRACT A safety case results from a rigorous safety engineering process. It involves reasoned arguments, based on evidence, for the safety of a given system. The DEF(AUST)5679 standard provides detailed requirements and guidance for the development of a safety case. DEF(AUST)5679 safety cases involve a number of highly inter-related documents; tool support is needed to manage the process and to maintain consistency in the face of change. The HiVe Writer is a tool that supports structured technical documentation via a centrally-managed datastore so that any documents created within the tool are constrained to be consistent with this datastore and therefore with each other. This paper discusses how the HiVe Writer can be used to support safety case development. We consider the safety case for a fictitious Phased Array Radar Target Illuminator (PARTI) system and show how the HiVe Writer can support hazard analysis for the PARTI system. BIOGRAPHY Tony Cant currently leads the High Assurance Systems (HAS) Cell in DSTO’s Command, Con¬trol, Communications and Intelligence Division. His work focuses on the development of tools and techniques for providing assurance that critical systems will meet their requirements. Tony has also led the development of the newly published Defence Standard DEF(AUST)5679 Issue 2, entitled “Safety Engineering for Defence Systems”. Tony obtained a BSc(Hons) in 1974 and PhD in 1979 from the University of Adelaide, as well as a Grad Dip in Computer Science from the Australian National University (ANU) in 1991. He held research positions in mathematical physics at the University of St Andrews, Tel Aviv University, the University of Queensland and the ANU. He also worked in the Common¬wealth Department of Industry, Technology and Commerce in science policy before joining DSTO in 1990. 10 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 “The Application of Hazard Risk Assessment in Defence Safety Standards” - C.B.H. Edwards ABSTRACT Hazard Risk Assessment (HRA) is a special case of Probabilistic Risk Assessment (PRA) and provides the theoretical basis for a number of safety standards. Measurement theory suggests that implicit in this basis are assumptions that require careful consideration if erroneous conclusions about system safety are to be avoided. These assumptions are discussed and an extension of the HRA process is proposed. The methodology of this extension is exemplified in recent work by Jarrett and Lin. Further development of safety standards and the possibility of achieving a harmonization of the different approaches to assuring system safety are suggested. BIOGRAPHY Christopher Edwards is a Senior Systems Safety Analyst. Prior to joining Defence in 1979 Chris was a member of the CSIRO's Division of Mathematics and Statistics where he worked as a consultant statistician. Chris has over 15 years experience in the management and development of safety-critical software intensive systems. Since retiring in 2001 he has been contracted as the Safety Evaluator for a number of defence systems which have involved the use of Def(Aust)5679 as the safety standard. Chris is currently the Treasurer of the Australian Safety Critical Systems Association (aSCSa) and sits on the executive committee of that organisation. “Integrating safety and security into the system lifecycle” - Bruce Hunter ABSTRACT We live in a brave new world where Information Security threats emerge faster than control mechanisms can be deployed to limit their impact. Information Security is not only an issue for financial systems but has greater risks for control systems in critical infrastructure, which depend not only on their continued functionality, but also on the safety of their operation. This new dimension to the dependability of systems has been recognised by some safety and security standards but not much has been done to ensure the conflicting requirements and measures of security and safety are effectively managed. Conflicts in the implementation of safety and security aspects of systems arise from the differing values and objectives they are to achieve. Neglecting the relationship between functional, safety and security issues can lead to systems that are neither functional, safe or secure. This paper proposes an integrated model to ensure the safety and security requirements are effectively treated throughout the system lifecycle, along with functional and performance elements, maintaining ongoing compatibility between their individual objectives. BIOGRAPHY Bruce Hunter ([email protected]) is the Quality and Business Improvement Manager for the Security Solutions & Services and Aerospace divisions of Thales Australia. In this role Bruce is responsible for product and process assurance as well as the management of its reference system and its improvement. Bruce has a background in IT, systems and safety engineering in the fire protection and emergency shutdown industry and has had over 30 years of experience in the application of systems and software processes to complex real-time software-based systems. Bruce is a contributing member of Standards Australia IT6-2 committee, which is currently reviewing the next edition of the IEC61508 international functional safety standards series. Bruce is also a Certified Information Security Manager and Certified Information Systems Auditor. 11 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 “What can the agent paradigm offer safety engineering?” - Tim Miller ABSTRACT A current trend in safety-critical applications is towards larger, more complex systems. The agent paradigm is designed to support the development of such complex systems. Despite this, agents are having minimal impact in safety-critical applications. In this paper, we investigate how the agent paradigm offers benefits to traditional safety engineer¬ing processes. We demonstrate that concepts such as roles, goals, and interactions narrow that gap between engineering and safety analysis, and provide a natural mechanism for managing re-analysis after change. Specifically, we investigate the use of HAZard and OPerability studies (HAZOP) in agent-oriented software engineering. This offers a first step towards broadening the scope of systems that can be analyzed using agent-oriented concepts. BIOGRAPHY Tim Miller is a lecturer in the Department of Computer Science and Software Engineering at University of Melbourne. Tim completed his PhD at the University of Queensland before taking up a four-year postdoctoral research position at the University of Liverpool, UK, where he worked on the highly successful PIPS (Personalised Information Platform for Life and Health Services). Tim's research interests include agent-oriented software engineering, models of multi-agent interaction, computational modelling & analysis of complex systems, and software testing. “Complexity & Safety: A Case Study” - George Nikandros ABSTRACT Despite correct requirements, competent people, and robust procedures, unsafe faults occasionally arise. This paper reports on one such incident; one that involves a railway level crossing. Whilst the direct cause of the failure was defective application contol data, it was a defect that would be difficult to foresee and if foreseen, to test for. A sequel to this failure is the sequence of events to correct the defect. In the haste to correct the defect, another unsafe failure was introduced. BIOGRAPHY George is an electrical engineer with some 30 years experience in railway signalling. George is chairman of the Australian Safety Critical Systems Association. George has published papers, is accredited as the author of a Standards Australia Handbook “Safety Issues for Software” and a coauthor of the book “New Railway Environment – A multi-disciplinary business concept”. George represents the Australian Computer Society on the Engineers Australia/ Australian Computer Society Joint Board in Software Engineering. George is a Chartered Member of Engineers Australia, a Fellow of the Institution of Railway Signal Engineers, and a Member of the Australian Computer Society. 12 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 SYSTEMS ENGINEERING AND SYSTEMS INTEGRATION 13 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 14 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 THE EFFECT OF THE DISCOVERY OF ADDITIONAL WORK ON THE DYNAMIC PROJECT WORK MODEL Peter A.D. Brown, Alan C. McLucas, Michael J. Ryan ABSTRACT In their paper “Knowledge sharing and trust in collaborative requirements analysis”, Luis Luna-Reyes et al built on the body of system dynamics knowledge to propose a model of project work that demonstrates key dynamic elements of IT projects. By expanding the model to include additional criteria, useful insights into the likely impact of work required to achieve essential project outcomes – but not identified at the beginning of a project – can be derived. Essential new work discovered during the course of a project increases a project’s scope, requiring re-planning. In addition to the relatively straight-forward scope increase resulting from undiscovered work, work discovered late in a project usually requires some of the work already completed satisfactorily – particularly integration work, testing and sometimes design and fabrication – to be re-done. Where scope changes resulting from these two impacts are significant, re-approval or even project termination may result. Organisations can use insights gained through application of the expanded model to improve initial project planning and more effectively manage ‘unknowns’ associated with project scope. BACKGROUND Project Work Model In their paper “Knowledge sharing and trust in collaborative requirements analysis”, Luis Luna-Reyes, Laura J. Black, Anthony M. Cresswell and Theresa A. Pardo (Luna-Reyes et al. 2008) built on the body of system dynamics knowledge to propose a model of system development project work that demonstrates key dynamic elements of IT projects. Their model is represented in Figure 1 below, modified only to observe current stock-and-flow diagramming conventions (McLucas 2003), (Sterman 2000), (Coyle 1996). The authors’ intend to further develop the model proposed by Luna-Reyes et al to facilitate a systemic examination of the dynamic structure and characteristics of projects and their effects on projects’ performance and outcomes. This paper will examine the effects of discovering additional work after a project starts on that project’s performance. It is intended to be the first in a series of papers examining projects’ dynamic behaviour, eventually leading to improved investment decisionmaking. 15 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure1:RepresentationofLunaReyesDynamicProjectWorkModel(LunaReyesetal.2008) The Luna-Reyes Stock-and-Flow diagram shows that, of an amount of work to be done, some is done correctly and some is flawed and must be reworked. It also shows that of the re-work undertaken, a proportion is likely to contain flaws and require reworking yet again. In addition to a parameter Error Fraction that represents the probability of doing work incorrectly, Luna-Reyes et al recognise three highly aggregated parameters that influence the behaviour of the model: Concreteness, Transformability and Learningby-Doing. It is the interaction of these parameters with the other elements of a project’s structure that causes the changes over time that can be so difficult to manage effectively and that impact so significantly on project outcomes. An understanding of these parameters is crucial to the development of a valid model for use in simulating project dynamic behaviour and performance. An improved appreciation of the influences of the key parameters may be gained from the influence diagram in Figure 2 below. Figure2:Influencediagram–DynamicProjectWorkModelbasedonworkbyLunaReyesetal 16 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Concreteness In their paper, Luna-Reyes et al examine the efficacy of requirements identification and definition in a collaborative project environment. In doing so, they define Concreteness as the ‘specificity’ of representations elicited during cross-boundary or trans system interface work on requirements discovery. However, in the more general context of dynamic project work modelling, Concreteness could be better interpreted as those representations that depict the extent to which the mechanisms and interrelationships influencing a system are understood in terms of their potential to contribute to or impede achievement of desired outcomes. In other words, Concreteness is a picture of how visible key stakeholders’ wants and needs are to the organisation or organisations developing a specific system or solution, how far apart those various needs are and what constraints (dependent and independent, exogenous and endogenous) help make up the specific development environment. Factors influencing concreteness may include, inter alia: x the maturity and worldliness of stakeholder organisations; x how clearly each stakeholder organisation understands their own and other stakeholders’ needs and desired project outcomes; x the validity and completeness of, and the disparity between, organisations’ mental models of the system development environment and desired project outcomes; and x the specificity and tangibility of such understanding to organisations’ abilities to convert ‘understanding’ to ‘achievement’. In the model, ‘Concreteness’ is applied directly to the Rate of Doing New Work Correctly, the Rate of Doing New Work Incorrectly and the Rate of Discovering Rework. Transformability Luna-Reyes et al define ‘Transformability’ as the likelihood that an organisation or actor is able to recognise deficiencies in concreteness representations and define and successfully apply the corrective actions needed (Luna-Reyes et al. 2008). In a more general sense, Transformability is the ability to recognise deficiencies in requirements and take action to correct them. Ǯǯ ǢǮǯǤ Learning by doing ‘Learning-by-Doing’ is a parameter that represents the level of knowledge of stakeholders’ needs and desired outcomes, including knowledge of effective ways of achieving those outcomes. Learning-by-Doing as an output parameter only, influenced by the Rate of Doing New Work Correctly, the Rate of Doing Rework Correctly and the Rate of Discovering Rework. 17 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Error Fraction As indicated above, Error Fraction is the probability that the outcome of work attempted will be incorrect or faulty, and it relates to how difficult specific work is to the organisation or organisations undertaking that work. Error Fraction applies to all rates in the model except the Rate of Discovering Rework. Parameter definitions Currently, apart from the context-constrained definitions of the key model parameters offered by Luna-Reyes et al, there is little other than empirical evidence and logical thinking to support several of the definitions above. For now it seems appropriate to define the parameters and their relationships with each other using a ‘black box’ approach. In doing so, key parameters can be defined in terms of their inputs and outputs, what elements of the model they influence and to what extent they do so. The authors offer the modified definitions of ‘Concreteness’ ‘Transformability’ and ‘Learning-by-Doing’ hypothesised above more in response to a need for these definitions to effectively model projects than to any mathematical proof or agreed definition of the terms. It should be noted that the key parameters (especially ‘Transformability’ and ‘Concreteness’) are likely to influence the various rates in the model in different ways and to different extents. A parameter may have a negligible influence on some rates for some projects and a massive influence on others. Furthermore, a parameter might only influence a subset of the domain of causes of a rate’s variations, not the whole domain. The nature of the key parameters ‘Concreteness’, ‘Transformability’ and ‘Learning-by-Doing’ and their relationships with other project parameters requires further research and will be addressed in future papers. How the basic model works In the basic model at Figure 1, work is done for as long as the stock KNOWN NEW WORK remains greater than zero. New work is undertaken at a certain rate which may vary over time, and will result either in work done correctly at a rate influenced by the rate factor ‘(1-Error Fraction)’, accumulating in the stock WORK DONE CORRECTLY, or work done incorrectly at a rate influenced by the rate factor ‘Error Fraction’, accumulating in the stock UNDISCOVERED REWORK. The need for rework is discovered gradually, and work transfers from UNDISCOVERED REWORK to KNOWN REWORK at the rate the need for rework is discovered. The discovery of rework is influenced by Concreteness. The better requirements are understood, the more likely and more quickly deficiencies will be recognised. Known Rework is then processed, influenced by the Error Fraction and Transformability and flows either to WORK DONE CORRECTLY or back to UNDISCOVERED REWORK. It is important to note that in an actual project, stakeholders do not have visibility of accumulating UNDISCOVERED REWORK, believing instead that this work has accumulated as WORK DONE CORRECTLY. Stakeholders only recognise the requirement for rework as requirements thought to be met are found not to be. In the model, recognition causes flawed work to flow from UNDISCOVERED REWORK to KNOWN REWORK at the rate the need for rework is discovered, but in real life, 18 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 stakeholders are likely to visualize rework flowing from WORK DONE CORRECTLY to KNOWN REWORK. Issues with the basic model Most project management methodologies in use today place heavy emphasis on adequately defining requirements prior to solution development and implementation. While this optimal outcome is possible in some circumstances where a project’s field of endeavour is well travelled and relatively straightforward, it is rarely the case for complex projects. This is particularly so for ‘green field’ endeavours where the system being developed will be the initial system and developers must do without a bottom-up view of the business needs. It may also be the case in circumstances where businesses’ strategic, operational and working level plans require substantial development or re-development. Likewise, in cases when substantial organisational change will be required to fully utilise the capability a solution provides, it may not be possible to develop a mature set of requirements prior to the commencement of system development. These realities have been demonstrated on many occasions and are the reason that more complex system life cycle models such as evolutionary acquisition and spiral development were conceived and are in extensive use today. Given that requirements often cannot be comprehensively and accurately established prior to complex system design, development and in some cases, construction and implementation, it should be recognised that requirements must continue to be elaborated as projects proceed. Consequently, new work will nearly always be discovered over the life of a project, including after the time requirements development activities were scheduled to be complete. Therefore – referring to the model – KNOWN NEW WORK will not always be the known quantity it ideally should be, but will continue to vary over the life of a project every time new work is discovered. Adapting the Model If we are to examine the effects of additional work discovered after a project has commenced, the basic Luna-Reyes model must be modified by the addition of elements representing the discovery of new work. An influence diagram for the modified model is shown in Figure 3 below. 19 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure3:Influencediagramaccountingfordiscoveryofnewworkafterprojectcommencement The new model shows that as additional new work is discovered over the duration of a project, the value of KNOWN NEW WORK increases. Additionally, the discovery of new work will often require some of the work already done correctly (WORK DONE CORRECTLY) to be redone, increasing the project’s rework (KNOWN REWORK) as a result. Even more distressingly, the discovery of new work after a project has commenced sometimes makes work already completed and accepted redundant. A stock-and-flow representation of the adapted model is shown below in Figure 4. Figure4:StockandFlowDiagramDynamicProjectWorkModelmodifiedtoaccountfornewwork afterprojectcommencement Because ‘Learning by Doing’ is an output parameter and it is not of central interest to this discussion, it can be removed from the model for the purposes of this study without affecting the validity of simulation results. 20 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Perfect Project In order to establish a project performance baseline against which other variations and enhancements may be compared, the key parameters ‘Concreteness’, ‘Transformability’ and ‘Error Fraction’ are given values that will equate to a ‘perfect’ project. Also, in a perfect project, no additional work would be discovered. To achieve this, the value for Error Fraction was set and maintained at zero to represent no errors and the values for Transformability and Concreteness were set and maintained at 100%. For the perfect project, the value of additional work discovered was also set to zero. As a result of these changes, it is possible to simplify the model further for this special case only (see Figure 5 below). Figure5:InfluenceDiagramof'perfect'DynamicProjectWorkModel The stock-and-flow diagram for this ‘perfect’ model is shown below in Figure 6. It is the same as that in Figure 4 but because there are no errors, no undiscovered new work and ‘Transformability’ and ‘Concreteness’ have values of 1, only the top part of the basic model is used. Figure6:StockandFlowDiagramof'Perfect'DynamicProjectWorkModel 21 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 MODEL BEHAVIOUR Assumptions and constraints When examining a model’s behaviour, it is first necessary to understand the assumptions and constraints built into it. Firstly, major projects are usually developed in phases and stages. This model examines one phase/stage only and does not consider inter-stage influences in the analysis of project performance. The discovery of new work is assumed to occur in very small increments and is treated as continuous for the purposes of this study. In real life, discovery of additional work is likely to occur in discrete parcels; further research on the way new work is discovered after a project has begun is required and will be addressed in future papers. In the model, for this study, the rates at which new work is undertaken (correctly and incorrectly), the rate at which rework is discovered and the rate at which rework is undertaken (correctly and incorrectly) are held constant. In real life (and in future research), rates will vary over the life of a project. The parameter values used in this model were selected as reasonable values to start with based on the authors’ experience. They will be subject to more rigorous research and analysis in the near future. In setting parameter values, it was assumed that the amount of additional work discovered after a project has commenced (UNDISCOVERED NEW WORK) is related to the complexity of the system being developed and the environment it is being developed in. Therefore a parameter representing an aggregation of influences loosely titled ‘complexity’ was adopted and expressed as a fraction of the project’s initially known new work. It was also recognised that the position on a project’s time line when additional work was discovered (represented in the model as the Rate New Work Discovered) might be just as relevant to project performance as the amount of additional work discovered. For consistency, the rate new work is discovered was based on the rate at which KNOWN NEW WORK is processed. Finally, the rates at which new work discovered caused work already completed and accepted (WORK DONE CORRECTLY) to be reworked or to become redundant was set to a fraction of the rate of discovery of the additional work (Rate New Work Discovered). Project performance baseline Initially, the model was set up as a ‘perfect’ project comprising 2000 tasks with zero errors, Concreteness and Transformability set at 100% and no additional work discovered. The duration for the ‘perfect’ project was 200 days. The effect of additional work on project performance Based on the assumptions and constraints above, three test cases each containing three scenarios were constructed. 22 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Table1:DPWMmodellingcases Test Case Scenario Rate of Discovery of Additional Work Complexity No additional work Not applicable a Slow Discovery Rate = 0.1 * rate of doing new work Simple (5% of Initial Known New Work) b Slow Discovery Rate = 0.1 * rate of doing new work Complex (25% of Initial Known New Work) c Slow Discovery Rate = 0.1 * rate of doing new work Very Complex (50% of Initial Known New Work) a Medium Discovery Rate = 1 * rate of doing new work Simple (5% of Initial Known New Work) b Medium Discovery Rate = 1 * rate of doing new work Complex (25% of Initial Known New Work) c Medium Discovery Rate = 1 * rate of doing new work Very Complex (50% of Initial Known New Work) a Fast Discovery Rate = 2 * rate of doing new work Simple (5% of Initial Known New Work) b Fast Discovery Rate = 2 * rate of doing new work Complex (25% of Initial Known New Work) c Fast Discovery Rate = 2 * rate of doing new work Very Complex (50% of Initial Known New Work) 0 1 2 3 A plot of the test case results is shown in Figure 7 below. 23 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure7:Effectofadditionalworkonprojectperformance1 The model was designed to simulate work already completed and accepted that required rework or was made redundant as a result of the discovery of additional work. Arbitrary values of 10% of the rate at which new work discovered made work already completed redundant and 20% for the rate at which completed work had to be reworked were selected. The values for total redundant and reworked tasks are shown in Table 2 below. Table2:Workmaderedundantorrequiringreworkduetodiscoveryofnewwork Complexity Completed work (% of Initial redundant (Tasks) Known Work) made Completed work rework (Tasks) 5% 10 20 24% 50 100 50% 100 200 requiring Discussion Predictably, the plot shows that the addition of new work will extend project duration for all test cases. It also shows that for the slower rate of discovery and more complex ͳ 24 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 projects, duration may be extended because not all the new work is discovered before the known new work is completed. In reality, that situation would not normally arise, but there have been instances of projects that haven’t stopped due to the continued addition of new work. The absolute values for work made redundant or requiring re-work due to the discovery of additional work are relatively small; however, if you consider the data in the context that a project was approved on the basis that it would take 200 days to complete and that due to additional work, that duration increased to 300 days of which more than 20 days work had to be re-done and over 10 days work could not be used, the impact becomes significant. The significance of the discovery of additional work will vary from project to project depending on what effort and resources are required to undertake the work, how those resources are allocated and how large an impact the additional work has on work already completed. It is generally appreciated, though, that the later in a project changes are introduced, the greater the impact they are likely to have on work already completed and overall project cost. Conclusions This paper builds on the body of systems dynamic and project management work in proposing an improved model for examining the dynamic behaviour of projects in relation to performance and outcomes. Even in its current basic form, the model provides useful insights into the ways the discovery of additional work over a project’s life affects project performance. It is not always possible, particularly for complex projects, to comprehensively map and define project requirements prior to system design and development. From experience and the preliminary modelling results outlined in this paper, it is likely that a project’s performance will be sensitive to the discovery of additional work after commencement, causing not only significant schedule and cost over-runs but also unanticipated rework and sometimes even a portion of work completed to become nugatory and be discarded. Future work Further research into the nature of new work, how it is discovered and how sensitive project duration is to the time it takes to replan, accommodate and deal with additional work is required, as is improved definition of key dynamic project parameters including Error Fraction, Concreteness, Transformability and Learningby-Doing followed by analysis of their influence on project performance and outcomes. Future research will build on the modest start described in this paper, resulting (it is hoped) in a more effective means of managing dynamic project and program risk. 25 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Bibliography Ǧǡǡ ǤSoftwareProjectDynamicsAnIntegrated Approach.ǡǣ ǡͳͻͻͳǤ ǡǤ ǤSystemDynamicsModellingAPracticalApproach.ǤͳǤͳǤ ǣƬȀǡͳͻͻǤ ǦǡǤǡǤ ǡǤǡǤǤ Dz ǤdzSystem DynamicsReviewȋƬȌʹͶǡǤ͵ȋȌȋʹͲͲͺȌǣʹͷǦ ʹͻǤ ǡǤDecisionMaking:RiskManagement,SystemsThinkingand SituationAwareness.ǡǣǡʹͲͲ͵Ǥ ǡǤBusinessDynamics:SystemsThinkingandModellingfora ComplexWorld.ǡ ǣ ǡʹͲͲͲǤ ǡ ǤSystemEnquireyASystemDynamicsApproach. ǡǣƬǡͳͻͻͲǤ 26 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Lessons Learned from the Systems Engineering Microcosm Sandpit Quoc Do, Peter Campbell, Shraga Shoval, Matthew J. Berryman, Stephen Cook, Todd Mansell and Phillip Relf Defence Science and Technology Organisation, Edinburgh, South Australia. Ph. +61 8 8259 7566 Email: todd.mansel, phillip.relf {@dsto.defence.gov.au} Defence and Systems Institute, University of South Australia, Mawson Lakes Campus, SA Ph. +61 8 8302 3551 Email: quoc.do, peter.campbell, shraga.shoval, matthew.berryman, stephen.cook {@unisa.edu.au} Abstract Lessons learned can be highly valuable to engineering projects. They provide a means for systems engineers to thoroughly investigate and anticipate potential project risks before starting the project. Up-front analysis of the end-to-end process pays on-going dividends. This paper describes: 1) an evolutionary Microcosm for investigating systems integration issues, fostering model-based systems engineering research, and accelerating systems engineering education; 2) the lessons learned during the first stage of the Microcosm development; 3) how these lessons learned have informed the design and implementation of the Microcosm Stage Two. Interestingly, the lessons learned from the Microcosm Stage One reflect many of the common lessons learned found in much larger industry projects. This demonstrates the Microcosm Sandpit’s capability in replicating a wide range of systems development issues common to complex systems. Thus it provides an ideal environment for systems engineering education, training and research. INTRODUCTION The Microcosm program was established by the University of South Australia, and the Defence Science and Technology Organisation in 2006, to foster research, training and education in systems engineering, with a particular focus on conducting research into better understanding of how to manage the issues of system integration that arise in the development of large system of systems projects. These issues are described succinctly in (Norman and Kuras, 2006), for example, and are understood to arise from multiple causes. One of the most important sources of systems integration difficulty is caused by the need to assemble these systems from a number of different systems that have often been designed and built for different purposes, by different manufacturers, and to operate in different environments from those expected in the new system of systems. The Microcosm program has been deliberately designed to mimic this situation on a small scale, in order that the same type of issues will occur and research and teaching of these issues can be carried out in a small and manageable environment. 27 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 The Microcosm program is executed in multiple stages that adopt the Spiral Development Model. Stage One of the Microcosm development project has been completed, with the identification of over 30 lessons learned, which have strong parallels with those that arise in real projects. This first stage was carried out with the following inadequate planning and management characteristics that are often reflected in real projects: a. b. c. d. Unclear and shifting requirements, A tight schedule and very tight funding, Inadequate systems engineering and project management planning, Development based on components supplied by a number of different manufacturers, causing numerous interface issues, and e. Inadequate understanding of the possible environmental effects on the system’s performance. Both governments and industries have recognized the need to document and apply the knowledge gained from past experience to support current and future projects in order to avoid the repetition of past failures and mishaps and the promotion of successful practices. Traditionally lessons learned are associated with failures, and involving the loss of valuable resources. However, project successes can and should also be recorded as lessons learned. These lessons are generally stored and maintained in a database and regarded as a valuable resource for rectifying causes of failure, or informing decision makings that are likely to improve future performance. One of the valuable systems engineering products generated from the Microcosm Stage One is a lessons learned database. Many lessons learned databases that have been built in the past have ended up in total neglect within a short time because they did not provide potential future users with easy access or the cues to promote their use. The Microcosm lessons learned database is intended to be an interactive living database to provide continuous support to the project into improved practices of systems engineering, and address systems integration issues as well as the use of Microcosm systems engineering process products for education purposes. This paper gives a brief description of stage one of the Microcosm project, and then discusses the lessons learned during its development, how they have informed the execution of the second stage of the project and the database design to make them available on the project Wiki. MICROCOSM PROGRAM The Microcosm program is aimed at developing an evolutionary facility, namely the Microcosm Sandpit, which will expand in capability to meet the wider and longer-term requirements of its stakeholders by adopting the Spiral development model as described in (Buede, 2000). It is essentially an open-ended project with multiple stages, where each stage will enhance existing capabilities as well as develop new capabilities to meet the growing needs of stakeholders. It resembles the evolutionary nature of military systems upgrades on a much smaller scale. This is a challenge for the traditional systems engineering paradigm, since the systems engineering process is usually taught and largely practiced using a linear system development process (ie., the Waterfall model), unsuited to deal with evolving systems. 28 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 The Microcosm Sandpit provides a flexible means to explore systems engineering practices within a facility that utilises real and simulated autonomous systems operating and interacting with humans, the physical and the simulated environments. The evolving Microcosm Sandpit fosters systems engineering practices in both research and teaching environments. Essentially, it provides a systems engineering and systems integration environment to be used by stakeholders to stage demonstrations, conduct experiments, train staff, and to evaluate systems configuration and operation. The Microcosm programme has six defined use-cases: Simulation/Stimulation: Investigation of the interactions between a mixture of models and physical hardware in real-time and simulated scenarios, including hardware in the loop, simulation architectures and plug-and-play hardware and software. Human Agent-Based Modelling for Systems Engineering: Development of human operational models and interfaces in scenarios based on teaming agents, human replacement agents, socio-technical models, and research into the insertion of data and image files into an OPNET Modeller. Modelling, Simulation and Performance Analysis: Development of object oriented modelling of system devices, and statistical modelling of captured real-life data. Autonomous Vehicles Research: Investigation of swarming robots, cooperative robots, dynamic task allocation, robustness and reliability, and combined communication and localisation. Systems Engineering Approach to Evolving Model Development: Investigation of SE–lifecycle models, evolving agents, platforms and environments, and the transition from low to high fidelity models. Systems Enhancement Research: System analysis, system parameterisation, optimisation, algorithm development, kit development, and research into HumanComputer Interfaces (HCI). Microcosm Stage One Architecture The Microcosm programme high-level architecture has three distinct parts (Mansell et al., 2008): the Microcosm Information Management System (MIMS), the Modelling and Simulation Control System (MASCS), and the Microcosm Physical System (MPS), illustrated in Figure 1. The MIMS is an integrated information management system that stores all the systems engineering products associated with the project through each spiraldevelopment cycle. The MASCS is a simulation and control subsystem that contains synthetic models of Microcosm’s components including environment models, simulated autonomous vehicles, and a suit of onboard and off-board sensors models. In addition, the system provides the capability for hardware-in-the-loop (HIL) simulation through the use of a common interface between simulated and physical components that provides seamless interactions between these components in a given operational scenario. Finally, the MPS consists of all the physical components of the Microcosm facility, including the autonomous robotic vehicles, external sensors and external environments. Microcosm Stage One Implementation The Microcosm Stage One operational scenario consists of two unmanned ground vehicles (Pioneer 3DXs), and a fixed global external sensor: SICK LMS 291 laser sensor as shown in 29 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure 2(a). Each mobile robot has an on-board laptop, vision sensor, ultrasonic sensor, laser sensor and wheel encoders. The global sensors are used for intruder detection (i.e. a person), which send notification to the two mobile robots to intercept, assess, perform threat mitigation or neutralize the intruder. In addition to the physical components, synthetic models have been developed for the P3DXs, laser sensors, ultrasonic sensor, vision sensors, and the operating environment. This provides the flexibility to run scenarios using only physical systems, synthetic models alone or a combination of real components and synthetic models. Also it provides a powerful environment to explore systems integration issues, and modelbased systems engineering research. Figure 1. The high level architecture of the Microcosm Sandpit. The Microcosm Sandpit Stage One system implementation has been successfully completed and evaluated (Do et al., 2009) using a service oriented architecture (SOA). The SOA is based on the Decentralised Software Service (DSS) (Nielsen and Chrysanthakopoulos, 2008) and the Concurrent Control Runtime (CCR) library (Morgan, 2008) from Microsoft and is illustrated in Figure 2(b). The CCR is a managed library that provides classes and methods for concurrency, coordination and error handling. It enables segments of codes to operate independently, pass messages and execute in parallel. The DSS, on the other hand, extends the CCR capability across processes and machines. Both CCR and DSS are provided within the Microsoft Robotic Development Studio (MRDS) (Johns and Taylor, 2008). (a) (b) Figure 2. (a) Operational view (OV1) of the Microcosm Stage One scenario. (b) Service-oriented architecture for the Microcosm Stage One implementation. 30 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Another important aspect of the Microcosm Sandpit development is that the simulation environment is implemented using the simulation capability of the Microsoft Robotic Development Studio (MRDS). It uses the AGEIA PhysX physics engine to simulate physical interactions within the virtual environment, such as friction, gravity and collision (Johns and Taylor, 2008). The Microcosm Sandpit physical environment and the robots are modelled to scale in the simulation, and sensor accuracy is modelled with a modest level of fidelity. The modelled environment is shown in Figure 3. The synthetic robots are configured to replicate the motions of their real counterparts in the real environment, performing the operational scenario depicted in Figure 2 (a). A synthetic intruder is also modelled and its motion is emulated based on the intruder’s position calculated by the ground-based laser sensor in the physical environment. Intruder and robots’ motion data are supplied to the simulation environment by the Master-Runtime Control. Figure 3. Aerial view of the synthetic environment of the Microcosm Sandpit. The operation of the stage one operation scenario is based on centralised control architecture, as depicted in Figure 2(b). The Master Runtime Control is responsible for intruder detection, sensor fusion, creating and maintaining an operational picture, and requesting positional update from the robots. It also instructs the Simulation Orchestrator to emulate the robots and intruder motions. Upon intruder detection, the robots receive the intruder’s position from the Master-Runtime Control. Robot Two pivots and tracks the intruder using the onboard camera, while Robot One performs its own path planning to follow and intercept the intruder using a finite state machine (Cook et al., 2009). After the initialisation state Robot One transits automatically into the Standby state. Upon intruder detection it progresses to the Follow-Intruder state, in which it performs path planning and follows the intruder. While in the Follow-Intruder state, if an obstacle appears in the way (detected by the onboard laser sensor), Robot One transits to the Obstacle-Avoidance state, and resumes the previous state when the obstacle is cleared. When the robot is within a metre of the intruder, it enters Intruder Engagement state and announces successful interception to the intruder. A voice message is also transmitted on each state transition to enable observers to assess progress through the scenario. Should the intruder leave the guarded area, the robot enters the Return-To-Base state. 31 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 MICROCOSMS LESSONS LEARNED Lessons learned databases are a collection of analysed data from a variety of current and historical events. Traditionally these events are associated with failures, and involving the loss of valuable resources. Lessons learned databases provide a resource that can be used to rectify the causes of failure, or choose pathways that are likely to improve future performance. Many large civilian and military organizations around the world manage a database that includes a description of the driving events (failures or success), the lesson(s) learned from these events and recommendations for future actions. In this section we describe some events that are included in the lessons learned database of the Microcosm program. Although the Microcosm database is driven by relatively simple events, the database has similar outcomes and recommendations to the traditional lessons learned databases found in larger systems engineering/integration organisations. Furthermore, the unique configuration of Microcosm offers an environment in which systems engineering practices are developed and extended in a forgiving environment, and provides a large variety of driving events with little risk of loss of resources. The following table gives examples of lessons learned in categories that are maintained in our lessons learned database. For each entry the nature of the issue and what we learned as a consequence of this issue’s manifestation has been summarised in the third and fourth columns respectively. Table.1: Samples of the lessons learned from the Microcosm Stage One. Id MLL01 Issue Type Equipment Operation Issue Description EMI issue with the flex compass being interfered with by the robot’s electrically powered motors. MLL02 Environment MLL03 Integration MLL04 Interfaces MLL05 Computer Hardware Robot’s localisation system was unable to cope with measurement inconsistencies, introduced by the environmental conditions i.e., uneven floor surfaces. COTS equipment manufacturer’s interface documentation was incomplete and ambiguous. Team members were developing their common software interfaces but assumed incompatible units of measurement and dissimilar coordinate systems for various data fields. Processing power, while adequate for a single application, was found to be inadequate when the same CPU was required to support multiple applications. MLL06 Configuration Management System data was stored in a common folder on a server but system baselines were not clearly defined. It was difficult to identify working modules 32 What Was Learned? EMI issues are difficult to predict during the design stage and can also be intermittent in their manifestation. Hardware prototyping and extensive system testing is required to ensure that EMI issues are identified early. Information redundancy, supported by multiple different sensor types, is required to compensate for sensor errors realised within an imperfect environment. Interface testing is necessary to prove that the COTS equipment communicates as expected before system test is conducted. An interface specification is required to effectively define the system component interfaces. Each affected team member is hence required to ‘sign up to’ this interface specification. Before loading an application set on a computing platform a processor performance model should be developed to confirm that adequate computing resources are available over all system operational scenarios. System baselines need to be defined and the relevant ‘snap shot’ of the system kept unmodified in a known directory structure to ensure that development to the next Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Id Issue Type Issue Description for each baseline and differentiate completed modules from work-inprogress ones. Due to a software issue, the robot went ‘rouge’ during a test, which could have resulted in damage to the robot, before manual action was forthcoming to secure the robot’s motion. See MLL07 MLL07 Safety MLL08 Emergency Procedures MLL09 Project Management Our initial effort estimates gave a value of 1.1 person years. However, the actual expenditure was 1.5 person years. MLL10 Technology adoption/inse rtion MLL11 Lesson Learned Capture The self-localisation of the robot was based on only odometry with no correctional global positioning. It was found that the odometry suffers from accumulative errors that lead the robot to a “lost state” within a short-period of time. Lessons learned were captured at the end. It was found that many of the same issues were independently discovered by multiple team members, which could have been avoided in the subsequent cases, given early access to populating the Microcosm Stage One lessons learned database What Was Learned? system baseline progresses from a known state. A readily accessible and simple method of halting the robot’s motion (i.e., kill switch) should be provided as part of the system functionality. A procedure and deployment of this procedure (i.e., staff training) was found to be necessary to address potential safety issues. Our systems engineering process was not robust enough to mitigate against the rework required to recover from various issues, see this table for examples. Our systems engineering process will now be evaluated and process improvement considered as appropriate. Model-based systems engineering should be used to inform the feasibility of the intended methodology and technology prior to their insertion. Lesson learned should be recorded immediately after they occurred to avoid duplicated failures. This requires the lessons learned database to have online access, and have email notification of new entry to all team members. IMPACTS OF LESSONS LEARNED ON THE MICROCOSM STAGE TWO Lessons learned can be regarded as a critical source of information for risk mitigation and optimal project planning. However, their effective use and maintenance still remain as a challenge to the engineering community. One of the aims of the Microcosm program is to investigate methods for the effective use of the lessons learned to inform the design and implementation of future stages of the Microcosm project, and also their use in systems engineering education. The captured lessons learned from the Microcosm Stage One have informed the execution of the Microcosm Stage Two work in two different aspects: engineering implementation and systems engineering process. The former has occurred in the system design and implementation phases, where stronger emphasis on understanding and designing to the interface issues between different system components is being addressed as the result of the lesson learned id-MLL03 and id-MLL04 in Table.1. In particular, the following areas are considered: communication protocol, deterministic response, Quality of Service (QoS), type of messages/data being passed between services, data update rates of each sensors and 33 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 services, all embedded in a hardware-software system deployment architecture framework design. Similarly, the system design process is being informed by the lesson learned id-MLL10, in Table.1 where preliminary system components’ capability is being investigated by creating statistical models of the components, to ascertain whether the intended technology will meet our system’s requirements, irrespective of what was stated in the various product specifications. For instance, an Ultra-wide Band (UWB) positioning system was considered to be used for providing indoor analogue to GPS to update of the robots’ positions. A modelbased systems engineering approached was adopted to inform the system design with an outcome that indicates that UWB positioning system alone will not be sufficient in providing global positioning update for the robots due to large positional errors. This led to an insertion of an additional requirement to procure an extra SICK LMS-291 laser sensor. Note that this was identified before the project start instead of toward the end of the project, as might otherwise have happened. Furthermore, the tailored systems engineering processes for the Microcosm Stage Two has been informed by the lessons learned recorded in Table.1. For instance, with reference to the lesson learned, id-MLL09, our estimate of the total effort required was 1.09 person effort was out by 36%, due extensively to the allocation of time for the test and evaluation phase. This lesson has informed the time allocation for the systems integration test and evaluation phase of the Microcosm Stage Two, increasing it from 5% to 20% of the over project schedule. This is believed to be the general expectation in industry settings and hence rework has potentially been mitigated as a consequence of the lesson-learned id-MLL10. PROPOSED RESEARCH INTO THE USABILITY OF A LESSONS LEARNED DATABASE In order to make use of the captured lessons learned it is important to have the right software infrastructure in place and the right methodology/ environment for using it. The wiki software being used will allow lessons learned to be linked to relevant architectures and design patterns (which can also be stored in the wiki), and the impacts of changes in these can be captured and stored in the wiki’s history. Having the right processes will include things like making sure people are using the wiki, the use of email to notify wiki updates with timecritical lessons learned, deciding on the right course of action as a result of the lesson learned, and implementing these changes. Learning can take place at a number of levels (Grisogono and Spaans, 2008). In the context of Microcosm, level 1 learning would be tuning existing systems engineering processes as a result of lessons learns. Level 2 learning includes improving the processes used to support level 1learning. This includes modifications to the structure of the wiki and processes for learning, along with expanding and modifying the set of engineering processes used in Microcosm. Level 3 learning is learning about level 2 learning – capturing lessons learned regarding the use of the wiki database and the processes adopted for acquiring the lessons learned. The Microcosm wiki could capture level 3 learning by research into usability of the lessons learned database. Level 4 learning is about aligning how the lessons learned are used with measures of real world performance of the Microcosm. For example, if a different way of communicating architectural changes is used, does this reflect in better software performance? Level 5 learning is about how lessons learned are used in a co-operative 34 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 setting. To be most effective at learning, all five levels have to be implemented. Therefore, the Microcosm lessons learned research program focuses on how well each of the five levels of learning can be achieved through the use of the Microcosm wiki and other processes. CONCLUSIONS This paper has discussed the usefulness of capturing lessons learned from the Microcosm Stage One and illustrated how these lessons learned have informed the design and implementation, and the tailored systems engineering processes of the Microcosm Stage Two. The captured lessons learned were from a small-scale project but reflect many of the same lessons learned reported in large-scale projects. This has demonstrated the Microcosm Sandpit’s merits in generating systems engineering products that could be used in systems engineering education, training and research. The captured lessons learned are stored on the project Wiki that will be equipped with interactive mechanisms for autonomously engaging and proving insightfully information at various stages of the Microcosm program as part of its model-based systems engineering research theme. REFERENCES Buede, D. M. 2000. The Engineering Design of Systems, Wiley Publishing Inc. Cook, S., T. Mansell, Q. Do, P. Campbell, P. Relf, S. Shoval and S. Russell, 2009. Infrastructure to Support Teaching and Research in the Systems Engineering of Evolving Systems. 7th Annual Conference on Systems Engineering Research 2009 (CSER 2009), accepted for publication, Loughborough, UK. Do, Q., T. Mansell, P. Campbell and S. Cook, 2009. A Simulation Architecture For Modelbased Systems Engineering and Education. SimTect 2009, accepted for publication, Adelaide, Australia. Grisogono, A.-M. and M. Spaans, 2008. Adaptive Use of Networks to Generate an Adaptive Task Force. 13th ICCRTS: C2 for Complex Endeavors. Johns, K. and T. Taylor 2008. Professional Microsoft Robotics Developer Studio Wiley Publishing Inc. Mansell, T., P. Relf, S. Cook, P. Campbell, S. Shoval, Q. Do and C. Ross, 2008. MicrocosmA Systems Engineering and Systems Integration Sandpit. Asia-Pacific Conference on Systems Engineering - APCOSE, Japan. Morgan, S. 2008. Programming Microsoft Robotics Studio, Microsoft Press, US. Nielsen, H. F. and G. Chrysanthakopoulos. 2008. "Decentralized Software Services Protocol – DSSP/1.0." http://download.microsoft.com/download/5//6/B/56B49917-65E8494A-BB8C-3D49850DAAC1/DSSP.pdf. Norman, D. and M. Kuras 2006. Engineering Complex Systems. Complex Engineered Systems: Science Meets Technology. D. Braha, A. Minai and Y. Bar-Yam, Springer. 35 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 36 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 MAKING ARCHITECTURES WORK FOR ANALYSIS Ruth Gani B.AppSc., Simon Ng BSc./BEng. PhD. Joint Operations Division Defence Science and Technology Organisation INTRODUCTION Capability integration is an important step towards delivering joint capability for Defence— indeed, interoperability is central to enabling effective joint, combined and coalition operations. In the words of the Department’s Defence Capability Development Manual, ‘Joint interoperability is to be seen as an essential consideration for all ADF capability development proposals’ (Australian Department of Defence 2006, p47). A central component of ensuring interoperability is the integration of systems across capabilities. According to Weill from the MIT Center for Information Systems Research (Weill 2007), the enterprise architecture is the ‘organising logic for key business process and IT capabilities reflecting the integration and standardisation requirements of the firm’s operating model’. The Australian Department of Defence has adopted an enterprise architectures approach through the Defence Architectural Framework (Australian Department of Defence 2006, p47). The Chief Information Officer Group states that the application of the Defence Architectural Framework ‘enables ICT architectures to contribute to effectively acquiring and implementing business and military capabilities, ensuring interoperability, and value for money and can be used to enhance the decision making process’ (Chief Information Officer Group, 2009) Architectures are intended to document a system in order to support reasoning about that system. This encompasses the use of the architecture to manage and direct system changes, to measure system effectiveness and performance and to describe and visualise the system (Institute of Electrical and Electronics Engineers, 2000). As such, architectures must be tractable from this perspective. Unfortunately, the utility of architectures can be limited by poor appreciation of their potential to support this sort of reasoning or by a lack of understanding of who might qualify as an ‘end-user’ of the architecture. The Defence Science and Technology Organisation (DSTO) plays an important role in supporting Defence capability development and acquisition projects. Part of this role is to examine the risks and mitigation strategies associated with the integration of the project into the wider Defence capability. As such, DSTO can be an ‘end-user’ of the architectures developed as part of the capability development process (DSTO can, of course, also be involved in the development of architectures for Defence). This paper reports on lessons on the use of architectures drawn from DSTO’s support to a key Defence capability acquisition project. It demonstrates the importance of documenting architectures in a manner that makes them useful for reasoning by presenting the remedial efforts that were needed to make the extant project architecture amenable to analysis. Because of the classified nature of much of the material related to the Defence project in question, the name of the project and details of the proposed system options under consideration have been withheld. 37 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 OVERVIEW The aim of the DSTO study presented herein was to answer two questions: 1. What were the important interoperability requirements and associated information domain standards necessary for the new Defence capability to operate within the Defence Information Environment (DIE)? 2. What was the impact of poor interoperability (due to poor standards compliance) on the new capability’s capacity to conduct its missions and to meet defined information exchange requirements? These questions were answered by comparing standards associated with the information exchange requirements identified by the project with standards defined for the Defence Information Environment, producing a ‘degree of compliance’ rating. If the degree of compliance was 1, then the capability being developed by the project would be fully interoperable (within the limits of resolution of the study). If the degree of compliance was 0, then the capability would not be interoperable at all. THE DEFENCE INFORMATION ENVIRONMENT Defence Information Domains (DID) Management Operations Policy and Doctrine Organisation and Structures People and Training Information Management Data Defence Information Infrastructure (DII) Sensors Processes and Procedures User applications Common Services Information Interoperability Information Management Defence Information Environment Weapons User devices System Hardware Networks/Datalinks Bearers Fixed Coalition Allies OGOs Industry NII Deployed Figure 1. A logical representation of the DIE. The Defence Information Environment (DIE) as represented in Figure 1 (above) is divided into layers. Specific layers were of particular relevance to the project, those being: x the Data, User Applications and Common Services layers of the Defence Information Infrastructure (DII); 38 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 x x the information management aspects of the Network/Datalinks; and the Bearers layer. Each layer of the DII has an associated set of approved standards which are documented within the Approved Technology Standards List (Chief Information Officer Group 2006). The capability on which this study was focussed exists in the space defined by sensors and weapons to the right of the diagram. This capability interfaces with other entities (be they ships, planes or headquarters) through the DIE. SOURCING DATA AND INFORMATION Data concerning three types of entities were required for DSTO’s purpose: x data concerning the Information Exchange Requirements between the new capability and the DIE; x a list of relevant DIE standards (current and likely future); and x a list of missions that might need to be undertaken in the context of Defence fielding the new capability. The project’s architecture data provided the central source of information about the types of information exchanges needed between the new capability and the broader Defence Information Environment. In other words, the architecture products, the project Operational Concept Document (OCD) and other project documents (such as the preliminary Functional Performance Specification (FPS) and the Communication Information Systems (CIS) report) articulated the types of information that needed to be exchanged, the methods and media that would be used to exchange the information and some of the associated standards. Standards associated with the DII layers of interest were sourced from the Approved Technology Standards List (ATSL) and other Defence documentation. Future standards assumed the adoption of the US Distributed Common Ground System Integrated Backbone (DIB) standards into the DIE.1 A list of the missions being undertaken by the new capability were contained within the OCD. However, the architecture products were expressed in terms of scenarios, which were too context specific for generating statements of generic mission effectiveness into the future. The missing mission data was extracted from the OCD (for the new capability) and other doctrinal sources for other capabilities (nodes) in the system with which information exchanges were required. ARCHITECTURE DATA: THE UNDERLYING PROBLEMS As stated, the design of an architecture is mediated by the purpose for which it will be used. The architecture provided with the project’s OCD was flawed across a number of levels. x The architecture was developed in a fragmented way. The key ‘operational view’ data that form the DAF were developed for the project under two different contracts at two different times. This led to inconsistencies that the adoption of a framework approach aiding configuration management is meant to avoid. Specifically, items of information specified in one area of the architecture were 1 References related to the project or restricted Defence information are not included. 39 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 inconsistent, in terms and definitions, with items specified in other parts of the architecture. This made it impossible to directly trace from operational activities to information exchanges between nodes. The answer to the question ‘what activity does this information exchange relate to’ could only be guessed at. A second inconsistency was the use of missions as the basis of one area of the architecture and scenarios as the basis of the other. These fundamental disconnects made interpretation of the relationship between operational entities, information and operational activities very difficult. x The architecture consisted of views, but very little documented underlying data. An architecture is a collection of data and data relationships. Unfortunately, the architecture provided by the project was a set of ‘views’—diagrams—with no explicit information about underlying data structures or relationships. No repository existed that could be filtered, sorted or in any way manipulated. In essence, the architecture products supplied as part of the OCD were aimed at securing transit of the document through ‘first pass’, but they were poorly suited to our analytical purposes and (as shall be seen) required considerable data ‘scrubbing’. x The elements of the architecture were ambiguous. Development of the elements of the architecture was not done consistently. For instance, capability nodes were defined at different levels of resolution: weapons hardware, radar systems, individual ship types and a surface action group (SAG) were all present as nodes with relationships expressed as information ‘needlines’ between them. There was no indication as to which ships constituted a SAG and which ships within the SAG might be responsible for each needline. Some decision early on as to the fidelity required for analysis could have saved effort in the long run. Another problem was in the expression of Information Exchange Requirements (IERs) in the IER matrix. A sample of the IER matrix showing one of the hundreds of IERs processed is shown in Table 1 (below). Table 1: Selected columns for a row in appearing in the IER matrix. Information Element Name Content Description Triggering Element Activity Producer Consumer Media System Media Method Temporal Information Weather forecasts Weather Forecasts (incl. visibility) Bureau of Meteorology (BOM) Start of Mission BOM Command and Control Headquarters Internet Line 1 way The IER matrix was a primary source of information for the study, but it contained significant ambiguities. The IERs were described as being ‘one way’, ‘two-way’, ‘network’, etc. and multiple information types were identified for each IER. The ‘oneway’ IERs were manageable; ‘two way’ or ‘network’ made it difficult to determine which information was flowing in what direction. The ‘media system’ and ‘media method’ fields were occasionally populated by multiple data options. For instance, if the system field contained ‘Mainframe, PC, Mac, Notebook’ and the method field contains ‘wireless, network, hand carriage, line of sight, beyond line of sight’ then decisions must be made as to which of the cross- 40 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 product (system x method) options were valid and needed. Some combinations could be automatically ruled out (mainframe x hand carriage) but it wasn’t possible to determine whether other combinations made sense and whether or not they were essential to meeting the IER. In some cases the systems or methods mentioned were not discrete (eg. TADIL, Datalink, Link A), leading to boundary problems. If an IER mentions Link A, do you also relate it to Datalink and TADIL? Problems can be introduced by broadening the scope of node to node IER media methods and systems. x The ‘users’ or purpose of the architecture was not well understood. It was clear that the architecture views were developed to facilitate a stage in the approval of the project. This doesn’t preclude the appropriate development of the architecture for other uses, but the caveat attached to the front of the architectural views stated that no analytical purpose had been established as part of the design brief for the architecture. In other words, the developers of the architecture were not given an explicit statement of all the uses to which the architecture would be put, although they may very well have made implicit assumptions about what the architecture might be used for and by whom it might be used. Some of the problems raised above could have been avoided at the design stage of the data gathering process. Much of the effort in the study was in correcting or compensating for the above data consistency and configuration problems. FIXING AND RECONCILING THE UNDERLYING DATA Given the challenges inherent in the architectural views supplied in the project’s OCD, two options were available: 1. Load the views into a Commercial Off-The-Shelf (COTS) architectural product; or 2. Create a purpose-built relational database to house the study data and implied data relationships derived from the OCD views supplied. The first option (COTS) entailed purchasing software licenses and training staff to a reasonable level of expertise. Further, it wasn’t possible to guarantee that a COTS tool would easily accommodate such ill-configured data. Ironically, configuration management may have limited our flexibility in manipulating and analysing what poor data was available. Due to these risks and costs, this option was discounted. The second option required the creation of a simple relational database with just enough detail to support the data we had available using Defence application tools already in place. We chose to create a lightweight prototype for the purposes of supporting this project alone: it would be flexible (or slack) enough to accommodate the badly formed data. Importantly, it would provide a queriable repository, one that wasn’t available from the project architecture itself. To construct the database, it was necessary to define the logical relations between the data of interest within the OCD architecture views. The logical relationship between data types are described in the Entity Relationship Diagram (ERD) in Figure 2 (below). 41 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Scenario Media IER Need Cardinality one two one or more Node Figure 2. Logical IER matrix Entity Relationship Diagram. Unfortunately, this relational model was not broad enough in scope to address the analytical questions the study wanted to answer, and so it was extended to accommodate the standards list and missions list, as seen in Figure 3 (below). Standard data was related to the media methods and systems that needed to comply with the standards. Missions were related to IERs that supported the mission activity and nodes that were actually involved in carrying out mission activities. Scenario Standard Media IER Need Cardinality one two zero or more one or more Node Figure 3. Extended logical Entity Relationship Diagram. 42 Mission DataSource Organisation Standard MediaMethod MediaType MediaSystem NodeMediaMethodMapping NodeMediaSystemMapping IERMediaMapping Node IER 43 Figure 4. Final Entity Relationship Diagram for the study database zero or one one two zero or more one or more Cardinality StandardMediaMethod Mapping StandardMediaType Mapping StandardMediaSystem Mapping MissionNodeMapping Need ScenarioIERMapping Band Activity Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Mission IERMissionMapping Scenario Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This logical structure was instantiated as a simple MS Access database. To reduce duplication, tables to facilitate many-to-many relationships were introduced. This ensured that the lookup or ‘mapping’ tables containing the table ‘joins’ were insulated from entity data changes. Such relationship tables appear in the Figure 4 (above) as tables whose names end in ‘mapping’ (ie. IERMediaMapping). Entity data was entered using various methods: x by file import into the appropriate table (DIB data into the Standards table); x using cut-and-paste on a record-by-record basis (ATSL data into the Standards table); x executing insert queries (duplicating each ‘two-way’ IER to create two ‘one-way’ IERs with reversed source and sink nodes); and x by manual data entry (Scenario, Mission, IER, Need, Node, Media). After the data was entered, we tested the system by recreating the IER matrix using database queries and matching the output against that provided in the OCD. Other queries showing Node - Mission relationships and Node - IER Method - IER System relationships were also generated and scrutinised for error. This building of an enriched relational database was the first step in tackling the shortcomings inherent in the original data. Not only did it correct these shortcomings, but it also provided a repository through which questions linking different parts of the architecture could be asked and meaningful answers arrived at. For example, based on this newly constructed database, it was possible to ask questions like ‘what will the mission impact be if the project adopts a standard for this tactical data link that mismatches with the standard mandated in the ATSL?’. This presented a significant step forward in making the data analysable, and it highlights the importance of understanding the purpose for which the architecture is built. DEVELOPING AN ANALYSIS APPROACH The analysis approach involved generating a score for each IER’s compliance with the identified standards and then aggregating the scores in order to give meaning at a systemic level. It was of little or no value to simply provide the project with a view of which media systems had poor interoperability, but it was useful to highlight risks to mission outcomes that arose from this lack of compliance. Therefore, the compliance score for an IER is the aggregation of all media system and media method scores associated with the IER. A media score is related to how many standards associated with the media are duplicated between the new capability standards list and the DIE standards list. When considering the future DIE, we incorporated the ‘other’ (DIB) standards into the DIE standards list. See Figure 5 (below) for a pictorial view of the aggregation. Note that the fidelity and confidence in the resulting scores diminishes at each level as the score is diluted by continual aggregations and outliers are camouflaged. Three cases were explored: worst case, medium case and best case. In the medium case analysis, it was assumed that the risk associated with information exchange to and from the proposed system was proportional to the fraction of standards common to the system and the DIE (incorporating the DIB in the case of the future standards analysis). 44 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Low Sink Mission IER Media System Media Method Standards Confidence Needline High DIE standards Proposed system required standards Other standards Figure 5. The data structure supporting the analyses undertaken. To illustrate the method used here, consider the analysis of system compliance with the DIE: 1. Each standard was assigned a value of: ‘1’ if it was common to both the proposed system and the DIE; or ‘0’ otherwise; 2. For each media method, two scores were determined: a. the ‘Positives’, which was the number of standards relevant to that media method that were marked as ‘1’; b. the ‘Total’, which was the total number of standards relevant to that media method. If the ‘Positives’ equalled the ‘Total’ then all standards associated with a media method were, by definition, common to both the proposed system and the DIE. A similar procedure was used to score the media systems. 3. The degree of compliance for each IER was determined as follows: For an example IER with one associated media system:media method pair A:A’, the compliance score P is determined by: P c( A, A' ) ° 0, if A 0 ° ° 0 ® 0, if A' ° ° Positives(A) Positives(Ac) otherwise ° Total(A) Total(Ac) ¯ 45 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 In the medium case analysis for IERs with more than one media system:media method pairing the average of the compliance scores for the IER was calculated: N P for an IER with N media pairs ¦ c (O , O ' ) O A N An identical approach was used to determine the degree of compliance for each needline and mission. In the best and worst case analysis, it was assumed that the risk associated with information exchange between the proposed system and the external environment was determined by the best or worst case risk associated with a given media system:media method pair, but otherwise the approach was for all intents the same as that described above. An identical approach was taken to determine the degree of compliance for each needline and mission. The best/worst case analysis provides an idea of the risk spread for a given IER or mission. REPRESENTING RESULTS A challenge for any analysis is to represent results in a way that is meaningful to the client. In this study, simply stating the compliance levels across information exchanges would have conveyed the underlying analysis but not expressed it in terms that the clients (military operators) would be likely to appreciate. Instead, considerable work was done to build the relational database to allow the impact of poor compliance on missions to be determined. A simple scheme was used to express the risk to mission failure. Results were presented in a client report to the IPT in graphical and tabular form. The risk profile graphs for IERs and Missions were presented and the implications for Defence discussed. A sample graph is shown in Figure 6 (below): In an attempt to focus any remedial effort and identify ‘low-hanging fruit’, a coarse sensitivity analysis was done. The results were included in the client report in graphical form akin to Figure 7 (below). It is clear from such an output that further work regarding options for improving the compliance of ‘System X’ should have a positive effect on mission risk. To allow the Defence project team to conduct further investigation, the client report was accompanied by the study tools (database and spreadsheet) in electronic form—that is, the database was not only used to support the DSTO study, but also as a tool for ongoing analysis within the project itself. The project team have the option of using the tools to check their high-level compliance as standards or documents develop throughout the future life of the project. This transfer of knowledge in explicit form adds value to the analysis already undertaken. 46 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Sink Mission Risk Profiles 0.6000 Proportion 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000 Best case Medium case Worst case Very High (0.0<P<=0.2) 0.0220 0.1500 0.1600 High (0.2<P<=0.4) 0.0280 0.2600 0.3700 Moderate (0.4<P<=0.6) 0.4200 0.3300 0.4700 Low (0.6<P<=0.8) 0.5300 0.2600 0.0000 Very Low (0.8<P<=1.0) 0.0000 0.0000 0.0000 Risk Cases Figure 6 Proportion of Missions in each risk category as a result of non-compliance of standards between the proposed system and the DIE (numbers are illustrative not actual). Normalised contribution to IER compliance M edia Systems: Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Au di N o et w or k J D at a D at al in N k et w or k Sy I st em Sy Z st em W In te rn et Sy st em Sy X st em Y Li nk A Li nk C Li nk N on B -v M ol an at u al il e M ed ia Se Li lf nk pr D ot ec tio n TA D IL Vo ic e 0 Media system Figure 7 The relative contribution made by each media system (top) and media method (bottom) to the overall compliance performance and associated risk of the proposed system’s information exchange with the broader DIE (results are illustrative only). 47 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Apart from advice concerning the proposed system’s interfaces, the DSTO has made many recommendations about the architectural products supplied as part of the OCD. The Defence project team are now investing in their capability architecture as an important engineering and analysis tool. The faults that were identified in form and content have been taken on board and a coherent, self-consistent architecture is currently being produced. CONCLUSIONS Architectures form the basis for reasoning about systems. In Defence, they are a mandated part of the capability development process. However, projects often lack an appreciation of the likely uses to which an architecture may be put. This paper has highlighted the importance of considering the users in any development of an architecture. It has flagged specific shortcomings in the data presented to DSTO by the project and detailed the effort that went into remedial work. This level of effort is costly and time consuming, and would be obviated if a more encompassing approach to the development of the architecture took place up front. Finally, it has flagged several important lessons, which may seem obvious conceptually but which are often overlooked in capability development: x x x x An architecture is only as valuable as it is consistent, coherent and documented. Views, mismatches in resolution or elementary content and lack of specificity undermine the utility of the architecture in supporting reasoning; Any compliance process must take into account the quality and validity of the architectural data and products, not just the form of the Architectural views. Defence should assess whether their compliance process meets such a goal; Analysis of architectural data requires the data to be contained in analysable form, but it isn’t enough to perform the analysis: the results must be meaningful to the end audience. To make it meaningful requires an understanding of the end audience and the questions they are ultimately going to want answered. This information, in turn, dictates the form of the architecture. Often, multiple audiences exist and multiple questions need to be answered, which amplifies the need to move beyond views towards a structured repository; and finally, Transferring knowledge isn’t simply about providing advice and recommendations; it is also about transferring tools and the capacity for analysis to the end user. REFERENCES Australian Department of Defence, (2006) Defence Capability Development Manual, Defence Publishing Service, Canberra. Chief Information Officer Group, (2006) Defence Information Environment (DIE) Approved Technology Standards List (ATSL) (version 2.5), Canberra. Chief Information Officer Group, (2009) Directorate of Architecture Practice Management (DAPM) - Defence Architecture Framework. [Online] (Updated 6 Apr 2009) Institute of Electrical and Electronics Engineers, (2000) IEEE Std-1471-2000 Recommended Practice for Architectural Description of Software-Intensive Systems, IEEE. Weill, P. (2007) Innovating with information systems: what do the most agile firms in the world do?, Proceedings of the 6th e-Business Conference-PwC & IESE. Barcelona, Spain 27 March 2007. 48 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 SYSTEMS ENGINEERING IN-THE-SMALL: A PRECURSOR TO SYSTEMS ENGINEERING IN-THE-LARGE Phillip A. Relf1; Quoc Do3; Shraga Shoval3; Todd Mansell2; Stephen Cook3, Peter Campbell3; Matthew J. Berryman3 1 2 3 Raytheon Australia, email: [email protected] Defence Science and Technology Organisation, email: [email protected] University of South Australia email: Matthew.Berryman, Peter.Campbell, Stephen.Cook, Quoc.Do, Shraga.Shoval {@UniSA.edu.au} Abstract – The teaching of the systems engineering process is made problematic due to the enormity of experience required of a practising systems engineer. A ‘gentle’ project-based introduction to systems engineering practice is currently being investigated under the Microcosm programme. The Microcosm programme integrates a robotics-based system-ofsystems, as the first stages in building a systems engineering teaching environment. Lessons learnt have been collected from the Microcosm Stage One project and the systems engineering processes, used during the project, have been captured. This paper analyses the processes and lessons learnt, compares them against typical large-scale Defence systems engineering projects, and discusses the lesson learnt captured by the systems engineers who had been working in-the-small. While executing the case study it was found that the lessons learnt which are known to industry, would have been militated against their occurrence by the use of robust industrial systems engineering processes but that the Microcosm project schedule, with these industrial processes, would have been exceeded. As the Microcosm Stage One project was successfully completed, effort must now be expended to ensure that the participants understand the limitations and strengths of systems engineering in-the-small procedures and also understand the issues associated with the scaling up of the procedures. INTRODUCTION This paper reports a case study of a systems engineering project that was conducted in-thesmall namely the Microcosm Stage One project; and contrasts the lessons learnt against industrial systems engineering in-the-large practices. Where possible the relevant industrial practices are cited from published papers. Finally, recommendations to balance the education of systems engineers, who have only worked in-the-small, are given to offer appreciation for working in-the-large. 49 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 BACKGROUND Contrarytopopularbelief,systemsengineeringhasbeenwithusforsomeconsiderabletimebutwe arestillstrugglingtodefinewhatitisandtounderstandhowtoteachit.Wearedoggedinourneed tobuildincreasinglycomplexsystemsbutarefrequentlybeinghinderedbyourowncognitive limitations.Microcosmisasystemsengineering‘sandpit’thatisdesignedtoallowournovice systemsengineerstolearntheexplicitandtacitknowledgerequiredofasystemsengineer,and guidethemonthepathtotransitionfromsystemsengineeringinthesmalltosystemsengineering inthelarge. Systems Engineering Scope and History McQuay (2005) surveyed the literature and reports that a general consensus of the scope of systems engineering has been achieved but Concalves (2008) notes that this has not always been the case. Traditionally, systems engineering has used the “V” model (i.e., top-down definition, and bottom-up integration and test) (Walden 2007) and has been characterised by systems engineering standards such as ANSI/ITAA EIA-632, IEEE Std 1220, ISO/IEC 15288 and MIL-STD-499, to name a few. Systems engineering in its simplest manifestation is concerned with the design of the whole and not with the design of the parts (IEEE 2000). Lightfoot (1996) defines systems engineering as the controlled application of procedures, standards and tools to a problem such that the developed solution manifests to satisfy a specific need. INCOSE (2009) expands on this definition: Systems Engineering integrates all the disciplines and specialty groups into a team effort forming a structured development process that proceeds from concept to production to operation. Systems engineering considers both the business and the technical needs of all customers with the goal of providing a quality product that meets the user needs. Systems engineering was born out of World War II (Brown and Scherer 2000). Since the 1940’s the Defence sector has used systems engineering practices to develop complex systems (McQuay 2005) which have been used to coordinate information, material and people in the support of operational missions (Brown and Scherer 2000). However, systems engineering practice continues to suffer unrest driven by economic and political pressures (Hellestrad 1999). As an example, in 1994 The Secretary of Defence directed that commercial off-the-shelf (COTS) equipment should be used to encourage open systems architectural development and as insurance against obsolescence issues (IEEE 2000), hence changing the scope of systems engineering practice in the process. Specifically, the importance of systems requirements in the systems engineering lifecycle of new projects has been reduced substantially as COTS equipment, by definition, already exists and has been built to another customer’s system requirements specification. Classical systems engineering is essentially a sequential, iterative development process that results in the generation of a system (Rebovich 2008). The systems engineer works on the assumption that the classical systems engineering process is driven by requirements which ultimately result in a system (Rebovich 2008, Meilich 2005). However, classical systems 50 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 engineering is outmoded (Rebovich 2008) particularly due to the complexities imposed by the use of COTS equipment as system components (Dahmann and Baldwin 2008, Walden 2007, Rickman 2001). It has been recognised that the essence of systems engineering has become how to integrate the system components into a whole (Chase 1966). More recently, academics and practitioners alike now refer to systems integration as the process being undertaken in the development of systems (Ai and Zhang 2008, Mindock and Watney 2008, Meilich 2005). However, industry strongly rejects the practice of using “systems integration” as a synonym for “systems engineering” as it creates confusion in that the term would refer both to an engineering discipline and to a step in the systems engineering lifecycle. Systems Engineering Process Boarder (1995) analysed the systems engineering process and identified 400+ unique activities. While using these activities to guide the development of a relatively simple system, a total in excess of 1,000 distinct activities were subsequently identified (Boarder 1995). Independently, Raytheon have developed their Integrated Product Development System (IPDS) which, at the lowest level, contains over 1,000 activities (Rickman 2001). The Raytheon IPDS integrates three systems engineering lifecycle models (i.e., evolutionary, spiral and waterfall) into a tailorable systems engineering process (Rickman 2001). The employment of such complex procedures may appear excessive to the novice but none the less, the effective application of the systems engineering process can reduce the cost of developing a system by the prevention of errors and hence results in the reduction of rework (Lewkowicz 1988). Education Needs Lann (1997) stated that systems engineering was the “least rigorous” of the accepted engineering disciplines. Given this assertion, it is apparent that some concerted education effort was required but how to proceed? Systems engineering encompasses both technical and non-technical (i.e., cultural, economic, operational, organisational, political and social) contexts (Rebovich 2008, Stevens 2008, Lee 2007). In addition, a systems engineer must be taught problem solving techniques but an undue influence on this aspect can in actuality retard the learning of other systems engineering skills (Concalves 2008). Systems engineers must also be taught how to manage complexity: including management complexity, methodology complexity, research complexity and systems complexity (Thissen 1997). However, due to the scope of systems engineering (see the INCOSE definition as an example), one person cannot be expected to hold all the prerequisite skills and hence systems engineering must be practised by a team of individuals who collectively hold the prerequisite skills (Concalves 2008). Existing systems engineers have agreed that systems engineering leaning can only effectively occur though experience and that guided experimentation is the best approach (Newman 2001). As formal education has been based extensively on the delivery of explicit knowledge (i.e., knowledge that can be readily communicated), a paradigm shift is required to provide tacit knowledge (i.e., knowledge that cannot easily be communicated – like learning how to ride a bicycle) (Concalves 2008). 51 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 We need an ability to provide systems engineering capability if we are to sustain growth while managing system complexity (Concalves 2008). However, this necessary systems engineering knowledge, which is created in the minds of individuals, does not disseminate throughout an organisation naturally (Pope et al. 2006). This knowledge must be actively transmitted to those neophytes who will become the next generation of systems engineers. Management have attempted to disseminate systems engineering knowledge by capturing this knowledge within procedures and then mandating the use of these procedures within their organisation. However, experience has shown that this practice fails and that systems engineering knowledge is created only through the practice of relevant formal education combined with applicable systems engineering experience (Chase 1966). Currently there is a global shortage of systems engineers (Concalves 2008). There are relatively few systems engineering degrees being awarded, which is further exacerbating the problem. Brown (2000) has estimated that there are less than 500 BS, approximately 250 Master and approximately 50 PhD systems engineering degrees awarded per year in the US. For the same year, the National Academics Press (2008) published figures of 413 BS, 626 Master and 75 PhD systems engineering degrees (which are in the same order of magnitude ranges as the previous figures). These figures have only marginally increased to 723 BS, 1,150 Master and 104 PhD systems engineering degrees awarded in 2006 (The National Academics Press 2008), compared to approximately 74,000 general engineering degrees awarded in 2004 (IEE 2004). Strategies to instil systems engineering competencies into general engineering graduates have included: attending formal courses; developed either internally or externally to the company; and providing on-the-job experience, which can be slow to achieve results made questionable when not accompanied by appropriate theory (Concalves 2008). Asbjornsen and Hamann (2000) suggest that a minimum of 300 hours is required to give an engineer only the most basic systems engineering knowledge. However, industry can be reluctant to make this investment and would prefer to employ suitably knowledgeable systems engineers. Hence, the impetus to relegate systems engineering training to the universities has come from industry (Asbjornsen and Hamann 2000). When approached by British Aerospace to develop a systems engineering course, Loughborough University had some concerns regarding the content of the degree (Newman 2001). The university interviewed potential supporters for the degree to identify the expectation to be placed on the degree graduate’s abilities, knowledge and skills. It became apparent that the students would require eight to ten years of formal course work just to learn the engineering basics prior to teaching the knowledge particular to systems engineering (Newman 2001). Loughborough University subsequently developed a less ambitious four year systems engineering BEng degree and a five year Master degree. The university found that in addition to the usual student attributes, the systems engineering Master degree students: provided mutual support to their fellow students; had a strong sense of identity; demonstrated a strong ability to cope with change; were able to provide innovative solutions; and were comfortable presenting their work in open forums (Newman 2001). 52 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 The USAFA have also developed a four year systems engineering Bachelor degree. The USAFA degree in first year introduces students to systems engineering by having the students develop a boost glider where they must use aeronautical engineering to support their system design, astronautics engineering to launch the glider, electrical engineering to effect control during glider flight, mechanical engineering to develop a robust design, and civil engineering to build a launch pad (George 2007). The John Moores University, Liverpool has also employed experimentation as part of their systems engineering education programme. The university has used a systems engineering project employing a robot design for many reasons. Some of these reasons include: the ability to use a simple design which is readily understandable by first year students; the project has a particular relevance to the electrical and manufacturing component of systems engineering; and the project is readily extendable making it ideal as a student project (Boyle and Kaldos 1997). The teaching of systems engineering, similar to the execution of complex systems engineering projects, must be addressed incrementally. The University of Queensland found that the IEEE Std 1220 standard proved to be too complicated in its untailored form for the students to understand (Mann and Radcliffe 2003). Novice engineers have difficulty assimilating systems engineering concepts as they are unable to see their relevance to the task (Mann and Radcliffe 2003) and have an incomplete appreciation for what can go wrong in a project (Concalves 2008). Similarly, the sequence in which systems engineering concepts are presented is important to trainee systems engineers. As an example, systems engineers working on small systems may develop a process based on their experiences with these systems and become unable to modify their approach to address more complex systems engineering challenges even in the face of evidence that their processes are inadequate (Concalves 2008). The primary author has noted that another oversight in the education of systems engineers is that they are not generally taught to recognise systems engineering problems that are too complex for the waterfall or even the spiral models for system development and an evolutionary model must, in these instances, be employed. Systems Engineering In-The-Large Systems engineering academia and practitioners alike now recognise that the engineering of small-scale bounded predictable systems (i.e., systems engineering in-the-small) is inherently different from the engineering of large-scale complex systems (i.e., systems engineering inthe-large) (Stevens 2008). Watt and Willey (2003) list the specific characteristics of largescale systems engineering projects that make them different from small-scale systems engineering projects i.e.: x x x x Management of subcontractors. Management of factors that are considered to be high risk. Competing view points between stakeholder held system priorities. Integration of multiple system components and multiple technologies, some of which may not exist at project start. 53 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 x x Development cycles measured in years and operational life-span measured in decades, which has an impact on obsolescence issues and which must be considered during the design phase of the project. Typically funded by large organisations, with large budgets and often with high public visibility. When integrating COTS equipment to build a system-of-systems, the initial system requirements are often changed to allow for unambiguous mapping to the individual COTS equipment (Walden 2007, Rickman 2001). Similarly and quite often COTS equipment does not seamlessly integrate with other COTS equipment, and ‘glue’ and/or ‘wrapper’ software is required to effect the actual systems integration (Walden 2007). This software is typically supported, within large systems engineering projects, by a single software entity often called a Data Server or more recently as an Integration Backbone (Raytheon 2009). CASE STUDY METHOD A literature review was conducted to discover the current and historical scope of the systems engineering domain. This literature review also covered the known failings apparent in systems engineer training, particularly with reference to the transition from systems engineering in-the-small to systems engineering in-the-large. The result of the literature review has already been presented within the Background section of this paper. Industrial systems engineering processes and the actual industrial practices, that the primary author has been privy to, were compared to the practice demonstrated on a small systems engineering project. The primary author has been closely involved with the Microcosm Stage One engineering project and also conducted interviews with the participants to understand their practices during the project. Judgement was made as to whether the industrial systems engineering processes could have alleviated the issues encountered by the Microcosm Stage One project systems engineers. For a description of the Microcosm Stage One project, see Mansell et al. (2008). CASE STUDY RESULTS The Microcosm programme (Mansell et al. 2008) has commenced the development of a systems engineering ‘sandpit’, using robot vehicles and airships, to teach necessary theory and skills to systems engineering students. The case study used the Microcosm programme as an example of systems engineering in-the-small and contrasted this program against various industry projects such as the Air Warfare Destroyer (AWD), Jindalee Operational Radar Network (JORN) and the Collins Replacement Combat System (RCS) projects which represented systems engineering in-the-large. The comparison was made by the primary author who has in excess of thirty years engineering experience. Results – Systems Engineering Process In the absence of a formal systems engineering procedure, an industrial process was heavily tailored and captured as a series of activities described within a Project Management Plan (PMP) on behalf of the Microcosm systems engineers by an industry experienced systems 54 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 engineer. The Work Breakdown Structure (WBS) contained within this industrial PMP proved to be too ambitious for the project schedule. This industrial PMP was abandoned and a lighter-weight PMP was developed by the staff who would be actually executing the Microcosm Stage One project. During the planning stage the Microcosm systems engineers were introduced to several tools (e.g., project scheduling software, the WBS concept) that not all members of the team were familiar with. This familiarisation process took time from the schedule. The industrial PMP referenced 95 activities in a WBS, which was reduced to 31 activities in the lighter-weight PMP. A corresponding reduction in effort saw the estimate of effort reduce from 3.01 person years to 1.09 person years. The Microcosm Stage One project was completed with an expenditure of 1.31 person years but as was noted above, some time was lost in tool familiarisation, which was not accounted for within the project schedule and some rework was also required to progress the project. The systems engineering process actually used by the Microcosm Stage One project was to conduct a needs analysis; develop a number of indicative scenarios; develop the system architecture; conduct requirements analysis followed by; system design; system implementation (i.e., build); system integration; and system test and evaluation. The needs analysis was informally communicated to the participants by a PowerPoint presentation and formally tracked using an Excel template developed to support the needs analysis activity (Shoval et al. 2008). The scenario definition (which were labelled as ‘use cases’ by the Microcosm systems engineers) broadly consisted of: (1) provision of a systems engineering environment; (2) provision of post graduate courses; (3) investigation of human-machine interfaces; (4) autonomous vehicles research; (5) investigation of model-based systems engineering approach; and (6) a demonstration of the robot vehicle’s operation. The Microcosm Stage One architecture describes three systems, i.e.; the Modelling and Simulation Control system; Microcosm Information Management system; and the Microcosm Physical system. The system architecture was described using the Department of Defense Architecture Framework (DoDAF): High-Level Operational Concept Graphic (OV1), Operational Node Connectivity Description (OV2), Systems/Services Communications Description (SV2) and functional flow diagrams. The systems requirements were not formally captured but were presented on PowerPoint slides during the Microcosm Stage One preliminary design review (PDR). The system design was developed (extensively individually) by the relevant systems engineers and was jointly presented during the PDR. Similarly, the system implementation, and system test and evaluation were extensively developed by the responsible systems engineers. The Microcosm Stage One system-of-systems was ‘sold off’ against a demonstration of the sixth scenario (i.e., robot vehicles operation scenario), in the absence of any formally recorded systems requirements. The first five scenarios defined by the scenario elucidation phase were 55 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 considered outside of the scope of the Microcosm Stage One project and hence were not applicable to the ‘sell off’ of the Microcosm Stage One project. Results – Lessons Learnt The Microcosm Stage One project suffered from some rework, which resulted in a number of lessons learnt by the Microcosm systems engineers (Do et al. 2009). The lessons learnt that were documented by the Microcosm systems engineers are broadly grouped as: (1) program management; (2) specialty engineering; and (3) COTS equipment related. In addition, lessons learnt were also extracted during interviews with the Microcosm systems engineers and it became apparent that they were dissatisfied with the level of their system documentation effort and also with their system engineering risk assessment. As Colwell (2002) notes, large systems require some considerable time to complete but that at completion the methods and tools, successfully used on the project, may now be obsolete, necessitating that we continually learn from our systems engineering endeavours. The project management issues included but were not limited to the under estimation of work effort, particularly the testing effort, consequently the project suffered from lack of staff resources and project documentation suffered. The general work effort estimation and specifically the testing effort estimation was acknowledged as being due to the inexperience of the Microcosm system engineers who developed the schedule. Industry would typically employ seasoned systems engineers who would also have access to historical data that could be employed to support the effort estimates. Industry has learnt and continues to learn the importance of allocating sufficient systems testing effort. For instance the Hubble telescope’s early problems were partly due to inadequate systems testing (Colwell 2002) and Constantinides (2003) states that the Mars Climate Orbiter and Mars Polar Lander were both lost due to inadequate system testing. The specialty engineering issues include but were not limited to configuration management, system modelling and safety. The Microcosm Stage One project suffered during system testing as it was difficult for the Microcosm systems engineers to identify tested software modules from those software modules that were being actively modified. Industry recognises this problem as a configuration management issue and implements configuration management processes specifically to deal with this issue. The Microcosm systems engineers discovered during system testing that their robot vehicle’s localisation software was unable to cope with measurement inconsistencies introduced by the environmental conditions (i.e., uneven floor surfaces). Industry would develop a system model which would, depending on the fidelity of the model, could be expected to predict such issues as positional measurement inconsistencies. Again during system testing, the robot vehicle went ‘rogue’, which could have resulted in damage to the robot before manual action was forthcoming to secure the robot vehicle’s motion. Industry would have employed a safety engineer whose task it would have been to consider safety as a project risk. However, all of these industrial solutions are applicable to systems engineering in-the-large and are of a lesser importance to systems engineering in-the-small as demonstrated by the successful completion of the Microcosm Stage One project. 56 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Industry have used systems engineering policy documents such as ISO/IEC 15288 (2002) and MIL-STD-499C-Draft (2005) to develop their systems engineering process and as such have empowered specialty engineers such as dedicated configuration management engineers to manage project artefacts; modellers to model the proposed system architecture within the real environment to validate the system design; and safety engineers to ensure that the system design is safe to operate and does not cause unintentional damage. In the absence of a model for the system processing performance requirements, the processing capability of the system architecture was only found to be inadequate during system testing. Similarly, in the absence of a functional executable model (i.e., simulator) the Microcosm systems engineers realised a schedule slip when the COTS equipment failed and could not be repaired in a timely fashion. The Microcosm systems engineers, who were under considerable schedule pressure at the time, were not able to adequately document their system architecture and there is some evidence that a hierarchical system architecture that affords ‘full instrumentation’ (i.e., adequate test points) was not achieved. Full instrumentation was an initial requirement for the system architecture, see Mansell et al. (2008). Industry continues to learn these lessons too. Beutelschies (2002) states that the loss of the Mars Climate Orbiter was due in part to the lack of documented project-level decisions and lack of documented system architecture. The COTS equipment demonstrated issues relating to a miss-match between asynchronous and synchronous messaging; a disconnect between expected and actual measurement units; and processing performance. In addition, the Microcosm systems engineers showed frustration at their inability to access and hence modify the COTS equipment’s software. Learning that COTS equipment must be handled as though it was a ‘black box’ was a useful lesson as textbooks typically allocate very few pages to discussing COTS equipment. However, industry too continues to learn from the use of COTS equipment. Colwell (2002) recounts the loss of the Mars Climate Orbiter was actually due to a disconnect between units of measurement and Lann (1997) argues that the loss of the Ariane 5 flight 501 was due to the inappropriate reuse of the Ariane 4 software for use on a different rocket, supporting an extensively different mission. ANALYSIS The industrial PMP was rejected on the grounds of its inability to support the imposed schedule constraints of the project in favour of a lighter-weight PMP. Consequently the lighter-weight PMP did not address certain activities, the presence of which would have been able to militate against the potential for rework. This potential was realised and some rework did occur. However, the lighter-weight PMP was obviously sufficient to complete the Microcosm Stage One project and hence the possibility exists that the Microcosm systems engineers, who used the lighter-weight PMP, may not necessarily appreciate the value of the industrial PMP when scaling up for more complex projects. CONCLUSIONS We need to teach our fledgling systems engineers the systems engineering process, expose them to systems engineering procedures and methods, and give them access to relevant tools. 57 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This exposure should be sufficient for them to recognise a systems engineering problem and decompose it into necessary activities, and for the systems engineering student to be confident enough to tailor the systems engineering process to meet the particular system development. However, specifically from the experiences gained during the Microcosm Stage One project, systems engineers need to be taught that there are inherent differences between systems engineering in-the-small and systems engineering in-the-large and as a minimum these differences are: x x x x x x Systems engineering processes that are used for small system developments may not necessarily work for large system developments. Students need to practice effort estimation with exposure over many projects. Students need to be given exposure to project management tools. Students should be taught how to manage subcontractors. Students need to be taught how to work in a team, document their work for the team and to use strong Configuration Management processes that ensure that the team has access to project artefacts in a known state. Students need to be taught that they can’t expect to do every systems engineering task proficiently, and that they should expect to specialise in some area of systems engineering and be supportive of the team that culminates to provide the collective systems engineering proficiencies. Systems engineering students also need to be counselled that they need a full decade of education and employment working as a systems engineer, before they can validly claim the title of “Systems Engineer”. REFERENCES Ai, X.; and Zhang, Z. (2008) Study on Results-Oriented Systems Engineering (ROSE), International Seminar on Future Information Technology and Management Engineering ANSI/ITAA EIA-632 (2003) Processes for Engineering a System, Information Technology Association of America (GEIA Group) Asbjornsen, O. A.; and Hamann, R. J. (2000) Toward a Unified Systems Engineering Education, IEEE Transactions on Systems, Man and Cybernetics, part C: Applications and Reviews Boarder, J. (1995) Systems Engineering as a Process, IEEE Systems Engineering for Profit Boyle, A.; and Kaldos, A. (1997) Using Robots as a Means of Integrating Manufacturing Systems Engineering Education, IEE Colloquium on Robotics and Education Brown, D. E.; and Scherer, W. T. (2000) A Comparison of Systems Engineering Programs in the United States, IEEE Transactions on Systems, Man and Cybernetics, part C: Applications and Reviews 58 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Chase, W. P. (1966) System Design: Basic Realities and Common Myths, IEEE Transactions on Aerospace and Electronic Systems, vol: 2, no: 4 Concalves, D. (2008) Developing Systems Engineers, Portland International Conference on Management of Engineering & Technology Dahmann, J.; and Baldwin, K. (2008) Understanding the Current State of US Defense Systems of Systems and the Implications for Systems Engineering, IEEE International Systems Conference Do, Q.; Campbell, P.; Shoval, S.; Berryman, M. J.; Cook, S.; Mansell, T.; Relf, P. (2009) Use of the Microcosm Environment for Generating a Systems Engineering and Systems Integration Lessons Learnt Database, Improving Systems and Software Engineering Conference George, L. (2007) Engineering 100: An Introduction to Engineering Systems at the US Air Force Academy, IEEE International Conference on Systems of Systems Engineering Hellestrand, G. R. (1999) The Revolution in Systems Engineering, IEEE Spectrum, vol: 36, issue: 9 IEE (2004) http://www.engtrends.com/IEE/1004D.php (accessed: 15May09) IEEE (2000) Overview: What is Systems Engineering?, IEEE Aerospace & Electronic Systems Magazine, Jubilee issue IEEE Std 1220 (2005) IEEE Standard for Application and Management of the Systems Engineering Process, IEEE INCOSE (2009) Systems Engineering Scope Definition http://www.incose.org/practice/shatissystemseng.aspx (accessed: 15May09) ISO/IEC 15288 (2002) Systems Engineering – System Life Cycle Processes, International Standard Organisation Lee, D. M. (2007) Structured Decision Making with Interpretive Structural Modeling (ISM): Implementing the Core of Interactive Management, Sorach Inc. Canada, ISBN: 0-9684914-13 Lewkowicz, P. E. (1988) Effective Systems Engineering for Very Large Systems: An Overview of Systems Engineering Considerations, Digest of the Aerospace Applications Conference Lightfoot, R. S. (1996) Systems Engineering: The Application of Processes and Tool in the Development of Complex Information Technology Solutions, Proceedings of the International Conference on Engineering and Technology Management McQuay, W. K. (2005) Distributed Collaborative Environments for Systems Engineering, IEEE Aerospace and Electronic Systems Magazine 59 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Mann, L. M. W.; and Radcliffe, D. F. (2003) Using a Tailored Systems Engineering Process Within Capstone Design Projects to Develop Program Outcomes in Students, 33rd Annual Frontiers in Education Mansell, T.; Cook, S.; Relf, P.; Campbell, P.; Do, Q.; Shoval, S.; and Ross, C. (2008) Microcosm – A Systems Engineering and Systems Integration Sandpit, Asia-Pacific Conference on Systems Engineering Meilich, A. (2005) Systems of Systems (SoS) Engineering & Architecture Challenges in a Net Centric Environment, IEEE/SMC International Conference on System of Systems Engineering MIL-STD-499C-Draft (2005) Systems Engineering, Department of Defense Mindock, J.; and Watney, G. (2008) Integrating System and Software Engineering Through Modeling, IEEE Aerospace Conference National Academics Press (2008) http://books.nap.edu/openbook.php?record_id=12065&page=54#p200140399960054001 (accessed: 15May09) Newman, I. (2001) Observations on Relationships between Initial Professional Education for Software Engineering and Systems Engineering – A Case Study, Proceedings of the 14th Conference on Software Engineering Education and Training Pope, R. L.; Jones, K. W.; Jenkins, L. C.; Ramsev, J.; and Burnham, S. (2006) History of Science and Technology Systems Engineering: The Histner, IEEE/AIAA 25th Digital Avionics Systems Conference Raytheon (2009) http://www.raytheon.com/capabilities/products/dcgs/ (accessed: 18May09) Rebovich, G. (2008) The Evolution of Systems Engineering, 2nd Annual IEEE Systems Conference Rickman, D. M. (2001) Model Based Process Deployment, 20th Conference on Digital Avionics Systems Shoval, S.; Hari, A.; Russel, S.; Mansell, T.; and Relf, P. (2008) Design of a Systems Engineering Laboratory Using a Scenario Matrix, Six Annual Conference on Systems Engineering Research Stevens, R. (2008) Profiling Complex Systems, 2nd Annual IEEE Systems Conference Thissen, W. A. H. (1997) Complexity in Systems Engineering: Issues for Curriculum Design, IEEE International Conference on Systems, Man and Cybernetics Walden, D. D. (2007) The Changing Role of the Systems Engineer in a System of Systems (SoS) Environment, 1st Annual IEEE Systems Conference 60 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Watt, D.; and Willey, K. (2003) The Project Management – Systems Engineering Dichotomy, Engineering Management Conference, Managing Technological Driven Organizations: The Human Side of Innovation and Change 61 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 62 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 SOFTWARE ENGINEERING 63 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 64 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 REQUIREMENTS MODELLING OF BUSINESS WEB APPLICATIONS: CHALLENGES AND SOLUTIONS Abbass Ghanbary, Consensus Advantage, E-mail: [email protected] Julian Day, Consensus Advanatge Email: [email protected] ABSTRACT The success of web application development projects greatly depend upon the accurate capturing of the business requirements. This paper discusses the limitations of the current modelling techniques while capturing the business requirements in order to engineer a new software system. These limitations are identified by modelling the flow of information in the process of converting user requirements to a physical system. This paper also defines the factors that influence the change in business requirements. Those captured business requirements are then transferred into pictorial and visual illustrations in order to simplify the complex project. In this paper, the authors define the limitations of the current modelling techniques while communicating those business requirements with various stakeholders. The authors in this paper also review possible solutions for those limitations which will form the basis for a more systematic investigation in the future. KEYWORDS: Modelling, Tools, Business requirements, Policies, Analyst, Testing, Process, Quality 1. INTRODUCTION There have been significant advances in the modelling theory and modelling tools in practice within past few decades to provide clearer presentation of the required web system. The success of web application development projects greatly depends upon the accurate capturing of the requirements. The analysis of the business requirements and appropriate modelling (of those requirements) leads to the correct design of the new system. The analysis and design of the new system can be classified as one of the most complex human activities since the analyst and designer need to intellectually identify the requirements, cope with the complexity and develop a new system that can satisfy those elaborated requirements. Modelling helps us to understand the reality of the existing system applications and processes and create newer reality in order to develop new systems [8]. Business Analyst (BA) starts the process of gathering and modelling the requirements of the system. The understanding and documentation of the BA is extended iteratively and incrementally into a solution level design [9]. The modelling plays an important part in the System Development Life Cycle (SDLC). Every model is an abstraction of reality [7]. The modelling tools present the business requirements as a pictorial illustration that can be visually reviewed. The generated model should enable the people (user, business (client) as well as the development team (Analyst, Designer and Programmer)) to identify the problem, propose solution, recognize the behavior of the system and plan how to implement the proposed solution. 65 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 The introduction of an information system to an organisation results in changes to their business processes, which in turn causes changes to the implemented information system [3]. At the same time there are numerous concerns due to the limitation of the current modelling techniques since various stakeholders can translate and understand these pictorial and visual illustrations in various ways. In current modelling techniques, a minor change in business requirements can also have a massive impact on the project specifically if the analysis and design of the system are in the parallel mode and happen simultaneously. These issues have massive impact on the quality of the developed system. This paper discusses the factors that impact on business requirements and identify the limitations of the current modelling techniques and provides the solution for those limitations. 2. COMMUNICATING BUSINESS REQUIREMENTS Project planning facilitates and supports the development of web applications based on corresponding business requirements and changes to those business requirements. In order to assure the quality in SDLC, the people (internal and external) with various sociocultural backgrounds must have a similar understanding of the business requirements and modelling techniques. This is so because almost always the requirements emerge as the most important factor from the user’s viewpoint in terms of their perception of quality. Requirements also play a crucial role in scoping, planning and executing the project. The communication of business requirements has two main aspects. The first aspect is the information loss during the communication and the additional changing factors (internal and external) to original business requirements. Figure 1 present the flow of business requirements and demonstrate the line of communication for the business requirements. The business requirement can get lost or corrupted during this transformation. Aspect 1- Information loss: The Flow of Business Requirements in SDLC System Users Business Analyst, Designer, Developer and Test Team Business (Client) Time and Budget Developed System Figure 1. The Flow of Business Requirements in SDLC 66 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 2.1. System Users The user in computing context is classified as people who use a computer system in order to perform their daily activities to complete a business process. The user of the technology adds value to the business (client) by improving efficiency, reducing costs, increasing revenue, and creating stronger relationships with organisation and their clients. The user has a major impact on the business requirements in SDLC as they are end users of the system under the development. The user can also be classified as the trigger for SDLC because these people identify the shortcomings of the current business process. The user has impact on SDLC before and after the product is developed. 2.2. Business (Client) Business (client) as a structured and organized entity identifies the need for the new system to increase quality and profit. In SDLC, business evaluates the need for the new system by assessing the cost and time involved while determining the benefits such as improving the profit and/or service. The business is in direct contact with the user in order to determine the requirements and dictating them to business analyst. The business must have clear understanding of the requirements and transfer them correctly in order to achieve the maximum quality in SDLC. 2.3. Time and Budget Time and budget plays a major role in SDLC specifically in the business requirements. The time and budget change the scope of the project which impacts on the scope of the business requirements by moving the less important requirements to the future phases of the project. 2.4. Business Analyst, Designer, Developer and Tester Team Business analyst, designer, developer and tester team alongside of the business (client) must ensure that they have the same level of understanding in all stages of the SDLC. This quality assurance currently is achievable by running numerous workshops based on the modelling document to make sure all the business requirements are captured and the involved people are in the same level of understanding. These numerous workshops are required because various people involved in the project are coming from different background and the modelled requirements (UML, EPC, OPEN…..) do not make sense to everyone. The consultants or business analyst must translate the documents to the business as well as the developers to make sure everybody has the same understanding of the created document. 2.5 Developed System The developed system will go under intense testing for the quality assurance by the qualified system tester, business and the users before entering the next cycle if required. 67 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 However, there are so many other issues that influence the quality of the project that complicates the situation in SDLC such as rapid change of the business requirements. Aspect 2- Additional factors changing the business requirements Figure 2 presents additional factors that impact on business requirements mentioned earlier and may lead to changes in these requirements. The organizational and Government policies, roles and regulations might be known to the business (Client) as well as the development team. The development team as well as the business analyst must be aware of limitations of the technology and must have a good knowledge of the enterprise architecture (architecture of the system). The business analyst is also responsible to inform the business (client) if the requirements alter due to these issues. The architecture of the system also has a big impact on performance of the system. In many cases, the business requirements can be delivered but the existing architecture of the enterprise is unable to cope with the load and as a result the system might crash. The Dynamic Nature of the Business Requirements in SDLC Organisation's Government's Policies, Roles and Policies, Roles and Regulations Regulations Enterprise Architecture System Users Business Analyst, Designer, Developer and Test Team Business (Client) Limitation of Technology Time and Budget Developed System Figure 2. Other Factors Impacting the Business Requirements in System Development Life Cycle 2.6. Government’s Policies, Roles and Regulations The legal issues also play an important role in business requirements and it is the responsibility of the business as well as the business analyst to identify those Government policies. As an example, if company A wants to develop a system to sell a product to the 68 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 employee of the company B while the funds are transferred directly through the salary office of the company B, the company A must ensure that the taxation office approves such a transfer of funds. 2.7. Organisation’s Policies, Roles and Regulations The internal roles and policies of the organisations are important factors in changing the business requirements. Referring back to the previous example, what happens to the system under development, if the company A has the policy of not having the right to keep the details of the company B in their database? The business, business analyst and designer must identify these issues to either find a solution or change the business requirements. 2.8. System Architecture of the Organisation Every individual organisation has their own system architecture designed based on the level of their non-functionality requirements such as performance, security, implementation, maintenance and reusability. In relation to our previous example, if the company A must reveal some data of the company B employee to the finance department while these data is not stored in the Company A’s database due to the organisational policy. The business requirements have to be changed hence the organisation’s role has not allowed the database of the employee’s in company B to be registered inside the architecture of the Company A system. The developed system must access the database (Holding the details of Company B’s employee) outside of the firewall and forward it to the finance company. If the Company B is not allowing the data to be stored outside of the firewall then the business requirements has to be changed. 2.9. Limitation of Technology The limitation of technology also has a major impact on the business requirements. The business might have requirements while the technological capability does not exist. The research and development contributing and identifying these limitations and provide the solution based on these business requirements. The request of a system users or the pressure on the organisation initiates a request for a new system or the changes to an existing system. The user might identify the limitation of the existing system (which could be classified as a need for any system) and at the same time organisation might identify the need to change or create a new system based on the existing weakness (internal to the organisation) and threats (external to the organisation). The modelling must ensure that user requests are considered, the specific problem is identified, the purpose and the objectives of the new system is clearly demonstrated and the constraints in providing the solution to the business requirements have been registered. The following section discusses the limitation of the current modelling techniques. 3. LIMITATION OF THE CURRENT MODELLING TECHNIQUES Business requirements are modelled using the formalisms (such as UML activity, class, use case implementation, communication and state diagrams) these definitions 69 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 can rarely be understood by key business stakeholders or even business analysts [1]. The current software development techniques such as Rational Unified Process (RUP), Object-oriented Process, Environment and Notation (OPEN), Process Mentor and SAGE-II play a crucial role in the success of web development projects. However the following limitations were identified based on the existing literature and authors experience working on the various projects [11] [12] [13] [14]. A new modelling tool is needed for representation of requirements that is understood by end users as well as system analyst and developers considering that UML does not meet the capability and interest of end users while project lacks from incomplete requirements validation [4]. This tool must be reliable, flexible, modifiable, reusable and must be capable of transferring and communicating the complexity that is understandable by every one. The observation and investigation of the current modelling techniques in practice revealed that the current modeling techniques are unable to fully capture the business requirements as well as communicating it to the developers and the client. These limitations of the current modelling tools consist of the followings: 1. The current modelling tools are unable to present the overall view of the required system. The current modelling techniques do not provide a diagram to picture the overall requirement of the system. The current existing diagrams demonstrate the piece or one business process at a time considering that a system is supporting multiple business processes. A minor change in business requirements might have a big impact on architecture of the information changing all the created design. In the current modelling tools, the analyst and designer can not easily trace and evaluate the impact of any change on the remaining structure of the system. 2. The purpose for some of modelling tools are not explained as an example how an activity diagram helps the developer to create a system. The current modelling diagrams are providing a pictorial illustration of an individual task or a business process. The literature does not define how this pictorial illustration will lead to coding a program. The diagrams facilitate to breakdown the problem to smaller pieces and visually present them in order to just communicate the business requirements. The literature does not identify how these diagram leads and facilitates the developers to actually code a system. 3. There is no testing on current modelling to understand whether the analyst has captured the right functionalities. The testing only takes place after the system is developed. In the current SDLC, the analyst and designer capture the business requirements by reading the available documentation and through intense discussion by business and 70 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 the system users. The result of the discussion and reading the documentation are modelled in various ways that is not understandable by the business (client). The business (client) must sign a document that is not easily understandable since they are unfamiliar with the concept. The client realises the impact of the signed document when the actual system is developed. There should be a mechanism in place that business can run a test (on prototype) to make sure that the analyst and designers have captured the correct requirements. 4. Lack of proper standardisation leading to numerous workshops in order to understand the functionalities. In the current environment, there is not a unique way of capturing the business requirements since the understanding of the people are different. These limitations of the standards lead to the workshops and meetings to bring all the various stakeholders in to the same level of understanding. The details of the created documents in many ways can be translated only by the business analyst or the person responsible for creating the documentation. 5. Lack of understanding in which tool belong to which space. For example, the activity diagram is part of problem or solution space. There is a big ambiguity in understanding what tools belong to what phase of the development life cycle. In other words, while capturing the business requirements should the analyst just concentrate on analysis or at the same time they should identify the solution. 6. Lack of support in defining the non-functionality requirements. There is a big misunderstanding in order to identify when capturing the functionality requirements finishes and the non-functionality requirements starts. Currently, there is no place to define the non-functionality requirements in the use cases. The nonfunctionality requirements should be added as an additional row in the use case description. 7. Lack of understanding on how these modelling techniques lead to coding. The current modelling techniques facilitate to breakdown the problem into smaller pieces while it is not very clear how it will lead to coding. The only available diagram that actually leads to coding is the class diagram. The class diagram depending on the maturity level can be divided in to Analysis class diagram and Design class diagram. Identifying the classes and the relationship of those classes can be classified as the Analysis class diagram. The completion of the attributes and methods transfer the Analysis class diagram to the Design class diagram. The design class diagram should clearly demonstrate the expected behaviour of the system and remaining UML diagrams must show the behaviour of the system in various stages 71 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 of the web system development while the system is actually coded and it is developed. 8. The current modelling techniques do not prioritise the processes. The prioritisation of the business processes specifically while re-engineering is happening can be classified as the most crucial factor for the “success” and “Failure” of the project. Majority of the re-engineering studies and projects mainly deal with those core business processes that actually generate profit. The current modelling techniques do not support these prioritisations. There should be a mechanism in place to support ranking of the business processes and classify their significance impact on development of the web system. 9. Complexity of the process complicates the situation by producing a confusing model. The current modelling diagrams become more complicated as the project becomes more complex. This creates more confusion for the involved parties (Business, user, analysts, designers and developers). The complexity increases when the tools such as Visio or Rational Rose can not support the length and complexity of the document and have to move to the next page. 10. There are no clear definitions as to which tool is better for which specific project There should be a unique way to understand which of the current modelling techniques are better to address the specific project based on the complexity and related tasks and activities. Modelling should enable decision makers to filter out the irrelevant complexities of the real world, so the effort can be directed toward the most important parts of the system under study [2]. The above explanation identifies that while various modelling tools have been used in the industry the maximum benefits have never been achieved due to the explained limitations. The following section provides some solutions for these current shortcomings of the modelling techniques as far as the capturing of the business requirements are concerned. 4. ENHANCEMENT IN CAPTURING BUSINESS REQUIREMENTS There have been numerous studies by [5] [6] [10] proposing various construction processes to overcome problems while capturing the business requirements. The correct capturing of the business requirements facilitate to make the correct decision. In order to minimize the errors while capturing the business requirements and provide all stakeholders (with various backgrounds) with the similar 72 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 understanding of the captured requirements, the we have reviewed the following possibilities. 1. Create a prototype. The pictorial modelling could be reduced if a prototype is created based on the textual capture of the business requirements. The prototype could be enhanced as more information is achieved. The business can see the sample of the product while the developers can easily improve on the existing prototype to create the new system. This solution is currently being used in agile software development environment. In case of the complex system, the various prototypes can be created for the sub-systems. The creation of the screen design allows the business to understand all the needed attributes, commands and screen format are correct while they can also evaluate the possible performance of the system. 2. Identify when, and how information get lost or corrupted. There should be a mechanism in place to identify how the information are getting lost, misplaced or corrupted when they are transferred from user, business to the development. The loss of information or developing a system based on the corrupted information result in a system that is unable to perform the desired task which eventually caused the failure of the project. 3. Look at implemented mistakes (communication modes: text, hypertext, images, animation, video and audio). Recording the minutes, communications and meeting allows us to make sure the right requirements have been captured. These requirements can then be translated in the form of text, hypertext, images, animation, videos and audios to make sure the correct requirements have been captured before moving to the design phase of the new system. 4. Trust within involved parties. Trust also plays an important part amongst human interaction. Our life is based upon our trust otherwise is almost impossible to survive in the society. Similar pattern and procedure applies when the requirements are transferred. It is very important for all parties to trust each other otherwise is almost impossible to develop a new system capable of performing its desired task. 5. Mixture of existing models. The mixture of the existing modelling tools such as UML, OPEN, Process Mentor and Business Process Management Notation (BPMN) might be able to cover the limitation of the current individual modelling tools. Additional research and 73 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 investigation is required to understand whether this combination could address the shortcomings of the current modelling techniques. 6. Partitioning to sub problems. Breaking down the problems and partitioning them into various sections allows the modeller to have a better control over them. This technique might ensure that the information is not lost, misplaced or corrupted. This fact can be achieved by dividing the functional requirements in to various parts and create a dependency document for further traceability. 7. Identification of the similar patterns required in every system (User access level, …). The business analyst should identify the similar pattern required in every system and manage to use them rather than creating those requirements from scratch. For example, if a system needs a private webpage, the analyst should be able to use Access Control functionality and modelling. The modelling should be capable for providing the stakeholders with traceability in order to distinguish the functional dependency. 5. CONCLUSIONS The importance of the modelling was described in this paper followed by the important aspects of the System Development Life Cycle (SDLC) as far as the capturing of the business requirements are concerned. The limitation of the current modelling techniques while presenting those captured requirements was identified. As a result of those identified shortcomings of the modelling tools, some possibilities were reviewed. However, further research is in progress to construct and re-construct the new modelling tools in real projects within the various industries. The future research area is classified as followings: 1: Evaluate of various modelling techniques in real projects (UML, OPEN, Process Mentor....) in order to demonstrate how they link to the proposed limitations. 2: Individually test each limitation against each modelling techniques. 3: Based on each identified limitations (on each modelling techniques) propose and test the possible solutions in order to identify the need and develop new modelling technique. 4: Identifying the cost and feasibility of the solution. 5: Problem-specific decisions related to solutions. 6: Required enhancement and modifications acceptance. 7: How modelling could take place in terms of the information flow. 74 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 REFERENCES [1] Dubray, J.J. (2007). “ Composite Software Construction). USA. C4Media Incorporation. ISBN: 978-1-4357-0266-0 [2] Giaglis, G. M., A “Taxonomy of Business Process Modelling and Information Systems Modelling Techniques”. The international Journal of Flexible Manufacturing Systems, 13 (2001): 209-228. © Kulwar Academic publishers, Boston. [3] Ginige, A. “Aspect Based Conceptual Modelling of Web Applications”. Proceedings of the 2nd International United Information Systems Conference UNISCON 2008, Klagenfurt, Austria, April 2008. pp.123-134 [4] Kop, C. & Mayer, H. C., “Mapping Functional Requirements: from Natural Language to Conceptual Schemata”. Proceedings of 6th IASTED International Conference Software Engineering and Applications. Cambridge, USA. Nov 4-6, (2002),.PP. 82-88. [5] Leite, J.C.S. & Hadad, G.D.S. & Doorn, J.H. & Kaplan, G.N., “A Scenario Construction Process”, Journal of Requirements Egineering, Vol. 5 No. 1, 2000, Springer Verlag, , pp. 38–61. [6] Rolland, C. & Achour, C. B., Guiding the Construction of Textual Use Case Specifications, Data & Knowledge Engineering Journal, Vol. 25 No 1-2, 1998, North Holland Elsevier Science Publ., pp. 125–160. [7] Teague, L. C. & Pidgeon, C. W. (1991). “ Structured Analysis Methods for Computer Information Systems”. Macmillian Publishing Compnay. ISBN: 0-02-946559-1 [8] Unhelkar, B. (2005). “Practical Object Oriented Analysis”. Thomson Social Science Press. ISBN: 0-17-012298-0.pp: 15 [9] Unhelkar, B. (2005). “Practical Object Oriented Design”. Thomson Social Science Press. ISBN: 0-17-012299-9.pp: 3 [10] Zhu, H. & Jin, L., “Scenario Analysis in an Automated Tool for Requirements Engineering”, Journal of Requirements Engineering, Vol. 5 No. 1, 2000, Springer Verlag, pp. 2 – 22. [11] http://www-306.ibm.com/software/rational/offerings/ppm/. Downloaded: 2/07/2008 [12] http://www.processmentor.com/Architecture/Default.aspx. . Downloaded: 2/07/2008 [13] http://www.dialog.com.au/content/view/28/45/. Downloaded: 2/07/2008 [14] http://www.open.org.au/. Downloaded: 2/07/2008 75 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 76 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 DESIGN UNCERTAINTY THEORY - Evaluating Software System Architecture Completeness by Evaluating the Speed of Decision Making Trevor Harrison1, Prof. Peter Campbell1, Prof. Stephen Cook1, Dr. Thong Nguyen2 1 Defence and Systems Institute, 2Defence Science and Technology Organisation Contact: [email protected] ABSTRACT There are two common approaches to software architecture evaluation [Spinellis09, p.19]. The first class of evaluation methods determines properties of the architecture, often by modelling or simulation of one or more aspects of the system. The second, and broadest, class of evaluation methods is based on questioning the architects to assess the architecture. This research paper details a third, more fine-grained approach to evaluation by assuming an architecture emanates from a large set of design and design-related decisions. Evaluating an architecture by evaluating decision making and decision rationale is not new (see Section 3). The novel approach here is to base an evaluation largely on the time dimensions of decision making. These time dimensions are (1) time allowed for architecting, and (2) speed of architecting. It is proposed that progress of architecture can be measured at any point in time. For example: “Is this project on track during the concept development stage of a system life cycle?” The answer can come from knowing how many decisions should be expected to be finalised at a particular moment in time, taking into account a plethora of human factors affecting the prevailing decision-making environment. Though aimed at ongoing evaluations of large military software architectures, the literature review for this research will examine architectural decisions from the disciplines of systems engineering, information technology, product management and enterprise architecture. 1 INTRODUCTION The acceptance of software architecture as resulting from a set of design decisions is now relatively well established. Worldwide, the efforts of six separate communities or researchers have resulted in proposed updates to an IEEE standard and complementary tools to capture software architecture decision rationale. A literature review has revealed two blind spots though. Almost zero references to, and no appreciation of impact from those individual or team factors and their effects on decision making identified by the psychology and sociology sciences. (An assumption is made here that humans are making all architectural decisions.) Another blind spot is the absence of alternative decision-making philosophies such as heuristic-based architecting found in systems engineering, which recognizes architecting as an eclectic mix of rationale and naturalistic decision-making methods. This research aims to 77 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 quantify the uncertainty surrounding performance of the decision-making process(es) by modelling the range/distribution of possible legitimate times taken to finalise different types of architectural decisions. This will qualify the uncertainty surrounding processes with inherent variation, and have the capability to quantify a reduction in uncertainty. Decisions, decision relationships, decision-making strategies, and factors affecting speed of decision making will be modelled using agent based modelling and simulation. This choice of modelling allows exploration of the decision interactions and their time dimension sensitivities. The complex system under study is thus one of architecting. (The authors want to avoid a “systems thinking disability” [Senge06, p.51] by pulling apart a small piece of a complex decision-making system only to discover this has destroyed the very system under study.) The remainder of the paper is structured as follows: section 2 covers the two time dimensions of decision making (timing and time period), sections 3 and 4 cover architectural decisions in general, sections 5 and 6 cover speed of decision making, section 7 covers modelling of everything in sections 2 through 6. Finally, a challenge to conventional research methods is revealed in section 8. 2 SIGNIFICANCE OF TIME AND TIMING Timing refers to the time window when it is most optimal to make a decision. Time period refers to the optimal period of time to spend on decision making. Both time dimensions have lower and upper bounds. Decision making outside of these bounds will incur different types of penalties. 2.1 The Importance of Optimal Timing of Decisions – Avoiding Re-work Even with a limited set of critical or high priority decisions, the order of decisions can change the architecture [Rechtin00, p.130] i.e. inappropriate order could mean over-constrained decisions later on. At first glance, this may speed up decision making by reducing choices later on. However, schedule overruns in other engineering activities will occur to compensate for architecture deficiencies. The early detection of decision-making happening too fast is closely related to the estimation of time to spend on architecting activities. 2.2 The Importance of Optimal Timing of Decisions – Cost of Gathering Information Quality of decision making is often dependent on the quality of information available to make decisions. Utility graphs in [Williams09, Fig 2.4] show a cut-off point when the cost of collecting information exceeds the benefit of quality decision outcomes. Such graphs quantify a cut-off time for a decision(s) on the most effective solution to a problem, and/or the choice of concept. 2.3 Optimal Time to Expend On Software Architecture Regression analysis to calibrate the Architecture and Risk Resolution (RESL) scale factor from the COCOMO II estimation model [Boehm03, p.221] confirmed the hypothesis that 78 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 proceeding into software development with inadequate architecture and risk resolution will cause project effort to increase, and that the rework effort increase percentage will be larger for large projects. Corresponding “sweet spots” have been identified for minimum delay architecture investment, shown in Figure 1. Less time expended than, or greater time expended than these “sweet spots” will result in re-work (thus schedule delay) to cater for deficiencies in a software architecture. NOTE: the larger the size of the software system, the larger time and effort should be expended in architectural design. Figure1:MinimumEfforttoAchieveLeastReworkDuetoArchitecturalDeficiencies[Boehm03] While useful for forecasting, the RESL factor is ineffective as a work-in-progress tracking measure during the early days of concept definition in major defence projects; architectural decisions are made at time when very little, if any, code is written. (In the next section, the authors propose to replace forecasting of thousands lines of code with forecasting the finalisation of hundreds of design and design-related decisions.) 2.4 Decision Time Lines from Product Management “Time line” refers to a time period for which a decision outcome/choice is valid. Typically for product management, this is the same as the “shelf life” for an individual technology component. For example, for the choice of a personal computer for office use, the decision time line is approximately three years before the decision needs to be re-visited. Under Government contracts, protracted or belated decision-making will often shorten a decision’s time line (because a requirements specification stays fixed). 3 ARCHITECTURE GENRES AND THEIR RECOGNITION OF DECISIONS 79 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This section reviews three different types of architecture; enterprise architecture, systems architecture and software architecture. The key message here is that there are many nontechnical, design-related decisions, which set the context for architectural design. 3.1 Design Decisions in Enterprise Architecture Drafting of “IEEE (draft) Std P1694 – Enterprise Strategic Decision Management (ESDM)” is underway to define a standard framework for the enterprise-level management of strategic decisions. Many strategic decisions set the context for modelling efforts such as architectural design. The top half of Figure 2 shows a Decision Network template covering business/strategy decisions, while the bottom half of Figure 2 covers Platform Architecture Management (PAM) decisions. A Decision Network provides a "50,000 foot" view of the decision-making process. It serves to focus resources on the most critical decisions and provides a higher level method of communication concerning a decision-making situation. A Decision Network provides a set of decisions which serves as a high level decision roadmap for the entire project. The roadmap provides an analysis plan that helps avoid the common pitfalls of "diving into detail" and "analysis paralysis". [Fitch99, p.41]. Figure2:DecisionNetworkTemplatefromIEEEStd(draft)P1694whereeachbox/bulletisamajordecision 3.2 Design Decisions in Software Engineering Kevin Sullivan was the first to claim that many software design decisions amount to decisions about capital investment under uncertainty and irreversibility [Sullivan96, p.15]. (Uncertainty about future requirements and system-wide qualities in say 10 to 20 years time.) Design decisions are like “call options”. To bind and implement a design decision is to exercise an option – to invest in a software asset such as architectural design [Sullivan96, p.16]. Thus a software architecture can also be viewed as portfolio of options. (There was 80 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 earlier pre-1996 research into decisions for software design, but, there was no recognition of the time dimension i.e. appropriate timing of decisions.) 3.3 Design Decisions in Systems Engineering An INCOSE 1996 conference paper [Novorita96] is one example of elevating decisions to be an equally important artefact as requirements and design models. Novorita’s paper details an information model underlying a systems development process. The intent of such an information model is to improve the speed of communication between marketing and engineering teams. Without such an information model, design information is inevitably found scattered and duplicated across numerous documents, spreadsheets, PowerPoint slides and support systems databases such as product help desks. The consequences of this are, for example, pieces of design information that have no individual owner and no relationship meta-data existing between the pieces of design information. Figure 3 shows decisions as a major part of an information model, to bring all design information into one place. Decisions Risk Mgt Models Req’s Tasks Plans Documents Figure3EssentialDatatoSupportRequirements– includesDecisions[Novorita96,Fig.3] 4 INTERCONNECTEDNESS AND INTERACTIONS AMONGST DECISIONS “A phenomenon, sometimes acknowledged but rarely explicitly investigated is that decisions interact with one another. “ [Langley95, p.270] Previous section(s) took a static view of decisions. This section looks at the dynamics of decision-to-decision relationships. Different types of relationships have different effects on time, such as appearing to “speed up time” or “freeze time”. This is the first sign of complexity. The disciplines of Product Management and Enterprise Architecture have traditionally revealed strong connections between non-design decisions and architectural decisions. Architectural decisions for any product are closely linked with decisions about marketing strategy, manufacturing capabilities and product development management [Ulrich99, p.142]. 81 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 A more fine grained view of relationships amongst software architecture decisions themselves is contained in an ontology by [Kruchten04], shown in Table 1. The names of these relationships imply time dimension-related effects; time can either be speeded up, or, time can be frozen. Table1: Relationships Between Architectural Design Decisions [Kruchten04, pp. 4-6] Type Example/Explanation constrains “must use J2EE” constrains “use JBoss” forbids a decision prevents another decision being made enables “use Java” enables “use J2EE” subsumes “all subsystems are coded in Java” subsumes “subsystem X is coded in Java” conflicts with “must use J2EE” conflicts with “must use .Net” overrides “the communication subsystem will be coded in C++” overrides “the whole systems is developed in Java” comprises (is made of) this is stronger than ‘constrains’ an alternative to A and B are similar design decisions, addressing the same issue, but proposing different choices is bound to decision A constrains decision B and B constrains A is related to mostly for documentation and illustration reasons dependencies decision A depends on B if B constrains A relationship to external artefact “traces from” and “does not comply with” Further research by [Lee08] has attempted to visualise the relationships in Table1. 4.1 Classification of Decision Relationships There is apparently been no further research since the claim by Ann Langley et al that no comprehensive theory of decision interconnectedness exists [Langley95, p.270]. Though not attempting to develop such a theory, the journal article [Langley95] has attempted to work relationships into a typology shown in Table2. 82 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Table2:TypesofLinkageBetweenDecisions[Langley95,pp.271273] Sequential Linkages Lateral Linkages Concerning the same basic issue at different points in time. Dealing with a major issue involves sub decisions. Links between different issues being considered concurrently. Concurrent decisions are linked as they share resources. Nesting Linkage Snowballing Linkage Recurrence Linkage Pooled Linkage Contextual Linkage Enabling linkage Evoking linkage Precursive Linkages Cutting across different issues and different times, as decisions taken on one issue affect subsequent decisions on other issues within the same organization. Pre-empting linkage Cascading linkage Merging linkage Learning linkage 5 SPEED OF ARCHITECTING (SPEED OF DECISION MAKING) “Truly successful decision making relies on a balance between deliberate and instinctive thinking.” [Gladwell05, p.141] This section presents another time dimension of decision making; speed. Similar to lower and upper bounds for both time period and timing of decisions, speed of decision making also has lower and upper limits. These limits vary from individual to individual human decision maker. 5.1 Speed of Architecting viewed as Short Cuts The word ‘heuristic’ means to discover. It is used in psychology to describe a method (often a short cut) that people use to try to solve problems under extremes of complexity, time pressure and lack of available information [Furnham08, p.116]. Heuristics applicable to systems architecting are well documented [Maier09]. The field of cognitive systems engineering [Rasmussen94] demonstrates that a mix of decision-making styles is valid. The “Decision Ladder” in Figure 4 represents the set of generic subtasks involved in decision making. States of knowledge are arranged in a normative, rational sequence. There are three legitimate points in the decision-making process to expedite certain steps based on heuristics. Heuristic shortcut connections are typical of a natural, not formalised, decision-making style. Taking heuristic short cuts are heavily dependent on the individual architect’s knowledge, experience and perspective [Curtis06]. 83 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure4DecisionLadderwithHeuristicbasedShortcuts[Rasmussen94,p.65] The more popular short-cuts utilised within software engineering are patterns and pattern language. A pattern is a specific form of prescriptive heuristic. When a number of patterns in the same domain are collected together, they can form a pattern language. The idea of a pattern language is that it can be used as a tool for synthesizing fragments of a complete solution [Maier09, p.41]. 5.2 Speed of Architecting viewed as State Transitions The state transition chart in Figure 5 clearly has potential for decision-making loops (and therefore increasing uncertainty about achieving an optimal amount of time to invest in architectural design) before a particular decision is accepted. A visualising framework called ‘Profuse’ has been used to visualise these particular state transitions over a time period (known as “decision chronology”). The example in the right hand side of Figure 5 shows three decision creation or activity sessions over a two-week interval. The state of the decisions is denoted by the shape: Diamonds are ‘Idea’, circles are ‘Tentative’; and squares are ‘Decided’. 84 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure5–DecisionChronologyofStateTransitions Combining Decisions and Factors Affecting Decision-Making Other loops (iterations) are seen in development process models e.g. “requirements loop” and “design loop” in systems engineering [Young01, p.133] e.g. “triple peaks” in software engineering [Rozanski05, p.75]. 5.3 Speed of Software Architecting viewed as Hierarchy of Impacts Florentz & Huhn discuss “three levels of architectural decisions” for embedded automotive software. These levels are based on the point in time at which decisions are made [Florentz07, p.45]. The three layers are (1) top-level decisions, (2) high-level decisions, and (3) low-level decisions. Top level decisions vary the least1. High level decisions are made early on. Low level decisions, being the closest to system realization, vary the most i.e. provide the basis of architecture variants. Both predictability and known impact increase when moving from top-level decisions to low-level decisions. 5.4 Speed of Architecting viewed as Holistic Thinking Research by Microsoft Federal Systems Consulting [Curtis06] found a wide variation in the time to complete an architecture of IT Systems (data centres, servers, LANs and WANs). Even in cases where similar IT systems were being designed by architects with the same levels of knowledge and the same years of experience, time to achieve customer sign off for an architecture varied from three months to 18 months. Investigations revealed the speed of architecting was determined by an architect’s amount of “perspective”. ‘Perspective’ is the ability to consider impacts of any design decision upon other parts or processes of a business. The outcome of the Microsoft research has been the Perspective Based Architecture method. The PBA method is a question-based approach to architecting consisting of 46 questions. Similar to a Decision Network (e.g. Figure2), the focus of the PBA method is to help guide 1 Ahierarchybasedonprocessingratesoccurwithinanyecosystem[O’Neill86].Forexample,growthofa forestoccursoverdecades,leafgrowthovermonths,whilephotosynthesisoccursdaily. 85 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 designers and product planners in how to consider non-design decisions which are critical to the success of implementing any architecture. 5.5 The Speediest Decision Making – Unconscious Thinking “Thin slicing” is one explanation for split second decision-making; those moments when we know something without knowing why [Gladwell05]. For any architectural decision-making, there will be a certain amount of intangible design decisions. Some architectural design is made without an associated issue or decision; examples include "It worked for me last time", "first thing I tried at random", “taboo-based decisions”, or just insights for which no connection can be identified. 6 ENVIRONMENTAL FACTORS AFFECTING SPEED OF DECISION MAKING Identified earlier in section 4, individual architectural decision attributes (e.g. ‘priority’) and decision-to-decision relationships (e.g. ‘forbids’) are some of the factors that either constrain or accelerate the speed of decision making. Many more speed-affecting factors can be found amongst the sociology and psychology literature. 6.1 Factors of the Project Environment All software and system architectural design effort is carried out in a project environment, mostly within an organisation. In organisations, bias is manifest as a culturally engrained behaviour that constrains decisions and actions to a fixed and limited set of perceptions and responses [Whittingham09, p.2]. Figure6 highlights the influence of culture and leadership on decision-making processes of a team. Figure6AProjectEnvironment’sImpactonDecisionProcesses[Shore08,p.6] Decisions concerning selection of an engineering solution may be significantly influenced by biases – this factor has very little to do with the mechanics of an engineering solution. 6.2 Human Behaviour Factors Prospect theory from psychology [Furnham08, p.127] explains both why we act when we shouldn’t (things that should not have been done, but never the less were done), and why we 86 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 don’t act when we should (things that should have been done, but were not). The former category can be considered inefficient decision-making, the later unreliable decision-making. Both affect speed of decision making with a decelerating effect on speed. Inefficient decision making (e.g. not using configuration control on diagrams) has an immediate but ‘gentle’ deceleration. Unreliable decision making (e.g. not using explicitly documented systempartitioning criteria) has a delayed but almost abrupt, “show stopper” deceleration. A unique case study2 that has observed large software projects in-situ is [Curtis88]. Layer upon layer of people interactions affect design decision-making and subsequent productivity in a project. An accumulation of these effects can be represented in the “layered behavioural model” in Figure7. The size and structure of a project determines how much influence each layer has. A large project (such as those defence projects to be studied by this research) is affected by all factors! Figure7 LayeredBehaviorModel[Curtis88] Most research into software architecture decisions has restricted itself to studying decision making (1) at the ‘Individual’ level of the layered behavioural model in Figure7, and (2) postproject data gathering e.g. re-enactment of projects, e.g. recollections from project participation. This is simply due to the practicalities of studying a large project in-situ. The next two sections justify a synthetic, in-situ project simulation to get closer to a real decisionmaking environment, taking into account as many factors affecting speed of decision making as possible. 2 [Glass03]writesfifteenyearslaterthatnosimilarcasestudyofprojectsinsituhasbeencarriedoutsince. 87 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 7 MODELLING THE UNCERTAINTY ARCHITECTURAL DECISIONS TO FINALISE “The safest conclusion to be drawn from descriptive studies is that there is no single decision-making (or problem-solving) process. Mental activities vary enormously across individuals, situations and tasks.” [Hodgkinson08, p.467] For this research, the unquantifiable subject of observation is the architectural design decision-making process(es) and the associated throughput. What has to be quantified though, is the uncertainty of the time it takes to get through the decision-making process; that is, a range/distribution of times to make all those decisions which together constitute a architecture of a desired state of maturity. The inherent variation in speed of decision-making illustrated in the previous sections points towards a probability distribution function of all possible actual times when attempting to match an optimal time to expend on architecture design. Figure8 is an envisaged output from this research. It represents all possible decision-making completion times from 1st January to 31st December. The most likely date is 1st April. The date with a 50/50 chance of being true is the 1st May. There is a 0% probability of completing all decision making prior to 1st January. It is the distribution that is the extent of (design) uncertainty. Figure8 – TheDistributionistheUncertainty 7.1 Agent Based Modelling and Simulation To an observing outsider, the whole business of architecting is shrouded in uncertainty and complexity. To make any kind of generalisation or theory (to make a first step in understanding the complex system that is ‘architecting’) requires an ensemble of project instances [Macal08, p.14]. These instances must preserve many human factors and decisionto-decision relationships. Small adjustments of theses should be enough to provide the randomness/stochastic nature of human decision making. A computational model run thousands of times, representing thousands of in-situ projects, shall be used to produce a distribution envisaged in Figure8. To re-iterate, it is the distribution of possible time periods to finalise all architectural decision-making that is the uncertainty. 88 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 At the time of writing, an agent-based model [Miller07, Ch.6] appears best suited to modelling decision makers (agents), the decisions (also agents), decision interconnectedness & interactions, and the human behaviour factors & human environment factors impacting speed of decision making. 8 RESEARCH METHODS A research method suited to understanding is required. One must understand the complex system of architecting before deciding what action to take regarding a time sensitive, decision-based evaluation of an architecture. 8.1 Research Methods Suited to Research of Decision Making in Project Environments The primary limitation of the software architectural design research case studies in the literature review is the sample size of each study. Most of these studies compared or examined less than ten participants performing software design; thus the external validity of these studies is weak [Zannier05, p.4]. Furthermore, with the sample size being small, it is highly likely those samples are not representative of the larger population of architects and designers. As a consequence any results cannot be said to be statistically relevant. For the study of complex systems such as ecosystems, valid research can only be obtained from observations conducted “out in the wild” and not in a test tube, lab or zoo. The equivalent of “out in the wild” for architectural design decision-making is inside a project that is ‘in situ’. Unfortunately, the timeframe for the main author participating in a live project together with researched participants is not within the time frame of a PhD. The research method will thus have to consist of an artificial, computational model of a project in situ, with the ability to quickly & cheaply modify synthetic human factors and human environmental factors to see their effects on the speed of and consequent time period for decision making. This is to be buttressed with discussions with architects, and, attempts at decision data gathering from any software or system architecture development undertaken at local Universities. 9 SUMMARY This research paper has adopted the stance that architectural design is decision making, and uncertainty pervades all design. (Architecting in major projects is about predicting the future; if I design the system thus, how will it behave? [Hazelrigg98, p.657]) There is additional uncertainty surrounding the varying speed of architecting/decision making; this variation is inherent to numerous human factors affecting decision-making methods. Complexity arises from changes in the interrelatedness and interconnections of decisions themselves as time progresses. Modelling all this uncertainty is to be carried out using agent based modelling and simulation – a technique already used to understand complex systems where many components interact. The understanding will be a distribution of the legitimate time periods for architecting and timing of decisions. The first envisaged application is knowing whether a project is on track during the conceptual design stage of the system or product lifecycle 89 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 undertaken within large projects. The benefits arising from viewing architectures as a set of decisions is evaluation by all stakeholders, technical and non-technical. 10 REFERENCES Boehm, Barry, and Turner, Richard (2003), Balancing Agility and Discipline: A Guide for the Perplexed, Addison-Wesley Professional. Curtis, Bill, Herb Krasner and Neil Iscoe (1988), A Field Study of the Software Design Process for Large Systems, Communications of the ACM, November 1988, Volume 31, Number 11. Curtis, Lewis , and Cerbone, George (2006), “The Perspective-Based Architecture Method”, The Architecture Journal, Journal No. 9, October 2006, Microsoft Developer Network (MSDN) http://msdn.microsoft.com/en-us/architecture/bb219085.aspx , accessed November 2008. Fitch, John (1999), Structured Decision-Making & Risk Management, Student Course Notes, Systems Process Inc. Florentz, B., and Huhn, M. (2007), Architecture Potential Analysis: A Closer Look inside Architecture Evaluation, Journal of Software, Vol. 2, No. 4, October 2007. Furnham, Adrian (2008), 50 Psychology Ideas You Really Need to Know, Quercus Publishing Plc. Gladwell, Malcolm (2005), Blink: The Power of Thinking without Thinking, Penguin Books. Glass, Robert (2003), Facts and Fallacies of Software Engineering, Addison-Wesley. Hazelrigg, G.A. (1998), A Framework for Decision-Based Engineering Design, Journal of Mechanical Design, December 1998, Vol. 120. Hodgkinson, Gerald P., and Starbuck, William H. (2008), The Oxford Handbook of Organizational Decision Making, Oxford University Press, USA. Kruchten, Philippe (2004), An Ontology of Architectural Design Decisions in Software Intensive Systems, Proc. of the 2nd Workshop on Software Variability Management, Groningen, NL, Dec. 3-4, 2004. Langley, Ann et al (1995), Opening up Decision Making: The View from the Black Stool, Organization Science, Vol. 6, No. 3, May-June 1995. Lee, Larix and Kruchten, Philippe (2008), A Tool to Visualize Architectural Design Decisions, QoSA 2008, Lecture Notes in Computer Science, pp. 43–54, Springer-Verlag. Maier, Mark W., and Rechtin, Eberhardt (2009), The Art of Systems Architecting, Third Edition, CRC Press. Miller, John H., and Page, Scott E. (2007), Complex Adaptive Systems: An Introduction to Computational Models of Social Life, Princeton University Press. Novorita, Robert J. and DeGregoria, Gary L. (1996), Less is More: Capturing The Essential Data Needed for Rapid Systems Development, INCOSE 96 Systems Conference, July 1996, Boston. 90 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 O’Neill, R.V. et al (1986), A Hierarchical Concept of Ecosystems, Princeton University Press. Rasmussen, Jens, Annelise Mark Pejtersen, and L.P. Goodstein (1994), Cognitive Systems Engineering, Wiley-Interscience. Senge, Peter (2006), The Fifth Discipline, 2nd Revised edition, Random House Books. Spinellis, Diomidis, and Gousios, Georgios (2009), Beautiful Architecture, 1st Edition, O'Reilly Media, Inc.. Rechtin, Eberhardt (2000), System Architecting of Organizations – Why Eagles Can’t Swim, CRC Systems Engineering Series. Rozanski, Nick and Eóin Woods (2005), Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives, Addison-Wesley Professional. Shore, Barry (2008), Systematic Biases and Culture in Project Failures, Project Management Journal, December 2008, Vol.39, No. 4, pp.5-16. Sullivan, Kevin J. (1996), Software Design: The Options Approach, Joint proceedings of the second international software architecture workshop (ISAW-2) and international workshop on multiple perspectives in software development (Viewpoints '96), pp.15 – 18. Ulrich, Karl T. (1999), Product Design and Development, McGraw-Hill Inc.,US; 2nd Revised edition edition. Whittingham, Ian (2009), Hubris and Happenstance: Why Projects Fail, 30th march 2009, gantthead.com,. Young, Ralph R. (2001), Effective Requirements Practice, Addison-Wesley. Zannier, Carmen, and Maurer, Frank (2005 ), A Qualitative Empirical Evaluation of Design Decisions, Human and Social Factors of Software Engineering (HSSE) May 16, 2005, St. Louis, Missouri, USA BIOGRAPHY Trevor Harrison's research interests are in software systems architecture and knowledge management. His background is in software development (real-time information systems), technology change management and software engineering process improvement. Before studying full-time for a PhD, he spent 6 years with Logica and 11 years with the Motorola Australia Software Centre. He has a BSc(Hons) in Information Systems from Staffordshire University and an MBA (TechMgt) from La Trobe University. Prof. Peter Campbell is the Professor of Systems Modelling and Simulation and Research Leader in the Defence and Systems Institute (DASI) at the University of South Australia from 2004 and founding member of the Centre of Excellence for Defence and Industry Systems Capability (CEDISC), both of which have a focus on up-skilling government and defence industry in complex systems engineering and systems integration. He currently leads the design for the simulation component of the DSTO MOD funded Microcosm program and is 91 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 program director for two other DSTO funded complex system simulation projects. Through mid 2007, he consulted to CSIRO Complex Systems Science Centre to introduce complex system simulation tools to support economic planning of agricultural landscapes. 92 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 PROCESS IMPROVEMENT 93 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 94 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 APPLYING BEHAVIOR ENGINEERING TO PROCESS MODELING David Tuffley, Software Quality Institute, Griffith University Terry Rout, Software Quality Institute, Griffith University Nathan, Brisbane, Qld. 4111, AUSTRALIA [email protected] | [email protected] Abstract: The natural language used by people in everyday life to express themselves is often prone to ambiguity. Examples abound of misunderstandings occurring due to a statement having two or more possible interpretations. In the software engineering domain, clarity of expression when specifying the requirements of software systems is one situation where absence of ambiguity is important. Dromey’s (2006) Behavior Engineering is a formal method that reduces or eliminates ambiguity in software requirements. This paper seeks an answer to the question: can Dromey’s (2006) Behavior Engineering reduce or eliminate ambiguity when applied to the development of a Process Reference Model? INTRODUCTION Behavior Engineering has proven successful at reducing or eliminating the ambiguity associated with software requirements (Dromey, 2006). But statements of software requirements are not the only kind of artefact developed in the software engineering domain that need to be clear and unambiguous. Process Reference Models (PRM) is another category of software development artefact that might also benefit from being clear and unambiguous. A Process Reference Model is a set of descriptions of process entities defined in a form suited to the assessment and measurement of process capability. PRMs have a formal mode of expression as prescribed by ISO/IEC 15504-2:2003. PRMs are the foundation for an agreed terminology for process assessment (Rout, 2003). The benefits of a method for achieving greater clarity are twofold: (a) PRM developers would gain from improving the efficiency of process model development, and (b) users of process models would benefit by achieving a clearer understanding of the underlying intention of a process which then serves as a consensus starting point for determining how a process might be applied in their own case. This paper therefore examines the ability of Behavior Engineering to disambiguate a particular PRM currently being developed. This paper illustrates how Dromey's Behavior Engineering method (2006) can be used to disambiguate process models, making the resulting model clearer and easier to understand. It is suggested that this method has broader applicability in the Software and Systems Engineering domains. The paper examines in detail how Behavior Engineering has been applied in practice to a specific Process Reference Model developed by the authors. Before and after views are given of several process outcomes that had already passed through three previous reviews to remove ambiguity. The Behavior Engineering analysis results in evident improvements to clarity. 95 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 In a more general sense, it is suggested that this method may be helpful to the modelling of processes at both project and organisational levels, including project-level processes, highlevel policy documents, and project agreements. WHAT IS BEHAVIOR ENGINEERING? Overview Essentially, Behavior Engineering is a method for assembling individual pieces to form an integrated component architecture. Each requirement is translated into its corresponding ‘behavior tree’ which describes unambiguously the precise behaviors of this particular requirement (Glass, 2004). The ‘tree’ is built up from (a) components, (b) the states the components become, (c) the events and decisions/constraints associated with the components, and (d) the causal, logical and temporal dependencies associated with the component (Glass, 2004). When each component is modelled in this way, and then integrated into a larger whole, a clear pattern of intersections becomes evident. The individual components fit together like a jigsaw puzzle to form a coherent component architecture in which the integrated behavior of the components is evident. One component establishes a precondition for another component to perform its function and so on. This allows a software system to be constructed out of its requirements, rather than merely satisfying its requirements (Glass, 2004). Duplications and redundancies are identified and removed, for example, when the same requirement is expressed twice using different language in different places. Another benefit is that requirements traceability is managed with greater efficiency by creating traceable linkages between requirements as they move towards implementation. Historical context The practices now described as behavior engineering evolved from earlier work in which an approach for clarifying and integrating requirements for complex systems was developed (Dromey, 2001). This remains a significant application of the approach (Dromey, 2006); however, as the technique evolved, it became apparent that it could be applied to more general descriptions of systems behavior (Milosevic and Dromey, 2002). To date, some preliminary exploration of applying the technique to the analysis and validation of process models has been undertaken. The OOSPICE Project (Stallinger et al, 2002) had the overall aim of improving time-tomarket, productivity, quality and re-use in software development by focussing on the processes and technology of component-based software development (CBD). OOSPICE combined four major concepts of software engineering: CBD, object-oriented development, process assessment and software process improvement. Its objectives were the definition of a (a) unified CBD process metamodel, (b) a CBD assessment methodology, (c) resulting in component-provider capability profiles, and (d) a new CBD methodology and extensions to the ISO/IEC 15504 Information Technology: Process Assessment. Joint Technical Committee IT-015, Software and Systems Engineering (2005). 96 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 A key part of the OOSPICE was the definition of a model of coherent processes addressing the issues of component-based development; the process model was strongly aligned to ISO/IEC 12207 Standard for Information Technology-Software Life Cycle Processes (1998), but incorporated significant additional processes that specifically addressed CBD issues. The process model was developed following the approach of ISO/IEC 12207, with a series of structured activities and defined tasks. Additional detail on input and output work products was also specified for each activity. The process model was examined using the behavior tree method in order to assess its consistency and completeness. The behavior tree analysis was highly successful; a total of 73 task level problems, 8 process level problems and numerous task integration problems were identified. In addition, examples were found where fragments of tasks were identified which subsequently have no integration point with the larger process tree – a weakness caused by unspecified or inconsistent task inputs or outputs. An indicative example of the behavior tree method applied to a single process is shown below (explanation of notation given later): 5.2.3.1 Proposal for New/Changed Software [Available] 5.2.3.1 Statement of Requirements [Written] 5.2.3.1 Statement of Requirements ? [Agrees to] Sponsor ? 5.2.3.1 Statement of Requirements [Available] 5.2.3.2 Statement of Requirements ?Available? 5.2.3.2 User Requirements [Expressed] 5.2.3.2 User Requirements ?Comprehensive? 5.2.3.2 User Requirements ?NOT: Comprehensive? 5.2.3.2 User Requirements [Available] 5.2.3.2 User Requirements ^ [Expressed] 5.2.3.5 User Requirements ?Available? 5.2.3.3 User Requirements ?Available? 5.2.3.5 Change Request [Submitted] 5.2.3.3 Statement of Requirements ?Available? 5.2.3.5 Configuration Management [Request Processed] 5.2.3.3 Traceability Report [Available] 5.2.3.1 Statement of Requirements ? NOT: [Agrees to] Sponsor ? 5.2.3.4 No inputs specified 5.2.3.4 Applicable Standards [Determined] Figure 1 – Behavior Tree Analysis – OOSPICE Process Model Figure 1 shows the integrated tree resulting from analysis of a single process. It clearly shows missing (dark shaded 5.2.3.1, 5,2,3,4) and "implied" but unstated elements (light shaded, 5.2.3.1, 5.2.3.3, 5.2.3.5), and also a failure in integration, resulting in lack of overall consistency (Ransom-Smith, McClung and Rout, 2002). The medium-shaded boxes were unchanged. 97 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Encouraged by success on OOSPICE, the technique was subsequently applied to the review of the Capability Maturity Model Integration (V 1.2) (Chrissis, Konrad and Shrum, 2003). This work was undertaken in the context of a review of drafts of the SEI’s Capability Maturity Model Integration (CMMI) V1.2, and the results formed the basis of requests for change submitted to the SEI; given resource constraints, it was not possible to apply the technique to support the complete review, but where problems were seen to exist, an analysis was conducted. Figure 2 is indicative of how the technique helped to clarify an ambiguity in the specification of the Requirements Development Process Area in CMMI: PA157.IG101.SP101.N101 PA157.IG101. SP101.N101 PLC: Project Life Cycle PA157.IG101. SP101.N101 STAKEHOLDER {Customer} PA157.IG101. SP101.N101 PA157.IG101. SP101.N101 ? ALL PLC activities are addressed by the requirements. how do I show that in the model PA157.IG101.S P101.N101 ) Requirement+ ( ) Requirement# ( Requirement# addresses :> PLC Activity+ PA157.IG101. SP101.N101 PLC Activity# PA157.IG101. SP101.N101 PLC Activity# has impact on :> product PA157.IG101. SP101.N101 Requirement# has impact on :> product OR Text could mean either of these two Figure 2 – Behavior Tree Analysis – Requirements Development Process Area Given the potential identified in these two applications of the approach, it seemed logical to apply the Behavior Tree approach to the larger task of verifying a complete model. The subject of the current study is a specification for a set of organizational behaviors, specified in terms of purpose and outcomes of implementation, which would support and reinforce effective leadership in organizations, and particularly in integrated and virtual project teams. Applying Behavior Engineering to a complete model From the discussion above, it might reasonably be hypothesised that given the parallels between process model and software system requirements (sets of required behaviors and attributes expressed in natural language) that Behavior Engineering may prove useful in verifying a process reference model. LEADERSHIP PROCESS REFERENCE MODEL PROJECT OVERVIEW The leadership of integrated virtual teams is a topic in the software engineering domain that has received little attention until a project to develop such a process reference model was undertaken by the Software Quality Institute. The topic is an important one, considering the increasing trend in a globalised environment for complex projects to be undertaken by virtual teams. The challenges of bringing any 98 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 complex project to a successful conclusion are multiplied by the coordination issues inherent in virtual environments. The Leadership Process Reference Model (PRM) is being developed using a Design Research (DR) approach (Hevner, 2004). DR is well-adapted to the software engineering domain, and IT development generally, being used to good effect by MIT’s Media Lab, Carnegie-Mellon’s Software Engineering Institute, Xerox’s PARC and Brunel’s Organization and System Design Centre (Vaishnavi and Kuechler,2004/5). In this project, DR is applied in the following way, consistent with Hevner’s guidelines: x x x x x x A designed artefact is produced from a perceived need, based on a comprehensive literature review. A series of review cycles follow in which the artefact is evaluated for efficacy by a range of stakeholders and knowledgeable persons and progressively improved. In this project, five reviews are performed. The first and second reviews involve interviews (four interviews per round) with suitably qualified practitioner project managers. These validate the content of the PRM. The third review applies ISO/IEC TR 24774 Software and systems engineering -Life cycle management -- Guidelines for process description (2007) to achieve consistency in form and terminology of PRMs in the Software Engineering domain. The fourth review applies Dromey’s Behavior Engineering to the draft PRM. The fifth review is by an Expert Panel comprised of recognized experts in the field of PRM-building. APPLYING BEHAVIOR ENGINEERING TO VERIFY PRM In this project, Behavior Tree (a subset of Behavior Engineering) verification is applied as the fourth (of five) reviews. Behavior Tree analysis could have been applied at any stage. The circumstances of this particular project determined that the Behavior Tree analysis verification was performed towards the end, not the first or last review stage. Being the second last review, it might be construed that the number and extent of changes that resulted from applying BE is an indication of its efficacy as a model verification tool. The previous three reviews notwithstanding, Behavior Tree analysis resulted in a significant number of changes. Indeed, most of the process outcomes needed to be reworded for clarity. Unnecessary qualifiers were removed, conjoined outcomes were split into two, each concerned with a single clear point. Behavior Engineering is comprised of a series of related activities, performed in a broad sequence, beginning with the Behavior Tree and followed by the Composition. With the space available, this paper concerns itself with the Behavior Tree component of the broader Behavior Engineering process. 99 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 BEHAVIOR TREES – FIVE W’S (AND ONE H) The Behavior Tree approach is based on the systematic application, with associated formal notation, of the principle of comprehensive factual description of an event known as the Five W’s (and one H) whose origins extend back to classical antiquity. In the 1st Century BC, Hermagoras of Temnos quoted the 'elements of circumstance' as the loci of an issue (Wooten, 1945): Quis, quid, quando, ubi, cur, quem ad modum, quibus adminiculis (Who, what, when, where, why, in what way, by what means) In the modern world, this dictum has evolved into who, what, when, where, why and how. This principle is widely recognised and practiced in diverse domains such as journalism and police work, indeed almost anywhere that comprehensive and unambiguous description of events or attributes is needed. Translated to the Software Engineering domain, who, what, when, where, why and how becomes Behavior Tree Notation. This is a branched structure showing component-states. The table below shows the application of the Behavior Tree aspect of BE in which each distinct component is described in terms of who, what, when, where, why and how, or the subset of these six descriptors that is applicable to this particular component. Behavior Trees are therefore defined as a formal, tree-like graphical device that represents behavior of individuals or networks of entities which realize or change states, make decisions, respond-to/cause events, and interact by exchanging information and/or passing control (Dromey, 2002). Naming conventions, elements and syntax are illustrated below: Table 1: Variable naming conventions (Dromey, 2007b) Table 2: Elements of a Behavior Tree node (Dromey, 2007b) 100 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure 3: Behavior Tree Node concrete syntax example (Dromey, 2007b) FUNCTIONAL REQUIREMENT TO BEHAVIOR TREE – INFORMAL TO FORMAL Functional requirement. When a car arrives, if the gate is open, the car proceeds, otherwise if the gate is closed, when the driver presses the button, it causes the gate to open. Behavior Tree. Translating the above statement into Behavior tree is illustrated below: Figure 4: Functional requirement to Behavior Tree notation (Dromey, 2007a) 101 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 REMOVING AMBIGUITY Statement: The man saw the woman on the hill with a telescope. This statement can be interpreted at least three different ways, a situation not uncommon with natural language. The developer must determine which interpretation is valid. Figure 5: Resolving ambiguity using Behavior Tree notation (Dromey, 2007a) In this example, the application of BT notation clarifies the statement by establishing the precondition (that the woman is located on the hill) from which the primary behavior can then be distinguished (that the man saw the woman by using the telescope). APPLYING BEHAVIOR TREE NOTATION TO A PROCESS MODEL The left-hand column of the table below shows the outcomes of the V0.3 PRM before applying the Behavior Tree notation. The Behavior Tree component column is what results from applying the who, what, when, where, how and who (or subset) using formal notation, from which a clear, simple restatement of the outcome can be derived, as shown in the third column. Note that the material removed from the outcome is not discarded, but relocated to the Informative Material section (not shown) where it serves a useful purpose for persons seeking a fuller understanding of the outcome. Refer to the Rationale for Change for discussion on specific improvements made by applying Behavior Tree notation. In general terms, the improvements derived from the application of BT is greater clarity and economy of words (eg. in first example below 17 words in V0.3 becomes 8 in V0.4 by rephrasing ‘what is to be accomplished’ to simply ‘goal(s)’ and removing the qualifier ‘ideally seen as an accomplished fact’ to the informative section. BT highlighted where and how these economies of expression could be made by applying the process illustrated in Figure 5 to remove ambiguity; in other words the more informal language of V0.3 was rendered into formal language in V0.4 PRM. 102 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 An advantage of BT notation here is that it provides a rigorous, consistently applied editorial logic for people without qualifications and/or much experience as editors. An experienced editor may achieve the same results without BT notation, anyone else would arguably benefit from its application. Behavior Tree Component V0.3 PRM V0.4 PRM Rationale for change FIRST EXAMPLE Leader creates a shared vision of what is to be accomplished, ideally seen as an accomplished fact. Leader clearly communicates the shared vision with team, ideally seen as an accomplished fact. Leader facilitates strong commitment in team to achieving the shared vision, encouraging resilience in the face of goal frustrating events. New outcome in V0.4 Leader develops a concrete and achievable set of goals that support achievement of the shared vision. Leader creates a shared vision of the goal(s). 1.1.1 LEADER (creates) what SHARED VISION/ (of) GOAL(S) 1.1.2 LEADER (communicates) what SHARED VISION/ (of) GOAL(S) 1.1.3 LEADER (gets) what COMMITMENT / (to) SHARED VISION/ (of) GOAL(S) 1.1.4 LEADER (encourages) what RESILIENCE / (in) TEAM when GOALFRUSTRATING EVENTS 1.1.5 LEADER (develops) what OBJECTIVE(S) / (to) ACHIEVE what GOAL(S) Goal(s) not ‘what is to be accomplished’ Remove qualification (ideally seen as an accomplished fact) to Informative Material Leader communicates the shared vision of the goal(s) with the team. Goal(s) included Leader gets commitment from team to achieving the goal(s). Create a new outcome about resilience (it should be a stand-alone outcome rather than a qualification of the commitment to goals outcome. Leader encourages resilience in team when goal-frustrating events occur. New outcome focussing on the important issue of resilience in the face of goal-frustrating events Leader develops practical objective(s) to achieve the goal(s). Practical objectives support the achievement of the goal(s) Remove qualification altogether redundant Change ‘shared vision’ to ‘goals’ since the objectives derive directly from the goals. SECOND EXAMPLE Leader consistently displays integrity, characterised by 1.2.1 Leader behaves with integrity LEADER (behaves) 103 Remove qualifiers to the informative section Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 trustworthiness, and adherence to principle Leader consistently displays competence, characterised by technical, interpersonal, conceptual and reasoning skills what INTEGRITY 1.2.2 LEADER (behaves) what COMPETENCE Leader behaves competently Remove qualifiers to the informative section Leader provides teammembers with ondemand synchronous high-resolution communications media Rename ‘richlytextured’ to ‘hi-res’ (a more common term) THIRD EXAMPLE Leader provides richly-textured communications media for team members to use ondemand. 3.4.1 LEADER (provides) who TEAMMEMBERS when ON-DEMAND what HIGH-RES ICT / / (that is) SYNCHRONOUS Add ‘synchronous’ as appropriate Reorder the sentence to be subject-verbobject. FOURTH EXAMPLE Leader allocates project requirements before team members are recruited to verify the integrated team structure is appropriate to goals. 2.3.1 LEADER (verifies) what TEAMSTRUCRURE/ / when RECRUITING TEAMMEMBERS (before) how Leader verifies team structure before recruiting teammembers by allocating project requirements Restructure sentence to place emphasis on correct aspects (this outcome is primarily about verifying the team structure’) Leader develops highcapability selfmanaging performance functions where complex tasks are performed asynchronously Reword to simplify. ALLOCATING REQUIREMENTS FIFTH EXAMPLE Leader develops higher capability selfmanagement functions early in the project lifecycle where complex tasks are performed asynchronously in virtual environments (i.e. where temporal displacement is high). 3.5.2 LEADER (develops) what PERFORMANCEFUNCTIONS / (that are) SELFMANAGING / / (and) HIGHCAPABILITY when COMPLEX TASKS / /(are perf) ASYNCHRONOUSLY Take ‘early in project’ and put in Informative section. Table 3: Applying behavior tree notation to a process model The Behavior Tree notation analysis was performed by the first named author after receiving around 60 minutes of training from Professor Dromey. The data shown in Table 3 is a 104 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 representative subset of the review done on the V0.3 PRM. As an indication of the defect density, this version of the model contained 24 processes with 63 outcomes collectively. Almost all outcomes were changed in some way as a result of the analysis. The kind of defects found and fixed is represented in the table above. Defects are identified when the notation is applied, beginning with the main entity (leader in most cases), a verb that describes what the entity does (eg. develops, or verifies, or provides etc), and followed by the specific what, or who or when etc as makes sense for each outcome in order to build up a complete unit of sense. This process goes beyond simple editing however. When applied rigorously to the process model, a high-degree of consistency and clarity of expression is achieved. Even with competent editors, other process models (eg. OOSPICE and CMMI as discussed earlier) do not achieve this level of consistency and clarity. The analysis of the 24 processes and 63 outcomes took around six hours to perform, including the documenting of the analysis using the kind of table seen above (with PRM before and after, notation and rationale for change). CONCLUSION The approach to specifying processes in terms of their purpose and outcomes was developed in the course of evolution of ISO/IEC 15504 (Rout, 2003) and is arguably a key innovation in the approach to process definition and assessment embodied in the Standard. By viewing a process (or collection of processes) in this way, it becomes clear that the outcomes represent the results of desired organizational behavior that, if institutionalised, will result in consistently achieving the prescribed purpose. The approach redirects the analysis of process performance from a focus on conformance to prescribed activities and tasks, to a focus on demonstration of preferred organizational behavior through achievement of outcomes. Given this, it is logical to see that the application of the Behavior Tree approach to the analysis of such process models will be effective. The earlier studies reported here were of a much smaller scale than the current study, which embraces the full scope of a comprehensive model of organizational behavior. The aim in applying the approach was to provide a more formalised verification of the integrity, consistency and completeness of the model than conventional approaches – based generally on expert review – could achieve. It may therefore be seen from Table 3 above that applying Behavior Tree notation to the draft outcomes of a process reference model produced significant improvement to the clarity of the outcomes by simplifying the language, reducing ambiguity and splitting outcomes into two where two ideas were embodied in the original. It is suggested, based on the evidence outlined above, the Behavior Engineering is a useful tool for model-builders in the domain of model-based process improvement. It reinforces the claims that the technique is a superior tool for the verification of complex descriptions of system behavior. 105 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 REFERENCES Chrissis, M.B., Konrad, M., & Shrum, S., (2003). CMMI Guidelines for Process Integration and Product Improvement. Addison-Wesley, Boston. Dromey, R.G. (2001) Genetic Software Engineering - Simplifying Design Using Requirements Integration, IEEE Working Conference on Complex and Dynamic Systems Architecture, Brisbane, Dec 2001. Dromey, R.G. (2002). From Requirements To Design – Without Miracles, Whitepaper published by the Software Quality Institute. Available: http://www.sqi.gu.edu.au/docs/sqi/gse/Dromey-ICSE-2003.pdf (accessed 13 April 2009) Dromey, R.G. (2006). Climbing Over the 'No Silver Bullet' Brick Wall, IEEE Software, Vol. 23, No. 2, pp.118-120. Dromey, R.G. (2007a). Principles for Engineering Large-Scale Software-Intensive Systems Available: http://www.behaviorengineering.org/docs/Eng-LargeScale-Systems.pdf (accessed 14 April 2009) pg 39. Dromey, R.G. (2007b). Behavior Tree Notation Available: http://www.behaviorengineering.org/docs/Behavior-Tree-Notation-1.0.pdf (accessed 10 June 2009) pg 2-3. Glass, R.L. (2004). Is this a revolutionary idea or not?, Communications of the ACM, Vol 47, No 11, pp. 23-25. Hevner, A., March, S., Park, J. and Ram, S. (2004). Design Science in Information Systems Research. MIS Quarterly 28(1): pp 75-105. ISO/EIA 12207 (1998) Standard for Information Technology-Software Life Cycle Processes. This Standard was published in August 1998. ISO/IEC 15504 (2003) Information Technology: Process Assessment. Joint Technical Committee IT-015, Software and Systems Engineering. Part 2 Performing an Assessment. This Standard was published on 2 June 2005. ISO/IEC TR 24774 (2007). Software and systems engineering -- Life cycle management -Guidelines for process description. This Standard was published in 2007. Milosevic, Z., Dromey, R.G. (2002) On Expressing and Monitoring Behavior in Contracts, EDOC-2002, Proceedings, 6th International Enterprise Distributed Object Computing Conference, Lausanne, Switzerland, Sept, pp. 3-14. M. Ransom-Smith, K. McClung and T. Rout, (2002) Analysis of D5.1 – initial CBD process model using the Behavior Tree method. Software Quality Institute report for the OOSPICE Project, December 4. Rout, T.P. (2003) ISO/IEC 15504 - Evolution to an International Standard, Softw. Process Improve. Pract; 8: 27–40. 106 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Stallinger, F., Dorling, A., Rout, T., Henderson-Sellers, B., Lefever, B., (2002) Software Process Improvement for Component-Based Software Engineering: An Introduction to the OOSPICE Project, EUROMICRO 2002, Dortmund, Germany, April. Vaishnavi, V. and Kuechler, W. (2004/5). Design Research in Information Systems January 20, 2004, last updated January 18, 2006. URL: http://www.isworld.org/Researchdesign/drisISworld.htm Authors e-mail: [email protected] [email protected] Wooten, C.W. (2001) The orator in action and theory in Greece and Rome. Brill (Leiden, Boston). 107 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 108 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 SAFETY MANAGEMENT AND ENGINEERING 109 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 110 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 BRINGING RISK-BASED APPROACHES TO SOFTWARE DEVELOPMENT PROJECTS Felix Redmill Redmill Consultancy London UK INTRODUCTION The history of software development is strewn with failed projects and wasted resources. Reasons for this include, among others: • • • • • • Failure to take an engineering approach, despite using the epithet ‘software engineering’; Focus on process rather than product; Failure to learn lessons and use them as the basis of permanent improvement; Neglect to recognise the need for high-quality project management; Reliance on tools to the exclusion of understanding first principles; and Focus on what is required without consideration of what could go wrong. If change is to be achieved, and software development is to become an engineering discipline, an engineering approach must be embraced. This paper does not attempted to spell out the many aspects of engineering discipline. Rather, it addresses the risk-based way of thinking and acting that typifies the modern engineering approach, particularly in safety engineering, and it proposes a number of ways in which a risk-based approach may be incorporated into the structure of software development. Taking a risk-based approach means attempting to predict what undesirable outcomes could occur in the future (within a defined context) and taking decisions – and actions – to provide an appropriate level of confidence that they will not occur. In other words, it uses knowledge of risk to inform decisions and actions. But, if knowledge of risk is to be used, that knowledge must be gained, which means acquiring appropriate information. In safety engineering, such an approach is essential because the occurrence of accidents deemed to be preventable is not considered acceptable. (As retrospective investigation almost always shows how accidents could have been prevented, this often gives rise to contention, but that’s another matter.) In the security field, although a great deal of practice is carried out ad hoc, standards are now based on a risk-based approach: identifying the threats to a system, determining the system’s vulnerabilities, and planning to nullify the threats and reduce the vulnerabilities in advance. However, in much of software development, the typical approach is to arrive at a product only by following a specification of what is required. Problems are found and fixed rather than anticipated, and consideration is seldom given to such matters as the required level of confidence in the ‘goodness’ of any particular system attributes. A risk-based approach carries the philosophy of predicting and preventing, and this is an asset both in the development of products and the management of projects. This paper therefore 111 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 proposes some first steps in creating a foundation for the development of such an approach in software development and project management. The next section briefly introduces the subject of risk, and this is followed by introductions to two techniques, used in risk analysis, which are applicable in all fields and are therefore useful as general-purpose tools. Subsequent sections offer thoughts on the introduction of a risk-based approach into the various stages of software development projects. It is hoped that the explanations offered in this paper are easily understandable, but they do not comprise a textbook. Risk is a broad and tricky subject, and this paper does not purport to offer a full education in it. NOTES ON RISK Risk exists in every situation in which we find ourselves and arises from every decision and action that we take. Because of this, we are all practiced risk managers. But our familiarity with risk is intuitive rather than conscious, and: • • • Our successes in intuitive risk management are mostly in simple situations; Our failures are usually not sufficiently serious to warrant conscious assessment, and we perceive them to be the result of bad luck rather than deficient risk management; and Our intuitive risk-management processes are, mostly, not effective in more complex situations, such as development projects and modern technological systems. Psychologists have shown that our perception of risk is influenced by a number of factors, all of which are strongly subjective. They include: • • • • • Whether the risk is taken voluntarily or not; Whether we believe ourselves to be in control or not; The level of uncertainty; The value of the prize for taking the risk; and Our level of fear. Engineering risk analysis employs two factors: the probability of a defined undesirable event occurring within a defined time period, and the potential consequences if it did occur. As both lie in the future, neither can be determined with certainty, and the derivation of both must include subjectivity. However, given appropriate information, both are estimable. The key is information. In its absence, estimates of probability and consequence can be no more than guesses (as they often are in project ‘risk workshops’). The importance of adequate information cannot be over-emphasised. If there is to be confidence in risk estimates, there must be confidence in the estimates of probability and consequence. And these, in turn, depend on information in which there is confidence, i.e. information from trusted sources, and sufficient of it to warrant the level of confidence that is required or claimed. A great part of risk analysis is, therefore, the acquisition of an adequate amount of information to provide the basis for risk estimates that are appropriate to the circumstances. But what is appropriate to the circumstances? This question is answered by considering such factors as the costs of getting it wrong, the level of confidence needed in the estimates, and the costs in time and resources to achieve a given level of confidence. The greater the importance of the enterprise, the more important it is to derive high confidence in risk estimates, so the more important it is to acquire an appropriate amount of information of 112 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 proven pedigree. Yet, thoroughness in acquiring appropriate information is often thwarted by: • • An intuitive belief that we understand risk better than we do; and Over-confidence in our ability to estimate by quickly ‘sizing-up’ (i.e. guessing). Project ‘risk workshops’, sometimes held to ‘identify’ project risks, do not do the trick – unless we are content with extremely low confidence. But once such a workshop has been held, many project participants, including management, are unaware of its inadequacy and believe its results. There is never a good excuse for failing to collect adequate information, for analysis of an adequacy of the right information almost always disproves pre-held beliefs. It is always wise to be suspicious of our preconceptions. A saying that is attributed, unreliably, to many different authorities makes the point: ‘It ain’t so much the things we don’t know that get us in trouble. It’s the things we know that ain’t so.’ Just as important as obtaining accurate risk values is the process of thinking that makes use of them. Risk-based thinking does not focus attention only on what we want to achieve; it also tries to determine what risks lie ahead, what options we have for managing them, how to decide between the options, and then what actions to take. In well understood situations, risks may be addressed intuitively, by reference to experience or documentation. Or, rules for the management of previously experienced risks may be created. Indeed, risk-management mechanisms are built into the processes considered integral to traditional project management. But such devices flounder in novel or complex circumstances, or when project managers, often because of inexperience or pressure from management, cut corners by eliminating, changing, or failing to enforce processes or rules whose origin and purpose they don’t understand. When risky situations are well understood, it may be possible to make risk-management decisions quickly and with confidence, without needing to obtain and analyse further information. But when a situation is not well understood, it is essential to be more formal. Then, the search for information needs to focus on sources that are relevant to the purpose of the study and on what contributes to improved understanding of the risks. The level of confidence that may be claimed in results depends on the pedigree of the sources. TWO GENERAL-PURPOSE TECHNIQUES In safety engineering, various techniques for directing the search for relevant information have been designed and developed. In this section, two are described. Their application is not restricted to the field of safety; they are useful in most situations, including software development and project management. Guidewords Often we talk glibly about ‘failure’, as though it can occur in only one way. But there may be many ways of failure (many failure ‘modes’) and, crucially, each failure mode carries different potential consequences. One technique, Hazard and Operability Studies (HAZOP – see Redmill et al (1999)), is based on the use of ‘guidewords’, each of which focuses attention on a possible failure mode. By using a guideword to raise questions on its associated mode of failure, information is gathered on both the mode of failure’s likelihood of occurrence and its potential consequences. 113 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 A generic set of guidewords, with universal application, is presented in Table 1. In some cases they may need interpretation, depending on the circumstances. For example, ‘No’ may need to be interpreted as ‘None’ or ‘Never’; ‘As well as’ many need to be interpreted as ‘More’ when applied to an amount (say, of data), ‘Too long’ in the context of a time interval, or ‘Too high’ when applied to a rate of data transmission. It is this flexibility that makes the guidewords universal; without it, they would appear specific to certain situations. Table 1: Guidewords and their Definitions Guideword No As well as Part of Other than Early Late Before After Definition No part of the design intent is achieved All the design intent is achieved, but with something more Some of the design intent (but not all) is correctly achieved None of the design intent is achieved, but something else is The design intent is achieved early, by clock time The design intent is achieved late, by clock time The design intent is achieved before something that should have preceded it The design intent is achieved after something that it should have preceded As a simple example of the application of guidewords, consider the production (by a software-based system) of an invoice. It is immediately clear that reference to ‘failure’ is vague, for there are many ways in which it may be faulty, and Table 2 shows the use of guidewords in identifying them. Table 2: Use of Guidewords in Identifying and Examining Invoice Production Failure Modes Guideword No Mode of Failure No invoice is produced As well as Invoice contains additional items for which the customer is not responsible Invoice contains only some of customer’s items A document other than the invoice (perhaps another customer’s invoice) is produced Invoice is produced before all work is done Part of Other than Early Late Before After Invoice is produced after it should have been Not relevant Not relevant Potential Consequences (not exhaustive) Customer suffers no loss but may lose confidence in company. Company fails to collect payment. Customer loses confidence and may cease to do business with the company. Customer may lose confidence in company. Company does not collect full payment. Customer loses confidence and may cease to do business with the company. Company does not collect payment. A further invoice has to be produced. Customer may be confused. Payment is collected late. If systematic, company suffers cash-flow problem Once the credible modes of failure and their potential consequences have been identified, further investigation may be carried out to determine the value of each failure mode to a defined stakeholder (in this case, the customer or the company). For example, we may be interested to determine what events might result in the loss of a customer (or customers in general), or what could lead to the company not being paid, or to a customer experiencing unacceptably low quality of service. Then, the actions taken would be to ensure that the risk of occurrence of such events is low – for example by improving reliability through design, strengthening the management and quality of development, being more rigorous in testing, or improving the monitoring of operation. 114 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Used judiciously, these guidewords can facilitate identification of failure modes in just about any situation; they comprise a general-purpose tool. And, as seen from the above example, the steps taken to reduce the likelihood of failure, or to achieve quality, are not necessarily changed by taking a risk-based approach; rather, those steps are directed more effectively. Fault Trees Another universally applicable technique is fault tree analysis. In this method, a single ‘top event’ is selected and top-down analysis is carried out to identify its potential causes. First, the immediate causes are identified, as in the example shown in Figure 1. In this example, the top event would result from any one of the potential causes, so they are linked to it by the logical OR function. Car fails to start OR No petrol delivered No oxygen delivered Battery problem Electrical fault Other causes Figure 1: An example fault tree (to first-level causes) In some cases, the occurrence of the top event would require the concurrence of two events, so they would be linked to it by a logical AND function, as in Figure 2. Clearly, reliability is improved if failure requires two (or more) simultaneous events rather than a single event, so the fault tree may be used to inspire a solution as well as an analysis tool. Indeed, this is the principle of redundancy, which is mostly used in hardware replication, but also, in some cases, in software. Loss of power AND Mains fails Generator fails Figure 2: A fault tree with causes linked by AND Once the immediate causes have been determined (see Figure 1), their (the second-level) potential causes are determined. For example, the failure to deliver petrol to the engine may be because there is none in the tank or its transmission from the tank is impeded, which in turn may result from a blockage, a leak, or a faulty pump. The battery problem may be that the battery is discharged (flat) or that the electrical connection is faulty. Then the third-level causes are determined, and so on, until a full analysis is completed. In systems in which 115 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 probabilities of causes are known from historic data (for example, electromechanical systems the reliabilities of whose components are recorded), calculations may be made to arrive at a probability of occurrence of the top event. However, such results are not always accurate and can be misleading. Used qualitatively, the fault tree is a valuable tool in many situations. For example, it may be used to model the ways in which a project might have an unsuccessful outcome – and, thus, provide guidance on ways in which this could be avoided. It is traditionally said that there are three principal ‘dimensions’ of a project – time, budget, and to specification – so that project failure is defined in terms of any one of the three. So, if each of these is taken as a top event, the resulting fault trees would reveal the ways in which a project could fail. Deriving such fault trees for an entire project would result in too large a number of contributing causes, and be too vague for specific instances – though the causes that they throw up could be built into project management checklists. However, such fault trees could be useful if derived for the ends of project stages, or at defined milestones. Figure 3 shows an example of first-level causes for time overrun at the first project milestone. These, and the results of next-level causes may be used as indicators of essential project-management responsibilities. Like guidewords, a fault tree is a tool of universal application, which may be employed to advantage in most risk analyses. This brief introduction neither explains every detail of fault tree creation not presents its difficulties. A fuller account is given by Vessely et al (1981). Time overrun at first milestone OR Project started late Project authority delayed Planning too optimistic Necessary documents unavailable Delay in creating project team Unexpected delays Team lacks necessary skills Other causes Figure 3: A Fault Tree of First-level Causes of Time Overrun at the First Project Milestone RISK ANALYSIS AT THE OBJECTIVES STAGE OF A PROJECT A thorough risk analysis consists of a number of activities: • • • • • • Identifying the hazards that give rise to risk; Collecting information to enable an estimate to be made of the likelihood of each hazard maturing into an undesirable event; Collecting information to enable an estimate to be made of the potential consequences if the undesirable event did occur; Analysing the information to derive estimates of likelihood and consequence; Combining the estimates to arrive at values of the risks arising out of the hazards; and Assessing the tolerability of each risk. 116 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 In safety engineering, and in other fields when the stakes are high, each of these activities must be carried out so as to provide high confidence in the results (indeed, the two techniques described above are used in the first four of these activities). However, it is not always necessary to be so thorough. Nor is it always possible to be, for there are times when sufficient information is not obtainable (e.g. at the early stages of a new disease, and in the early days of climate-change debate). The following two generalisations may be made: • • The thoroughness of a risk analysis and confidence in its results are constrained by the availability of appropriate information; and The necessity for thoroughness is usually dependent on the criticality of the situation. An instance when detailed information is unavailable is at the Objectives stage of a project, when product details have not been designed and only strategic proposals have been made. Yet, at this point, carrying out a risk analysis is invaluable, for it can identify risks that could lead to project failure and huge consequential losses. Indeed, the report into the London Ambulance Service (1993) showed that failure to consider the risks, particularly early in the project, was instrumental both in awarding the contract to replace the service’s allocation system to an unsuitable company and in the total failure of the project. The value of analysis at the Objectives stage may be exemplified by a proposal for a hospital’s patient-records system, the objectives of which might be stated by management as being to: • • • • • • Store the clinical and administrative records of all patients; Provide physicians with rapid access to patients’ medical histories; Provide the means of updating patients’ records with diagnoses, prescriptions, and illness histories; Provide nurses with rapid access to treatment and dosing specifications; Provide management with means of documenting, accessing and analysing all patient transactions, both medical and administrative; Produce invoices for services provided to patients. These objectives are defined from the perspectives of some of the system’s stakeholders, specifically the hospital’s management and medical staff. Typically, and importantly, they state what they require of the system – which amounts to the creation and updating of records, storage, provision of access, and the output of analysed information – all database and information-system facilities. The objectives do not reveal the difficulties involved in attempting to meet the stated goals or the risks attached to the failure of any of them. Yet, as revealed by the London Ambulance Service Inquiry, understanding such risks at this early stage is crucial. Deeper scrutiny reveals that a slow system would not be used by doctors (a fact not considered until too late by those responsible for the UK’s health systems), that the loss of records would result in the hospital not being paid for services, that nurses and administrators should not have access to full patient records, that if dosing information were corrupted patients’ lives would be threatened, and that unauthorized access could result in breaches of safety. Moreover, the safety risks are to the patients who, though at the heart of the system, are not mentioned in the objectives except by allusion. From these observations, it becomes apparent that the system should not only meet its functional requirements but also be highly available, secure and safe. Such revelations are likely to be daunting to management who supposed their system to be a simple one. They should lead management to recognize the need for appropriate expertise in its specification, design, development and implementation – and, if the development project 117 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 is to be contracted out, in contracting and management of the project. A risk analysis at the objectives stage facilitates: • • • • The clarification of objectives; The early detection of the implications of – and, importantly, the risks thrown up by – the stated objectives; The detection of conflicts between objectives; and The definition of objectives that should have been defined but weren’t. In addition to facilitating better definition of objectives, a risk analysis at this stage throws up the need to take decisions about the future of the clarified objectives. For example, when management comes to understand that project risks include safety, will they want to proceed with the project as defined? In some cases they wouldn’t. Options for action provided by analysis of the objectives include: • • Cancelling some objectives. This may be done, for example, because it is realized that their implementation would take too long, would require the use of technologies that are untried or for which we possess no expertise, or would carry risks (e.g. safety risks) with which we do not wish to be involved. Cancelling the project. This may be done if the risks exemplified in the previous point were carried by the core objectives. And, if it is decided to proceed, that is to say, to accept the risks, the analysis provides the basis for defining: • • Requirements, to be included in the specification, for the appropriate management of the identified risks; and Special testing requirements for the risk management aspects of the design and, later, the product. This combines risk-based and test-based approaches, and thus offers a new and deeper way of raising confidence in the ultimate success of both the project and the product. A risk analysis also provides information for the credentials required of (and, thus, for the selection of) a contractor, if development is to be contracted out – a point emphasised in the London Ambulance System Inquiry (1993). RISK ANALYSIS AT THE SPECIFICATION STAGE At the Objectives stage, there is little detailed information about an intended project, so both the objectives and the identified risks are mostly at a strategic level. But, as a project progresses, more detail is introduced at every stage, and this has two major effects: • • It allows more thorough identification and analysis of risks; and It is likely to introduce further risks. Thus, risk analysis should be carried out when the principal product of each stage has been completed. For example, when the specification has been developed, a risk analysis should serve at least four purposes. First, it should identify and analyse new risks introduced by the functional requirements. Often these appear straightforward, but if, for example, they invoke a domain, such as safety or security, with which the developers are unfamiliar, or if they call 118 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 for a level of system or functional integrity not previously achieved, the risks involved should be recognized and the implicated requirements given special design considerations. Second, it should identify and analyse new risks introduced by the non-functional requirements. Often these are assumed to accor with what is ‘normal for such systems’ or at least technologically possible. But they should be carefully assessed. The required response times, load capabilities, and reliability may not easily be achievable, or not achievable at all to the staff employed to develop the system. Analysts should also identify the absence of necessary non-functional requirements – a frequent deficiency in specifications. Third, it should identify and analyse the risks introduced by any constraints placed on the product or the project. Examples are: unfamiliar technologies, software and hardware platforms, and systems with which the system must be integrated, time allowed for testing. Fourth, the analysis should be used to determine if requirements have been specified for the mitigation of the risks identified at the Objectives stage. It is sensible also to identify requirements that do not contribute to any of the objectives. Although they may add value for some stakeholders, they do not add strategic value (at least, not according to the project’s defined strategic intent) and are likely to cause time over-run. Given the results of the analysis, options for risk-based action include: • • • Cancel risky requirements – which requires re-consideration of the objectives; Contract the project, or parts of it, to others with appropriate competences and experience; and Accept the risks and plan their mitigation, e.g. by planning and implementing changes to the constitution and competence of the development team, by designing risk-reduction measures into the design, and by defining controls in operational procedures. Thus, the project now proceeds with the purpose not only of meeting the specified requirements, but also of mitigating identified risks and, thus, avoiding problems that could otherwise have occurred. Not to carry out a risk analysis and take appropriate actions results in the unrecognised risks maturing into project and product problems and, thus, leading to time over-run, budget over-spend, unsatisfactory products, or total project failure – examples of all of which abound. RISK ANALYSIS AT THE DESIGN STAGE When a design has been completed – whether architectural or detailed – it should be examined from two risk-based perspectives. First, a technical check should be made to ensure that it provides features that reduce the previously identified risks to tolerable levels. This check should trace a path through all previous risk analyses back to the Objectives stage. Each time an identified risk was accepted, requirements for mitigating it should have been introduced, and those to be met in the design should now be verified. The check should also establish whether any risk-mitigation requirements are to be effected by operational procedures, and if so, it should confirm that their design is not neglected. Second, a study should be undertaken to determine what risks the design introduces into the operational system or into the project. Typically, the rationale of a design is that it should meet the functional and non-functional requirements. But what if a failure should occur during operation (e.g. of a function or a component)? In some cases, the resulting loss may be 119 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 deemed unimportant; in others, it may be catastrophic. So, by carrying out a meticulous study of a design – say, using HAZOP and its guidewords, or fault trees, as discussed earlier – not only are the risks identified, so are the most critical aspects of the system. Actions that follow may then include: • • • • Making design modifications for risk reduction, e.g. introducing redundancy or protection functions, or even just error messages; Informing the software development teams of the most critical functions so that they can apply their most rigorous processes (including appropriate staff) to them; Defining test requirements for the risk-reduction functions and, more generally, providing information to inform risk-based testing; and Defining operational procedures to ensure that the risk-reduction and risk-management measures are effective and, if necessary, to introduce further risk management. RISK-BASED TESTING For well known reasons, exhaustive software testing is not possible in finite time and, if it were, it would not be cost-effective. Planned testing must therefore be selective. But on what basis should selection be made? It makes sense to carry out the most demanding testing on the most important aspects of the software, that is, where the risks attached to failure are greatest. But how might this be done systematically? Normally, derivation of a risk value requires estimates of both consequence and likelihood. However, though the consequence of failure of a defined service provided by the system may be estimated, there is no information on which to base an estimate of the service’s probability of failure prior to testing the relevant software. Yet there are ways in which a risk-based approach can achieve, or at least improve, both effectiveness and efficiency in testing. First, a ‘single-factor’ analysis may be conducted, based on consequence alone, on the basis that if the consequence of failure of a service is high then the probability of failure is desired to be low. On the assumption that testing – followed by effective fixing – reduces the probability of failure, estimates of the consequences of failure of all the services provided by a system may be used to determine the rigour of testing of the items of software that create those services. Of course, it cannot be proved that the probability of failure has been reduced to any given level, but a relationship between severity of consequence and rigour of testing can be defined for a project. This technique carries the major advantages that: • • Sufficient information is usually available for consequence to be determined accurately, provided that analysts meticulously seek it out; and Both the consequence analysis and the resulting test planning can be done in advance of development of the software itself, so it is a strategic technique. Many testers believe that they already carry out this type of risk-based test planning, but they are usually undone because they fail to take the trouble to collect the information necessary for making sensible consequence estimates. Confidence in an estimate can only be justified if it is supported by an adequacy of relevant information. And, as pointed out earlier, deep investigation of data almost always disproves preconceptions. Second, since estimates of the quality of software may be made, by observation and historic information, prior to testing, a single-factor analysis may be based on quality as a surrogate for probability. Confidence in quality estimates cannot be as high as those in consequence, and such estimates cannot be made until the software has been produced and inspected. 120 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 However, quality estimates may be used as a tactical technique late in a project – for example, to provide a basis for planning the reduction of testing because of the lack of time. Third, even if test planning has been based on a consequence-based analysis, a quality-based analysis can later be used to offer confidence in decisions on reducing or re-prioritising testing as time runs out. In this case, one might refer to ‘two-factor’ analysis. The scope of this paper allows only this brief introduction to the possibilities of risk-based testing, but the author has provided fuller details elsewhere (Redmill 2005). SUMMARY Traditionally, the culture in software development projects is to focus wholly on the production of what has been specified. The result is that risks that might have been foreseen – and mitigated – unexpectedly give rise to problems that throw projects off course and lead to defective – and, in some cases, unusable – products. Modern engineering thinking, on the other hand – particularly in domains such as safety and security – is to take a predict-andprevent approach by taking account of risks early. This entails carrying out risk analyses and basing risk-management activities on them, so as to reduce the likelihood of undesirable events later. This paper proposes the introduction of a risk-based approach into software development and project management. The paper outlines what such an approach implies and goes on to explain ways in which it can be implemented. It describes key aspects of two risk-analysis techniques employed in safety engineering, shows that they can in fact be used in all situations, and briefly demonstrates their application in software development projects. It then shows how risk analysis can be used at the various stages of projects, in particular at the Objectives, Specification, and Design stages. Carrying out risk analyses at these points provides options for developers, from strategic management at the Objectives stage to design decisions later on. It offers the opportunity to make changes in response to the acquired knowledge of risks: to cancel a project, or parts of it, to change requirements and adjust the specification, and to build risk-reduction features into the design. Further, by the early identification of critical system features, it also presents the opportunity for early planning of their testing. Further, the paper offers an overview of ways of carrying out risk-based testing, by using knowledge of risks to inform test planning and execution. This paper is not a final manifesto, or a textbook. It only introduces the subject of risk-based thinking. However, it is felt that the principles proposed could bring improvements to software development and project management and take these disciplines a step closer to embracing an engineering approach and culture. REFERENCES London Ambulance Service (1993). Report of the Inquiry into the London Ambulance Service. South West Thames Regional Health Authority, UK Redmill F, Chudleigh M and Catmur J (1999). System Safety: HAZOP and Software HAZOP. John Wiley & Sons, Chichester, UK Redmill F (2005). Theory and Practice of Risk-based Testing. Software Testing, Verification and Reliability, Vol. 15, No. 1 Vessely W E, Goldberg F F, Roberts N H and Haasl D F (1981). Fault Tree Handbook. U.S. Nuclear Regulatory Commission, Washington DC, USA 121 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 122 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 MODEL-BASED SAFETY CASES USING THE HiVe WRITER Tony Cant, Jim McCarthy, Brendan Mahony and Kylie Williams Command, Control, Communications and Intelligence Division Defence Science and Technology Organisation PO Box 1500, Edinburgh, South Australia 5111 email: [email protected] Abstract A safety case results from a rigorous safety engineering process. It involves reasoned arguments, based on evidence, for the safety of a given system. The DEF(AUST)5679 standard provides detailed requirements and guidance for the development of a safety case. DEF(AUST)5679 safety cases involve a number of highly inter-related documents; tool support is needed to manage the process and to maintain consistency in the face of change. The HiVe Writer is a tool that supports structured technical documentation via a centrallymanaged datastore so that any documents created within the tool are constrained to be consistent with this datastore and therefore with each other. This paper discusses how the HiVe Writer can be used to support safety case development. We consider the safety case for a fictitious Phased Array Radar Target Illuminator (PARTI) system and show how the HiVe Writer can support hazard analysis for the PARTI system. 1 INTRODUCTION Safety critical systems are those with the potential to cause death or injury as a result of accidents arising from unintended system behaviour. For such systems an effective safety engineering process (along with choice of the appropriate safety standards) must be established at an early stage of the acquisition lifecycle, and reflected in contract documents. This process culminates in a safety case: i.e. reasoned arguments, based on evidence, for the safety of the system. Safety cases are important because they not only help to provide the assurance of safety that is required by technical regulators, but can also – by providing increased understanding of safety issues early in the project lifecycle – help avert substantial costs at later stages in the project lifecycle. There are well-known methods and tools to support the development of safety cases. For example, the ASCAD (Adelard 2009) tool makes use of the “claims, arguments, evidence” (CAE) approach (Emmet & Cleland 2002), as do the hypertext systems AAA (Schuler & Smith 1992) and Aquanet (Marshall, Halasz, Rogers & Janssen 1991). Another approach is called GoalStructured Notation (Wilson, McDermid, Pygott & Tombs 1996). GSN represents elements (i.e. requirements, evidence, argument and context) and dependencies of the safety case by means of a graphical notation. Of the tools available today the most widely used is the Adelard Safety Case Environment (ACSE), which supports GSN as well as CAE (Adelard 2009). A safety case will usually involve a complex suite of documents built on various assurance artifacts and other forms of evidence. The methods and tools mentioned above are valuable, but they do not fully address the fact that the safety case must be guaranteed to be consistent and robust in the face of changes (Kelly & McDermid 1999): such changes may be trivial ones that 123 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 need to be tracked throughout the documentation, or they may be major changes that impact the whole safety argument. One approach to the consistency problem is to develop a glossary of technical terms and to ensure that such terms are used systematically throughout the safety case documentation. We speak of structured text to refer to documents with such embedded technical terms. The aggregration of technical terms used in the documentation forms a light-weight model for the safety case. Clearly there is considerable scope to provide tool support for developing such model-based safety cases. The HiVe (Hierarchical Verification Environment), currently under development at DSTO, is a general purpose tool for producing structured system documentation. It allows a user to develop the necessary technical glossary and ensures that references to technical terms are managed consistently throughout. It has a number of potential applications; in this paper we claim that the application of the HiVe to the development of safety cases offers many advantages. Another advantage to the HiVe’s implementation of structured text is the ability to enforce workflows through the introduction of form-like constructions. By structuring the safety case documentation in an appropriate way, it is possible to ensure structural compliance with the requirements of the standard. The HiVe may also be programmed to perform simple correctness checks on important details of the safety case or even to calculate the correct results of mandated analyses. In this paper, we describe a specialization of HiVe that supports the development of DEF(AUST)5679 compliant safety cases. In Section 2.1 we give an overview of DEF(AUST)5679, focusing on the requirements for hazard analysis. Section 2.2 summarises the HiVe Writer. In Section 3 we introduce the concept of model-based safety case, and discuss issues for tool support. Section 4 presents an overview of the hazard analysis for a realistic Defence case study. In Section 5 we show how this case study is captured within the HiVe Writer. Section 6 presents some conclusions and suggestions for further work. 2 BACKGROUND 2.1 DEF(AUST)5679 The recently published DEF(AUST)5679 Issue 2 (DSTO 2009b) provides detailed requirements and guidance for the development of safety cases. A safety case involves the following steps: • An analysis of the danger that is potentially presented by the system. This involves an assessment of the system hazards, along with the ways that accidents could occur, as well as their severity. This is called hazard analysis. • A system design that provides safety features (internal mitigations), i.e. a safety architecture. • Arguments that system components have been built in such a way that provides assurance of safety, called design assurance. • An overall narrative (or high-level argument) that is convincing to a third-party and pulls all the above together. The safety case must be acceptable to an auditor (whose role is to monitor the system engineering process and ensure that the procedural aspects of standards are followed); to an evaluator, whose role is to provide a thorough independent and objective review of the validity of the technical arguments that critical requirements are met by the system; and to a regulator, whose role is to set the policy framework within which decisions about system safety must be made (and who also may have the role of certifying that the system is sufficiently safe to be accepted). 124 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 For the purposes of this paper we shall describe in more detail the hazard analysis stage of the safety case. The aim of hazard analysis is to describe the system, its operational context and identify all possible accident scenarios (and their associated danger levels) that may be caused by a combination of the states of the system, environmental conditions and external events. An accident is an external event that could directly lead to death or injury. The severity of an accident is a measure of the degree of its seriousness in terms of the extent of injury or death resulting from the accident. System hazards are top-level states or events of the system from which an accident, arising from a further chain of events external to the system, could conceivably result. Accident scenarios describe a causally related collection of system hazards and coeffectors that lead to a defined accident. An external mitigation is an external factor that serves to suppress or reduce the occurrence of coeffectors within a given accident scenario. The strength of external mitigation must be assessed as one of low, medium or high. The danger level of an accident scenario is a function of the resulting accident severity and the assigned strength of external mitigation. There are six danger levels, labelled from D1 to D6 . Some of the key requirements in DEF(AUST)5679 that are relevant for hazard analysis are reproduced in Figure 1. They have a special format: they are in bold face, with a unique paragraph number (shown here in square brackets), and usually reference one or more technical terms (shown in blue). 2.2 The HiVe Writer The HiVe Writer represents a novel approach to the creation and management of complex suites of technical documentation. It blends both modelling and documentation through a synthesis of concepts from model-based design, literate programming, and the semantic web. It supports a range of documentation styles from natural language descriptions through to fully formal mathematical models. The Writer’s free text mode allows the author maximum expressive freedom. The Writer’s syntax-directed editing mode ensures compliance with documentation and notational standards. The two modes may be mixed freely in a single document. More details on the HiVe may be found in (Cant, Long, McCarthy, Mahony & Williams 2008); in this section we give a brief overview. In the Writer, the modelling activity proceeds by interspersing commands through a normative design document (NDD), which serves as a “script” that builds up the model-based design. These commands serve to enter data into a centrally managed datastore. The datastore records the fundamental technical terms and other building blocks of our model — called formal elements — as well as properties satisfied by these formal elements. Commands may also be used to initiate interactions with external analysis tools and to record results in the datastore. Elements from the datastore may be freely referred to in any document. All such references are created (and are guaranteed to remain) consistent with the datastore, greatly simplifying the management of change propagation and consistency across complex suites of documentation. The Writer also provides a highly sophisticated rendering layer that allows the user to present information from the datastore in numerous styles. This allows the user to target the presentation style to the needs of the intended audience and to create different presentations of the same information for different audiences. Designers are encouraged to write design, explanatory and technical documentation in parallel, with complete consistency and targeted presentation styles, thereby helping them to produce documents that convince others of the correctness of the design. The capabilities of The Writer can be extended via a powerful plug-in facility. In particular, 125 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 [8.6.2.] The Hazard Analysis Report must provide a list of all Accidents arising in Accident Scenarios, including an estimate of the Accident Severity of each Accident as determined by Table 8.1. Accident Severity Catastrophic Fatal Severe Minor Definition Multiple Loss of Life Loss of Life Severe Injury Minor Injury Accident Severities (Table 8.1 in DEF(AUST)5679). [8.9.4] The Supplier shall assign each Accident Scenario a Danger Level in accordance with the following conditions. • For each Accident Scenario a default Danger Level shall be assigned based on the Accident Severity using Table 8.2. • If no External Mitigations are present in the Accident Scenario, the Danger Level shall remain at the default value for that severity. • If, for a given Accident Scenario, a strength of External Mitigation can be assigned, then the Danger Level shall be reduced from its default value according to Table 8.2 [8.9.6] Danger Level assignments of greater than D4 must be explicitly justified in the Hazard Analysis Report, showing cause why stronger External Mitigation of Damage Limitation factors could not be introduced into the Operational Context. Default Level Accident Severity Catastrophic Fatal Severe Minor D6 D5 D4 D3 External Mitigation Low Medium D6 D5 D5 D4 D4 D3 D3 D2 High D4 D3 D2 D1 Danger Levels (Table 8.2 in DEF(AUST)5679). [8.10.2] The Supplier shall assign to the System a System Danger Level that is the maximum over all the Danger Levels assigned for Accident Scenarios. Figure 1: Requirements for hazard analysis 126 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 plug-ins can be developed to support specific business processes and documentation standards. In Section 5, we describe a safety case tool that we have developed as a HiVe plug-in. 3 MODEL-BASED SAFETY CASES We have already noted that there are a number of potential benefits to be gained from adopting a light-weight modelling approach in safety case development. Here we discuss some of the experiments we have carried out in applying HiVe concepts in the DEF(AUST)5679 context. 3.1 Observations on compliance In developing a safety case against a defined standard — in our case DEF(AUST)5679 — the matter of compliance assurance comes to the fore. There are various levels of compliance. As a trivial example, DEF(AUST)5679 requires the use of four accident severities; if the safety case actually makes use of five severities, then it will not be compliant. This kind of compliance is easy to check and should (ideally) be enforced at the time that the safety case is developed, so that such mistakes are impossible. We call this shallow (or surface) compliance. At the other end of the spectrum, for example, would be the case where the reasoning used to justify the accident severities is unconvincing. This would need to be identified by a skilled reviewer (or evaluator) and is an example of deep (non-)compliance with the standard. It can’t be automatically enforced during safety case development. Nevertheless, the safety case should be built and laid out in such a way as to facilitate the checking of all forms of compliance. We have identified a number of ways that the HiVe can enforce shallow compliance and support deep compliance. 3.2 HiVe DEF(AUST)5679 experiments The development of the standard itself is a complex endeavour, complicated in the case of DEF(AUST)5679 Issue 2 by the parallel development of guidance papers and a significant worked case study (DSTO 2009a). Our experiments in support of this process are described elsewhere (Cant et al. 2008). In short, we developed an extensive glossary of formal elements (such as accident, hazard analysis, severity etc) to manage consistent use of terminology across this large body of documentation and also a light-weight model of the various actors and processes treated in the standard. Although modest in scope, these modelling activities gave us useful tools for managing consistency and completeness across this large document suite. For example, the tools ensured that each requirement had a clearly defined responsible agent and automatically collected a table of the responsibilities of each agent in an appendix to the standard. We are confident that this tool support led to a significantly higher quality end product. Encouraged by the results of this experiment, we began to consider the development of a HiVe plug-in for supporting the development of actual safety cases. Most obviously, safety case authors would benefit from access to the DEF(AUST)5679 technical glossary as well as an ability to define and manage a technical glossary specific to the system under consideration. The basic HiVe Writer already provides such capabilities, ensuring that all of the technical terms (as well as the requirements) of DEF(AUST)5679 can be directly referenced in the safety case under construction, with the HiVe maintaining consistency throughout. More ambitiously, we were interested in developing light-weight models for the necessary workflows and deliverables of DEF(AUST)5679. 127 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Prohibited Areas Threats Phased Array Radar Beams ESSM Laser Targetting Beams Own Ship Figure 2: PARTI System and Environment Building on the basic datastore of DEF(AUST)5679 requirements we introduced specific commands for making compliance claims against the standard. For example, a command that describes and adds a new accident scenario to the list required under paragraph 8.6.4 of DEF(AUST)5679 (see Figure 1). Using structured text techniques, this command can ensure that each accident scenario has all of its required attributes, such as associated accident and danger level, properly defined. It can even automate any calculations required by DEF(AUST)5679, such as determining the danger level as modified by external mitigation according to Table 8.2 of DEF(AUST)5679 (see Figure 1). Properly implemented, such a collection of commands can ensure a very high degree of shallow compliance with the standard. They also provide useful guidance to the evaluator in determining deep compliance by directing attention to the critical compliance claims. Once the compliance claims of the safety case are entered into the HiVe datastore, it becomes possible to make automated consistency and completeness checks. For example, by identifying an accident that has been declared but does not appear in any accident scenarios. A more sophisticated check is to ensure all accident scenarios with danger levels above D4 are properly justified in accordance with paragraph 8.9.6 of DEF(AUST)5679 (see Figure 1). The facilities described above have been integrated into a prototype HiVe plug-in for DEF(AUST)5679 safety case development. Currently, the plug-in focuses on the hazard analysis phase, but there are plans to extend it to a complete support tool for DEF(AUST)5679, possibly even including support for formal specification and verification. 4 THE PARTI SYSTEM The Phased Array Radar Target Illumination (PARTI) system is a fictitious ship system that scans, detects, discriminates, selects and illuminates airborne threats. The illumination of targets provides a target fix for the existing Evolved Sea Sparrow Missile (ESSM) system. The PARTI system incorporates the phased array radar (PAR) and associated control functionality; a sub-system for detecting, discriminating and selecting airborne threats; an operator; and a laser designator and associated control functionality, for target illumination. The PARTI system and its environment are shown in Figure 2. The PARTI system was used as a case study for DEF(AUST)5679 (DSTO 2009b) and the detailed results presented in DEF(AUST)10679 (DSTO 2009a) (along with other material giving guidance on how to apply the standard). Here we are just interested in the hazard analysis for the PARTI system. The results of this analysis are summarized in Tables 1a– 1c. Recalling 128 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Id. HAZ HAZ HAZ HAZ HAZ A B C D E System Hazard The PAR irradiates a prohibited area. The laser illuminates a non-target object. The PARTI sends an erroneous communication. The laser fails to maintain target illumination. The PARTI operates without command authorisation (a) System hazards Id. A1 A5 A7 A8 A9 A10 A11 A12 Accident Severity Missile or other ordnance causes damage to a non-target Helicopter crash due to RF interference Laser causes damage to a non-target Personnel injuries caused by Electromagnetic Radiation Collision and Grounding of Ship Laser causes eye damage to personnel Laser kills a person Personnel deaths caused by Electromagnetic Radiation Catastrophic Catastrophic Catastrophic Severe Catastrophic Severe Fatal Catastrophic Default Danger Level D6 D6 D6 D4 D6 D4 D5 D6 (b) Accidents Accident Scenario AS A 2 AS B 1 AS B 2 Hazard HAZ A HAZ B HAZ B Accident A5 A8 A10 Default DL D6 D6 D4 Mit. Strength High Medium High Assigned DL. D4 D5 D2 (c) Typical accident scenarios Table 1: PARTI hazard analysis results Section 2.1, Table 1a of system hazards and Table 1b of accidents are self-explanatory. For reasons of space, Table 1c only includes three representative accident scenarios – those used in Section 5. Similarly, we do not go into the details of the accident scenarios. For example, scenario AS A 2 involves (DSTO 2009a, PARTI-HAR): the PAR irradiat[ing] a prohibited area (hazard HAZ A) while a helicopter is close to the ship, causing the aircraft to malfunction, leading to helicopter crash due to RF interference (accident A5) with multiple fatalities (and so default danger level D6 ). The table records the calculated danger level accounting for mitigation by external coeffectors. In the example of AS A 2 the need for proximity and aircraft malfunction are two independent coeffectors; thus the danger level is lowered to D4 . The interested reader can find full details of the PARTI case study in DEF(AUST)10679 (DSTO 2009a). One such detail is that among the scenarios not shown are those involving HAS E and HAS F, in which the PARTI emits arbitrarily lethal radar or laser radiation respectively. Such hazards can clearly lead to catastrophic accidents with little or no potential for external mitigation. Thus, as defined in Figure 1 (Clause 8.10.2), the system danger level is D6 . Fortunately, it is fairly easy to design the PARTI (by limiting available power output) so as to eliminate these hazards. 129 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure 3: The NDD for the PARTI system 5 THE PARTI HAZARD ANALYSIS IN THE HiVe In this section we demonstrate the HiVe Writer applied to the PARTI hazard analysis, making use of a prototype plug-in – here called the hazard analysis plug-in – to provide commands specific to hazard analysis. 5.1 Generic interface Figure 3 shows part of the HiVe Writer interface (Cant et al. 2008), as captured in a screenshot of a session using the hazard analysis plug-in. The top left hand window is the Project Navigator: this provides an easy mechanism for moving between the different documents in different open projects. Underneath the navigator is a formatting palette, which can be used to present information according to a given user-defined style — this is very useful for presenting the same information to different audiences. The main editor window shows part of the NDD for this project. The NDD provides a literate script that builds up the “model” on which the 130 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure 4: The datastore after processing the first accident scenario safety case is built. The yellow background on a command indicates that it has been processed by the Writer. Direct interaction with the NDD is through a toolbar (top right of the screen) that allows the user to control the processing of commands in the script. 5.2 Hazard analysis support interface The content of the NDD in Figure 3 reflects the domain-specific nature of the hazard analysis plug-in. In particular, after declaring PARTI to be the name of the project as well as the system, we have the command “begin hazard analysis report for PARTI” which serves to set the context for further command invocations in the plug-in. The next command constructs a module that covers the operational context and system description (if these are not present — as DEF(AUST)5679 requires — the hazard analysis plug-in will complain after the NDD is processed, thus enforcing shallow compliance in this instance). The use of modules ensures that the definitions all lie within their own namespace in the project’s datastore. This is immediately used to good effect in the introduction of hazards and accidents, each defined in their own module. As yet unprocessed, in the next module introduced in Figure 3, are the definitions of the three accidents (along with their severities) from Table 1b. The snapshot shows just the first of these accident scenarios (AS A 2): it is introduced with some descriptive text, followed by two coeffectors. After we have processed down to the end of this block, we find that the datastore not only records this definition, but also computes automatically the default and final danger levels (in accordance with Table 8.2 of DEF(AUST)5679). This is shown in Figure 4. If we further process the next two accident scenarios and then try to end the hazard analysis, the HiVe will not permit this, because (according to DEF(AUST)5679 (Clause 8.9.6)), explicit justification is needed for any danger level assignments greater than D4 . Now we make use of the HiVe’s syntax directed-editing. Using a palette of commands we can enter the skeleton for the command giving explicit justification; we can then complete this using a second palette to enter the name of the second scenario. The NDD can now be completely processed (see Figure 5, which also shows the two palettes). 6 CONCLUSION AND PROPOSED FURTHER WORK In this paper we have discussed the HiVe tool and how it can be used to address the problem of constructing convincing and trustworthy safety cases. We demonstrated the use of the HiVe on the hazard analysis phase of a safety case for a realistic case study. Work is now focused on extending the tool to cover the safety architecture and design assurance phases of the same example safety case. The architecture verification for the PARTI system has already been explored using a theorem prover (Mahony & Cant 2008); it will be instructive to capture this work 131 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure 5: The processed NDD with two palettes within the HiVe tool as well. Constructing and managing safety cases present a huge challenge. It will be many years before there is general agreement on how a safety case should be structured; it will also take some time for tools to be available that are easy to use and achieve the desired properties that a safety case should have. Acknowledgments. The authors wish to thank the Defence Materiel Organisation for sponsorship and funding of The HiVe Writer prototype. References Adelard (2009), ‘The Adelard Safety Case Development (ASCAD) manual’. URL: http://www.adelard.com/web/hnav/resources/ascad/index.html Cant, T., Long, B., McCarthy, J., Mahony, B. & Williams, K. (2008), The HiVe writer, in ‘Systems Software Verification’, Elsevier. 132 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 DSTO (2009a), DEF(AUST)10679/Issue 1, Guidance Material For DEF(AUST)5679/Issue 2, Australian Government, Department of Defence. DSTO (2009b), DEF(AUST)5679/Issue 2: Safety Engineering for Defence Systems, Australian Government, Department of Defence. Emmet, L. & Cleland, G. (2002), Graphical notations, narratives and persuasion: a pliant systems approach to hypertext tool design, in ‘HYPERTEXT ’02: Proceedings of the thirteenth ACM conference on Hypertext and hypermedia’, ACM, New York, NY, USA, pp. 55–64. Kelly, T. P. & McDermid, J. A. (1999), A Systematic Approach to Safety Case Maintenance, in ‘SAFECOMP’, pp. 13–26. URL: citeseer.ist.psu.edu/kelly01systematic.html Mahony, B. & Cant, T. (2008), A Lightweight Approach to Formal Safety Architecture Assurance: The PARTI Case Study, in ‘SCS 2008: Proceedings of the Thirteenth Australian Conference on Safety-Related Programmable Systems’, Conferences in Research and Practice in IT., pp. 37–48. Marshall, C., Halasz, F., Rogers, R. & Janssen, W. (1991), Aquanet: a hypertext tool to hold your knowledge in place, in ‘HYPERTEXT ’91: Proceedings of the third annual ACM conference on Hypertext’, ACM, New York, NY, USA, pp. 261–275. Schuler, W. & Smith, J. B. (1992), Author’s Argumentation Assistant (AAA): a hypertext-based authoring tool for argumentative texts, in ‘Hypertext: concepts, systems and applications’, Cambridge University Press, New York, NY, USA, pp. 137–151. Wilson, S. P., McDermid, J. A., Pygott, C. H. & Tombs, D. J. (1996), Assessing complex computer based systems using the goal structuring notation, in ‘ICECCS ’96: Proceedings of the 2nd IEEE International Conference on Engineering of Complex Computer Systems (ICECCS ’96)’, IEEE Computer Society, Washington, DC, USA, p. 498. BIOGRAPHY Tony Cant currently leads the High Assurance Systems (HAS) Cell in DSTO’s Command, Control, Communications and Intelligence Division. His work focuses on the development of tools and techniques for providing assurance that critical systems will meet their requirements. Tony has also led the development of the newly published Defence Standard DEF(AUST)5679 Issue 2, entitled “Safety Engineering for Defence Systems”. Tony obtained a BSc(Hons) in 1974 and PhD in 1979 from the University of Adelaide, as well as a Grad Dip in Computer Science from the Australian National University (ANU) in 1991. He held research positions in mathematical physics at the University of St Andrews, Tel Aviv University, the University of Queensland and the ANU. He also worked in the Commonwealth Department of Industry, Technology and Commerce in science policy before joining DSTO in 1990. 133 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This page intentionally left blank 134 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 THE APPLICATION OF HAZARD RISK ASSESSMENT IN DEFENCE SAFETY STANDARDS C.B.H. Edwards1 M. Westcott2 N. Fulton3 1 AMW Pty Ltd PO Box 468, Queanbeyan, NSW 2620 Email: [email protected] 2 CSIRO Mathematical and Information Sciences GPO Box 664, Canberra, ACT 2601 Email: [email protected] 3 CSIRO Mathematical and Information Sciences GPO Box 664, Canberra, ACT 2601 Email: [email protected] Abstract Hazard Risk Assessment (HRA) is a special case of Probabilistic Risk Assessment (PRA) and provides the theoretical basis for a number of safety standards. Measurement theory suggests that implicit in this basis are assumptions that require careful consideration if erroneous conclusions about system safety are to be avoided. These assumptions are discussed and an extension of the HRA process is proposed. The methodology of this extension is exemplified in recent work by Jarrett and Lin. Further development of safety standards and the possibility of achieving a harmonisation of the different approaches to assuring system safety are suggested. Keywords: Probabilistic Risk Assessment, Hazard Risk Assessment, Safety Standards, Safety Evaluation, Hazard Analysis 2 Introduction The use of Probabilistic Risk Assessment (PRA) is widespread in industry and government. Examples include environmental impact studies, food and drug management, border protection, bio-security and the insurance industry. In many organisations the use of PRA has become institutionalised, being applied in a prescriptive manner with little questioning about the assumptions implicit in the method. In the safety domain the risk of hazard realisation leading to an accident is of concern. This is known as Hazard Risk Assessment (HRA) and is a particular application of PRA. This paper examines the application of HRA in safety standards, which are used to guide the assessment of the safety of defence systems. In recent years there has been a growing body of literature expressing concern that the application of PRA can lead to false conclusions about the nature of perceived hazards. For example, Hessami (1999) provides an analysis of the limitations of PRA and (inter alia) notes: The Risk Matrices, once regarded the state-of-the-art in pseudo quantified assessment are essentially outmoded and inapt for today's complex systems and standards of best practice. They are a limited tool which cannot be universally applied in replacement for systematic assessment and it is not possible to compensate for their structural defects and enhance their credibility through customization of their numerical axes as advocated by the Standard (IEC). These also encourage an incremental as opposed to the holistic view of risks through arbitrary allocation of tolerability bands. In short risk matrices are best suited to the ranking of hazards with a view to prioritize the assessment effort. A systems framework is required to provide a suitable and sufficient environment for qualitative and quantitative assessment of risks within a holistic approach to safety. A so called “precautionary principle” has evolved over a number of years and has been proposed as an alternative to the use of PRA. O’Brien (2000) provides a guide to the application of this principle. When describing the precautionary principle Wikipedia notes: This is a moral and political principle which states that if an action or policy might cause severe or irreversible harm to the public or to the environment, in the absence of a scientific consensus that harm would not ensue, the burden of proof falls on those who would advocate taking the action, [Raffensperger C. & J. Tickner (1999)]. The principle implies that there is a responsibility to intervene and protect the public from exposure to harm where scientific investigation discovers a plausible risk in the course of having screened for other suspected causes. The protections that mitigate suspected risks can be relaxed only if further scientific findings emerge that more robustly support an alternative explanation. In some legal systems, as in the law of the European Union, the precautionary principle is also a general and compulsory principle of law, [Recuerda. (2006)]. 135 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Given the widespread use of HRA within the safety community it is important that the limitations of this approach to system safety be well understood and that some form of a precautionary principle is woven into the further development of safety standards. 3 Hazard Risk Assessment 3.1 HRA Process HRA aims to identify the risk of hazards and to guide the application of resources to minimize assessed risk. This concept has been applied to a wide range of situations, ranging from relatively simple OH&S problems, such as office safety, to the acquisition of complex weapons systems. HRA attempts to quantify risk through the use of the Hazard Risk Index (HRI) measure. After the derivation of the HRI for a particular hazard, an assessment of the application of resources required to mitigate or remove the risk is made. Often this assessment is based on the As Low as Reasonably Practicable (ALARP) principle. Notably, the ALARP method allows for a statement of Residual Risk, i.e. the risk remaining after the completion of the safety process. Further discussion about ALARP can (inter alia) be found in Ale (2005). 3.2 Derivation of the HRI The derivation of an HRI for a particular hazard, i.e. a hazard derived from the HRA process, is typically based on a tabulation of ‘Likelihood’ versus ‘Consequence’ as shown in Table 1. The acceptability of the HRI is then determined by a grouping of derived HRI. An example is shown in Table 2. Consequence Likelihood Catastrophic Critical Major Minor Frequent 1 3 7 13 Probable 2 5 9 16 Occasional 4 6 11 18 Remote 8 10 14 19 Improbable 12 15 17 20 Table 1. Hazard Risk Index HRI Risk Level Risk Acceptability 1 to 5 Extreme Intolerable 6 to 9 High Tolerable with continuous review 10 to 17 Medium Tolerable with periodic review 18 to 20 Low Acceptable with periodic review Table 2. Acceptability of Risk 3.3 HRI Measurement Difficulties The Likelihood and Consequence scales on Table 1 are ordinal measures. Thus within a particular row or column of Table 1 entries are ranked and comparison of those rankings is valid. For example, it is reasonable to assert that within the Critical Consequence column an Occasional likelihood is a worse outcome than a Remote likelihood. Comparisons of rankings from different rows or different columns are more problematic. To assert that Occasional but Catastrophic 136 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 (HRI=4) is equivalent to Frequent and Critical (HRI=3) is difficult to justify, particularly in the absence of a quantified hazard consequence that is consistent with the hazard context. Thus a grouping of HRI, for example as shown in Table 2, is difficult to justify. Groupings may have some meaning if the context of the Likelihood and Consequence assessments is known and done on a case-by-case basis, i.e. the system or issue under evaluation together with the operational environment are well understood. Importantly, because a general a priori statement of HRI groupings has no theoretical basis, groupings of HRI used to ascribe a level of risk acceptability must be done on a case by case basis prior to the safety analysis, in a manner that takes into account the system context. Stevens (1946) provides a useful discussion on the theory of measurement scales, while Ford (1993) discusses the application of measurement theory to software engineering. The HRI based approach to safety has intuitive appeal to program managers and continues to be widely used. This practice follows from the fact that the ALARP concept appears to simplify the problem of resource allocation and that the concept of residual risk leads to qualitative statements of remaining safety actions, such as additional procedures, which once articulated provide a well defined end to a safety program. It is of interest to note that the ‘burden of proof’ or ‘required due diligence’ for estimating the residual risk in the absence of hazard mitigation appears to be the same regardless of how high the inherent risk. 4 Assumptions and Limitations of HRA 4.1 The Importance of Context One problem with a general application of the HRI, as shown in the example Tables 1 and 2 lies in the fact that it is not always possible to apply appropriate context scaling to the Likelihood and Consequence groups. The likelihood of various outcomes will be dependent on the context of the problem being studied, as will the severity of the realisation of a particular hazard. For example, the distribution of acceptability of HRI for faults in a Full Authority Digital Engine Control System (FADECS) is likely to be very different from an examination of the risks of an experimental drug treatment for patients with advanced forms of cancer. In the former case it is likely that there would be little tolerance of a fault in the FADECS, while in the latter patients might be willing to risk death if there was even a small chance of a cure. Prasad and McDermid (1999) discuss the importance of the context of a system when attempting to identify emergent properties such as dependability. The importance of context in trying to assess the safety of complex systems is well illustrated by Hodge and Walpole (1999) where they adapted Boulding’s (1956) hierarchy of systems complexity to the Defence planning. The General Hierarchy of Systems was summarised and illustrated as seen in Figure 1 below. Figure 1. General Hierarchy of Systems The application of this concept to the appropriate use of safety standards follows from the fact that standards developed in an OH&S context are typically aimed at application at the Social level, while standards aimed at providing assurance that a complex system is safe are applied at the Control level. An attempt to apply an OH&S based standard to a 137 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 complex computer-based system is likely to produce erroneous estimation of the system’s safety. This topic is discussed later. There are a number of assumptions, limitations and requirements implicit in a valid HRA process. These include: a. The requirement for semantic consistency between the context description and the outcomes of a hazard realisation i.e. the Consequence description; b. the need for an exact quantitative description of the Risk Likelihood used, i.e. the likelihood function and the boundaries of the associated class groups; c. a consistent mathematical description of the Consequence of hazard outcomes; and d. the requirement for an ‘a priori’ mathematical model of the acceptability of risk that is consistent with the context of the analysis. Apart from the requirement for semantic consistency, Cox (2008a) has addressed these issues in a paper on the fundamentals of risk matrices. He concludes that PRA often has limited benefit, noting: The theoretical results in this article demonstrate that, in general, quantitative and semi quantitative risk matrices have limited ability to correctly reproduce the risk ratings implied by quantitative models, especially if the two components of risk (e.g., frequency and severity) are negatively correlated. Moreover, effective risk management decisions cannot in general be based on mapping ordered categorical ratings of frequency and severity into recommended risk management decisions or priorities, as optimal resource allocation may depend crucially on other quantitative information, such as the costs of different countermeasures, the risk reductions that they achieve, budget constraints, and possible interactions among risks or countermeasures (such as when fixing a leak protects against multiple subsequent adverse events). Cox (see also Cox (2008b)) makes many other important points, including that probabilities are not an appropriate measure for assessing the actions of intelligent adversaries. He also suggests three axioms that a risk matrix should satisfy and shows that many matrices used in practice do not meet them. A further observation is that using a large number of risk levels (or colours) in a matrix can give a spurious impression of the matrix’s ability to correctly reproduce model risk ratings. For a 5 x 5 matrix, his axioms imply the matrix should have exactly three levels of risk (the axioms require at least three levels, but Cox also recommends keeping the number of levels to a minimum). However, there are some cases where PRA might be usefully employed. As Cox (2008a) notes: If data are sufficiently plentiful, then statistical and artificial intelligence tools such as classification trees (Chen et al., 2006), rough sets (Dreiseitl et al., 1999), and vector quantization (Lloyd et al., 2007) can potentially be applied to help design risk matrices that give efficient or optimal (according to various criteria) discrete approximations to the quantitative distribution of risks. Other variations of conventional HRA aimed at better relating the likelihood and consequence pairs have been proposed. Swallom (2005) provides an example, while Jarrett and Lin (2008) suggest a practical process to consistently quantify likelihood and consequence, leading to a quantified HRI. The latter approach is strongly data dependent and appears to provide a thoughtful and defensible use of HRA. 4.2 Semantic Consistency The description of the system context and the outcomes from the realisation of a hazard can be a fraught process involving imprecise descriptions and relationships. Overcoming this problem for complex systems will often require considerable effort with the process being aided by formal analysis of the semantics involved in the description of the system design and resulting hazards. One method for achieving internal consistency of the description of system context is the application of set theoretic modelling. Wildman (2002) provides an example of this process. 5 Application of HRA in Safety Standards 5.1 Risk Based Standards Over the last two decades there has been a divergence in the theoretical basis of system safety standards. In essence there are two lines of thought. The conventional approach involves a process that attempts to apply HRA to identify and classify system hazards according to some sort of acceptability criteria. The alternative approach is a qualitative one, driven by system safety requirements, in which each accident scenario is assigned a danger level, and each component safety requirement is assigned an assurance level that dictates the level of rigour required to develop and analyse system components. The alternative approach is discussed later. Safety standards such as the UK DEF STD 00-56 and the ubiquitous US MIL-STD-882 (2000) are examples of the conventional approach. These are based on the ALARP approach to safety. This approach accepts the possibility of Residual Risk and attempts to quantify assurance through the use of the HRI metric. Note: The US MIL-STD-882D does not specifically call out ALARP but is instead based on a reasonability test similar to the ALARP approach. Locally, the RAN Standard ABR 6303 (2006) is another example of the application of ALARP. This standard has been widely promulgated and has been used to assess the safety of a number of complex systems. 138 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 While it may be possible to apply HRA-based safety standards to situations where the system is well understood, there are systems where it is possible to argue that, a priori, the application of HRA to provide an estimation of system assurance is inappropriate. For example, it may be appropriate to apply an HRA-based safety standard to a simple mechanical system where there are sufficient data to draw some inferences about the statistical properties of (Probability, Consequence) pairs and where the system context is well understood. However, the application of HRA to a computer intensive system requires further consideration and alternative strategies need consideration. In the case where the use of an HRA-based standard has been mandated it will be important to ensure that critical software components of the system are identified and treated appropriately. 6 Proposed Extension to HRA 6.1 Interpretation of Risk We assume it is possible to quantify the (Likelihood, Consequence) pairs so that the definition Risk = Likelihood x Consequence makes sense (Royal Society Study Group, 1992, Sections 1.3.2, 1.3.4). The meaning of the value of Consequence needs careful thought. In general this will be a random variable; that is, different realizations of a hazard will produce different consequences. In an HRA, which is prospective though perhaps informed by data, the values for Consequence could be decided by an individual or as part of a collective evaluation process. In the former case, the value could incorporate the risk perceptions of the individual. In the latter case, the value is likely to be closer to an average or expected Consequence. If so, there is a useful interpretation of risk as an expected cost rate; see h. below and Section 6.3.4. This conclusion also emphasises the importance of an inclusive and multidisciplinary process when evaluating Consequence. 6.2 Concept Application As noted previously, Jarrett (2008) appears to offer a practical method of developing an estimate of system safety assurance. The approach attempts to overcome some of the known deficiencies in the construction and use of risk matrices. It is based on work by Anderson (2006) and Jarrett and Lin (2008); the latter work is summarized in Jarrett (2008). Their mathematical basis for quantifying the margins of the matrix is very similar, though Jarrett and Lin embed this in a wider process for deriving a risk matrix. The stated intentions of this process are to “create greater transparency” and “develop a more quantitative approach”. The specific context for the work in Jarrett and Lin (2008) is assessment of maritime threats to Australia. The main steps of this process are as follows. a. Define the relevant hazard or threat categories. b. For each threat category, assess the consequences and the likelihood of the hazard. The consequences are also classified into categories. c. For each consequence category, the possible severities are listed, described and ranked. It is important that the severities with the same rank line up across the categories, so that they will be generally agreed to be comparable. A guide to severities is that, where possible, the steps should correspond to roughly 10-fold changes in “cost” (which might be dollars but could be fatalities, injuries, land areas affected, etc). d. The hazard is then assigned a score (rank) in each category. e. The overall consequence score for the hazard is calculated by combining the category scores in a particular way (see below). f. The likelihood of the hazard is assessed by its expected frequency, defined as the expected number of occurrences per annum. The score assigned to the likelihood also has steps that correspond to 10-fold changes in the frequency. Verbal descriptions of the scores can be given but are really only indicative; the number is the crucial element here. g. The risk score for the hazard is then given by Risk score = Consequence score + Likelihood score h. This score has a rough interpretation as a log10(expected annual cost) i. The possible scores for a hazard can be assembled into a matrix or table. If desired, each cell of the table can be assigned a measure of risk based on the risk score for the cell. This would look like a traditional risk matrix, but the scores have a definite quantitative interpretation that is transparent and can be validated against data. 6.2.1 Combination of category scores (Jarrett and Lin) Suppose there are c categories and the associated category scores are score is 139 s1 , , s c . Then the combined (consequence) Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 S log 10 10 s1 10 s 2 10 s c This is similar to taking the maximum score, but it gives added weight to multiple occurrences of the maximum. For example, consider two cases, with c=7. I. There is one score of 5 and six scores of 1. II. There are seven scores of 5. In each case, the maximum score is 5, so if S were taken to be the maximum score both these cases would get the same score. However with the proposed system: SI=log10 (100,000+60) = 5.002; and SII=log10 (700,000) = 5.85 Thus case II has a substantially higher risk score, which seems appropriate since it has a high score in every category so presumably is judged to have more a severe overall consequence. Variants of this basic scheme are clearly possible, and might be desirable in a particular case. 6.2.2 Interpretation of cell entries In h. above, the risk score is given the interpretation of a log(expected annual cost). The model behind this is as follows. Suppose hazardous events occur randomly in time at a rate . Each occurrence of an event has an associated cost which is a random variable from a distribution with mean µ. Provided the costs are independent of the occurrence process, the expected total cost per unit time is the product .µ. Taking logs gives the relation in g. above, and explains h. 6.3 Example This example is taken largely from Jarrett (2008). It concerns maritime threats. 6.3.1 Threat categories These were taken as: • Maritime Terrorism • Illegal Activity in a Protected Area • Protected Area Breach • Piracy • Unauthorised Maritime Arrivals • Illegal Exploitation of Natural Resources • Marine Pollution and Biosecurity 6.3.2 Consequence categories and severity levels The categories are given in the top row of the table below (from Jarrett (2008)), together with descriptions of the severities at ranks 4 and 5. (Note: The table is a fragment from the complete table provided in the CSIRO report). Consequence category Death, injury or illness Economic Environmental Symbolic 5: Catastrophic Mass fatalities, remains collection compromised $5 billion+ Irreversible loss of a Destruction of conservation value of nationally important a bioregion symbol 4: Major Multiple fatalities, remains collection compromised $1-5 billion. Damage to a Serious damage to a conservation value nationally important where recovery > ten symbol years One might dispute the equivalence of outcomes within a particular Consequence Category but it is clear that a high degree of discussion and consultation was involved in production of this table. This discussion and consultation are an essential part of the proposed risk assessment process. It is worth noting that at this point the above table of Consequence and associated Severities is similar to the approach provided by ABR 6303. The extension of the methodology proposed here would thus appear to be a natural extension 140 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 of the ABR 6303 process. However, there are limitations to the application of this approach, particularly when dealing with computer intensive systems. These limitation are discussed later. 6.3.3 Likelihood This is summarised in the following table (from Jarrett (2008)). Likelihood Indicative Rate Australia-Wide Description Prob 0.01 of event each year. Likelihood Score Rare Aware of an event like this occurring elsewhere. 1 Unlikely The event will occur from time Prob 0.1 of event to time. each year. 2 Possible The event will occur every few One every three years. years. 2.5 Likely The event will occur on an annual basis. One every year. Very Likely The event will occur two or three times a year. Two to three events a year. Almost Certain The event will occur on about a Ten events or more a monthly basis. year. 3 3.5 4 The following should be noted: a. The 10-fold increase in frequency with each unit increase in the Likelihood Score. In this case, the authors have refined the scoring system to include some changes of 0.5; these are associated with a 3-fold change in frequency. This is broadly consistent, since log 10 3 0 . 477 0 .5 . b. The decision to equate score 1 with a frequency of 1 event per 100 years. This is entirely arbitrary. We shall see shortly that it might be better to increase all likelihood scores by 1 in this instance. c. The verbal descriptions in the first column are evocative but have no direct influence on the results. Effectively, they are defined by the frequencies. This is in contrast to other uses of risk matrices, where terms on the frequency/likelihood axis appear to be undefined (e.g. Fig.9 in FWHA (2006)) d. The caveats in Cox (2008b) about use of probabilities when the hazard results from the actions of intelligent adversaries should be kept in mind. 6.3.4 Risk score This is defined by the sum of the consequence and likelihood scores. The interpretation mentioned, that of the log of the expected annual cost, can be seen as follows. A likelihood score of 3 corresponds to an expected frequency of one event per year. If the entire category scores are 4, say, then the consequence score is about 4.85 (log10(7x104)), leading to a risk score of 7.85. Looking at the table above, severity level 4 is associated with a $ cost of order $1 billion. So the expected annual cost would also be of order $1 billion and its log would be 9. So the risk score represents roughly 1/10 of the annual expected cost. This is why increasing all the likelihood scores by 1 might be sensible in this case; it would give a closer match between risk score and log expected annual cost. The Risk matrix produced from this example is shown below. 141 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Likelihood Overall Consequence Score 1 2 2.5 3 3.5 4 Rare Unlikely Possible Likely Very Likely Almost Certain Negligible Low Low Low Moderate Moderate Low Low Moderate Moderate Moderate Moderate Low Moderate Moderate Moderate Moderate High Moderate Moderate Moderate to High High High High to Extreme Moderate High High High to Extreme Extreme Extreme 1 to 1.85 Insignificant 2 to 2.85 Minor 3 to 3.85 Moderate 4 to 4.85 Major 5 to 5.85 Catastrophic Here the choices for the cell entries will again be the outcome of an extensive discussion and consultation process. If actions are associated with each risk level, these must also be carefully thought through and calibrated to be consistent, and appropriate for the perceived level. We note that this matrix might not accord with the recommendation in Cox (2008a) for a minimal number of levels (colours), though it does satisfy his three axioms. Because the isorisk contours on a log scale are straight lines, the banding in this matrix is more natural and defensible than in many other applications. 7 Discussion 7.1 Hazard Severity Based Standards An alternative to HRA is a qualitative approach based on a perceived severity of identified system hazards, which in turn dictates the level of rigour required to analyse a hazard. Issue 2 of DEF(AUST)5679 (2008) provides an example of this approach, where the standard asserts that the necessary system assurance will be obtained because the hazard has been ‘designed out’. An important demand made by Issue 2 of DEF(AUST)5679 is a tight coupling between the safety program and other aspects of system development. Thus safety requirements are determined as part of the general system requirements development process and the satisfaction of those requirements are incorporated into the system design and implementation phases. Importantly, and in contrast to HRA, DEF(AUST)5679 increases the ‘burden of proof’ as the inherent danger of a hazard increases, the notation used in the standard being Hazard Danger Levels. Application of Issue 2 of DEF(AUST)5679 to existing systems can present problems if the provenance of the system safety argument is uncertain or non-existent. In these circumstances the application of HRA in the manner suggested by Jarrett and Lin (2008) below appears to offer a practical method of developing an estimate of system safety assurance. Noting that the treatment of Non Development Items (NDIs) in DEF(AUST)5679 allows for the use of informal methods, it appears that a theoretically defensible approach to HRA could be incorporated into the standard when assessing NDIs. The SVRC Report (1999) provides further useful comparative information on the two different approaches, albeit on earlier versions of the standards. 7.2 Application of HRA in Safety Standards It is clear that the current use of HRA-based safety standards when assessing the safety of complex defence systems is fraught with difficulties. Not only is the assessment of likelihoods largely qualitative and not based on supporting data, but the associated consequences are unlikely to represent a global assessment of possible accidents. Both ABR 6303 and MIL-STD-882 tend to produce assurance assessments that could be readily challenged in the courts. They are both essentially ‘low assurance’ standards. The context of application of these two standards is important. MIL-STD-882 is a mature standard having evolved through its application to military systems in the USA. The results of such application are normally evaluated by a well 142 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 resourced government organisation such as the USN Weapons Systems Explosive and Safety Review Board (WSESRB), which can provide skills that allow the results of a MIL-STD-882 process to be examined carefully. Notably, the WSESRB has the executive authority to enforce the results of its findings. Thus, while the theoretical basis of the standard remains inadequate, the skill base of US organisations, such as the WSESRB, are able, to some extent, to compensate for the limitations of the standard. By way of comparison, safety organisations within the Australian Defence organisation do not possess the same degree of executive independence as that enjoyed by their American counterparts. The final decision on system deployment is not subjected to mandatory approval by the safety organisation, but rather is made at the ‘Social Level’ (of Figure 1) by non-safety personnel who take into account the recommendations of a safety assessment. For example, a system assessed as HRI 9 could be accepted into service with the residual risk being mitigated by procedure rather than by changes to the system recommended by a safety analyst. The result of the Australian Defence process is effectively a filtering process through different levels of the defence bureaucracy, which can result in diluted assessment of the safety argument. 7.3 A Safety Paradigm Shift During the latter phase of the long evolutionary development of MIL-STD-882 there has been a rapid development of programmable technology. Noting the acknowledged limitations of HRA in dealing with this technology, the logical conclusion is that a new paradigm for addressing computer intensive systems is required. A focus on safety requirements, and subsequently identified hazard severities, exemplified by Issue 2 of DEF(AUST)5679, appears to provide the basis for such a paradigm shift. Not only does this standard incorporate a precautionary principle into the assessment of system safety, it also provides for a more rigorous and defensible safety argument. The use of DEF(AUST)5679 as the default safety standard would not initially impose a markedly different process from the use of MIL-STD-882. Both standards require a hazard analysis to be conducted in the first instance, with the results of that analysis determining the nature of any subsequent safety effort. 7.4 Characteristics of Computer Intensive Systems Computer intensive systems are now widespread in both defence and civilian applications. As paragraph 1.4.2 Issue 2 of DEF(AUST)5679 notes: The implementation of system functions by SOFTWARE (or DIGITAL HARDWARE) represents some unique risks to safety. Firstly, the flexibility of programming languages and the power of computing elements such as current microprocessors means that a high level of complexity is easily introduced, thus making it harder to predict the behaviour of equipment under SOFTWARE control. Secondly, SOFTWARE appears superficially easy and cheap to modify. Thirdly, the interaction of other elements of the system with the SOFTWARE is often poorly or incompletely understood. The idea of ‘safety critical software’ is one fraught with conceptual difficulties. Software per se is simply a set of logical constructs which have (hopefully) been built according to a design requirement. As such, software is neither ‘safe’ nor ‘unsafe’, but rather may contain constructs that when executed in a particular environment (both platform and external environment) may produce unexpected results. Thus the context in which the software executes is just as important as the code itself when it comes to assessing system assurance. DEF(AUST)5679 has a particular focus on ensuring conformance of software with the design requirements, but as noted previously, can produce difficult management issues when applied to NDI or Military Off The Shelf (MOTS) products. Many of these products are either in military service or have a history of commercial application, and in these circumstances it is likely that system reliability data would be available. For example, an Inertial Measurement Unit (IMU) is a complex device that is commercially available and has application in both military and civilian systems. Thus a safety case for a system containing an IMU might be able to take advantage of reliability data in the way described above, where the IMU is treated as a black box within a wider system context. While such an approach would seem to be consistent with the treatment of NDI products in DEF(AUST)5679, it would not reduce the requirement for rigour in the analysis of the surrounding system and associated IMU system boundary. Rather the use of reliability data would augment the safety argument. It is clear that it is quite inappropriate to use the RAN Standard ABR6303 as a guide to assessing the assurance of complex computer-based systems. Not only is the standard aimed at assessing the risk of OH&S hazards, it is not data dependent and is qualitative in assessing likelihood risks. However, it is suggested that with the incorporation of the methodology discussed above the applicable scope of standard could be widened. In particular it would allow meaningful application to a larger class of physical systems. 7.5 Cultural Issues Anecdotal evidence suggests that many program managers and system engineers regard a safety program as a necessary evil, providing program uncertainty with little visible benefit. Such attitudes are inconsistent with system development experience, but more support from senior management is required if an attitudinal change is to be achieved. Such support should come from the reality that a well integrated safety program not only improves the system engineering process, but the final quality of the product. 143 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Additionally, it behoves program managers to provide for the most complete and defensible safety program available with current technology, because in the event of a serious accident, it is inevitable that today’s safety efforts will be judged in the light of tomorrow’s safety standards. Because software based systems are logically complex it follows that it is not possible to assure system safety through testing alone. Conversely, testing combined with appropriate analysis offers the possibility of reducing the scope of a safety program. This follows from the fact that such a combination can limit the requirement for an otherwise almost infinite test regime, through an initial investment in the analysis of the safety critical properties of the system. In dealing with the issue of the scope of a testing regime it should be borne in mind that testing and analysis go hand in hand in supporting the safety argument. Different safety cultures interpret this truism differently, a fact which is reflected in the oft repeated anecdotal quotation: “In the USA they test, test and test and then analyse the results of the testing, while in Europe they analayse, analyse and analyse and then test the results of the analysis”. While it might seem trite to labour this point, it is apparent that this is a cultural difference that is reflected in the divergence in the theoretical basis for safety standards. 7.6 Harmonisation of Safety Standards Noting the limitations of safety assurance derived from HRA and the existence of safety standards based on both HRA and an assessment of hazard severity, there would seem to be a need to harmonise the two different approaches when assessing complex computer intensive systems. Such harmonisation should identify the validity of a particular approach to providing assurance of system safety. More particularly it is imperative that cases of inappropriate application of HRA are identified. The main harmonisation issue flows from the fact that DEF(AUST)5679 does not mandate the determination of a residual risk which, in the case of a HRA, is often made on the basis of unsupported qualitative assessments. Essentially, the real difference between the HRA and assurance based techniques lies in the mandated determination of safety risk. MIL-STD-882 does not provide adequate guidance on the design and implementation of computer intensive systems. As a result the standard is necessarily weak in requirements for assessing the assurance of the software product. The standard tries to address the issue through a concept of ‘software control categories’. This approach does little to improve the situation as system complexity often denies the accurate enumeration of these categories at an appropriate level of abstraction. So, while at a macro level, i.e., at a high level of abstraction, such categorisation is possible, identification of the actual code module responsible for the control function may not be easy. Interestingly, Swallom (2005) notes that: ….. the F/A-22 matrix adds a “designed out” column for hazards where risk has been reduced to zero. This acknowledgement suggests that, in the continuing attempts to extend the application of HRA, there has been a development of a tacit recognition that the hazard severity approach of mitigating hazards through careful design has some merit. In comparison to MIL-STD-882, DEF(AUST)5679 provides strong guidance on the design and implementation of computer intensive systems. While this concept works well for a true development process, the standards approach when dealing with the integration and acceptance of NDI has the potential to present project management with difficult financial decisions. A safety case developed under DEF(AUST)5679 will almost certainly provide enough context information to allow an informed groupings of risk and consequence to be made. Thus if demanded by a regulatory authority, it seems intuitive that the approach outlined by Jarrett and Lin (2008) could provide a translation from the DEF(AUST)5679 approach to a risk based approach. The point here is that while the process of moving from a severity based approach to a risk based approach appears possible, the reverse is likely to be much more difficult. 7.7 Further Development of DEF(AUST)5679 As noted earlier Issue 2 of DEF(AUST)5679 has the propensity to provide program managers with difficult problems if it is used to provide assurance for NDI based systems. Given the widespread use of software based NDI within defence it is clear that further development in this area would increase the appeal of the standard to program mangers. 7.8 The Role of the Technical Regulatory Authority in Defence As noted earlier there are a number of Technical Regulatory Authorities (TRAs) embedded within the fabric of the Australian Department of Defence. The roles of these authorities vary in description, emphasis and basic function, but all claim ‘safety’ as part of their raison d’être for existence. So for example, the Defence Safety Management Agency will claim seniority in matters of Occupation Health and Safety (OH&S), whereas the Director General Technical Airworthiness (DGTA) claims ownership of air and ground systems safety within the Royal Australian Air Force (RAAF). Within the RAN there are a number of separate but interacting organisations involved in the assessment of system safety. The TRAs are supported by administrative proclamations issued by various levels within the Defence hierarchy. 144 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 The use of a specific safety standard is generally determined by the interested TRA. For example, DGTA almost invariably requires software be developed in accordance with the development standard RTA/DO-178B and argues that the development process ensures that the product is safe and suitable for service. An interesting defence of this process has been provided by Reinhardt (2008). The use of a particular standard to assure system safety must be endorsed by the TRA responsible for certifying the safety of the system, where the system function can range from the development of safety critical avionics software, through the development and deployment of explosive ordnance, to the methodology of reporting and investigating OH&S issues. Importantly, any change to the application of new safety standards within Defence requires the approval of the various TRAs. Thus any move away from HRA based safety standards within DoD would require TRA support. In the case that a TRA demands an HRA based safety assessment, the provision of a method of mapping from the hazard severity based approach of DEF(AUST)5679 to a HRI description of the outcome of the safety process would assist in the acceptance of outcomes of the safety analysis. 7.9 The Need for Further Research Given the pervasive use of HRA within the defence community there is an urgent need for research to further develop the theoretical basis for the application of HRI when assessing system safety. In this regard Cox (2008a) notes: In summary, the results and examples in this article suggest a need for caution in using risk matrices. Risk matrices do not necessarily support good (e.g., better-than-random) risk management decision and effective allocations of limited management attention and resources. Yet, the use of risk matrices is too widespread (and convenient) to make cessation of use an attractive option. Therefore, research is urgently needed to better characterize conditions under which they are most likely to be helpful or harmful in risk management decision making (e.g., when frequencies and severities are positively or negatively correlated, respectively) and that develops methods for designing them to maximize potential decision benefits and limit potential harm from using them. A potentially promising research direction may be to focus on placing the grid lines in a risk matrix to minimize the maximum loss from misclassified risks. In particular, there is a need to better understand the relationship between, and possible integration of, the competing safety methodologies, i.e. risk based, and those based on the concept of accident severity. Thus, in order to provide improved interoperability between safety standards it is suggested that research into the relationship and applicability between severity and risk based assessments of system safety be supported. Such research is not profitably done in isolation, but rather needs to be done in the context of assessing the assurance of real systems. This requires support from more than the primary research organisation. 8 Conclusions Hazard Risk Assessment provides an inadequate theoretical platform for assessing the safety of complex systems. Safety standards based on this approach can only be regarded as low assurance standards, not in tune with modern safety thinking. The extension to HRA proposed in this paper has the potential to extend the scope of the process to include many physical systems. However, this requires a concomitant increased emphasis on the collection and analysis of quantitative reliability data. This in turn demands the application of statistically sound data collection and analysis methodologies, an approach not commonly found in today’s safety community. Assessments of complex computer intensive systems continue to pose a particular problem for the safety analyst. Strict conformance to design requirements and careful design of test regimes can assist the task, but system complexity can make this approach expensive and time consuming, particularly if the safety requirements have not been identified or adequately analysed. 9 Acknowledgements The authors thank the referees for their constructive comments. 10 References Ale, B. J. M. (2005): Tolerable or Acceptable: A Comparison of Risk Regulation in the United Kingdom and in the Netherlands. Risk Analysis 25(2), 231-241, 2005. Anderson, K. (2006): A synthesis of risk matrices. Australian Safety Critical Systems Association Newsletter, 8-11, December 2006. ABR 6303 (2006): Australian Book of Reference 6303, NAVSAFE Manual, Navy Safety Management, Issue 4 Boulding, K.E. (1956): General Systems Theory – The Skeleton of Science, Management Science, 2(3), April 1956. Chen, J. J., Tsai, C. A., Moon, H., Ahn, H., Young, J. J., & Chen, C.H. (2006). Decision threshold adjustment in class prediction. SAR QSAR Environmental Research, 17(3), 337–352. Cox L.A. (2008a): What’s Wrong with Risk Matrices?, Risk Analysis, Risk Analysis 28, 497-512, 2008 Cox L.A. (2008b): Some Limitations of “Risk = Threat x Vulnerability x Consequence” for Risk Analysis of Terrorist Attacks. Risk Analysis, 28(6) 1749-1761, 2008 145 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 DEF(AUST)5679 (2008): Commonwealth of Australia Australian Defence Standard, Safety Engineering for Defence Systems, Issue 2 Dreiseitl, S., Ohno-Machado, L., & Vinterbo, S. (1999). Evaluating variable selection methods for diagnosis of myocardial infarction. Proc AMIA Symposium, 246–250. FWHA (2006): Risk Assessment and Allocation for Highway Construction Management, at http://international.fhwa.dot.gov/riskassess Ford G. (1993): Lecture Notes on Engineering Measurement for Software Engineers, CMU/SEI-93-EM-9, Carnegie Mellon University Hessami A.G. (1999): Risk Management: A Systems Paradigm, Systems Engineering, 2(3), 156-167. Hodge, R. and Walpole, J. (1999): A Systems Approach to Defence Planning – A work in Progress, Systems Engineering, Test & Evaluation Conference, Adelaide. 20-22 October 1999 Jarrett, R. (2008. Developing a quantitative and verifiable approach to risk assessment, CSIRO Presentation on Risk, August 2008 Jarrett, R. & Lin, X. (2008): Personal Communication. Lloyd, G. R., Brereton, R. G., Faria, R., & Duncan, J. C. (2007): Learning vector quantization for multiclass classification: Application to characterization of plastics. Journal of Chemical Information and Modeling, 47(4), 1553–1563. MIL-STD-882D, (2000): Department of Defense, Standard Practice for System Safety O'Brien, M. H. (2000): Beyond Democratization Of Risk Assessment: An Alternative To Risk Assessment Prasad, D. & McDermid, J. (1999): Dependability Evaluation using a Multi-Criteria Decision Analysis Procedure, dcca, p. 339, Dependable Computing for Critical Applications (DCCA '99). Raffensperger, C. & Tickner, J (eds.) (1999): Protecting Public Health and the Environment: Implementing the Precautionary Principle. Island Press, Washington, DC Recuerda, M. A. (2006): Risk and Reason in the European Union Law, 5 European Food and Feed Law Review Reinhardt, D. (2008): Considerations in the Preference for and Application of RTCA/DO-178B in the Australian Military Avionics Context,13th Australian Workshop on Safety Related Programmable Systems (SCS’08), Canberra, Conferences in Research and Practice in Information Technology, 100. Royal Society Study Group (1992: Risk: Analysis, Perception and Management. Royal Society, London. Stevens, S.S. (1946): On the Theory of Scales of Measurement, Science, 103(2684), June 7, 1946. SVRC Services (1999): International Standards Survey and Comparison to Def(Aust) 5679, Document ID: CA38809101, Issue: 1.1 Swallom, D. W. (2005): Safety Engineer, U.S. Army Aviation and Missile Command: A Common Mishap Risk Assessment Matrix for United States Department of Defense Aircraft Systems, 23rd International System Safety Conference, San Diego, Ca., 22-26 August 2005 UK DEF STD 00-56: Issue 4 1 June 2007, Safety Management Requirements for Defence Systems. Wildman, L. (2002): Requirements Reformulation using Formal Specifications: A Case Study, Software Verification Research Centre, University of Queensland. Wikipedia - the free encyclopedia, Precautionary Principle 146 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 INTEGRATING SAFETY AND SECURITY INTO THE SYSTEM LIFECYCLE Bruce Hunter INTRODUCTION System Safety and Information Security activities, while recognised as being critical aspects of a system, are often only assigned targets to achieve at the concept or requirements phases of development. They then are left to independently achieve outcomes that align with each other along with other aspects of the system throughout its life. This somewhat cynical view of the systems engineering model is reinforced by standards [IEC61508 Ed1, EN50126/28/29/59] that don’t require either an integrated approach or verification of compatibility between resulting safety and security controls. While some attempts have been made to integrate the practices of safety and security engineering [Ibrahim 2004)], key Information Security standards [IEC27001][ISO27005][SP 800-30] make no mention of the safety aspects of security controls. Only later standards [SP 800-82][ISA-99] [IEC62443] start to mention how security aspects support safety. Conversely the Functional Safety series (IEC61508) edition 1 makes no specific mention of security and its impact on achieving functional safety for the system. While later versions of sector safety standards (e.g. EN50126, 128, 129 and 159) include security aspects but not how these interact and are supported by security controls through the lifecycle. As identified later in this paper, treating safety and security activities independently in the system lifecycle can lead to unexpected and unwanted outcomes (see locked fire-door). Finding real-life examples of these issues is not easy and may be due to the fact that incidents are considered sensitive or the relationship has not been clearly understood or recognised. In recent surveys more that 70% of organisations do not report security incidents to external parties [Richards 2009]. We have the legal system [Supreme Court of Queensland - Court of Decisions R v Boden] to thank for details of an interrelated safety and security incident that would have gone unnoticed and undocumented except for a sewerage system failure and subsequent spill… Between 9 February 2000 and 23 April 2000 a former employee of a supplier deploying a radio-networked SCADA system for Maroochy Shire Council accessed computers controlling the sewerage system, altered electronic data in respect of particular sewerage pumping stations and caused malfunctions in their operations. The resultant sewerage spill was significant. It polluted over 500 metres of open drain in a residential area and flowed into a tidal canal. Cleaning up the spill and its effects took days and required the deployment of considerable resources. The court imposed a two year sentence covering: 1 count of using a restricted computer without the consent of its controller intending to cause detriment or damage and causing detriment greater than $5,000; 1 count of wilfully and unlawfully causing serious environmental harm; 26 hacking counts; and 1 count of stealing a two-way radio and 1 count of stealing a PDS compact 500 computer. The concurrent sentence imposed survived two appeals. 147 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 This case is now a benchmark for Cyber Security albeit that sometimes the facts are exaggerated. The key the issues associated with ensuring systems are safe and secure in their operation are shown in Table 1. Table 1. Maroochy Cyber Attack Security Aspect Compensating ISO27001 Control Objective The safety of the system would not have considered the security implications at the time due to expectations of industry norms and a culture of “Security by Obscurity”; A.10.6 Network security management The investigation found it difficult to differentiate between teething problems of the system being deployed (still had not been resolved from completion of installation in January) and the malicious hacking outcomes (this was also a source of the later appeals); A.10.10 Monitoring The employee had vital equipment and knowledge in his possession after resignation from the supplier including critical configuration software that allowed the system data and operation to be changed remotely A.8.3 Termination or change of employment The system did not discriminate between a masquerading rogue and real nodes in the network A.11.4 Network and A.11.5 Operating system access control 26 proven hacking attempts were made over a three month period, anecdotally there were more undiscovered events over a longer period A.13 Information security incident management Open communications network was used for what is a Critical Infrastructure operation, but again this was industry norm for the time. Communication Technology was transitioning from point to point links to digital networks A.10.6 Network security management The data controlling the system could be modified by an intruder based on past knowledge A.12.5 Security in development and support processes By hacking attempts, it was possible to disable alarms hiding further changes and unauthorised operation of the system A.11.6 Application and information access control Even the hacker himself was not immune from security issues; in appeal evidence the stolen laptop had problems in one of the hacking attempts as the “Chernobyl” virus had infected it. While in hindsight it may be easy to see the risks associated with lack of security controls that impacted the safety of the system (all of these issues could have been mitigated by the imposition of basic security objectives from ISO27001), the development and commissioning of industrial control systems at the time and their supporting standards would not have explicitly required this to be considered. It is easy to understand why there are good reasons to apply effective and timely security controls to systems to support both operational and functional safety integrity but: x Can they be addressed in isolation and still achieve their objectives? x Aren’t they the same anyway and achieve a compatible outcome? 148 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 VALUES, PRIORITIES AND COMPATIBILITY Before you attempt to integrate two value-based systems, it is important that you are sure their value systems align. When it comes to safety and security the values and priorities that drive the methodologies are not the same. Assets versus People The Common Criteria standard IEC 15408 addresses the assurance levels applied to security management and the criteria for evaluating and claiming assurance levels. This can be used as a level of trust that the security functions will be effective in the products protecting the confidentiality, integrity and availability of their associated assets. While this may provide some form of equivalent to the Safety Integrity Levels (SIL) associated with functional safety, there are basic differences in the purpose and methodology that prevent this. People value Owners value wish to minimise at risk of impose Countermeasures may be aware of may harm Hazards that may impact that may possess that require affecting Vulnerabilities that may increase Safety Functions Threat agents Could this be missing when systems have safety implications? leading to to that exploit Risk that may include give rise to Threats that increase to Assets to wish to abuse and may damage May also wish to harm Adapted from IEC15408.1 Figure 1. IEC15408 Security Concepts and Relationship Some explanation of the differences in approach can be seen with the Security Concepts and Relationship Model [IEC15408.1] reproduced in Figure 1. The prime value here is the assets that security protects from threat agents; whereas safety is about protecting against the risk of physical injury or damage to the health of people [IEC61508.0] added into the diagram with dotted lines). This incompatibility of values leads to the likelihood of conflicting risks and controls being applied that may compromise system safety and security. This needs to be considered in addition to the interdependencies between safety and security controls and their impact as outlined in Figure 2. Controls may detract or contribute to the effectiveness of the other. An example of the possible incompatible application of security controls at the expense of safety outcome (email of sound files to police blocked) can be found in the proceedings and 149 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 recommendations of the Coronial Inquest into the death of David Iredale and issues with the NSW 000 Emergency service [Hall 2009][NSW Deputy State Coroner 2009]. Safety Framework Security Framework Hazards Threats Control Contributors Faults Safety Controls Reliability Availability Maintainability Safety Systems and People Security Controls protect malicious actions from triggering hazardous action on or compromising Safety-Related functions. e.g. Denial of Service attack locking up safety related system. Safety Controls protect users and maintainers of Information Assets from hazards. Control Detractors “Malware” protection and fail-secure actions of Security Controls may degrade safety functions due to absorbing free system time. Fail—safe actions of Safety Controls may add back-door vulnerabilities to Information Assets Vulnerabilities Security Controls Confidentiality Integrity Availability Traceability Information Assets Figure 2. Safety and Security Control Contributors and Detractors Safety Integrity versus Security Priorities Safety hazard and risk analysis in association with any necessary risk reduction achieve a residual risk rating, which is to be as low as reasonably practical (ALARP)[IEC61508.5]. Reliance on the likelihood of a dangerous failure associated with this risk will lead to a required SIL. This is a quantitative level, which is derived from the failure rate associated random and systematic failures associated with the elements of the system that support the safety function. Security Risk Evaluation, however, leads to ranking of risk associated with the likelihood of a threat exploiting vulnerability and compromising and asset. Control Objectives are then applied to mitigate in priority of the risk ranking identified. There is no guarantee that all risks will or can be treated and risk treatment is invoked to reduce (by mitigating controls), retain (expecting the risk may be realised), remove or transfer the risk. This security risk ranking rather than rating approach is clearly not compatible with either the ALARP principle or other Safety risk methodology. RISK ASSESSMENT COMPATIBILITY Both safety and security engineering make extensive use of risk management to assess and mitigates risks that threaten the system integrity and compromise the safety of people and security of assets. There are however important differences in the way risk management is applied and the decisions made as a result of the risk estimated. 150 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Risk Impact The consequence of hazardous events associates with safety is related to injury people and their health ranging from individual with minor injury to many people killed. Security risk impact usually relates to the value of the asset compromised in dollar terms from disruption to operation, disclosure of information, loss of reputation and business and direct financial loss. Some standards [ANSI/ISA-99] quantify the security risk in terms that include safety outcomes and it is quite feasible that risks impacts could be aligned for “like” consequences. Other standards and Guidance [NIST SP800-30][ISO27005][ITSEAG 2006] again do not take into account the impact on safety or injury, just business impact. Risk Likelihood The comparison of likelihood methodology between safety and security is where these system attributes diverge markedly. Probability of Failure for functional safety is broadly accepted for random as well as systematic failures; although malicious action should be considered as part of Preliminary Hazard Analysis, the probability of motive-based hazards is yet unquantified but staring to be modelled [Moore, Cappelli, Trzeciak 2008]. Figure 3 illustrates the differences and alignment when between safety and security failures in a generic cause consequence model. Loss of Control Almost Certain Intentional Threat Constantly Evolving Enabler Possible Quantification Vulnerability Security Design Fault Lack of Design Rigour Safety Design Fault Accidental Quantification by Standards Wear-out Quantification by MTBF Component Life Ineffective Preventive Control Incident Ineffective Preventive Control Noncritical Security Outcomes Interactions Hazardous Event Systematic Failure Random Failure Ineffective Reactive Control Critical Security Outcomes Critical Safety Outcomes Ineffective Reactive Control Noncritical Safety Outcomes Figure 3. Generic Cause-Consequence Mode of Safety and Security Failures Assigning probabilities to security exploitations is difficult to say the least. Likelihood ratings in most standards are very qualitative due to the non-deterministic nature of security threat and the vulnerabilities of rapidly evolving Information Technology. Claiming a security incident is improbable with a frequency of 1 in 1000 years is clearly “unrealistic” and open to abuse in 151 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 manipulating the risk rating. Again some security standards don’t help and recommend up to five levels of likelihood from “Almost Certain” to “Rare” [ISO27005][ITSEAG 2006] with no quantification. The exponential rise of security incidents makes it hard to reliably quantify the frequency contribution to likelihood and has reached such levels that this is now measure of time to survive base on system type (in some cases currently less than an hour). Figure 4 [US CERT] shows the evolution of reported attacks tied into infrastructure related incidents and the publication of safety and security standards. This “flooding” dictates an attack probability of 1. Little was mentioned on Information Security in safety standards available in 2001. Security could have been considered as any other hazardous event risk but this would have required domain knowledge of the possible threats and vulnerabilities of the system to attack. As outlined previously in the SCADA system for Maroochy Shire Council, safety practitioners would not have been fully aware of the vulnerabilities of an open communications network. More recent attacks on Critical Infrastructure such as the Estonian and Georgian “Cyberwars” [Nazario 2007] and the US Power Grids [Condon 2008] show that the risk of systems attached to open communication networks is a difficult risk to quantify. ISA-TR99 MIL-STD-882C NIST SP800-30 IEC61508 ED1 AS4444 IEC62443-3 160 9 140 8 7 120 6 100 60 40 20 4 3 2 2009 2008 2007 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1 1992 0 5 CERT stops attack reporting Survival time now 30..100 min for unprotected Windows systems 80 Vulnerabilities (10 3 ) BS7799 2006 CERT Reported Atacks (103 ) AS4048 ISO27001 IEC61508 ED2? 0 Year Attack triggered Maroochydore Sewer Spill Slammer Worm Ohio Nuclear Plant Estonian Cyber War Georgian Cyber War US Electricity Grid Cyber Profiling ?? Figure 4. US CERT Trend in Reported Computer Attacks in Context Understanding probability in terms of random component failure and the probability of introducing systematic faults with defined levels of development rigour proven foundations. The assignment of probability to motive-based attacks, however, is not tenable. Security attacks driven by the attraction of the target to the motives of the attacker and this is hard or even impossible to measure. Attack targets and mechanisms are continuing to evolve from spamming to political agendas to cyber-crime. Probability of motive-based attacks is best considered as 1 and other protective layers introduced. 152 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Along with attack targets and mechanisms, new security vulnerabilities are constantly emerging due to evolving technology which driven by the needs of consumers. The converging of system and communication technology to meet these needs has created previously unconsidered vulnerabilities for control systems. The time taken to fix discovered vulnerabilities is too short to rely on patches alone for protection. The XP vulnerability exploited by “Conficker” malware, which was discovered and detected early by Virus Checkers, took several months for an effective patch and cleaning software. This also used different attack mechanism in the form of the “autorun” in USB thumb drives and other removable media thus avoiding usual detection and protection controls. What if Safety is Reliant on Security Reliability? One possible method to align safety and security risk likelihood is to use the Layer of Protection Analysis (LOPA) as supported by IEC61511.3 and yet to be published ED 2 of IEC61508.5. LOPA could be used to assign failure probability to security risks as long as realistic values are attributed to each Protection Layer (PL) and the rules of Specificity (one security function), Independence (or separation from other layers [Hunter 2006]), Dependability (quantitative failure rate) and Auditability (validation of functionality) are applied. Open Threat Environment Protection Layer 1 High Risk Attack Detected Security Incident Response Security Remediation Control Other Response Controls Protection Layer 2 PFAPL Incident Detection Time (TDET) PTEXP TDET TREM Incident Remedy Time(TREM) Protection Layer N Time for threat to exploit next layer vulnerability (TEXP) Protection Layer Compromised Figure 5. Probability Model of Security Defence in Depth Knowing threats and the vulnerabilities they exploit are subject to change most organisations apply a defence-in-depth strategy, where no one single control is relied on for protection against attack. This, however, is dependent on effective threat and vulnerability monitoring with fast and reliable incident response. This must prevent an attack from breaching the next protective layer before the threat agent discovers and/or exploits its vulnerabilities as illustrated in the Figure 5. The Probability of Failure on Attack (PFA) for each PL can be derived from probability of an attack exploiting the next protective layer before a mitigating response is implemented. The product of each layer’s PFA then provides the total PFAAVG. Table 2 is an example of the application of safety LOPA to security protection [IEC61511.3][IEC61508.5]. 153 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Table 2. LOPA Example with Security Defence in Depth Layers Initial Risk Event Impact Description Severity Level Network attack performs Major unauthorised operation Environmental Damage Initiating Cause Hacking Attack Initiating Likelihood 1 PL1 PL2 Message Encryption PFA Firewall Controls PFA 0.1 0.1 (can’t predict (XXX Standard) likelihood of attack PL3 PL4 Network Controls Application Controls PFA PFA 0.1 1 (two-factor authentication) (can’t application determine vulnerabilities) Residual likelihood Intermediate likelihood 10-3 Security Response PFA 0.1 Mitigated likelihood 10-4 (assuming response preexploitation) When a Safety Function is reliant on security protection from likely threat of being compromised, the PFA could form part of the Probability of Failure on Demand. This practice would need considerable more work to be counted towards a resulting SIL and establishment dependable statistics of Survival Probability. While the Common Criteria Evaluation Assurance Level may provide confidence and a level of trust in the security controls applied, this does not equate to a Safety Integrity Level. Care must also be taken, as in Figure 5 that remediation against the attack does not compromise a safety function. APPLICATION OF SAFETY AND SECURITY CONTROLS The application of security controls to support functional safety has been addressed in other papers [Smith, Russell, Looi 2003][ Brostoff, Sasse 2002]. Inherent conflicts between safety and security methodology become evident in the mitigation controls against the associated risk. Risk discussion forums and Application Standards show anecdotal cases where incompatible security controls lead hazardous situations in systems [NIST SP800-82][ISA-99]. Typically these relate to security scanning and penetration testing and may have just exposed inherent systematic faults that would eventually lead to these situations. With an increasing threat environment to safety-related systems and uncertainty to the vulnerabilities they contain, there is an increasing risk of ill-considered security controls being applied which directly degrade the functional safety of these systems. Establishing Control Compatibility Security Objectives Safety Objectives Must Don’t care Must not Must Contributing Compatible Incompatible Don’t care Compatible Compatible Compatible Must not Incompatible Compatible Contributing Figure 6. Proposed Control Compatibility Model One helpful, albeit obvious, way to manage where conflicts or incompatibilities arise is to divide functionality into value objectives: “must”; “must not”; and “don’t care” for each functional aspect as depicted in the compatibility chart proposed in Figure 6. 154 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 For compatibility to be achieved no “must” control in one is allowed to coexist with a “must not” in the other aspect; other matches are compatible. This can be seen in the “Locked Fire door” example following. Conflicts - The Locked Fire-door Paradox An example safety and security conflicts in everyday life is the paradox of locked fire doors. This has the serious consequences on the safety of occupants who may be either trapped in a burning building or fire escape stairs. Conflicts can be identified and resolved between the safety and security controls by using Goal Structure Notation (GSN) as in Figure 7 [Lipson, Weinstock 2008)[Kelly, Weaver, 2004]. FireSafety Keep occupants of an area safe from fie or other dangerous situations Enclosed areas provide danger to occupants during fire outbreak EvacuateEarly Evacuate occupants quickly to remove them from harmful and life-threatening conditions Experience has shown that injuries and deaths have been caused by ineffective exits AssetSecurity Protect security assets from threats it their confidentiality, Integrity and Availability AlternativeEgress Have alternative paths of egress with unfettered passage to cater for danger at one exit SecureEntry Secure boundary to prevent access from outside Assets compromised by unauthorised entry and exit from area SecureExit Secure boundary to prevent access compromised from inside “Safety - Must” Experience has shown that patrons have let accomplices in by fire exits “Safety - Must Not” Ensure doors are unlocked from inside to allow quick egress Provide multiple exits for alternate egress in emergency Lock doors from outside to prevent unauthorised entry Lock doors from inside to prevent unauthorised exit & breach Limit entrances to reduce security risk “Security - Must” “Safety – Don’t care” “Security – Must Not” Conflict 1 Conflict 2 Alignment 1 Use video surveillance to identify pending breaches Alarm exits to alert when security compromised Alignment 2 Figure 7. Simplified GSN view of Locked Fire-door Conflicts If the safety and security controls in this example are considered together, then not only are conflicts identified early but can be modified to improve the effectiveness of both. AN ALIGNED APPROACH Rather than combining disparate methodologies for safety and security, the paper proposes Lifecycle Attribute Alignment to ensure effective and compatible safety and security controls are established and maintained is at key lifecycle stages of: Concept; Requirements; Qualification; and Maintenance. In Figure 8, interaction in these phases is shown in terms of alignment attributes (A); requirement allocation attributes (R) and verification effectiveness attributes (V). Engineering, System Safety and Security Management plans should include these attributes as objectives to be achieved and maintained in the lifecycle. 155 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Operational Concept Studies System Safety Management A2 Information Security Management A1 System Safety Planning Requirements Analysis Preliminary Hazard Analysis System/ Architectural Design R3 Establish Safety Requirements & Boundaries Requirements Allocation Characterise System and Set Asset Security Objectives Assess Threat and Vulnerability Risk Estimate and Evaluate Security Risk R1 Determine Risk Treatment Allocate Safety Requirements Technical Performance Management A3 System Integration and Test Review and Define Security Controls R2 Safety Requirements Implementation V2 Qualification Testing V1 Safety Requirements Verification & Validation Implement Security Controls A4 Monitor Emerging Security Threats & Vulnerabilities Installation & Test System Safety Assurance Maintain & Update Security Controls R4 Acceptance & Transition to Support System Safety Maintenance & Update Through-life Support A5 A7 R4 A6 Evolving Security Threats, Technology and Obsolescence Systems Development Upgrade and Retrofit Obsolescence and Withdrawal A1 – Ensure security and operational concepts align A2 – Ensure safety and operational concepts align A3 – Ensure controls are free from conflict A4 – Ensure validation includes compatibility A5 – Ensure security updates don’t compromise safety A6 – Ensure functional updates don’t compromise safety A7 – Ensure functional updates don’t compromise security R1 – Ensure security risk are considered in hazard analysis R2 – Ensure safety requirements are allocated R3 – Ensure security requirements are allocated R4 – Ensure security vulnerability updates are conducted R5 – Ensure secure information is removed before disposal V1 – Ensure safety requirements are validated and match current risk V2 – Ensure security requirements are validated to vulnerabilities Figure 8. Key Lifecycle Alignment Points The yet to be published IEC61508.1 does reference newer security standards and the consideration of malevolent actions during hazard and risk analysis through the safety lifecycle. The Australian IT/6 Committee have recommended that security considerations, including compatibility, be added to IEC61508 at key points in the safety lifecycle and hopefully these will be in the final Edition 2 of the standard. 156 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 CONCLUSION System Safety and Security are not the same in their values, methodology or their stability; neither can they be treated independently if conflict is to be avoided in their mitigating controls. As discussed in this paper the two key issues that limit the success of integrating safety and security in the systems lifecycle: incompatibility with risk management; and possible conflicts with mitigating controls. Using the safety LOPA technique may allow the determination of probability of security control failure and resulting dangerous failure probability. The use of GSN or similar techniques should be applied where safety and security controls may have conflicts. This has also being used to develop more sound Security Cases [Lipson, Weinstock 2008]. Ensuring continued alignment of the dependence and compatibility of safety and security through the lifecycle by is the key to their successful integration. Dangers of not treating safety and security seriously together are: x An increasing risk of successful attacks on infrastructure systems with safety functions; x An increasing risk of mitigating security controls compromising safety functions somewhere in their lifecycle; and x The likely imposition of governmental controls on critical infrastructure protection if the industry cannot demonstrate adequate support for information security in the systems they supply, operate or maintain. Benefits of integrating safety and security into the lifecycle are not only safer and more secure systems but minimisation of the cost associated with late discovery issues in the implementation, acceptance or support phases; building safety and security in from the start is essential. REFERENCES ANSI/ISA-99 (2007) Security Guidelines and User Resources for Industrial Automation and Control Systems. AS IEC 61508 Ed. 1 (1998) Parts 0 to 7, Functional safety of electrical/ electronic/ programmable electronic safety-related systems. AS ISO/IEC 15408 Part 1-3 (2004) Information technology - Security techniques - Evaluation criteria for IT security AS/NZS ISO/IEC 27001 (2006), Information technology—Security techniques—Information security management systems—Requirements Brostoff, S., & Sasse, M. A. (2001, September). Safe and Sound: a safety-critical approach to security. Position paper presented at the New Security Paradigms Workshop 2001, Cloudcroft, New Mexico, USA. Condon, Stephanie (2008) Cyberattack threat spurs US rethink on power grids. ZDNet.co.uk Security threats Toolkit article, 15 Sep 2008 Hunter, B.R., (2006) Assuring separation of safety and non-safety related systems. In Proc. Eleventh Australian Workshop on Safety-Related Programmable Systems (SCS 2006), Melbourne, Australia. CRPIT, 69. Cant, T., Ed. ACS. 45-51. Hall, L., (April 28, 2009) Police had to fill in forms and wait for David Iredale phone Tapes, Article in the The Australian 157 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Ibrahim, L. et al, (2004) Safety and Security Extensions for Integrated Capability Maturity Model. United States Federal Aviation Administration IEC61511-3:2003 Functional safety – Safety instrumented systems for the process industry sector. Part 3: Guidance for the determination of the required safety integrity levels. ISO IEC/PAS 62443-3 (2008) Security for industrial process measurement and control – Network and system security ISO/IEC 27005 (2008) Information technology — Security techniques — Information security risk management IT Security Expert Advisory Group - ITSEAG (2006) Generic SCADA Risk Management Framework, Australian Government Critical Infrastructure Advisory Council Kelly, Tim P., Weaver, Rob A. (2004), The Goal Structuring Notation – A Safety Argument Notation. Proceedings of the Dependable Systems and Networks 2004 Lipson, H., Weinstock, C. (2008) Evidence of Assurance: Laying the Foundation for a Credible Security Case. Department of Homeland Security Build Security In website, May 2008. Moore, A.P., Cappelli, D.M., Trzeciak, R. F., (May 2008) The “Big Picture” of Insider IT Sabotage Across U.S. Critical Infrastructures. CMU/SEI-2008-TR-009 Nazario, J. (2007) Explaining the Estonian cyberattacks. ZDNet.co.uk Security threats Toolkit article, 30 May 2007 NSW Deputy State Coroner, (07.05.2009) 1427/2006 Inquest into the death of David Iredale, Office of the State Coroner of New South Wales Richards, K. (2009) The Australian Business Assessment of Computer User Security: a national survey. Australian Institute of Crime Reports Research and Public Policy Series 102 Smith, J., Russell, S., Looi, M., (2003) Security as a Safety Issue in Rail Communications. In Proc. 8th Australian Workshop on Safety Critical Systems and Software (SCS’03). SP 800-30 (2002) Risk Management Guide for Information Technology Systems, US National Institute of Standards and Technology, SP 800-82 (2008) Guide to Industrial Control Systems (ICS) Security. US National Institute of Standards and Technology, Final public draft September 2008 Supreme Court of Queensland (2002)- Court of Decisions R v Boden, QCA 164 (10 May 2002) Appeal against Conviction and Sentence US-CERT, United States Emergency Response Team - http://www.us-cert.gov/ BIOGRAPHY Bruce Hunter ([email protected]) is the Quality and Business Improvement Manager for the Security Solutions & Services and Aerospace divisions of Thales Australia. In this role Bruce is responsible for product and process assurance as well as the management of its reference system and its improvement. Bruce has a background in IT, systems and safety engineering in the fire protection and emergency shutdown industry and has had over 30 years of experience in the application of systems and software processes to complex real-time software-based systems. Bruce is a contributing member of Standards Australia IT6-2 committee, which is currently reviewing the next edition of the IEC61508 international functional safety standards series. Bruce is also a Certified Information Security Manager and Certified Information Systems Auditor. 158 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 WHAT CAN THE AGENT PARADIGM OFFER SAFETY ENGINEERING? Louis Liu, Ed Kazmierczak and Tim Miller Department of Computer Science and Software Engineering The University of Melbourne Victoria, Australia, 3010 Abstract A current trend in safety-critical applications is towards larger, more complex systems. The agent paradigm is designed to support the development of such complex systems. Despite this, agents are having minimal impact in safety-critical applications. In this paper, we investigate how the agent paradigm offers benefits to traditional safety engineering processes. We demonstrate that concepts such as roles, goals, and interactions narrow that gap between engineering and safety analysis, and provide a natural mechanism for managing re-analysis after change. Specifically, we investigate the use of HAZard and OPerability studies (HAZOP) in agent-oriented software engineering. This offers a first step towards broadening the scope of systems that can be analyzed using agent-oriented concepts. Keywords: agent-oriented software engineering, safety-critical systems, safety analysis, HAZOP 1 INTRODUCTION A current trend in safety-critical systems is towards systems that are larger, more complex and have longer life-spans than their predecessors (Mellor, 1994). Many modern systems are characterised by being autonomous and independent nodes distributed over a network, having multiple modes, and more functionality than their predecessors (Milner, 1989). Further, such systems typically undergo numerous upgrades and adaptations over their lifetime. The multi-agent paradigm is well-suited to modelling and analysing such systems. Despite being tailored for the development of complex distributed systems there has been little uptake of agent-oriented software engineering (AOSE) methods in safety-critical systems development — either in research or in practice. Current practice in safety engineering centres around processes that ensure that the hazards of a system are identified, analysed and controlled in requirements, design and implementation (see for example (Ministry of Defence, 1996; RTCA, 1992; IEC, 2003)). Hazard analysis forms a critical part of the engineering of safety-critical systems and there are numerous techniques reported in the literature for conducting such hazard analysis. These analysis methods are predominantly team-based and rely on documented accident analyses from similar systems and the ability and experience of engineers to predict potential accidents. In this paper, we discuss how the agent paradigm offers benefits to traditional safety engineering processes. We demonstrate that concepts such as roles, goals, and interactions narrow that gap between engineering and safety analysis, and provide a natural mechanism for managing re-analysis after change. Specifically, we investigate the use of HAZard and OPerability studies (HAZOP) in agent-oriented software engineering, which we overview in Section 3. The goal of our research programme is to develop analytical methods for assuring safety in multi-agent systems. In Section 4, we illustrate a way of analysing multi-agent systems based on the idea of interactions, or how to perform a HAZOP study based on interactions. To do this we introduce the idea of an 159 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 interaction map and show how to adapt the HAZOP to interaction maps. We then go on to show how the role and goal paradigm found in methodologies such as Gaia (Zambonelli et al., 2003) and ROADMAP (Juan et al., 2002) can be used to improve the hazard identification and analysis process by providing direct feedback from safety analysis to design, how roles and goals can limit changes and how interaction maps lead naturally to quantitative safety analysis similar to Mahmood and Kazmierczak (2006) in the form of Bayesian accident networks. These improvements are used to reduce the total effort spent on manual identification and analysis, and to provide feedback into the system design. 2 AN ILLUSTRATIVE EXAMPLE To illustrate our ideas, we consider a simple fruit packing and palletising system. Different types of fruit are conveyed by means of several special conveyor systems to a central sorting and packing area. Each conveyor—here called a fruit line —transports a number of different types of fruit. Here a centralised sorting, packing, palletising and storing system sorts the fruit into boxes for shipping to supermarkets or for further processing. The system must implement the following four features: A. A packing feature, in which the system must select quality pieces of fruit, ignoring damaged and bruised fruit, and pack the selected fruit into boxes without bruising or damaging the fruit; B. A palletising feature, in which the system must place approximately 640kg of fruit onto a pallet in an 8 × 8 × 5 symmetrical arrangement of 2kg boxes of fruit; C. A wrapping and nailing feature, in which the system wraps the completed pallet in a protective sheet and nails the sheet to the pallet; and D. A storing feature in which the system must move the completed pallet to a cool store. The performance requirements on the system are that it should be able to pack 6 pallets per hour, and to be available for 166.5 hours per week, leaving 30 minutes per working day for routine cleaning, maintenance and recalibration. It is anticipated that humans will be required to interact with the fruit packing system, thus safety will be important. An agent-oriented analysis using a role based method such as Gaia or ROADMAP might arrive at the following decomposition of the problem. A main goal that is to sort, palletise and store fruit and a decomposition of the main goal into four key subgoals: (1) to sort the fruit; (2) to pack a 2kg box of fruit; ( 3) to palletise the boxes by placing enough boxes on the pallet to make up the 640kg pallet; and ( 4) to store the completed pallets in a cool store. Each goal is decomposed further into one or more roles. The roles are shown schematically in Figure 1 as stick-figures. 3 HAZARD AND OPERABILITY STUDIES Traditional non-agent approaches to safety analysis rely heavily on constant hazard identification and analysis during all phases of the system development life-cycle. Hazards are the sources of potential accidents and the aim in safety engineering is to identify and control the hazards of a system. There are many different techniques for hazard analysis in the literature but the most often used are: Exploratory Methods that aim to simply explore the system and its behaviour in order to identify hazards. Prominent among the exploratory methods are HAZOP studies that we will investigate further below, and SHARD (Fenelon et al., 1994). Causal Methods that work backwards from Hazards to their possible causes. Prominent among causal methods is Fault Tree analysis (Leveson and Shimeall, 1991) that grows an AND-OR tree of events back from each hazard to system-level events. 160 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Safely Pack and Store 6 Pallets/Hour Sort the Fruit Sorting Role Available for 99% (or 166.5 hours) per week. Sort and Store Fruit Pack the Fruit into Boxes Packing Role Place on Pallet, Wrap and Staple Palletising Role Wrapping & Nailing Role Store the Fruit Storing Role Figure 1: A typical role-goal model for the fruit palletising system. The goal of packing and storing the fruit is decomposed into the four roles of packing, palletising, wrapping, and storing. Consequence Methods that begin by identifying system level events and then investigating the consequences of those events. Prominent among consequence methods are Failure Modes and Effects Analysis (Palady, 1995) and Event Tree Analysis. Here we investigate the use of HAZOP for conducting hazard analysis on the fruit packing system described in section 2 above. HAZOP studies are a team-based method for identifying and analysing the hazards of a system. Originally developed for the chemical engineering industry (Kletz, 1986), it has been applied in many other engineering domains, including software (Ministry of Defence, 2000). Hazard and Operability studies are a well-established technique for Preliminary Hazard Analysis (PHA) whereby a specific set of guide-words is used to explore the behavior of a system and the causes and consequences of deviations. For example, one HAZOP guide-word is “after”, which prompts the analyst to explore the consequences of a component performing some action or sending some message after a key point in time. HAZOP expects a sufficiently detailed understanding of the system such that the system components and their attributes are specified as either a system model or a set of requirements. A team of analysts selects each component and interconnection in turn, interprets the guide-words in the context of the system and applies the HAZOP guide-words to specific attributes of the component in the study. The output from a HAZOP study is a list of hazards, their causes and the consequences of each hazard. According to the HAZOP standard (Ministry of Defence, 2000), the output of a HAZOP study must include the following: (1) Details of the hazards identified, and any means within the design or requirements model to detect and mitigate the hazard; (2) Recommendations for mitigation of the hazards and their effects based on the team’s knowledge of the system and the revealed details of the hazard; (3) Recommendations for the later study of specific aspects of the design when there are uncertainties about the causes or consequences of a possible deviation from design intent. What might a HAZOP look like if the packer role is implemented by an agent in the form of a moving pick-and-place robot? From the domain, the types of accidents that may occur involve excess loss of fruit leading to substantial financial losses or one of the moving robots colliding and injuring a human. As an example of the analysis of a palletising and wrapping agent for the guide-words “before” and “after” is shown in Table 1. 161 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Guide-word Interpretation Possible Cause Consequences / Im- Indication / Protec- Before Agent performs the The agent fails to plication Boxes of fruit may be tion nailing of the sheet identify that the pallet damaged as they are before the pallet is is not filled. moved to the pallet. Recommendation ... ready. Before ... Storeman (human being) is injured through interacting with a pallet being nailed by the nailing agent. After The nailing agent per- The agent fails to Boxes of fruit may be forms its action af- identify that the pallet damaged if the pallet ter the pallet has been has left. is moved without the moved. ... protective covering. Humans injured The status of the pal- through an interaction let and the proximity with the wrapping and of humans to the agent nailing agent when it must be checked. ... is triggered late. Table 1: Partial results from a HAZOP applied to the wrapping and nailing ability in the palletiser example. Despite HAZOP’s simplicity there are several drawbacks in practice. The level of design details, the interpretation of guide-words, and output for the HAZOP study result in a large amount of documentation for even small systems. The result is often a table of information accompanying a bulk of paperwork that becomes unmanageable in situations where changes to requirements, design, and technology occur frequently (Mahmood and Kazmierczak, 2006). However, HAZOP is used considerably by industry practitioners. In the remainder of this paper we show that the multi-agent system paradigm presents an opportunity to exploit agent-oriented concepts and techniques to complement and improve HAZOP-style safety analysis. 4 APPLYING HAZOP IN MULTI-AGENT SYSTEMS HAZOP has been adapted to a number of different software analysis and design paradigms by interpreting components and attributes according to the paradigm. For example, the HAZOP standard (Ministry of Defence, 2000) includes a guide to interpretation. To adapt HAZOP to the analysis of multi-agent systems requires an interpretation of the guide-words and attributes. Our first step therefore is to adapt HAZOP for the analysis of multi-agent systems. 4.1 HAZOP Based on Interactions Traditional non-agent based HAZOP is based on identifying components and their attributes. In the original HAZOP the components were pipes, valves and tanks and the attribute was flow. In systems HAZOP, the components can be hardware units or software modules such as packages or classes and the attributes are signals, messages or data flows depending on the analysis or design paradigm used. The problem for multi-agent systems is that they are often complex systems. Complex systems are generally viewed to have at least the following characteristics (Newman, 2003): 162 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 (1) Complex systems are composed of a large number and diversity of components. (2) Components in a complex system interact with each other. The interactions between components may be non-linear and may vary in strength. (3) Complex systems have operations of formation by which smaller components can be formed into larger structures. Complex systems also form decomposable hierarchies. Operations in different parts of a hierarchical structure may occur in different time scales. (4) The environment in which the system operates and the demands of the environment influence the behavior and evolution of a complex system. The key point in this view is the emphasis on interactions between components, and not on the components themselves. This observation hints at a HAZOP analysis based on interactions. HAZOP does not explicitly guide the analyst toward interactions, but rather relies on the analyst being able to understand or imagine what interactions might arise and how they may lead to hazards. Leveson (2004) has made the connection between complex systems, interactions and accidents, but has not used HAZOP in the analysis. 4.2 Interaction Maps We begin by developing an analysis notation that makes interactions explicit. Let us define “interaction” in the context of a multi-agent systems HAZOP. First we will need to define the key elements necessary for our analysis: actors, abilities and resources. Actors are entities that perform the system functions and in this paper are either roles or agents. To carry out their tasks actors require and produce resources. Abilities are the actors’ key functional capabilities, expressed as a set of tasks. Resources are defined as everything else in the system other than actors and abilities. There may be different types of resources such as physical resources from the environment such as fruit, conveyors and pallets in our example, or communication channels between actors. To achieve a task an actor performs one or more actions, for example, sending a message, actuating a device, accepting a task, or calling a software function in another actor. Definition 4.1 Given two actors or resources A and B we say that A interacts with B if and only if A influences the actions of B or B influences the actions of A. Some observations are necessary regarding Definition 4.1. The first is that the definition assumes a reciprocal relationship between actors. An actor’s behaviour is characterised by the actions it performs, thus if A influences the actions of B then B’s actions depend on A and the converse. We include in our definition of interaction the case where A influences the actions of B but not the converse. The second is that interactions may be transitive but need not be. If A interacts with B, and B interacts with C then A does not necessarily interact with C. If B influences the behaviour of C because of the influence of A, then we consider this an independent interaction. Further, interactions can be internal so that, for example, A can change its own state without outside interference, influencing its own future behaviour. The interaction between actors and resources describe how actors influence each other and how their operational environment influences them. Three types of interaction can exist in a system: (1) an actor interacts directly with the environment (physical resource); (2) actors interact with each other via a resource, for example, a communication channel; and (3) resources interact with other resources. We use the idea of interaction maps to identify and record the network of interactions that exist between actors and resources, as well as the abilities that agents have to interact with resources. Interaction maps are networks in which there are three types of entity: (1) resources are nodes drawn in rectangular boxes; (2) abilities are nodes drawn in hexagonal boxes; and 163 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 (3) actors are collections of nodes and interactions. Edges in the network represent interactions. To be well formed an interaction map must always have a resource node between any two ability nodes. Figure 2 gives an example of an interaction map for the stacker and palletiser roles. Using the interaction map we can see that the two actors—the stacker and palletiser—roles, interact indirectly via the the “Completed Pallet” resource. Nails Palletiser Nail Sheet to Pallet Plastic Covered Boxes Covered and Nailed Pallet Wrap Sheet around Pallet Protective Sheet Completed Pallet Storing Agent Fruit Fill Box with Fruit Packer Filled Box Stack Box on Pallet Box Pallet Figure 2: An interaction map for the packer and palletiser roles. Figure 2 also illustrates how interaction maps define interaction within actors, for example, the packer role consists of the internal resources needed to cover and nail the protective sheet to the pallet as well as the abilities to achieve the goal. Interaction maps show the structure of the interactions in the system. Interactions maps exist at a higher level of abstraction than other methods of describing interactions such as collaboration diagrams in UML. By identifying the interactions, the analyst can hypothesise which actors can interact to cause accidents and even identify the boundary conditions of the accident. Further, interaction maps help the analyst to uncover any causal factor of the accident, and by examining the interaction, may find ways of mitigating or stopping an accident-causing interaction from occurring (Mahmood and Kazmierczak, 2006). Methodologies such as Gaia and ROADMAP explicitly aim to identify the interactions in systems at an early stage of development. We argue that this makes deriving an interaction map a straightforward task given a system specification. As an example, consider a human interacting with a covered pallet in order to take it to the cool store. The analyst notes that covered pallets are the result of the Packer actor undertaking its “Wrapping and Nailing” ability, which can be hazardous if the human comes into proximity with the nailing device. To mitigate or avoid such an undesired interaction, a gate can be used to prevent humans from accessing pallets until the nailing has ceased. This implies that humans interact with the gate, and the hazardous interaction between humans and the pallet is avoided. While we do not explore this idea further in this paper, it is possible to derive causal Bayesian networks from certain interaction maps to quantify risk. This is done by identifying and modelling the possible states of a resource during its lifetime and the possible states that an ability goes through during its execu- 164 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 tion. The set of states of each actor, ability and resource and their associated probability distributions can be thought of as random variables; for example, see Figure 3. The direction of the links in the network will depend on how the actors, resources, and abilities interact. States of AA Ability AB Resource RA Interaction Map Resource RB States of RA Ability AA Bayesian Network States of RB States of AB Figure 3: An interaction map and its corresponding Bayesian network. If the states of the resource RA can be observed then the probability of being in given state at time t can be estimated. The same is true of the abilities. Alternatively, these can be measured during development. 4.3 Interpreting HAZOP Guide-words Our approach to HAZOP uses the interaction map as well as the guide-words to guide the analysis. The analysis explores the effects of an actor’s ability being applied incorrectly to a resource, or to the incorrect resource. The analysis uses actors as the system components of the study, and the actor’s abilities as the attributes to which guide-words are applied. To apply HAZOP using interaction maps, we have to identify the interpretation of each of the guidewords with respect to interaction maps. Table 2 specifies our interpretation of each of the existing HAZOP guide-words for interaction maps. One can see that the aim of the guide-words is to investigate the effect of an actor incorrectly applying one of its abilities on the resources in the system. Guide-word None More Less Part of Other than As well as Before After General Interpretation The ability does not influence the resource. The ability influences the resource more than intended. The ability influences the resource less than intended. The ability influences the resources only partly as intended, or only part of the ability is exercised on the resource. The ability influences the resource in a way other than intended; or, the ability influences a resource other than the intended one. The ability influences the resource as intended, but influences additional resources. The ability influences the resource before intended. The ability influences the resource after intended. Table 2: HAZOP Guide-words interpreted on the abilities of an actor. The interpretation of the guide-words is quite general at this point. To apply them to a specific system, their interpretations must be further refined depending on the context in which they are applied. As an example, consider the ability of the palletiser role to nail the plastic sheet to the pallet. Table 3 outlines one interpretation of the guide-words for that ability. Using these guide-word interpretations, a simple method for the analysis can be given as follows. 1. Select an actor— role or agent— as the basis for study; 165 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Guide-word None More Less Part of Other than As well as Before After Specific Interpretation The palletiser does not nail the sheet down. The palletiser uses too many nails. The palletiser uses too few nails. The palletiser does not nail down the entire sheet. The palletiser nails a resource other than the sheet. The palletiser nails the sheet and a resource other than the sheet. The palletiser nails the sheet down earlier than intended. The palletiser nails the sheet down later than intended. Table 3: HAZOP Guide-words interpreted on the ability of the palletiser role to nail the plastic sheet to the pallet. 2. For each ability of the actor in turn, interpret the meaning of every guide-word in the current context, and explore the effects of the guide-word on each node (resource) connected to that ability in the interaction map. 3. Document the effect of every guide-word, and recommend a mitigation strategy if that effect results in a hazard. As an example, again consider the ability of the palletiser role to nail the plastic sheet to the pallet. Using the interpretations of the guide-words from Table 3, we explore their effects, resulting in the observations about its effect on the pallet in Table 4. Guide-word None More Less Part of Other than As well as Before After Effect on Pallet There is no sheet on the pallet, perhaps resulting in the hazard of fruit on the workshop floor. No hazard. The sheet is not secure, perhaps resulting in the hazard of fruit on the workshop floor. The sheet is not secure, perhaps resulting in the hazard of fruit on the workshop floor. A number of possible hazards. A number of possible hazards. The palletiser nails the sheet down before it is filled with fruit, possibly resulting in the hazard of fruit on the workshop floor (and damaged fruit) as it attempts to load more onto a covered pallet. The palletiser nails the sheet down as the human storing agent attempts to pick it up, resulting in an injury to the human agent. Table 4: The result of a HAZOP using the guide-words on the ability of the palletiser role to nail the plastic sheet to the pallet. 4.4 Design Feedback An integral part of the analysis process is to use the HAZOP study to refine the analysis model. Each row of the HAZOP table corresponds to a deviation from intent applied to an ability and a possible consequence of that deviation. For example, the second row of Table 1 details a possible consequence of the guide-word “after” applied to the nailing ability of the packer role. The result of this analysis is that the packer role has a hazard associated with it through its nailing ability. Further, we can use the interaction map to gain insight into the potential interactions leading to the hazard. How can the information in the HAZOP be used to refine the model? 166 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 The HAZOP table identifies the hazards but the interaction map shows what elements of the current model must interact to lead to the hazard. The model can be refined to mitigate or avoid such an interaction. There are several options on how to refine the model to do this: (1) we can manipulate the resources; that is, altering, adding or deleting resources, to control the hazard; or (2) we can manipulate the abilities of the actor to control the hazard. In Table 1, we wish to change the interaction between the storing role (played by a human agent) and the palletiser. An example of changing the resources is shown in the interaction map in Figure 4, in which we add a guard under the control of the palletiser and only allow the storing role access to the pallet if the palletiser actor deems it safe. Nails Palletiser Nail Sheet to Pallet Protective Sheet Plastic Covered Boxes Covered and Nailed Pallet Wrap Sheet around Pallet Guard Storing Agent Figure 4: Modifying the palletiser by the addition of a resource and the alteration of an ability. We may also be able to control the interaction with addition of an additional ability to the palletiser actor. The updated role/goal model may appear as in Figure 5 where a new means of implementing the safety goal has been added as the result of the HAZOP study. Pack and Store 6 Pallets/Hour Available for 99% (or 166.5 hours) per week. Sort the Fruit Pack the Fruit Sort and Store Fruit Safely Add a resource - a guard - to prevent the palletiser interacting directly with the store agent while nailing the sheet. Palletise, Wrap and Nail. Store the Fruit Figure 5: Updated role/goal model resulting from the analysis of the second row of the HAZOP table. In general each row of the HAZOP table can be used to extend the role/goal model in this way. 5 HAZOP BASED ON INTERACTIONS AND SYSTEM EVOLUTION Most systems undergo some form of evolution, either to adapt them to situations that were not imagined at the time of their design, or to add additional features and functions desired by users. Adaption to new situations is also a key feature of multi-agent systems, therefore safety analysis methods must be able to cope with change. Unfortunately, traditional non-agent HAZOP incurs a large overhead when dealing with changes to existing systems as much of the HAZOP must be reworked. For example, the HAZOP standard (Ministry of Defence, 2000) specifies that for any change in the system design the entire HAZOP study should be redone. Interactions maps coupled with the role/goal analysis of multi-agent system requirements provide clear 167 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 boundaries on what must be re-analysed in the event of a change. 5.1 Isolating Changes in the Role/Goal Model The hierarchical nature of the models means that constraints and quality goals in lower-level models must be consistent with the higher-level models, but this also means that changes at the lower-level models are isolated from the higher-level models. If we change a role but not a goal, then the new role must also meet the goal, so the goal does not need to be reanalysed. The role and its interaction, however, must be re-analysed. If we change an agent but not a role, then the agent’s externally observable interactions are the same as those for the role. In this case, if the new agent introduces a new external interaction then it needs to be reflected back up to the role model. Otherwise a HAZOP study on the agent model alone will be sufficient. If we change a role and an agent, then the system model will still be hierarchical. In this case we can always perform a HAZOP study on the role model first before performing a HAZOP on the agent models that implement the role. Observe however that, if the agent model belongs to the role, then a change to the role model will imply a change in agent model anyway. If the interaction map is correct then it will tell us what needs to be updated. If we are unsure what needs to be updated, then a HAZOP on the local change will indicate if there are new interactions or resources introduced into the system. As an example, consider a model in which the packer role being played by one agent evolves into a design in which the role is played by two agents (based on the role’s abilities): one agent to place the sheet on the box, and one agent to nail the sheet down. The role specification has not changed in this case, and neither has the interaction map associated with the role model. Considering this, the HAZOP study does not need to be re-performed at this level. The agent model, and its related interaction map, have changed. As a result, the HAZOP study needs to be performed at this level. 5.2 The Key is the Interaction Map In our analysis, the interaction maps specify which roles interact with which other actors, and through what resources. Consider again the example of changing the agent model such that the packer role is played by two agents instead of one. The HAZOP study must be redone on the two agents that now implement the role, but not the role provided that no new externally observable interactions have been added. What other actors in the system need to be re-analysed? The interaction map can be used to answer such a question. If we study the interaction map in Figure 2, we see that the only abilities affected are the palletiser’s wrapping ability and the palletiser’s nailing ability. The external interactions in Figure 2 with the packer and the storing agent remain unchanged. We conclude that this change requires only the two agents implementing the palletiser role to be re-analysed. It is straightforward to then identify the benefits of exploiting the hierarchical nature of many agent methodologies, and the interaction map: we can significantly reduce the burden on the safety engineers during design evolution by helping them to systematically identify which parts of a design must be re-analysed after a change. While this is possible using other development methodologies, the unique factor in the agent paradigm is that it forces developers to consider goals and interactions early in the development life-cycle. 168 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 6 RELATED WORK Several authors have integrated safety analysis into agent systems. Dehlinger and Lutz (2005) propose a product-line approach to requirements by developing requirements schemata and reusing these to capture shifting requirements in a multi-agent systems. Dehlinger and Lutz show how to group requirements so that the impact of change is minimised. Such an approach can be applied to safety requirements, but they do not discuss this. Feng and Lutz (2005) propose a way of handling safety analysis by proposing a bi-directional safety analysis for product-line multi-agent systems that extends Software Failure Modes, Effects and Criticality Analysis (FMECA) and Software Fault Tree Analysis (FTA) to incorporate multiagent systems. They show how to generate safety cases and constraints in the Gaia role model. The aim of the product-line approach is to increase the reusability of safety analysis. Our work shares a similar motivation, however, we do not use product lines. Giese et al. (2003) propose a way to ensure safety in self-optimized mechatronics system. They do not provide detail analytical methods to generate safety cases, but instead discuss how to ensure safety in system hierarchies with safety cases already provided by domain safety experts. Bush (2005) proposes an extension of traditional HAZOP studies for the I* development model. Our work is closely related, however, Bush applies HAZOP analysis on goals. We believe that abilities and resources are useful for safety analysis, because goals are the conditions that agents desire, whereas the abilities and resources outline how to achieve those goals — something that we believe is closer related to safety. 7 CONCLUSIONS AND FUTURE WORK In this paper, we have demonstrated that existing techniques such as HAZOP studies can be used with agent oriented software engineering methodologies with little amount of extension. We demonstrate that the introduction of interaction maps however can greatly ease the burden of re-analysis when changes to the system model occur. Dealing with change is perhaps more important for multi-agent systems than for traditional non-agent systems as their very design is often aimed at adapting to changing circumstances. To this end the use of interaction maps becomes vital as they help to identify the elements of the multiagent system—roles, goals, and agents— that need to be re-analysed in the event of changes to the system model. Despite greatly easing the burden of maintaining safety by re-analysing the system, if change is perpetual then the constant re-analysis of safety becomes a tiresome and costly overhead. The question is whether or not safety, once analysed, can be maintained by the system itself, even in the presence of constant change and evolution to the agents and even the roles. The goal of our research programme is to develop methods for assuring safety in multi-agent systems even in the presence of constant system evolution and adaptation. Our research program involves the use of accident knowledge to allow agents to perform safety analysis of their own behaviour. This will allow agents to change their behaviour at runtime after taking into consideration the cause of accidents involving other agents, and is the subject of current and future research. It is hoped that our research program will aid in the uptake of the agent paradigm in safety-critical systems. References Bush, D., August 2005. Modelling support for early identification of safety requirements: A modelling support for early identification of safety requirements: A modelling support for early identification of safety requirements: A preliminary investigation. In: Fourth International Workshop on Requirements for High Assurance Systems (RHAS‘05 - Paris) Position Papers. Dehlinger, J., Lutz, R. R., 2005. A product-line requirements approach to safe reuse in multi-agent systems. In: International Conference on Software Engineering. Vol. 3914. pp. 1–7. 169 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Fenelon, P., McDermid, J., Pumfrey, D., Nicholson, M., 1994. Towards Integrated Safety Analysis and Design. ACM Applied Computing Review 2 (1), 21–32. Feng, Q., Lutz, R. R., 2005. Bi-directional safety analysis of product lines. Journal of Systems and Software 78 (Issue 2), 111–127. Giese, H., Burmester, S., Klein, F., Schilling, D., Tichy, M., 2003. Multi-agent system design for safetycritical self-optimizing mechatronic systems with uml. In: OOPSLA 2003 - Sec- ond International Workshop on Agent-Oriented Methodologies, Anaheim, CA, USA. pp. 21–32. IEC, 2003. IEC 61508 Functional Safety of Programmable Electronics Safety-Related Systems. International Electrotechnical Commission. Juan, T., Pearce, A., Sterling, L., 2002. ROADMAP: Extending the Gaia methodology for complex open systems. In: Proceedings of the First International Joint Conference on Autonomous Agents and Multi-Agent Systems. ACM Press, pp. 3–10. Kletz, T. A., 1986. HAZOP & HAZAN notes on the identification and assessment of hazards. The Institution of Chemical Engineers, London. Leveson, N. G., April 2004. A new accident model for engineering safer systems. Safety Science 42 (4). Leveson, N. G., Shimeall, T. J., July 1991. Safety verification of Ada programs using software fault trees. IEEE Software 8 (4), 48–59. Mahmood, T., Kazmierczak, E., December 2006. A knowledge-based approach for safety analysis using system interactions. In: Asia Pacific Software Engineering Conference, APSEC’06. IEEE Computer Society Press. Mellor, P., 1994. CAD: Computer Aided Disaster. High Integrity Systems 1 (2), 101–156. Milner, R., 1989. Communication and Concurrency. International Series in Computer Science. Prentice Hall. Ministry of Defence, 1996. Defense Standard 00-56: Safety Management Requirements for Defence Systems. Ministry of Defence, 2000. Defense Standard 00-58: HAZOP Studies on Systems Containing Programmable Electronics. 2nd Edition. Newman, M. E. J., 2003. The structure and function of complex networks. SIAM Review 45, 167–256. Palady, P., 1995. Failure Modes and Effects Analysis. PT Publications, West Palm Beach Fl. RTCA, December 1992. RTCA DO-178B: Software Considerations in Airborne Systems and Equipment Certification. RTCA Inc. Zambonelli, F., Jennings, N. R., Wooldridge, M., 2003. Developing multiagent systems: The Gaia methodology. ACM Transactions on Software Engineering Methodology 12 (3), 317–370. 170 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 COMPLEXITY & SAFETY: A CASE STUDY Author George Nikandros, Chairman aSCSa, BE (Electrical) INTRODUCTION Despite correct requirements, competent people, and robust procedures, unsafe faults occasionally arise. This paper reports on the outcomes of an investigation into a series of related events: a series of events that involves a railway level crossing. Whilst the direct cause of the failure was defective application control data, it was a defect that would be difficult to foresee and if foreseen, to test for. The last failure event occurred after the correction was supposedly made. The correction was made as a matter of urgency. To understand the underlying complexity and safety issues, some background knowledge in relation to active level crossing controls i.e. flashing lights and boom gates and railway signalling is required. The paper therefore includes a description of the operation of the railway level crossing controls and the railway signalling associated with the case study. The official incident report is not in the public domain and therefore this paper has been prepared so as to not identify the location of the series of incidents, the identity of the organisations or the people involved. THE UNSAFE EVENTS There being three events, with the same unsafe outcome, in that a driver of a train was presented with a PROCEED aspect in the same trackside signal when the actively controlled crossing was open to road traffic i.e. the flashing lights were not flashing and the boom gates were in the raised position. Had the driver not observed the state of the active level crossing controls and proceeded on to the crossing, a collision with a road vehicle or pedestrian would have been very likely; the crossing is a busy crossing with some 4300 vehicles per day and 500 pedestrians per day. The first occurrence of this outcome occurred some seventeen days after the initial commissioning of a new signalling system and was not given the appropriate classification for investigation and action when logged. The second occurrence occurred two days later, a Saturday. This time the correct classification was made and actions were immediately initiated i.e. designer engineers were called in to identify and fix the problem. The third event occurred five days after the second occurrence and after the design flaw was supposedly removed. THE RAILWAY CONTROL SYSTEM Level Crossing Controls The key aim of active level crossing controls is to provide the road crossing user sufficient warning that a train is approaching and where boom gates are provided, to close the crossing to road traffic before the train enters the crossing. Once the train has passed, the crossing 171 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 needs to be reopened with minimal delay. If a second train approaches the crossing when already closed, the crossing is held closed. Figure 1 shows the typical train trigger points for controlled rail crossings for a unidirectional line. Figure 1: Typical train trigger points – one direction only Once opened the crossing needs to remain open for a sufficient time so as to ensure that the appropriate warning is again given to the road users. Particularly for busy roads, a level crossing should not be closed unnecessarily i.e. if a train stops short of the crossing at a signal displaying a STOP aspect for a time, then the crossing should be opened for road traffic. The signal should not then display a PROCEED aspect, until the appropriate warning is again given to the road crossing users. However level crossings are rarely located to make life simple. Having multiple tracks and locating a level crossing in the vicinity of a station stop significantly adds complexity. More than one train may approach the crossing simultaneously from both directions and trains may stop for long periods of time at the station platforms. Another complexity which usually occurs in urban areas is the use of road traffic control signals. There needs to be coordination (an interlock) between the road traffic control signals and the level crossing control signals; it would be unsafe to have a “GREEN” aspect in a road traffic signal for road vehicles to travel through the level crossing with the level crossing controls in the closing or closed states. The approach of a train needs to be detected earlier to enable the road traffic control system to cycle in sufficient time so that the signals allowing the road traffic across the level crossing to return to RED prior to the active level crossing controls begin closing the crossing. The road traffic signals also need to provide sufficient warning to the road users. Figure 2 shows the schematic of the level crossing of interest. It contains all the complexities mentioned. 172 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Point where rear of train needs to pass before No 4 Signal changes to PROCEED 1 Signal 2 Signal 3 Signal 4 Signal Point where rear of train needs to pass before signals leading to No 4 Signal change to PROCEED Signal of interest. There is 2m between No 4 Signal and the edge of road Controlled road intersection Figure 2: Layout Schematic for Level Crossing Rail Signalling Controls The aim of the signalling system is to safely regulate the movement of trains on a railway network. The signalling system ensures that: x x x the path ahead is clear, and there is no other path set or can be set for another train to encroach the path set, and any active level crossing controls are primed to operate so as to provide the appropriate warning to the road crossing user and close the crossing where boom gates are provided. Only when all these conditions are satisfied is an authority to proceed is issued. For the location of interest, the authority to proceed is conveyed via a PROCEED aspect in a trackside colour light signal. Signals may be controlled or automatic. Controlled signals are signals which display a STOP aspect until commanded otherwise by a Train Controller (the person responsible for managing the movement of trains on the railway network). Although the Train Controller commands a signal to PROCEED, it only displays a PROCEED aspect if the signal interlocking system deems it is safe to do so. Controlled signals automatically return to STOP after the passage of the train. 173 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Automatic signals are signals which are automatically commanded to PROCEED by the signal interlocking system i.e. there is no participation of the Train Controller; the Train Controller can neither command them to STOP nor PROCEED. Some controlled signals have an automatic mode which the Train Controller can select and deselect as necessary. Of the signals of interest, 3 Signal and 4 Signal are automatic signals, 1 Signal and 2 Signal are controlled signals. If there are no trains in the vicinity, 3 Signal and 4 Signal will each display a PROCEED aspect. Figure 1 depicts an example of this condition; the signal near the crossing represents 4 Signal. As a train, Train A, approaches 4 Signal, the road traffic controls are commanded to cycle, and after the allowed cycle time has elapsed, the flashing lights are activated to warn the road crossing users and after the required warning time has elapsed, the boom gates descend to close the crossing. Whilst Train A remains on the approach to 4 Signal the crossing remains closed. When the Train A passes 4 Signal, 4 Signal is automatically placed at STOP and no other train can approach 4 Signal at STOP until the rear of Train A passes the point when the signals applying to 4 Signal are permitted to display a PROCEED aspect (see Figure 2); this point is know as the overlap limit. The overlap is the safety margin provided at signals should the train driver misjudge the train’s braking. Once the rear of Train A clears the level crossing and there is no other train approaching the crossing on the other tracks, the crossing control commences its opening sequence. When the crossing is opened, the road traffic signals resume their normal cycle. It is important to note that the rail control system influences the road traffic signals; the road traffic signals do not initiate any action in the rail control system. Once the rear of Train A is beyond the overlap limit for 4 Signal, anyone of the signals applying towards 4 Signal, assuming no other trains are in the vicinity, can be placed at PROCEED, thus allowing another train, Train B to approach 4 Signal; this time however 4 Signal is at STOP (Figure 3). Figure 3: Level crossing operation for following trains 174 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Because of the close proximity of the 4 Signal to the level crossing, the level crossing needs to be closed to safeguard the road crossing users and the train in the advent of a braking misjudgement by the train driver. If Train B is detained at 4 Signal when at STOP for a sufficiently long enough period, for this case 35 seconds, the crossing opening sequence commences to allow road traffic flow to resume. Should the conditions to allow 4 Signal to display a PROCEED aspect be satisfied after the crossing is commanded open, then 4 Signal will remain at STOP until the crossing opening sequence is completed, the minimum road open time conditions are satisfied, the road traffic signals are again cycled, sufficient warning that the crossing is closing is given to the crossing users and the boom gates have lowered. If however Train B is detained at 4 Signal when at STOP and 4 Signal subsequently changes to display a PROCEED aspect i.e. the rear of Train A has passed the overlap limit for 3 Signal, within 35 seconds, the crossing remains closed until the rear of Train B clears the crossing, irrespective of how long the train takes. When the Train A passes 3 Signal, 3 Signal is automatically placed at STOP. When the rear Train A passes the overlap limit for 3 Signal, 4 Signal is automatically commanded to display a PROCEED aspect, but only does so if it is safe i.e. there is no train detained at 4 Signal with the level crossing open. SYSTEM ARCHITECTURE The signal interlocking system that performs the safety functions has a distributed architecture. The system consists of programmable logic controllers located geographically along the railway and interconnected with point to point serial data links, such that, referring to Figure 4, data that needs to go from Controller C to Controller A needs to go through Controller B. Figure 4: Distributed architecture showing area of control 175 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 It is important to note, that the architecture is not a master-slave architecture, where the slave controllers perform an input/output function as directed by the master controller. For this application the interlocking function is distributed over each of the controllers. Controller Technology The controllers are commercial-off-the-shelf (COTS) products specifically developed for railway signalling interlocking applications. All of the controllers are of the same type and version. Each controller maintains a time-stamped event log. However the controller clocks are not synchronised. The technology is modular and programmable. It uses plug-in modules for connectivity to different types and quantities of inputs and outputs. Thus only the hardware actually required for an application needs to be installed. The technology is supported by graphical tools for application design, simulation and testing. The suite of tools is used to define the modules and logical operation of the system and verify and validate the application logic. To satisfy the safety requirements the controllers operate on a fixed, nominally 1 second time cycle. Consequently an input change will not be immediately detected, however there is certainty as to when an input change will be detected and processed. THE DELIVERY PROCESS The system was delivered under a design, build, test and commission contract arrangement, where the contractor is responsible for delivery in accordance with the specification, and the railway organisation is responsible for verifying compliance and validation of the system to key signalling safety and operating principles. The contractor organisation was also the developer of the COTS controller technology and had a considerable history for deploying that technology on many railway networks, including that of the railway organisation commissioning the contract works. However this was the first time that a distributed interlocking architecture was to be deployed; neither the contractor personnel undertaking this work, nor the railway personnel verifying and validating this work had any prior experience in implementing a distributed interlocking architecture with this technology. The delivery model and the underlying processes had been well established. These had evolved over time and were considered best railway practice at the time. The personnel involved were appropriately qualified and experienced in the design of signal interlocking application logic of this complexity and in the use of the technology, albeit not in the design and implementation of the distributed interlocking architecture using this technology. 176 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Hazard Analysis Because the project was considered “routine” there was no specific hazard analysis performed for the application design. The technology had been used for similar applications, albeit with a different architecture, and the application scenario i.e. an actively controlled level crossing in the vicinity of road traffic signals and station platforms, was not new. The hazards of the application were well understood. The potential hazards due to the processing latency of the controllers and their associated communications links were understood, but how to design for them was not. The application manual for the controllers did warn of the latency, but provided no guidance as to how this latency should be treated to eliminate the hazards or test for them. Application Logic The railway organisation specified the interlocking requirements for this application. The contractor designed the controller architecture, the modules and the application data and submitted the “design” for review by the railway. The reviewed design was amended as appropriate by the contractor and the application data produced for each of the controllers. The contractor tested the amended application design for compliance with the specification using both simulation tools and the target hardware (the personnel were required to be independent i.e. they were not involved in developing for the design under test). The application logic was then tested by the railway organisation to validate compliance with the key signalling safety and operating principles using simulators and the target hardware. Those tasked with the validation task had no involvement in the development of either the interlocking specification or any of the design reviews. THE FAILURES There were three unsafe events. The first two were due to the same latent defect, although the initiating event was different. To assist in understanding, the sequence of events for Event 1 is provided in Table 1. The time base used is Controller B. The time-stamps for Controllers A and C have been aligned with the Controller B time. The event sequence should be read with reference to Figures 5, 6 and 7. Figure 5: A state of the system prior to the Event 1 failure 177 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Event 1 – The First Occurrence “C” “B” “A” 08:26:11 Event Train A approaches 3 Signal at STOP 08:27:13 Crossing closes for train approaching 1 Signal 08:28:18 3 Signal changes to PROCEED 08:28:29 08:28:32 Train B approaches 4 Signal at STOP [crossing already closed] Train A passes 3 Signal 08:29:05 08:29:04 Crossing called open – train at 4 Signal too long Train A (rear) passes 3 Signal overlap limit and 08:29:06 4 Signal changes to PROCEED 08:29:07 Crossing starts to open 08:29:07 Crossing called closed [4 Signal at PROCEED] 08:29:08 08:29:17 Crossing opens 08:29:41 Crossing commences to close 08:29:49 Crossing closed 08:30:22 Train B passes 4 Signal Table 1: Event 1 Sequence of Events The sequence of events show that 4 Signal was at PROCEED for some 40 seconds with a train on its approach and the level crossing not closed. The initiating event was Train A being detained at 3 Signal with Train B closely following. The reason for the detention of Train A at 3 Signal was because of incomplete works in relation to 3 Signal. Figure 6 shows the situation just as 3 Signal changes to PROCEED. Figure 6: Train A receives signal to continue, Train B at platform 178 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 Figure 7: The state of the system when the Event 1 failure occurred Figure 7 shows the situation as Train A clears the overlap beyond 3 Signal, thus enabling 4 Signal to be called to clear. Controller B allowed 4 Signal to display a PROCEED aspect at 08:29:06, because according to Controller B the crossing was closed and section of track from 4 Signal to 3 Signal and the overlap was clear. However Controller A had commanded the level crossing controls to open at 08:29:05, but Controller B did not receive and process this open command until 08:29:07. The failure is depicted in Figure 7. Once the crossing began to open, it could not again be closed until the crossing was open for the required crossing open time. The incident occurred because states of the crossing controls and 4 Signal in Controllers A and B were different for 1 second. The incident would not have happened if the conditions for 4 Signal to change to PROCEED were satisfied coincidently as the conditions to open the crossing were satisfied. Event 2 – Categorised correctly, investigated and fixed The initiating event was a failure of Controller C and the consequential loss of the communications link between Controllers B and C. Railway signalling systems have traditionally been required to fail safe. To meet this fail safe requirement, railway signal interlocking systems are required to fail to a safe state in the event of a failure i.e. trains are assumed to be everywhere, signals display STOP aspects and level crossings close. The loss of communications occurred whilst 1 Signal was at PROCEED in preparation for a future train. The failure resulted in the interlocking (Controller A) presuming that a train was approaching 1 Signal and closed the crossing and 4 Signal was placed at STOP because the track ahead was assumed to be occupied by a train. 179 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 A train approached 4 Signal at STOP at 14:42:47 automatically triggering the standing too long at platform timer. The timer times out at 14:43:22 and primes the crossing to open. The crossing does not open because of the presumed train approaching 1 Signal at PROCEED. At 14:44:00, the communications is re-established and the “presumed train” approaching 1 Signal disappears and, as 4 Signal was at STOP and the standing for too long timer for 4 Signal has expired, Controller A commands the crossing to open. However Controller B allowed 4 Signal to change to PROCEED because the track ahead was now confirmed clear when the communications recovered. At 14:44:02, the crossing was commanded to close because the interlocking (Controller A) conditions require the crossing closed when 4 Signal is at PROCEED and there is a train approaching i.e. the train standing at 4 Signal. However once the crossing began to open, it could not again be closed until the crossing was open for the required crossing open time. 4 Signal was at PROCEED for 42 seconds before the crossing was closed. The Fix The solution to the problem was to repeat the interlocking conditions requiring 4 Signal to be at STOP before opening the crossing in Controller A, in Controller B, thus ensuring that the states of 4 Signal and the level crossing control are always the same. Event 3 – After the Fix Some 5 days after the fault was supposedly corrected, there was another occurrence of the failure. This failure had a similar sequence of events as for Event 1, in that a train was detained at 3 Signal and a following train detained at 4 Signal long enough for the standing for too longer timer to expire. There was another train in the vicinity and it was approaching 1 Signal. The detention of the train at 3 Signal, this time, was due to a failure of Controller C which also caused a loss of communications between Controllers B and C. On recovery of Controller C and the re-establishment of communications between Controllers B and C, 3 Signal changed to PROCEED. The detained train moved on and when the rear of the train was beyond the overlap, 4 Signal changed to PROCEED, and the crossing called open. One second later, the crossing was commanded to close. This was essentially the same sequence of events for Events 1 and 2. So why did the fix not work? The reason why the fix did not work was because it was implemented incorrectly. Instead of requiring the crossing to be closed before 4 Signal changed to PROCEED when a train was standing at the signal, the implemented logic required 4 Signal to be at PROCEED to command the crossing to open. This effectively ensured that the crossing would always open automatically when 4 Signal changed from STOP to PROCEED. 180 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 THE FIX PROCESS The railway signalling industry gives unsafe signalling failures high priority. This failure was no exception. Subsequent to the second failure, which occurred on a Saturday (the first occurrence was not correctly categorised and hence not immediately acted upon), the railway organisation investigated the failure, devised the solution and implemented the next day, a Sunday. Because it was a failure of the safety system, it was considered a matter of urgency to correct the problem. People were specially called in. The personnel involved were those who verified and validated the interlocking system supplied under contract and so should have had good knowledge of the interlocking logic. However, their collusion in the investigation, the identification of the solution and its subsequent design, test and commissioning, compromised their independence. The change was not tested using the simulation tools and test facilities as it was assumed that the sequence of events could not be accurately simulated. This was a timing issue and the events had to be timed to the second. One second either way meant that the failure would not have occurred. The change however was relatively simple. There was some attempt to test the deployment of the change using the target system. However this only confirmed that the fix had no adverse affect on the normal sequence of events. It was not possible to induce the changes of state with sufficient accuracy to prove that the problem was corrected. THE COMPLEXITY FACTOR The interlocking logic flaw which led to the unsafe failure events described above was a direct result of the complexity created by the architecture selected for this particular application. Whilst the people involved appreciated the timing complexities of distributed systems, there were no prescribed processes as to how to deal with transitioning internal states common to different controllers. It is important to note than had 4 Signal been a controlled signal instead of an automatic signal, the flaw would not have as readily been revealed. The request to clear 4 Signal from the Train Controller would have had to arrive within 1 second of the conditions allowing 4 Signal to change from STOP to PROCEED were satisfied. The problem is, there appears to be no obvious practical way of identifying such precise timing-related flaws. How can we be ever certain that there are no other similar flaws which have yet to be revealed? The system has been in service now some nine years and there have been no other interlocking logic flaws revealed. 179 Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009 SAFETY None of the unsafe events resulted in any harm. However, that does not mean that this was not a serious safety flaw. The primary safety system failed and it was only the vigilance of the train drivers involved that prevented any harm from occurring. There is an increasing trend in the railway industry to automate train operations i.e. operate trains without a driver. Had there been no driver on the three trains involved in the failure events, then harm would have certainly occurred. The PROCEED aspect in 4 Signal would have been sufficient for the train to automatically depart with the crossing open to road traffic. If the controlled level crossing did not exist then the events would not have happened. The events only occurred because of the need to guarantee a minimum road open time before reclosing the crossing. CONCLUSION The versatility of programmable logic controllers tempt application designers to use them in ways not originally intended. Whilst the particular controllers had the functionality to communicate serially, the use of this functionality to construct such a distributed interlocking system was an innovative use of the technology. Whilst the equipment manuals did not preclude such use, there were warnings on the use of the serially links with respect to latency. The series of failures described in this case study demonstrates the subtlety of design errors that can be created in a distributed system which may lie dormant until revealed, sometimes with serious consequences. When such flaws are revealed, the urgency to correct the flaw is often a strong temptation to bypass the usual rigorous procedures. This case study demonstrates what can happen when such procedures are not adhered to, despite the involvement of appropriately competent people. 180
© Copyright 2024