CONFERENCE PROCEEDINGS OF REFEREED PAPERS

CONFERENCE PROCEEDINGS
OF
REFEREED PAPERS
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Proceedings of the Improving Systems and Software Engineering Conference (ISSEC)
Achieving the Vision
Canberra, 10-12 August 2009
Editor: Angela Tuffley
All papers contained in these Proceedings have been subjected to anonymous peer review by
at least two members of the review panel.
Publication of any data in these proceedings does not constitute a recommendation. Any
opinions, findings or recommendations expressed in this publication are those of the authors
and do not necessarily reflect the views of the conference sponsors. All papers in this
publication are copyright (but have been released for reproduction) by their respective authors
and/or organisations.
ISBN: 978-0-9807680-0-8
CONFERENCE SECRETARIAT
Eventcorp Pty Ltd
PO Box 3873
South Brisbane BC QLD 4101
AUSTRALIA
Tel: +617 3334 4460
Fax: +617 3334 4499
2
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
CONTENTS
ABSTRACTS AND BIOGRAPHIES
SYSTEMS ENGINEERING AND SYSTEMS INTEGRATION
THE EFFECT OF THE DISCOVERY OF ADDITIONAL WORK ON THE DYNAMIC
PROJECT WORK MODEL .................................................................................................................. 15
PETER A.D. BROWN, ALAN C. MCLUCAS, MICHAEL J. RYAN
LESSONS LEARNED FROM THE SYSTEMS ENGINEERING MICROCOSM SANDPIT..... 27
QUOC DO, PETER CAMPBELL, SHRAGA SHOVAL, MATTHEW J. BERRYMAN, STEPHEN COOK, TODD
MANSELL AND PHILLIP RELF
MAKING ARCHITECTURES WORK FOR ANALYSIS ................................................................ 37
RUTH GANI, SIMON NG
SYSTEMS ENGINEERING IN-THE-SMALL: A PRECURSOR TO SYSTEMS ENGINEERING
IN-THE-LARGE...................................................................................................................................... 49
PHILLIP A. RELF, QUOC DO, SHRAGA SHOVAL, TODD MANSELL, STEPHEN COOK, PETER CAMPBELL,
MATTHEW J. BERRYMAN
SOFTWARE ENGINEERING
REQUIREMENTS MODELLING OF BUSINESS WEB APPLICATIONS: CHALLENGES
AND SOLUTIONS .................................................................................................................................. 65
ABBASS GHANBARY, JULIAN DAY
DESIGN UNCERTAINTY THEORY - EVALUATING SOFTWARE SYSTEM
ARCHITECTURE COMPLETENESS BY EVALUATING THE SPEED OF DECISION
MAKING .................................................................................................................................................. 77
TREVOR HARRISON, PROF. PETER CAMPBELL, PROF. STEPHEN COOK, DR. THONG NGUYEN
PROCESS IMPROVEMENT
APPLYING BEHAVIOR ENGINEERING TO PROCESS MODELING....................................... 95
DAVID TUFFLEY, TERRY ROUT
SAFETY MANAGEMENT AND ENGINEERING
KEYNOTE ADDRESS AND PAPER
BRINGING RISK-BASED APPROACHES TO SOFTWARE DEVELOPMENT PROJECTS 111
FELIX REDMILL
MODEL-BASED SAFETY CASES USING THE HIVE WRITER................................................. 123
TONY CANT, JIM MCCARTHY, BRENDAN MAHONY AND KYLIE WILLIAMS
THE APPLICATION OF HAZARD RISK ASSESSMENT IN DEFENCE SAFETY
STANDARDS ......................................................................................................................................... 135
C.B.H. EDWARDS; M. WESTCOTT; N. FULTON
INTEGRATING SAFETY AND SECURITY INTO THE SYSTEM LIFECYCLE .................... 147
BRUCE HUNTER
WHAT CAN THE AGENT PARADIGM OFFER SAFETY ENGINEERING? .......................... 159
LOUIS LIU, ED KAZMIERCZAK AND TIM MILLER
COMPLEXITY & SAFETY: A CASE STUDY................................................................................. 171
GEORGE NIKANDROS
3
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Programme Committee and Peer Review Panel
Programme Chair
Angela Tuffley (Griffith University)
Systems Engineering Chair
Stephen Cook (DASI, University of South Australia)
Software Engineering Chair
Leon Sterling (University of Melbourne)
Systems Integration Chair
Todd Mansell (Defence Science & Technology Organisation)
Process Improvement Chair
Terry Rout (Software Quality Institute, Griffith University)
Safety Management and Engineering Chair
Tony Cant (Defence Science & Technology Organisation)
Committee and Review Panel Members
Wesley Acworth
Viviana Mascardi
Warwick Adler
Tariq Mahmood
Matt Ashford
Brendan Mahony
Judy Bamberger
Duncan McIntyre
Clive Boughton
Tafline Murnane
Peter Campbell
George Nikandros
David Carrington
Adrian Pitman
Quoc Do
Phillip A. Relf
Tim Ferris
Stephen Russell
Aditya Ghose
Mark Staples
Jim Kelly
Paul Strooper
Martyn Kibel
Kuldar Taveter
Peter Lindsay
Richard Thomas
4
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
ABSTRACTS AND BIOGRAPHIES
“The Effect of the Discovery of Additional Work on the Dynamic Project Work Model”
- Peter Brown
ABSTRACT
In their paper “Knowledge sharing and trust in collaborative requirements analysis”, Luis Luna-Reyes
et al built on the body of system dynamics knowledge to propose a model of project work that
demonstrates key dynamic elements of IT projects. By expanding the model to include additional
criteria, useful insights into the likely impact of work required to achieve essential project outcomes –
but not identified at the beginning of a project – can be derived.
Essential new work discovered during the course of a project increases a project’s scope, requiring replanning. In addition to the relatively straight-forward scope increase resulting from undiscovered
work, work discovered late in a project usually requires some of the work already completed
satisfactorily – particularly integration work, testing and sometimes design and fabrication – to be redone. Where scope changes resulting from these two impacts are significant, re-approval or even
project termination may result.
Organisations can use insights gained through application of the expanded model to improve initial
project planning and more effectively manage ‘unknowns’ associated with project scope.
BIOGRAPHY
In 2004, following a successful career in the RAAF, Peter joined KoBold Group, a rapidly growing
services and systems company recognised in both government and commercial sectors for innovative,
high quality systems solutions and implementations. Peter is currently part of the Customs project
team responsible for managing the Australian Maritime Identification System Program, a high profile,
national maritime security-related initiative. Peter teaches engineering management and systems
dynamics at UNSW (ADFA) in a part-time capacity and is currently enrolled in a research degree into
the system dynamics aspects of project management through the School of Information Technology
and Electrical Engineering.
“Lessons Learned from the Systems Engineering Microcosm Sandpit” – Quoc Do
ABSTRACT
Lessons learned can be highly valuable to engineering projects. They provide a means for systems
engineers to thoroughly investigate and anticipate potential project risks before starting the project.
Up-front analysis of the end-to-end process pays on-going dividends. This paper describes: 1) an
evolutionary Microcosm for investigating systems integration issues, fostering model-based systems
engineering research, and accelerating systems engineering education; 2) the lessons learned during
the first stage of the Microcosm development; 3) how these lessons learned have informed the design
and implementation of the Microcosm Stage Two. Interestingly, the lessons learned from the
Microcosm Stage One reflect many of the common lessons learned found in much larger industry
projects. This demonstrates the Microcosm Sandpit’s capability in replicating a wide range of systems
development issues common to complex systems. Thus it provides an ideal environment for systems
engineering education, training and research.
BIOGRAPHY
Dr. Quoc Do works at the Defence and Systems Institute (DASI), University of South Australia. He
completed his B.Eng, M.Eng and PhD at the University of South Australia in 2000, 2002 and 2006
respectively. His research interests are in the areas of mobile robotics (UAVs & UGVs), vision
systems, systems engineering and systems integration research and education, and model-based
systems engineering.
5
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
“Making Architectures Work For Analysis” - Ruth Gani
ABSTRACT
The Department of Defence mandates an architectural approach to developing and specifying
capability. According to the Chief Information Office Group, application of the Defence Architectural
Framework ‘enables ICT architectures to contribute to effectively acquiring and implementing
business and military capabilities, ensuring interoperability, and value for money and can be used to
enhance the decision making process’ . Architectural views form a component of the Operational
Concept Document for each major project within Defence. However, the utility of architectures is
often limited by poor appreciation of their potential to support design decisions.
In support of a key Defence capability acquisition project, the Defence Science and Technology
Organisation used project architectural data to support analysis of the risks associated with the
integration of the capability into the broader Defence Information Environment (DIE). Key to the
analysis were:
x transformation of the architectural data into an analytically tractable form;
x creation of a comprehensive database open to querying; and
x presentation of results in a manner meaningful to decision-makers.
Results were expressed in terms of the impact that poor compliance with accepted DIE standards
would have on the ability of the proposed system to conduct operational missions. The methodology
used provides a template for future effective analytical use of architectures—important if Defence is to
make best use of the architectural information required for each project. The study also highlights the
importance of building an architecture with a view to its purpose and touches on the challenges of
stove-piped architectural development.
BIOGRAPHY
Ruth Gani has worked for Defence Science Technology Organisation since 2001. She has been
involved in range of studies, including analysis to support capability acquisition, architecture
development and evaluation and technology trending.
“Systems Engineering In-The-Small: A Precursor to Systems Engineering In-TheLarge” - Phillip A. Relf
ABSTRACT
The teaching of the systems engineering process is made problematic due to the enormity of
experience required of a practising systems engineer. A ‘gentle’ project-based introduction to systems
engineering practice is currently being investigated under the Microcosm programme. The
Microcosm programme integrates a robotics-based system-of-systems, as the first stages in building a
systems engineering teaching environment. Lessons learnt have been collected from the Microcosm
Stage One project and the systems engineering processes, used during the project, have been captured.
This paper analyses the processes and lessons learnt, compares them against typical large-scale
Defence systems engineering projects, and discusses the lesson learnt captured by the systems
engineers who had been working in-the-small. While executing the case study it was found that the
lessons learnt which are known to industry, would have been militated against their occurrence by the
use of robust industrial systems engineering processes but that the Microcosm project schedule, with
these industrial processes, would have been exceeded. As the Microcosm Stage One project was
successfully completed, effort must now be expended to ensure that the participants understand the
limitations and strengths of systems engineering in-the-small procedures and also understand the
issues associated with the scaling up of the procedures.
BIOGRAPHY
Dr. Phillip Anthony Relf gained his PhD from the Engineering faculty of the University of
Technology, Sydney Australia. He has over three decades experience in large system integration, most
of which was gained while working in the Defence industry.
6
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
“Requirements Modelling of Business Web Applications: Challenges and Solutions” Abbass Ghanbary
ABSTRACT
The success of web application development projects greatly depend upon the accurate capturing of
the business requirements. This paper discusses the limitations of the current modelling techniques
while capturing the business requirements in order to engineer a new software system. These
limitations are identified by modelling the flow of information in the process of converting user
requirements to a physical system. This paper also defines the factors that influence the change in
business requirements. Those captured business requirements are then transferred into pictorial and
visual illustrations in order to simplify the complex project. In this paper, the authors define the
limitations of the current modelling techniques while communicating those business requirements with
various stakeholders. The authors in this paper also review possible solutions for those limitations
which will form the basis for a more systematic investigation in the future.
BIOGRAPHY
Abbass Ghanbary (PhD) has completed his PhD at the University of Western Sydney. Abbass is
focused on the issues and challenges faced by business integration modelling techniques, investigating
the improvements of the Web Services applications across multiple organisations. Abbass is also a
consultant in the industry in addition to his lecturing and tutoring in the university. He is a member of
Australian Computer Society and is active in attending various forums, seminars and discussion.
Abbass is also a committee member of the Quantitative Enterprise Software Performance (QESP)
association.
“DESIGN UNCERTAINTY THEORY - Evaluating Software System Architecture
Completeness by Evaluating the Speed of Decision Making” - Trevor Harrison
ABSTRACT
There are two common approaches to software architecture evaluation [Spinellis09, p.19]. The first
class of evaluation methods determines properties of the architecture, often by modelling or simulation
of one or more aspects of the system. The second, and broadest, class of evaluation methods is based
on questioning the architects to assess the architecture. This research paper details a third, more finegrained approach to evaluation by assuming an architecture emanates from a large set of design and
design-related decisions. Evaluating an architecture by evaluating decision making and decision
rationale is not new (see Section 3). The novel approach here is to base an evaluation largely on the
time dimensions of decision making. These time dimensions are (1) time allowed for architecting,
and (2) speed of architecting. It is proposed that progress of architecture can be measured at any point
in time. For example: “Is this project on track during the concept development stage of a system life
cycle?” The answer can come from knowing how many decisions should be expected to be finalised
at a particular moment in time, taking into account a plethora of human factors affecting the prevailing
decision-making environment. Though aimed at ongoing evaluations of large military software
architectures, the literature review for this research will examine architectural decisions from the
disciplines of systems engineering, information technology, product management and enterprise
architecture.
BIOGRAPHY
Trevor Harrison's research interests are in software systems architecture and knowledge management.
His background is in software development (real-time information systems), technology change
management and software engineering process improvement. Before studying full-time for a PhD, he
spent 6 years with Logica and 11 years with the Motorola Australia Software Centre. He has a
BSc(Hons) in Information Systems from Staffordshire University and an MBA (TechMgt) from La
Trobe University.
7
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
“Applying Behavior Engineering to Process Modeling” - David Tuffley
ABSTRACT
The natural language used by people in everyday life to express themselves is often prone to
ambiguity. Examples abound of misunderstandings occurring due to a statement having two or more
possible interpretations. In the software engineering domain, clarity of expression when specifying the
requirements of software systems is one situation where absence of ambiguity is important. Dromey’s
(2006) Behavior Engineering is a formal method that reduces or eliminates ambiguity in software
requirements. This paper seeks an answer to the question: can Dromey’s (2006) Behavior Engineering
reduce or eliminate ambiguity when applied to the development of a Process Reference Model?
BIOGRAPHY
David Tuffley is a Lecturer in the School of ICT at Griffith University, and a Consultant with the
Software Quality Institute since 1999. Before academia, David consulted in the computer industry for
17 years. Beginning in London in the 1980's as a Technical Writer, he progressed to business analysis
and software process improvement work. His commercial work has been in the public and private
sectors in the United Kingdom and Australia. David is currently doing postgraduate research on
developing a process reference model for the leadership of virtual teams.
8
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
KEYNOTE ADDRESS AND PAPER
“Bringing Risk-Based Approaches to Software Development Projects” - Felix Redmill
ABSTRACT
The history of software development is strewn with failed projects and wasted resources. Reasons for
this include, among others:
•
Failure to take an engineering approach, despite using the epithet ‘software engineering’;
•
Focus on process rather than product;
•
Failure to learn lessons and use them as the basis of permanent improvement;
•
Neglect to recognise the need for high-quality project management;
•
Reliance on tools to the exclusion of understanding first principles; and
•
Focus on what is required without consideration of what could go wrong.
If change is to be achieved, and software development is to become an engineering discipline, an
engineering approach must be embraced. This paper does not attempted to spell out the many aspects
of engineering discipline. Rather, it addresses the risk-based way of thinking and acting that typifies
the modern engineering approach, particularly in safety engineering, and it proposes a number of ways
in which a risk-based approach may be incorporated into the structure of software development.
Taking a risk-based approach means attempting to predict what undesirable outcomes could occur in
the future (within a defined context) and taking decisions – and actions – to provide an appropriate
level of confidence that they will not occur. In other words, it uses knowledge of risk to inform
decisions and actions. But, if knowledge of risk is to be used, that knowledge must be gained, which
means acquiring appropriate information.
In safety engineering, such an approach is essential because the occurrence of accidents deemed to be
preventable is not considered acceptable. (As retrospective investigation almost always shows how
accidents could have been prevented, this often gives rise to contention, but that’s another matter.) In
the security field, although a great deal of practice is carried out ad hoc, standards are now based on a
risk-based approach: identifying the threats to a system, determining the system’s vulnerabilities, and
planning to nullify the threats and reduce the vulnerabilities in advance.
However, in much of software development, the typical approach is to arrive at a product only by
following a specification of what is required. Problems are found and fixed rather than anticipated, and
consideration is seldom given to such matters as the required level of confidence in the ‘goodness’ of
any particular system attributes.
A risk-based approach carries the philosophy of predicting and preventing, and this is an asset both in
the development of products and the management of projects. This paper therefore proposes some first
steps in creating a foundation for the development of such an approach in software development and
project management. The next section briefly introduces the subject of risk, and this is followed by
introductions to two techniques, used in risk analysis, which are applicable in all fields and are
therefore useful as general-purpose tools. Subsequent sections offer thoughts on the introduction of a
risk-based approach into the various stages of software development projects.
It is hoped that the explanations offered in this paper are easily understandable, but they do not
comprise a textbook. Risk is a broad and tricky subject, and this paper does not purport to offer a full
education in it.
BIOGRAPHY
Based in London, UK, Felix Redmill is a self-employed consultant, lecturer and writer. His fields of
activity are the related subjects of risk and its management, safety engineering, project management,
software engineering, and the application of risk principles to other fields, such as software testing.
9
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
With a BSc in Electrical Engineering, he started work as a Computer Programmer and, thereafter, had
parallel careers in telecommunications and software engineering, as engineer and manager, for more
than 20 years prior to setting up his own consulting business. In the 1970s he attended Manchester
University on a bursary to do an MSc in Computation, and was seconded to Essex University to carry
out research into the stored program control of telephone exchanges. He has since conducted private
research into several subjects, including risk-based software testing. He gained experience in all
aspects of engineering, including maintenance, managed many system-development projects and, as
head of department, designed and led a number of quality-improvement campaigns.
He was the inaugurating Co-ordinator of the UK_ Safety-Critical Systems Club in 1991, organised
sixteen annual Safety-critical Systems Symposia, and still edits its newsletter, Safety Systems, which
is now in its eighteenth year.
Felix has been an invited lecturer at several universities in the UK and other countries in safety
engineering and management and in various aspects of software engineering and is an Honorary
Professor at Lancaster University. He has published and presented papers and articles on many
subjects, including telecommunications, computing, software engineering, project management,
requirements engineering, Fagan inspection, quality management, risk, safety engineering, the safety
standard IEC 61508, and engineering education. Some papers and articles have been published in
other languages: French, Spanish, Russian, Polish, Arabic, and Japanese. He has also written and
edited a number of books on some of these subjects, and has been invited to give keynote addresses in
the USA, Australia, India, Poland, Germany, as well as the UK.
He is a Chartered Engineer, a Fellow of both the Institution of Engineering and Technology and the
British Computer Society, and a Member of the Institute of Quality Assurance. He is currently active
in promoting professionalism among safety engineers, developing the profession of safety engineering
and helping to define its direction.
“Model-Based Safety Cases Using the HiVe Writer” - Tony Cant
ABSTRACT
A safety case results from a rigorous safety engineering process. It involves reasoned arguments,
based on evidence, for the safety of a given system. The DEF(AUST)5679 standard provides detailed
requirements and guidance for the development of a safety case. DEF(AUST)5679 safety cases
involve a number of highly inter-related documents; tool support is needed to manage the process and
to maintain consistency in the face of change. The HiVe Writer is a tool that supports structured
technical documentation via a centrally-managed datastore so that any documents created within the
tool are constrained to be consistent with this datastore and therefore with each other. This paper
discusses how the HiVe Writer can be used to support safety case development. We consider the
safety case for a fictitious Phased Array Radar Target Illuminator (PARTI) system and show how the
HiVe Writer can support hazard analysis for the PARTI system.
BIOGRAPHY
Tony Cant currently leads the High Assurance Systems (HAS) Cell in DSTO’s Command, Con¬trol,
Communications and Intelligence Division. His work focuses on the development of tools and
techniques for providing assurance that critical systems will meet their requirements. Tony has also
led the development of the newly published Defence Standard DEF(AUST)5679 Issue 2, entitled
“Safety Engineering for Defence Systems”.
Tony obtained a BSc(Hons) in 1974 and PhD in 1979 from the University of Adelaide, as well as a
Grad Dip in Computer Science from the Australian National University (ANU) in 1991. He held
research positions in mathematical physics at the University of St Andrews, Tel Aviv University, the
University of Queensland and the ANU. He also worked in the Common¬wealth Department of
Industry, Technology and Commerce in science policy before joining DSTO in 1990.
10
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
“The Application of Hazard Risk Assessment in Defence Safety Standards” - C.B.H.
Edwards
ABSTRACT
Hazard Risk Assessment (HRA) is a special case of Probabilistic Risk Assessment (PRA) and
provides the theoretical basis for a number of safety standards. Measurement theory suggests that
implicit in this basis are assumptions that require careful consideration if erroneous conclusions about
system safety are to be avoided. These assumptions are discussed and an extension of the HRA
process is proposed. The methodology of this extension is exemplified in recent work by Jarrett and
Lin. Further development of safety standards and the possibility of achieving a harmonization of the
different approaches to assuring system safety are suggested.
BIOGRAPHY
Christopher Edwards is a Senior Systems Safety Analyst. Prior to joining Defence in 1979 Chris was a
member of the CSIRO's Division of Mathematics and Statistics where he worked as a consultant
statistician. Chris has over 15 years experience in the management and development of safety-critical
software intensive systems. Since retiring in 2001 he has been contracted as the Safety Evaluator for a
number of defence systems which have involved the use of Def(Aust)5679 as the safety standard.
Chris is currently the Treasurer of the Australian Safety Critical Systems Association (aSCSa) and sits
on the executive committee of that organisation.
“Integrating safety and security into the system lifecycle” - Bruce Hunter
ABSTRACT
We live in a brave new world where Information Security threats emerge faster than control
mechanisms can be deployed to limit their impact. Information Security is not only an issue for
financial systems but has greater risks for control systems in critical infrastructure, which depend not
only on their continued functionality, but also on the safety of their operation.
This new dimension to the dependability of systems has been recognised by some safety and security
standards but not much has been done to ensure the conflicting requirements and measures of security
and safety are effectively managed.
Conflicts in the implementation of safety and security aspects of systems arise from the differing
values and objectives they are to achieve. Neglecting the relationship between functional, safety and
security issues can lead to systems that are neither functional, safe or secure.
This paper proposes an integrated model to ensure the safety and security requirements are effectively
treated throughout the system lifecycle, along with functional and performance elements, maintaining
ongoing compatibility between their individual objectives.
BIOGRAPHY
Bruce Hunter ([email protected]) is the Quality and Business Improvement Manager
for the Security Solutions & Services and Aerospace divisions of Thales Australia. In this role Bruce
is responsible for product and process assurance as well as the management of its reference system and
its improvement.
Bruce has a background in IT, systems and safety engineering in the fire protection and emergency
shutdown industry and has had over 30 years of experience in the application of systems and software
processes to complex real-time software-based systems.
Bruce is a contributing member of Standards Australia IT6-2 committee, which is currently reviewing
the next edition of the IEC61508 international functional safety standards series. Bruce is also a
Certified Information Security Manager and Certified Information Systems Auditor.
11
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
“What can the agent paradigm offer safety engineering?” - Tim Miller
ABSTRACT
A current trend in safety-critical applications is towards larger, more complex systems. The agent
paradigm is designed to support the development of such complex systems. Despite this, agents are
having minimal impact in safety-critical applications.
In this paper, we investigate how the agent paradigm offers benefits to traditional safety engineer¬ing
processes. We demonstrate that concepts such as roles, goals, and interactions narrow that gap
between engineering and safety analysis, and provide a natural mechanism for managing re-analysis
after change. Specifically, we investigate the use of HAZard and OPerability studies (HAZOP) in
agent-oriented software engineering. This offers a first step towards broadening the scope of systems
that can be analyzed using agent-oriented concepts.
BIOGRAPHY
Tim Miller is a lecturer in the Department of Computer Science and Software Engineering at
University of Melbourne. Tim completed his PhD at the University of Queensland before taking up a
four-year postdoctoral research position at the University of Liverpool, UK, where he worked on the
highly successful PIPS (Personalised Information Platform for Life and Health Services). Tim's
research interests include agent-oriented software engineering, models of multi-agent interaction,
computational modelling & analysis of complex systems, and software testing.
“Complexity & Safety: A Case Study” - George Nikandros
ABSTRACT
Despite correct requirements, competent people, and robust procedures, unsafe faults occasionally
arise. This paper reports on one such incident; one that involves a railway level crossing. Whilst the
direct cause of the failure was defective application contol data, it was a defect that would be difficult
to foresee and if foreseen, to test for.
A sequel to this failure is the sequence of events to correct the defect. In the haste to correct the defect,
another unsafe failure was introduced.
BIOGRAPHY
George is an electrical engineer with some 30 years experience in railway signalling. George is
chairman of the Australian Safety Critical Systems Association. George has published papers, is
accredited as the author of a Standards Australia Handbook “Safety Issues for Software” and a coauthor of the book “New Railway Environment – A multi-disciplinary business concept”. George
represents the Australian Computer Society on the Engineers Australia/ Australian Computer Society
Joint Board in Software Engineering. George is a Chartered Member of Engineers Australia, a Fellow
of the Institution of Railway Signal Engineers, and a Member of the Australian Computer Society.
12
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
SYSTEMS ENGINEERING
AND
SYSTEMS INTEGRATION
13
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
14
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
THE EFFECT OF THE DISCOVERY OF ADDITIONAL
WORK ON THE DYNAMIC PROJECT WORK MODEL
Peter A.D. Brown, Alan C. McLucas, Michael J. Ryan
ABSTRACT
In their paper “Knowledge sharing and trust in collaborative requirements analysis”,
Luis Luna-Reyes et al built on the body of system dynamics knowledge to propose a
model of project work that demonstrates key dynamic elements of IT projects. By
expanding the model to include additional criteria, useful insights into the likely
impact of work required to achieve essential project outcomes – but not identified at
the beginning of a project – can be derived.
Essential new work discovered during the course of a project increases a project’s
scope, requiring re-planning. In addition to the relatively straight-forward scope
increase resulting from undiscovered work, work discovered late in a project usually
requires some of the work already completed satisfactorily – particularly integration
work, testing and sometimes design and fabrication – to be re-done. Where scope
changes resulting from these two impacts are significant, re-approval or even project
termination may result.
Organisations can use insights gained through application of the expanded model to
improve initial project planning and more effectively manage ‘unknowns’ associated
with project scope.
BACKGROUND
Project Work Model
In their paper “Knowledge sharing and trust in collaborative requirements analysis”,
Luis Luna-Reyes, Laura J. Black, Anthony M. Cresswell and Theresa A. Pardo
(Luna-Reyes et al. 2008) built on the body of system dynamics knowledge to propose
a model of system development project work that demonstrates key dynamic elements
of IT projects. Their model is represented in Figure 1 below, modified only to observe
current stock-and-flow diagramming conventions (McLucas 2003), (Sterman 2000),
(Coyle 1996).
The authors’ intend to further develop the model proposed by Luna-Reyes et al to
facilitate a systemic examination of the dynamic structure and characteristics of
projects and their effects on projects’ performance and outcomes. This paper will
examine the effects of discovering additional work after a project starts on that
project’s performance. It is intended to be the first in a series of papers examining
projects’ dynamic behaviour, eventually leading to improved investment decisionmaking.
15
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure1:RepresentationofLunaReyesDynamicProjectWorkModel(LunaReyesetal.2008)
The Luna-Reyes Stock-and-Flow diagram shows that, of an amount of work to be
done, some is done correctly and some is flawed and must be reworked. It also shows
that of the re-work undertaken, a proportion is likely to contain flaws and require reworking yet again.
In addition to a parameter Error Fraction that represents the probability of doing work
incorrectly, Luna-Reyes et al recognise three highly aggregated parameters that
influence the behaviour of the model: Concreteness, Transformability and Learningby-Doing. It is the interaction of these parameters with the other elements of a
project’s structure that causes the changes over time that can be so difficult to manage
effectively and that impact so significantly on project outcomes. An understanding of
these parameters is crucial to the development of a valid model for use in simulating
project dynamic behaviour and performance. An improved appreciation of the
influences of the key parameters may be gained from the influence diagram in
Figure 2 below.
Figure2:Influencediagram–DynamicProjectWorkModelbasedonworkbyLunaReyesetal
16
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Concreteness
In their paper, Luna-Reyes et al examine the efficacy of requirements identification
and definition in a collaborative project environment. In doing so, they define
Concreteness as the ‘specificity’ of representations elicited during cross-boundary or
trans system interface work on requirements discovery. However, in the more general
context of dynamic project work modelling, Concreteness could be better interpreted
as those representations that depict the extent to which the mechanisms and
interrelationships influencing a system are understood in terms of their potential to
contribute to or impede achievement of desired outcomes. In other words,
Concreteness is a picture of how visible key stakeholders’ wants and needs are to the
organisation or organisations developing a specific system or solution, how far apart
those various needs are and what constraints (dependent and independent, exogenous
and endogenous) help make up the specific development environment.
Factors influencing concreteness may include, inter alia:
x the maturity and worldliness of stakeholder organisations;
x how clearly each stakeholder organisation understands their own and other
stakeholders’ needs and desired project outcomes;
x the validity and completeness of, and the disparity between, organisations’
mental models of the system development environment and desired project
outcomes; and
x the specificity and tangibility of such understanding to organisations’ abilities
to convert ‘understanding’ to ‘achievement’.
In the model, ‘Concreteness’ is applied directly to the Rate of Doing New Work
Correctly, the Rate of Doing New Work Incorrectly and the Rate of Discovering
Rework.
Transformability
Luna-Reyes et al define ‘Transformability’ as the likelihood that an organisation or
actor is able to recognise deficiencies in concreteness representations and define and
successfully apply the corrective actions needed (Luna-Reyes et al. 2008). In a more
general sense, Transformability is the ability to recognise deficiencies in requirements
and take action to correct them.
Ǯ”ƒ•ˆ‘”ƒ„‹Ž‹–›ǯ‹•ƒ’’Ž‹‡†–‘–Š‡ƒ–‡‘ˆ‘‹‰‡™‘”…‘””‡…–Ž›ƒ†–Š‡ƒ–‡
‘ˆ‘‹‰‡™‘”…‘””‡…–Ž›Ǣ‹–‹•‘–ƒ’’Ž‹‡†–‘–Š‡—†‡”–ƒ‹‰‘ˆǮ‡™ǯ™‘”Ǥ
Learning by doing
‘Learning-by-Doing’ is a parameter that represents the level of knowledge of
stakeholders’ needs and desired outcomes, including knowledge of effective ways of
achieving those outcomes. Learning-by-Doing as an output parameter only,
influenced by the Rate of Doing New Work Correctly, the Rate of Doing Rework
Correctly and the Rate of Discovering Rework.
17
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Error Fraction
As indicated above, Error Fraction is the probability that the outcome of work
attempted will be incorrect or faulty, and it relates to how difficult specific work is to
the organisation or organisations undertaking that work. Error Fraction applies to all
rates in the model except the Rate of Discovering Rework.
Parameter definitions
Currently, apart from the context-constrained definitions of the key model parameters
offered by Luna-Reyes et al, there is little other than empirical evidence and logical
thinking to support several of the definitions above. For now it seems appropriate to
define the parameters and their relationships with each other using a ‘black box’
approach. In doing so, key parameters can be defined in terms of their inputs and
outputs, what elements of the model they influence and to what extent they do so. The
authors offer the modified definitions of ‘Concreteness’ ‘Transformability’ and
‘Learning-by-Doing’ hypothesised above more in response to a need for these
definitions to effectively model projects than to any mathematical proof or agreed
definition of the terms.
It should be noted that the key parameters (especially ‘Transformability’ and
‘Concreteness’) are likely to influence the various rates in the model in different ways
and to different extents. A parameter may have a negligible influence on some rates
for some projects and a massive influence on others. Furthermore, a parameter might
only influence a subset of the domain of causes of a rate’s variations, not the whole
domain. The nature of the key parameters ‘Concreteness’, ‘Transformability’ and
‘Learning-by-Doing’ and their relationships with other project parameters requires
further research and will be addressed in future papers.
How the basic model works
In the basic model at Figure 1, work is done for as long as the stock KNOWN NEW
WORK remains greater than zero. New work is undertaken at a certain rate which
may vary over time, and will result either in work done correctly at a rate influenced
by the rate factor ‘(1-Error Fraction)’, accumulating in the stock WORK DONE
CORRECTLY, or work done incorrectly at a rate influenced by the rate factor ‘Error
Fraction’, accumulating in the stock UNDISCOVERED REWORK. The need for
rework is discovered gradually, and work transfers from UNDISCOVERED
REWORK to KNOWN REWORK at the rate the need for rework is discovered. The
discovery of rework is influenced by Concreteness. The better requirements are
understood, the more likely and more quickly deficiencies will be recognised. Known
Rework is then processed, influenced by the Error Fraction and Transformability and
flows either to WORK DONE CORRECTLY or back to UNDISCOVERED
REWORK.
It is important to note that in an actual project, stakeholders do not have visibility of
accumulating UNDISCOVERED REWORK, believing instead that this work has
accumulated as WORK DONE CORRECTLY. Stakeholders only recognise the
requirement for rework as requirements thought to be met are found not to be. In the
model, recognition causes flawed work to flow from UNDISCOVERED REWORK
to KNOWN REWORK at the rate the need for rework is discovered, but in real life,
18
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
stakeholders are likely to visualize rework flowing from WORK DONE
CORRECTLY to KNOWN REWORK.
Issues with the basic model
Most project management methodologies in use today place heavy emphasis on
adequately defining requirements prior to solution development and implementation.
While this optimal outcome is possible in some circumstances where a project’s field
of endeavour is well travelled and relatively straightforward, it is rarely the case for
complex projects. This is particularly so for ‘green field’ endeavours where the
system being developed will be the initial system and developers must do without a
bottom-up view of the business needs. It may also be the case in circumstances where
businesses’ strategic, operational and working level plans require substantial
development or re-development. Likewise, in cases when substantial organisational
change will be required to fully utilise the capability a solution provides, it may not be
possible to develop a mature set of requirements prior to the commencement of
system development. These realities have been demonstrated on many occasions and
are the reason that more complex system life cycle models such as evolutionary
acquisition and spiral development were conceived and are in extensive use today.
Given that requirements often cannot be comprehensively and accurately established
prior to complex system design, development and in some cases, construction and
implementation, it should be recognised that requirements must continue to be
elaborated as projects proceed. Consequently, new work will nearly always be
discovered over the life of a project, including after the time requirements
development activities were scheduled to be complete. Therefore – referring to the
model – KNOWN NEW WORK will not always be the known quantity it ideally
should be, but will continue to vary over the life of a project every time new work is
discovered.
Adapting the Model
If we are to examine the effects of additional work discovered after a project has
commenced, the basic Luna-Reyes model must be modified by the addition of
elements representing the discovery of new work. An influence diagram for the
modified model is shown in Figure 3 below.
19
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure3:Influencediagramaccountingfordiscoveryofnewworkafterprojectcommencement
The new model shows that as additional new work is discovered over the duration of
a project, the value of KNOWN NEW WORK increases. Additionally, the discovery
of new work will often require some of the work already done correctly (WORK
DONE CORRECTLY) to be redone, increasing the project’s rework (KNOWN
REWORK) as a result. Even more distressingly, the discovery of new work after a
project has commenced sometimes makes work already completed and accepted
redundant. A stock-and-flow representation of the adapted model is shown below in
Figure 4.
Figure4:StockandFlowDiagramDynamicProjectWorkModelmodifiedtoaccountfornewwork
afterprojectcommencement
Because ‘Learning by Doing’ is an output parameter and it is not of central interest to
this discussion, it can be removed from the model for the purposes of this study
without affecting the validity of simulation results.
20
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Perfect Project
In order to establish a project performance baseline against which other variations and
enhancements may be compared, the key parameters ‘Concreteness’,
‘Transformability’ and ‘Error Fraction’ are given values that will equate to a ‘perfect’
project. Also, in a perfect project, no additional work would be discovered. To
achieve this, the value for Error Fraction was set and maintained at zero to represent
no errors and the values for Transformability and Concreteness were set and
maintained at 100%. For the perfect project, the value of additional work discovered
was also set to zero.
As a result of these changes, it is possible to simplify the model further for this special
case only (see Figure 5 below).
Figure5:InfluenceDiagramof'perfect'DynamicProjectWorkModel
The stock-and-flow diagram for this ‘perfect’ model is shown below in Figure 6. It is
the same as that in Figure 4 but because there are no errors, no undiscovered new
work and ‘Transformability’ and ‘Concreteness’ have values of 1, only the top part of
the basic model is used.
Figure6:StockandFlowDiagramof'Perfect'DynamicProjectWorkModel
21
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
MODEL BEHAVIOUR
Assumptions and constraints
When examining a model’s behaviour, it is first necessary to understand the
assumptions and constraints built into it. Firstly, major projects are usually developed
in phases and stages. This model examines one phase/stage only and does not
consider inter-stage influences in the analysis of project performance.
The discovery of new work is assumed to occur in very small increments and is
treated as continuous for the purposes of this study. In real life, discovery of
additional work is likely to occur in discrete parcels; further research on the way new
work is discovered after a project has begun is required and will be addressed in
future papers.
In the model, for this study, the rates at which new work is undertaken (correctly and
incorrectly), the rate at which rework is discovered and the rate at which rework is
undertaken (correctly and incorrectly) are held constant. In real life (and in future
research), rates will vary over the life of a project.
The parameter values used in this model were selected as reasonable values to start
with based on the authors’ experience. They will be subject to more rigorous research
and analysis in the near future.
In setting parameter values, it was assumed that the amount of additional work
discovered after a project has commenced (UNDISCOVERED NEW WORK) is
related to the complexity of the system being developed and the environment it is
being developed in. Therefore a parameter representing an aggregation of influences
loosely titled ‘complexity’ was adopted and expressed as a fraction of the project’s
initially known new work. It was also recognised that the position on a project’s time
line when additional work was discovered (represented in the model as the Rate New
Work Discovered) might be just as relevant to project performance as the amount of
additional work discovered. For consistency, the rate new work is discovered was
based on the rate at which KNOWN NEW WORK is processed. Finally, the rates at
which new work discovered caused work already completed and accepted (WORK
DONE CORRECTLY) to be reworked or to become redundant was set to a fraction
of the rate of discovery of the additional work (Rate New Work Discovered).
Project performance baseline
Initially, the model was set up as a ‘perfect’ project comprising 2000 tasks with zero
errors, Concreteness and Transformability set at 100% and no additional work
discovered. The duration for the ‘perfect’ project was 200 days.
The effect of additional work on project performance
Based on the assumptions and constraints above, three test cases each containing three
scenarios were constructed.
22
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Table1:DPWMmodellingcases
Test
Case
Scenario
Rate of Discovery of Additional Work
Complexity
No additional work
Not applicable
a
Slow Discovery Rate = 0.1 * rate of
doing new work
Simple (5% of Initial Known New
Work)
b
Slow Discovery Rate = 0.1 * rate of
doing new work
Complex (25% of Initial Known New
Work)
c
Slow Discovery Rate = 0.1 * rate of
doing new work
Very Complex (50% of Initial
Known New Work)
a
Medium Discovery Rate = 1 * rate of
doing new work
Simple (5% of Initial Known New
Work)
b
Medium Discovery Rate = 1 * rate of
doing new work
Complex (25% of Initial Known New
Work)
c
Medium Discovery Rate = 1 * rate of
doing new work
Very Complex (50% of Initial
Known New Work)
a
Fast Discovery Rate = 2 * rate of doing
new work
Simple (5% of Initial Known New
Work)
b
Fast Discovery Rate = 2 * rate of doing
new work
Complex (25% of Initial Known New
Work)
c
Fast Discovery Rate = 2 * rate of doing
new work
Very Complex (50% of Initial
Known New Work)
0
1
2
3
A plot of the test case results is shown in Figure 7 below.
23
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
ƒ•‡Ž‹‡
Figure7:Effectofadditionalworkonprojectperformance1
The model was designed to simulate work already completed and accepted that
required rework or was made redundant as a result of the discovery of additional
work. Arbitrary values of 10% of the rate at which new work discovered made work
already completed redundant and 20% for the rate at which completed work had to be
reworked were selected. The values for total redundant and reworked tasks are shown
in Table 2 below.
Table2:Workmaderedundantorrequiringreworkduetodiscoveryofnewwork
Complexity
Completed
work
(% of Initial redundant (Tasks)
Known Work)
made Completed work
rework (Tasks)
5%
10
20
24%
50
100
50%
100
200
requiring
Discussion
Predictably, the plot shows that the addition of new work will extend project duration
for all test cases. It also shows that for the slower rate of discovery and more complex
ͳŠ‡†‹•’Žƒ›ˆ‘”–Š‡ƒ•–†‹•…‘˜‡”›”ƒ–‡‘˜‡”Ž‹‡•–Šƒ–ˆ‘”‡†‹—
24
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
projects, duration may be extended because not all the new work is discovered before
the known new work is completed. In reality, that situation would not normally arise,
but there have been instances of projects that haven’t stopped due to the continued
addition of new work.
The absolute values for work made redundant or requiring re-work due to the
discovery of additional work are relatively small; however, if you consider the data in
the context that a project was approved on the basis that it would take 200 days to
complete and that due to additional work, that duration increased to 300 days of
which more than 20 days work had to be re-done and over 10 days work could not be
used, the impact becomes significant.
The significance of the discovery of additional work will vary from project to project
depending on what effort and resources are required to undertake the work, how those
resources are allocated and how large an impact the additional work has on work
already completed. It is generally appreciated, though, that the later in a project
changes are introduced, the greater the impact they are likely to have on work already
completed and overall project cost.
Conclusions
This paper builds on the body of systems dynamic and project management work in
proposing an improved model for examining the dynamic behaviour of projects in
relation to performance and outcomes. Even in its current basic form, the model
provides useful insights into the ways the discovery of additional work over a
project’s life affects project performance.
It is not always possible, particularly for complex projects, to comprehensively map
and define project requirements prior to system design and development. From
experience and the preliminary modelling results outlined in this paper, it is likely that
a project’s performance will be sensitive to the discovery of additional work after
commencement, causing not only significant schedule and cost over-runs but also
unanticipated rework and sometimes even a portion of work completed to become
nugatory and be discarded.
Future work
Further research into the nature of new work, how it is discovered and how sensitive
project duration is to the time it takes to replan, accommodate and deal with
additional work is required, as is improved definition of key dynamic project
parameters including Error Fraction, Concreteness, Transformability and Learningby-Doing followed by analysis of their influence on project performance and
outcomes. Future research will build on the modest start described in this paper,
resulting (it is hoped) in a more effective means of managing dynamic project and
program risk.
25
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Bibliography
„†‡ŽǦƒ‹†ǡǡƒ†–—ƒ”–ƒ†‹…ǤSoftwareProjectDynamicsAnIntegrated
Approach.’’‡”ƒ††Ž‡‹˜‡”ǡ‡™‡”•‡›ǣ”‡–‹…‡ƒŽŽǡͳͻͻͳǤ
‘›Ž‡ǡǤ
ǤSystemDynamicsModellingAPracticalApproach.‘ŽǤͳǤͳ˜‘Ž•Ǥ
‘†‘ǣŠƒ’ƒƬƒŽŽȀǡͳͻͻ͸Ǥ
—ƒǦ‡›‡•ǡ—‹•Ǥǡƒ—”ƒǤŽƒ…ǡ–Š‘›Ǥ”‡••™‡ŽŽǡƒ†Š‡”‡•ƒǤƒ”†‘Ǥ
Dz‘™Ž‡†‰‡•Šƒ”‹‰ƒ†–”—•–‹…‘ŽŽƒ„‘”ƒ–‹˜‡”‡“—‹”‡‡–•ƒƒŽ›•‹•ǤdzSystem
DynamicsReviewȋ‘Š‹Ž‡›Ƭ‘•–†ȌʹͶǡ‘Ǥ͵ȋƒŽŽȌȋ‘˜‡„‡”ʹͲͲͺȌǣʹ͸ͷǦ
ʹͻ͹Ǥ
…—…ƒ•ǡŽƒǤDecisionMaking:RiskManagement,SystemsThinkingand
SituationAwareness.ƒ„‡””ƒǡǣ”‰‘•”‡••ǡʹͲͲ͵Ǥ
–‡”ƒǡ‘ŠǤBusinessDynamics:SystemsThinkingandModellingfora
ComplexWorld.‘•–‘ǡƒ••ƒ…Š—•‡––•ǣ”™‹…
”ƒ™‹ŽŽǡʹͲͲͲǤ
‘Ž•–‡Š‘Ž‡ǡ”‹…ǤSystemEnquireyASystemDynamicsApproach.
Š‹…Š‡•–‡”ǡ‡•–—••‡šǣ‘Š‹Ž‡›Ƭ‘•–†ǡͳͻͻͲǤ
26
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Lessons Learned from the
Systems Engineering Microcosm Sandpit
Quoc Do, Peter Campbell, Shraga
Shoval, Matthew J. Berryman,
Stephen Cook,
Todd Mansell and Phillip Relf
Defence Science and Technology
Organisation,
Edinburgh, South Australia.
Ph. +61 8 8259 7566
Email: todd.mansel, phillip.relf
{@dsto.defence.gov.au}
Defence and Systems Institute,
University of South Australia,
Mawson Lakes Campus, SA
Ph. +61 8 8302 3551
Email: quoc.do, peter.campbell,
shraga.shoval, matthew.berryman,
stephen.cook {@unisa.edu.au}
Abstract
Lessons learned can be highly valuable to engineering projects. They provide a means for
systems engineers to thoroughly investigate and anticipate potential project risks before
starting the project. Up-front analysis of the end-to-end process pays on-going dividends.
This paper describes: 1) an evolutionary Microcosm for investigating systems integration
issues, fostering model-based systems engineering research, and accelerating systems
engineering education; 2) the lessons learned during the first stage of the Microcosm
development; 3) how these lessons learned have informed the design and implementation of
the Microcosm Stage Two. Interestingly, the lessons learned from the Microcosm Stage One
reflect many of the common lessons learned found in much larger industry projects. This
demonstrates the Microcosm Sandpit’s capability in replicating a wide range of systems
development issues common to complex systems. Thus it provides an ideal environment for
systems engineering education, training and research.
INTRODUCTION
The Microcosm program was established by the University of South Australia, and the
Defence Science and Technology Organisation in 2006, to foster research, training and
education in systems engineering, with a particular focus on conducting research into better
understanding of how to manage the issues of system integration that arise in the
development of large system of systems projects. These issues are described succinctly in
(Norman and Kuras, 2006), for example, and are understood to arise from multiple causes.
One of the most important sources of systems integration difficulty is caused by the need to
assemble these systems from a number of different systems that have often been designed and
built for different purposes, by different manufacturers, and to operate in different
environments from those expected in the new system of systems. The Microcosm program
has been deliberately designed to mimic this situation on a small scale, in order that the same
type of issues will occur and research and teaching of these issues can be carried out in a
small and manageable environment.
27
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
The Microcosm program is executed in multiple stages that adopt the Spiral Development
Model. Stage One of the Microcosm development project has been completed, with the
identification of over 30 lessons learned, which have strong parallels with those that arise in
real projects. This first stage was carried out with the following inadequate planning and
management characteristics that are often reflected in real projects:
a.
b.
c.
d.
Unclear and shifting requirements,
A tight schedule and very tight funding,
Inadequate systems engineering and project management planning,
Development based on components supplied by a number of different
manufacturers, causing numerous interface issues, and
e. Inadequate understanding of the possible environmental effects on the system’s
performance.
Both governments and industries have recognized the need to document and apply the
knowledge gained from past experience to support current and future projects in order to
avoid the repetition of past failures and mishaps and the promotion of successful practices.
Traditionally lessons learned are associated with failures, and involving the loss of valuable
resources. However, project successes can and should also be recorded as lessons learned.
These lessons are generally stored and maintained in a database and regarded as a valuable
resource for rectifying causes of failure, or informing decision makings that are likely to
improve future performance.
One of the valuable systems engineering products generated from the Microcosm Stage One
is a lessons learned database. Many lessons learned databases that have been built in the past
have ended up in total neglect within a short time because they did not provide potential
future users with easy access or the cues to promote their use. The Microcosm lessons learned
database is intended to be an interactive living database to provide continuous support to the
project into improved practices of systems engineering, and address systems integration
issues as well as the use of Microcosm systems engineering process products for education
purposes. This paper gives a brief description of stage one of the Microcosm project, and then
discusses the lessons learned during its development, how they have informed the execution
of the second stage of the project and the database design to make them available on the
project Wiki.
MICROCOSM PROGRAM
The Microcosm program is aimed at developing an evolutionary facility, namely the
Microcosm Sandpit, which will expand in capability to meet the wider and longer-term
requirements of its stakeholders by adopting the Spiral development model as described in
(Buede, 2000). It is essentially an open-ended project with multiple stages, where each stage
will enhance existing capabilities as well as develop new capabilities to meet the growing
needs of stakeholders. It resembles the evolutionary nature of military systems upgrades on a
much smaller scale. This is a challenge for the traditional systems engineering paradigm,
since the systems engineering process is usually taught and largely practiced using a linear
system development process (ie., the Waterfall model), unsuited to deal with evolving
systems.
28
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
The Microcosm Sandpit provides a flexible means to explore systems engineering practices
within a facility that utilises real and simulated autonomous systems operating and interacting
with humans, the physical and the simulated environments. The evolving Microcosm Sandpit
fosters systems engineering practices in both research and teaching environments.
Essentially, it provides a systems engineering and systems integration environment to be used
by stakeholders to stage demonstrations, conduct experiments, train staff, and to evaluate
systems configuration and operation. The Microcosm programme has six defined use-cases:
Simulation/Stimulation: Investigation of the interactions between a mixture of models
and physical hardware in real-time and simulated scenarios, including hardware in the
loop, simulation architectures and plug-and-play hardware and software.
Human Agent-Based Modelling for Systems Engineering: Development of human
operational models and interfaces in scenarios based on teaming agents, human
replacement agents, socio-technical models, and research into the insertion of data
and image files into an OPNET Modeller.
Modelling, Simulation and Performance Analysis: Development of object oriented
modelling of system devices, and statistical modelling of captured real-life data.
Autonomous Vehicles Research: Investigation of swarming robots, cooperative robots,
dynamic task allocation, robustness and reliability, and combined communication and
localisation.
Systems Engineering Approach to Evolving Model Development: Investigation of
SE–lifecycle models, evolving agents, platforms and environments, and the transition
from low to high fidelity models.
Systems Enhancement Research: System analysis, system parameterisation,
optimisation, algorithm development, kit development, and research into HumanComputer Interfaces (HCI).
Microcosm Stage One Architecture
The Microcosm programme high-level architecture has three distinct parts (Mansell et al.,
2008): the Microcosm Information Management System (MIMS), the Modelling and
Simulation Control System (MASCS), and the Microcosm Physical System (MPS),
illustrated in Figure 1. The MIMS is an integrated information management system that
stores all the systems engineering products associated with the project through each spiraldevelopment cycle. The MASCS is a simulation and control subsystem that contains
synthetic models of Microcosm’s components including environment models, simulated
autonomous vehicles, and a suit of onboard and off-board sensors models. In addition, the
system provides the capability for hardware-in-the-loop (HIL) simulation through the use of a
common interface between simulated and physical components that provides seamless
interactions between these components in a given operational scenario. Finally, the MPS
consists of all the physical components of the Microcosm facility, including the autonomous
robotic vehicles, external sensors and external environments.
Microcosm Stage One Implementation
The Microcosm Stage One operational scenario consists of two unmanned ground vehicles
(Pioneer 3DXs), and a fixed global external sensor: SICK LMS 291 laser sensor as shown in
29
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure 2(a). Each mobile robot has an on-board laptop, vision sensor, ultrasonic sensor, laser
sensor and wheel encoders. The global sensors are used for intruder detection (i.e. a person),
which send notification to the two mobile robots to intercept, assess, perform threat
mitigation or neutralize the intruder. In addition to the physical components, synthetic models
have been developed for the P3DXs, laser sensors, ultrasonic sensor, vision sensors, and the
operating environment. This provides the flexibility to run scenarios using only physical
systems, synthetic models alone or a combination of real components and synthetic models.
Also it provides a powerful environment to explore systems integration issues, and modelbased systems engineering research.
Figure 1. The high level architecture of the Microcosm Sandpit.
The Microcosm Sandpit Stage One system implementation has been successfully completed
and evaluated (Do et al., 2009) using a service oriented architecture (SOA). The SOA is
based on the Decentralised Software Service (DSS) (Nielsen and Chrysanthakopoulos, 2008)
and the Concurrent Control Runtime (CCR) library (Morgan, 2008) from Microsoft and is
illustrated in Figure 2(b). The CCR is a managed library that provides classes and methods
for concurrency, coordination and error handling. It enables segments of codes to operate
independently, pass messages and execute in parallel. The DSS, on the other hand, extends
the CCR capability across processes and machines. Both CCR and DSS are provided within
the Microsoft Robotic Development Studio (MRDS) (Johns and Taylor, 2008).
(a)
(b)
Figure 2. (a) Operational view (OV1) of the Microcosm Stage One scenario.
(b) Service-oriented architecture for the Microcosm Stage One implementation.
30
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Another important aspect of the Microcosm Sandpit development is that the simulation
environment is implemented using the simulation capability of the Microsoft Robotic
Development Studio (MRDS). It uses the AGEIA PhysX physics engine to simulate physical
interactions within the virtual environment, such as friction, gravity and collision (Johns and
Taylor, 2008). The Microcosm Sandpit physical environment and the robots are modelled to
scale in the simulation, and sensor accuracy is modelled with a modest level of fidelity. The
modelled environment is shown in Figure 3. The synthetic robots are configured to replicate
the motions of their real counterparts in the real environment, performing the operational
scenario depicted in Figure 2 (a). A synthetic intruder is also modelled and its motion is
emulated based on the intruder’s position calculated by the ground-based laser sensor in the
physical environment. Intruder and robots’ motion data are supplied to the simulation
environment by the Master-Runtime Control.
Figure 3. Aerial view of the synthetic environment of the Microcosm Sandpit.
The operation of the stage one operation scenario is based on centralised control architecture,
as depicted in Figure 2(b). The Master Runtime Control is responsible for intruder detection,
sensor fusion, creating and maintaining an operational picture, and requesting positional
update from the robots. It also instructs the Simulation Orchestrator to emulate the robots and
intruder motions.
Upon intruder detection, the robots receive the intruder’s position from the Master-Runtime
Control. Robot Two pivots and tracks the intruder using the onboard camera, while Robot
One performs its own path planning to follow and intercept the intruder using a finite state
machine (Cook et al., 2009). After the initialisation state Robot One transits automatically
into the Standby state. Upon intruder detection it progresses to the Follow-Intruder state, in
which it performs path planning and follows the intruder. While in the Follow-Intruder state,
if an obstacle appears in the way (detected by the onboard laser sensor), Robot One transits to
the Obstacle-Avoidance state, and resumes the previous state when the obstacle is cleared.
When the robot is within a metre of the intruder, it enters Intruder Engagement state and
announces successful interception to the intruder. A voice message is also transmitted on
each state transition to enable observers to assess progress through the scenario. Should the
intruder leave the guarded area, the robot enters the Return-To-Base state.
31
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
MICROCOSMS LESSONS LEARNED
Lessons learned databases are a collection of analysed data from a variety of current and
historical events. Traditionally these events are associated with failures, and involving the
loss of valuable resources. Lessons learned databases provide a resource that can be used to
rectify the causes of failure, or choose pathways that are likely to improve future
performance. Many large civilian and military organizations around the world manage a
database that includes a description of the driving events (failures or success), the lesson(s)
learned from these events and recommendations for future actions. In this section we describe
some events that are included in the lessons learned database of the Microcosm program.
Although the Microcosm database is driven by relatively simple events, the database has
similar outcomes and recommendations to the traditional lessons learned databases found in
larger systems engineering/integration organisations. Furthermore, the unique configuration
of Microcosm offers an environment in which systems engineering practices are developed
and extended in a forgiving environment, and provides a large variety of driving events with
little risk of loss of resources.
The following table gives examples of lessons learned in categories that are maintained in our
lessons learned database. For each entry the nature of the issue and what we learned as a
consequence of this issue’s manifestation has been summarised in the third and fourth
columns respectively.
Table.1: Samples of the lessons learned from the Microcosm Stage One.
Id
MLL01
Issue Type
Equipment
Operation
Issue Description
EMI issue with the flex compass being
interfered with by the robot’s
electrically powered motors.
MLL02
Environment
MLL03
Integration
MLL04
Interfaces
MLL05
Computer
Hardware
Robot’s localisation system was
unable to cope with measurement
inconsistencies, introduced by the
environmental conditions i.e., uneven
floor surfaces.
COTS equipment manufacturer’s
interface documentation was
incomplete and ambiguous.
Team members were developing their
common software interfaces but
assumed incompatible units of
measurement and dissimilar
coordinate systems for various data
fields.
Processing power, while adequate for
a single application, was found to be
inadequate when the same CPU was
required to support multiple
applications.
MLL06
Configuration
Management
System data was stored in a common
folder on a server but system baselines
were not clearly defined. It was
difficult to identify working modules
32
What Was Learned?
EMI issues are difficult to predict during
the design stage and can also be
intermittent in their manifestation.
Hardware prototyping and extensive
system testing is required to ensure that
EMI issues are identified early.
Information redundancy, supported by
multiple different sensor types, is required
to compensate for sensor errors realised
within an imperfect environment.
Interface testing is necessary to prove that
the COTS equipment communicates as
expected before system test is conducted.
An interface specification is required to
effectively define the system component
interfaces. Each affected team member is
hence required to ‘sign up to’ this
interface specification.
Before loading an application set on a
computing platform a processor
performance model should be developed
to confirm that adequate computing
resources are available over all system
operational scenarios.
System baselines need to be defined and
the relevant ‘snap shot’ of the system kept
unmodified in a known directory structure
to ensure that development to the next
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Id
Issue Type
Issue Description
for each baseline and differentiate
completed modules from work-inprogress ones.
Due to a software issue, the robot
went ‘rouge’ during a test, which
could have resulted in damage to the
robot, before manual action was
forthcoming to secure the robot’s
motion.
See MLL07
MLL07
Safety
MLL08
Emergency
Procedures
MLL09
Project
Management
Our initial effort estimates gave a
value of 1.1 person years. However,
the actual expenditure was 1.5 person
years.
MLL10
Technology
adoption/inse
rtion
MLL11
Lesson
Learned
Capture
The self-localisation of the robot was
based on only odometry with no
correctional global positioning. It was
found that the odometry suffers from
accumulative errors that lead the robot
to a “lost state” within a short-period
of time.
Lessons learned were captured at the
end. It was found that many of the
same issues were independently
discovered by multiple team members,
which could have been avoided in the
subsequent cases, given early access
to populating the Microcosm Stage
One lessons learned database
What Was Learned?
system baseline progresses from a known
state.
A readily accessible and simple method of
halting the robot’s motion (i.e., kill
switch) should be provided as part of the
system functionality.
A procedure and deployment of this
procedure (i.e., staff training) was found
to be necessary to address potential safety
issues.
Our systems engineering process was not
robust enough to mitigate against the
rework required to recover from various
issues, see this table for examples. Our
systems engineering process will now be
evaluated and process improvement
considered as appropriate.
Model-based systems engineering should
be used to inform the feasibility of the
intended methodology and technology
prior to their insertion.
Lesson learned should be recorded
immediately after they occurred to avoid
duplicated failures. This requires the
lessons learned database to have online
access, and have email notification of new
entry to all team members.
IMPACTS OF LESSONS LEARNED ON THE MICROCOSM STAGE TWO
Lessons learned can be regarded as a critical source of information for risk mitigation and
optimal project planning. However, their effective use and maintenance still remain as a
challenge to the engineering community. One of the aims of the Microcosm program is to
investigate methods for the effective use of the lessons learned to inform the design and
implementation of future stages of the Microcosm project, and also their use in systems
engineering education.
The captured lessons learned from the Microcosm Stage One have informed the execution of
the Microcosm Stage Two work in two different aspects: engineering implementation and
systems engineering process. The former has occurred in the system design and
implementation phases, where stronger emphasis on understanding and designing to the
interface issues between different system components is being addressed as the result of the
lesson learned id-MLL03 and id-MLL04 in Table.1. In particular, the following areas are
considered: communication protocol, deterministic response, Quality of Service (QoS), type
of messages/data being passed between services, data update rates of each sensors and
33
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
services, all embedded in a hardware-software system deployment architecture framework
design.
Similarly, the system design process is being informed by the lesson learned id-MLL10, in
Table.1 where preliminary system components’ capability is being investigated by creating
statistical models of the components, to ascertain whether the intended technology will meet
our system’s requirements, irrespective of what was stated in the various product
specifications. For instance, an Ultra-wide Band (UWB) positioning system was considered
to be used for providing indoor analogue to GPS to update of the robots’ positions. A modelbased systems engineering approached was adopted to inform the system design with an
outcome that indicates that UWB positioning system alone will not be sufficient in providing
global positioning update for the robots due to large positional errors. This led to an insertion
of an additional requirement to procure an extra SICK LMS-291 laser sensor. Note that this
was identified before the project start instead of toward the end of the project, as might
otherwise have happened.
Furthermore, the tailored systems engineering processes for the Microcosm Stage Two has
been informed by the lessons learned recorded in Table.1. For instance, with reference to the
lesson learned, id-MLL09, our estimate of the total effort required was 1.09 person effort was
out by 36%, due extensively to the allocation of time for the test and evaluation phase. This
lesson has informed the time allocation for the systems integration test and evaluation phase
of the Microcosm Stage Two, increasing it from 5% to 20% of the over project schedule. This
is believed to be the general expectation in industry settings and hence rework has potentially
been mitigated as a consequence of the lesson-learned id-MLL10.
PROPOSED RESEARCH INTO THE USABILITY OF A LESSONS LEARNED
DATABASE
In order to make use of the captured lessons learned it is important to have the right software
infrastructure in place and the right methodology/ environment for using it. The wiki software
being used will allow lessons learned to be linked to relevant architectures and design
patterns (which can also be stored in the wiki), and the impacts of changes in these can be
captured and stored in the wiki’s history. Having the right processes will include things like
making sure people are using the wiki, the use of email to notify wiki updates with timecritical lessons learned, deciding on the right course of action as a result of the lesson learned,
and implementing these changes.
Learning can take place at a number of levels (Grisogono and Spaans, 2008). In the context
of Microcosm, level 1 learning would be tuning existing systems engineering processes as a
result of lessons learns. Level 2 learning includes improving the processes used to support
level 1learning. This includes modifications to the structure of the wiki and processes for
learning, along with expanding and modifying the set of engineering processes used in
Microcosm. Level 3 learning is learning about level 2 learning – capturing lessons learned
regarding the use of the wiki database and the processes adopted for acquiring the lessons
learned. The Microcosm wiki could capture level 3 learning by research into usability of the
lessons learned database. Level 4 learning is about aligning how the lessons learned are used
with measures of real world performance of the Microcosm. For example, if a different way
of communicating architectural changes is used, does this reflect in better software
performance? Level 5 learning is about how lessons learned are used in a co-operative
34
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
setting. To be most effective at learning, all five levels have to be implemented. Therefore,
the Microcosm lessons learned research program focuses on how well each of the five levels
of learning can be achieved through the use of the Microcosm wiki and other processes.
CONCLUSIONS
This paper has discussed the usefulness of capturing lessons learned from the Microcosm
Stage One and illustrated how these lessons learned have informed the design and
implementation, and the tailored systems engineering processes of the Microcosm Stage
Two. The captured lessons learned were from a small-scale project but reflect many of the
same lessons learned reported in large-scale projects. This has demonstrated the Microcosm
Sandpit’s merits in generating systems engineering products that could be used in systems
engineering education, training and research. The captured lessons learned are stored on the
project Wiki that will be equipped with interactive mechanisms for autonomously engaging
and proving insightfully information at various stages of the Microcosm program as part of its
model-based systems engineering research theme.
REFERENCES
Buede, D. M. 2000. The Engineering Design of Systems, Wiley Publishing Inc.
Cook, S., T. Mansell, Q. Do, P. Campbell, P. Relf, S. Shoval and S. Russell, 2009.
Infrastructure to Support Teaching and Research in the Systems Engineering of
Evolving Systems. 7th Annual Conference on Systems Engineering Research 2009
(CSER 2009), accepted for publication, Loughborough, UK.
Do, Q., T. Mansell, P. Campbell and S. Cook, 2009. A Simulation Architecture For Modelbased Systems Engineering and Education. SimTect 2009, accepted for publication,
Adelaide, Australia.
Grisogono, A.-M. and M. Spaans, 2008. Adaptive Use of Networks to Generate an Adaptive
Task Force. 13th ICCRTS: C2 for Complex Endeavors.
Johns, K. and T. Taylor 2008. Professional Microsoft Robotics Developer Studio Wiley
Publishing Inc.
Mansell, T., P. Relf, S. Cook, P. Campbell, S. Shoval, Q. Do and C. Ross, 2008. MicrocosmA Systems Engineering and Systems Integration Sandpit. Asia-Pacific Conference on
Systems Engineering - APCOSE, Japan.
Morgan, S. 2008. Programming Microsoft Robotics Studio, Microsoft Press, US.
Nielsen, H. F. and G. Chrysanthakopoulos. 2008. "Decentralized Software Services Protocol
– DSSP/1.0."
http://download.microsoft.com/download/5//6/B/56B49917-65E8494A-BB8C-3D49850DAAC1/DSSP.pdf.
Norman, D. and M. Kuras 2006. Engineering Complex Systems. Complex Engineered
Systems: Science Meets Technology. D. Braha, A. Minai and Y. Bar-Yam, Springer.
35
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
36
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
MAKING ARCHITECTURES WORK FOR ANALYSIS
Ruth Gani B.AppSc., Simon Ng BSc./BEng. PhD.
Joint Operations Division
Defence Science and Technology Organisation
INTRODUCTION
Capability integration is an important step towards delivering joint capability for Defence—
indeed, interoperability is central to enabling effective joint, combined and coalition
operations. In the words of the Department’s Defence Capability Development Manual, ‘Joint
interoperability is to be seen as an essential consideration for all ADF capability
development proposals’ (Australian Department of Defence 2006, p47).
A central component of ensuring interoperability is the integration of systems across
capabilities. According to Weill from the MIT Center for Information Systems Research
(Weill 2007), the enterprise architecture is the ‘organising logic for key business process and
IT capabilities reflecting the integration and standardisation requirements of the firm’s
operating model’. The Australian Department of Defence has adopted an enterprise
architectures approach through the Defence Architectural Framework (Australian Department
of Defence 2006, p47). The Chief Information Officer Group states that the application of the
Defence Architectural Framework ‘enables ICT architectures to contribute to effectively
acquiring and implementing business and military capabilities, ensuring interoperability, and
value for money and can be used to enhance the decision making process’ (Chief Information
Officer Group, 2009)
Architectures are intended to document a system in order to support reasoning about that
system. This encompasses the use of the architecture to manage and direct system changes, to
measure system effectiveness and performance and to describe and visualise the system
(Institute of Electrical and Electronics Engineers, 2000). As such, architectures must be
tractable from this perspective. Unfortunately, the utility of architectures can be limited by
poor appreciation of their potential to support this sort of reasoning or by a lack of
understanding of who might qualify as an ‘end-user’ of the architecture.
The Defence Science and Technology Organisation (DSTO) plays an important role in
supporting Defence capability development and acquisition projects. Part of this role is to
examine the risks and mitigation strategies associated with the integration of the project into
the wider Defence capability. As such, DSTO can be an ‘end-user’ of the architectures
developed as part of the capability development process (DSTO can, of course, also be
involved in the development of architectures for Defence). This paper reports on lessons on
the use of architectures drawn from DSTO’s support to a key Defence capability acquisition
project. It demonstrates the importance of documenting architectures in a manner that makes
them useful for reasoning by presenting the remedial efforts that were needed to make the
extant project architecture amenable to analysis.
Because of the classified nature of much of the material related to the Defence project in
question, the name of the project and details of the proposed system options under
consideration have been withheld.
37
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
OVERVIEW
The aim of the DSTO study presented herein was to answer two questions:
1. What were the important interoperability requirements and associated information
domain standards necessary for the new Defence capability to operate within the
Defence Information Environment (DIE)?
2. What was the impact of poor interoperability (due to poor standards compliance)
on the new capability’s capacity to conduct its missions and to meet defined
information exchange requirements?
These questions were answered by comparing standards associated with the information
exchange requirements identified by the project with standards defined for the Defence
Information Environment, producing a ‘degree of compliance’ rating. If the degree of
compliance was 1, then the capability being developed by the project would be fully
interoperable (within the limits of resolution of the study). If the degree of compliance was 0,
then the capability would not be interoperable at all.
THE DEFENCE INFORMATION ENVIRONMENT
Defence
Information
Domains
(DID)
Management
Operations
Policy and Doctrine
Organisation and Structures
People and Training
Information Management
Data
Defence
Information
Infrastructure
(DII)
Sensors
Processes and Procedures
User applications
Common Services
Information Interoperability
Information Management
Defence Information Environment
Weapons
User devices
System Hardware
Networks/Datalinks
Bearers
Fixed
Coalition
Allies
OGOs
Industry
NII
Deployed
Figure 1. A logical representation of the DIE.
The Defence Information Environment (DIE) as represented in Figure 1 (above) is divided
into layers. Specific layers were of particular relevance to the project, those being:
x the Data, User Applications and Common Services layers of the Defence Information
Infrastructure (DII);
38
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
x
x
the information management aspects of the Network/Datalinks; and
the Bearers layer.
Each layer of the DII has an associated set of approved standards which are documented
within the Approved Technology Standards List (Chief Information Officer Group 2006). The
capability on which this study was focussed exists in the space defined by sensors and
weapons to the right of the diagram. This capability interfaces with other entities (be they
ships, planes or headquarters) through the DIE.
SOURCING DATA AND INFORMATION
Data concerning three types of entities were required for DSTO’s purpose:
x data concerning the Information Exchange Requirements between the new capability
and the DIE;
x a list of relevant DIE standards (current and likely future); and
x a list of missions that might need to be undertaken in the context of Defence fielding
the new capability.
The project’s architecture data provided the central source of information about the types of
information exchanges needed between the new capability and the broader Defence
Information Environment. In other words, the architecture products, the project Operational
Concept Document (OCD) and other project documents (such as the preliminary Functional
Performance Specification (FPS) and the Communication Information Systems (CIS) report)
articulated the types of information that needed to be exchanged, the methods and media that
would be used to exchange the information and some of the associated standards.
Standards associated with the DII layers of interest were sourced from the Approved
Technology Standards List (ATSL) and other Defence documentation. Future standards
assumed the adoption of the US Distributed Common Ground System Integrated Backbone
(DIB) standards into the DIE.1
A list of the missions being undertaken by the new capability were contained within the OCD.
However, the architecture products were expressed in terms of scenarios, which were too
context specific for generating statements of generic mission effectiveness into the future. The
missing mission data was extracted from the OCD (for the new capability) and other doctrinal
sources for other capabilities (nodes) in the system with which information exchanges were
required.
ARCHITECTURE DATA: THE UNDERLYING PROBLEMS
As stated, the design of an architecture is mediated by the purpose for which it will be used.
The architecture provided with the project’s OCD was flawed across a number of levels.
x
The architecture was developed in a fragmented way.
The key ‘operational view’ data that form the DAF were developed for the project
under two different contracts at two different times. This led to inconsistencies that the
adoption of a framework approach aiding configuration management is meant to
avoid. Specifically, items of information specified in one area of the architecture were
1
References related to the project or restricted Defence information are not included.
39
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
inconsistent, in terms and definitions, with items specified in other parts of the
architecture. This made it impossible to directly trace from operational activities to
information exchanges between nodes. The answer to the question ‘what activity does
this information exchange relate to’ could only be guessed at. A second inconsistency
was the use of missions as the basis of one area of the architecture and scenarios as the
basis of the other. These fundamental disconnects made interpretation of the
relationship between operational entities, information and operational activities very
difficult.
x
The architecture consisted of views, but very little documented underlying data.
An architecture is a collection of data and data relationships. Unfortunately, the
architecture provided by the project was a set of ‘views’—diagrams—with no explicit
information about underlying data structures or relationships. No repository existed
that could be filtered, sorted or in any way manipulated. In essence, the architecture
products supplied as part of the OCD were aimed at securing transit of the document
through ‘first pass’, but they were poorly suited to our analytical purposes and (as
shall be seen) required considerable data ‘scrubbing’.
x
The elements of the architecture were ambiguous.
Development of the elements of the architecture was not done consistently. For
instance, capability nodes were defined at different levels of resolution: weapons
hardware, radar systems, individual ship types and a surface action group (SAG) were
all present as nodes with relationships expressed as information ‘needlines’ between
them. There was no indication as to which ships constituted a SAG and which ships
within the SAG might be responsible for each needline. Some decision early on as to
the fidelity required for analysis could have saved effort in the long run.
Another problem was in the expression of Information Exchange Requirements (IERs)
in the IER matrix. A sample of the IER matrix showing one of the hundreds of IERs
processed is shown in Table 1 (below).
Table 1: Selected columns for a row in appearing in the IER matrix.
Information
Element
Name
Content
Description
Triggering
Element
Activity
Producer
Consumer
Media
System
Media
Method
Temporal
Information
Weather
forecasts
Weather
Forecasts
(incl.
visibility)
Bureau of
Meteorology
(BOM)
Start of
Mission
BOM
Command
and Control
Headquarters
Internet
Line
1 way
The IER matrix was a primary source of information for the study, but it contained
significant ambiguities. The IERs were described as being ‘one way’, ‘two-way’,
‘network’, etc. and multiple information types were identified for each IER. The ‘oneway’ IERs were manageable; ‘two way’ or ‘network’ made it difficult to determine
which information was flowing in what direction.
The ‘media system’ and ‘media method’ fields were occasionally populated by
multiple data options. For instance, if the system field contained ‘Mainframe, PC,
Mac, Notebook’ and the method field contains ‘wireless, network, hand carriage, line
of sight, beyond line of sight’ then decisions must be made as to which of the cross-
40
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
product (system x method) options were valid and needed. Some combinations could
be automatically ruled out (mainframe x hand carriage) but it wasn’t possible to
determine whether other combinations made sense and whether or not they were
essential to meeting the IER. In some cases the systems or methods mentioned were
not discrete (eg. TADIL, Datalink, Link A), leading to boundary problems. If an IER
mentions Link A, do you also relate it to Datalink and TADIL? Problems can be
introduced by broadening the scope of node to node IER media methods and systems.
x
The ‘users’ or purpose of the architecture was not well understood.
It was clear that the architecture views were developed to facilitate a stage in the
approval of the project. This doesn’t preclude the appropriate development of the
architecture for other uses, but the caveat attached to the front of the architectural
views stated that no analytical purpose had been established as part of the design brief
for the architecture. In other words, the developers of the architecture were not given
an explicit statement of all the uses to which the architecture would be put, although
they may very well have made implicit assumptions about what the architecture might
be used for and by whom it might be used.
Some of the problems raised above could have been avoided at the design stage of the data
gathering process. Much of the effort in the study was in correcting or compensating for the
above data consistency and configuration problems.
FIXING AND RECONCILING THE UNDERLYING DATA
Given the challenges inherent in the architectural views supplied in the project’s OCD, two
options were available:
1. Load the views into a Commercial Off-The-Shelf (COTS) architectural product; or
2. Create a purpose-built relational database to house the study data and implied data
relationships derived from the OCD views supplied.
The first option (COTS) entailed purchasing software licenses and training staff to a
reasonable level of expertise. Further, it wasn’t possible to guarantee that a COTS tool would
easily accommodate such ill-configured data. Ironically, configuration management may have
limited our flexibility in manipulating and analysing what poor data was available. Due to
these risks and costs, this option was discounted.
The second option required the creation of a simple relational database with just enough detail
to support the data we had available using Defence application tools already in place. We
chose to create a lightweight prototype for the purposes of supporting this project alone: it
would be flexible (or slack) enough to accommodate the badly formed data. Importantly, it
would provide a queriable repository, one that wasn’t available from the project architecture
itself.
To construct the database, it was necessary to define the logical relations between the data of
interest within the OCD architecture views. The logical relationship between data types are
described in the Entity Relationship Diagram (ERD) in Figure 2 (below).
41
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Scenario
Media
IER
Need
Cardinality
one
two
one or more
Node
Figure 2. Logical IER matrix Entity Relationship Diagram.
Unfortunately, this relational model was not broad enough in scope to address the analytical
questions the study wanted to answer, and so it was extended to accommodate the standards
list and missions list, as seen in Figure 3 (below). Standard data was related to the media
methods and systems that needed to comply with the standards. Missions were related to IERs
that supported the mission activity and nodes that were actually involved in carrying out
mission activities.
Scenario
Standard
Media
IER
Need
Cardinality
one
two
zero or more
one or more
Node
Figure 3. Extended logical Entity Relationship Diagram.
42
Mission
DataSource
Organisation
Standard
MediaMethod
MediaType
MediaSystem
NodeMediaMethodMapping
NodeMediaSystemMapping
IERMediaMapping
Node
IER
43
Figure 4. Final Entity Relationship Diagram for the study database
zero or one
one
two
zero or more
one or more
Cardinality
StandardMediaMethod
Mapping
StandardMediaType
Mapping
StandardMediaSystem
Mapping
MissionNodeMapping
Need
ScenarioIERMapping
Band
Activity
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Mission
IERMissionMapping
Scenario
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This logical structure was instantiated as a simple MS Access database. To reduce
duplication, tables to facilitate many-to-many relationships were introduced. This ensured that
the lookup or ‘mapping’ tables containing the table ‘joins’ were insulated from entity data
changes. Such relationship tables appear in the Figure 4 (above) as tables whose names end in
‘mapping’ (ie. IERMediaMapping).
Entity data was entered using various methods:
x by file import into the appropriate table (DIB data into the Standards table);
x using cut-and-paste on a record-by-record basis (ATSL data into the Standards table);
x executing insert queries (duplicating each ‘two-way’ IER to create two ‘one-way’
IERs with reversed source and sink nodes); and
x by manual data entry (Scenario, Mission, IER, Need, Node, Media).
After the data was entered, we tested the system by recreating the IER matrix using database
queries and matching the output against that provided in the OCD. Other queries showing
Node - Mission relationships and Node - IER Method - IER System relationships were also
generated and scrutinised for error.
This building of an enriched relational database was the first step in tackling the shortcomings
inherent in the original data. Not only did it correct these shortcomings, but it also provided a
repository through which questions linking different parts of the architecture could be asked
and meaningful answers arrived at. For example, based on this newly constructed database, it
was possible to ask questions like ‘what will the mission impact be if the project adopts a
standard for this tactical data link that mismatches with the standard mandated in the ATSL?’.
This presented a significant step forward in making the data analysable, and it highlights the
importance of understanding the purpose for which the architecture is built.
DEVELOPING AN ANALYSIS APPROACH
The analysis approach involved generating a score for each IER’s compliance with the
identified standards and then aggregating the scores in order to give meaning at a systemic
level. It was of little or no value to simply provide the project with a view of which media
systems had poor interoperability, but it was useful to highlight risks to mission outcomes that
arose from this lack of compliance. Therefore, the compliance score for an IER is the
aggregation of all media system and media method scores associated with the IER. A media
score is related to how many standards associated with the media are duplicated between the
new capability standards list and the DIE standards list. When considering the future DIE, we
incorporated the ‘other’ (DIB) standards into the DIE standards list. See Figure 5 (below) for
a pictorial view of the aggregation. Note that the fidelity and confidence in the resulting
scores diminishes at each level as the score is diluted by continual aggregations and outliers
are camouflaged.
Three cases were explored: worst case, medium case and best case. In the medium case
analysis, it was assumed that the risk associated with information exchange to and from the
proposed system was proportional to the fraction of standards common to the system and the
DIE (incorporating the DIB in the case of the future standards analysis).
44
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Low
Sink Mission
IER
Media System
Media Method
Standards
Confidence
Needline
High
DIE standards
Proposed system
required standards
Other standards
Figure 5. The data structure supporting the analyses undertaken.
To illustrate the method used here, consider the analysis of system compliance with the DIE:
1. Each standard was assigned a value of:
‘1’ if it was common to both the proposed system and the DIE; or
‘0’ otherwise;
2. For each media method, two scores were determined:
a. the ‘Positives’, which was the number of standards relevant to that media
method that were marked as ‘1’;
b. the ‘Total’, which was the total number of standards relevant to that media
method.
If the ‘Positives’ equalled the ‘Total’ then all standards associated with a media
method were, by definition, common to both the proposed system and the DIE. A
similar procedure was used to score the media systems.
3. The degree of compliance for each IER was determined as follows:
For an example IER with one associated media system:media method pair A:A’, the
compliance score P is determined by:
P
c( A, A' )
­
° 0, if A
0
°
°
0
® 0, if A'
°
° Positives(A) Positives(Ac)
otherwise
°
Total(A) Total(Ac)
¯
45
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
In the medium case analysis for IERs with more than one media system:media method pairing
the average of the compliance scores for the IER was calculated:
N
P for an IER with N media pairs
¦ c (O , O ' )
O
A
N
An identical approach was used to determine the degree of compliance for each needline and
mission.
In the best and worst case analysis, it was assumed that the risk associated with information
exchange between the proposed system and the external environment was determined by the
best or worst case risk associated with a given media system:media method pair, but
otherwise the approach was for all intents the same as that described above. An identical
approach was taken to determine the degree of compliance for each needline and mission. The
best/worst case analysis provides an idea of the risk spread for a given IER or mission.
REPRESENTING RESULTS
A challenge for any analysis is to represent results in a way that is meaningful to the client. In
this study, simply stating the compliance levels across information exchanges would have
conveyed the underlying analysis but not expressed it in terms that the clients (military
operators) would be likely to appreciate. Instead, considerable work was done to build the
relational database to allow the impact of poor compliance on missions to be determined. A
simple scheme was used to express the risk to mission failure.
Results were presented in a client report to the IPT in graphical and tabular form. The risk
profile graphs for IERs and Missions were presented and the implications for Defence
discussed. A sample graph is shown in Figure 6 (below):
In an attempt to focus any remedial effort and identify ‘low-hanging fruit’, a coarse sensitivity
analysis was done. The results were included in the client report in graphical form akin to
Figure 7 (below). It is clear from such an output that further work regarding options for
improving the compliance of ‘System X’ should have a positive effect on mission risk.
To allow the Defence project team to conduct further investigation, the client report was
accompanied by the study tools (database and spreadsheet) in electronic form—that is, the
database was not only used to support the DSTO study, but also as a tool for ongoing analysis
within the project itself. The project team have the option of using the tools to check their
high-level compliance as standards or documents develop throughout the future life of the
project. This transfer of knowledge in explicit form adds value to the analysis already
undertaken.
46
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Sink Mission Risk Profiles
0.6000
Proportion
0.5000
0.4000
0.3000
0.2000
0.1000
0.0000
Best case
Medium case
Worst case
Very High (0.0<P<=0.2)
0.0220
0.1500
0.1600
High (0.2<P<=0.4)
0.0280
0.2600
0.3700
Moderate (0.4<P<=0.6)
0.4200
0.3300
0.4700
Low (0.6<P<=0.8)
0.5300
0.2600
0.0000
Very Low (0.8<P<=1.0)
0.0000
0.0000
0.0000
Risk Cases
Figure 6 Proportion of Missions in each risk category as a result of non-compliance of standards
between the proposed system and the DIE (numbers are illustrative not actual).
Normalised contribution to IER compliance
M edia Systems: Sensitivity
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Au
di
N
o
et
w
or
k
J
D
at
a
D
at
al
in
N
k
et
w
or
k
Sy
I
st
em
Sy
Z
st
em
W
In
te
rn
et
Sy
st
em
Sy
X
st
em
Y
Li
nk
A
Li
nk
C
Li
nk
N
on
B
-v
M
ol
an
at
u
al
il e
M
ed
ia
Se
Li
lf
nk
pr
D
ot
ec
tio
n
TA
D
IL
Vo
ic
e
0
Media system
Figure 7 The relative contribution made by each media system (top) and media method (bottom) to the
overall compliance performance and associated risk of the proposed system’s information exchange
with the broader DIE (results are illustrative only).
47
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Apart from advice concerning the proposed system’s interfaces, the DSTO has made many
recommendations about the architectural products supplied as part of the OCD. The Defence
project team are now investing in their capability architecture as an important engineering and
analysis tool. The faults that were identified in form and content have been taken on board
and a coherent, self-consistent architecture is currently being produced.
CONCLUSIONS
Architectures form the basis for reasoning about systems. In Defence, they are a mandated
part of the capability development process. However, projects often lack an appreciation of
the likely uses to which an architecture may be put. This paper has highlighted the importance
of considering the users in any development of an architecture. It has flagged specific
shortcomings in the data presented to DSTO by the project and detailed the effort that went
into remedial work. This level of effort is costly and time consuming, and would be obviated
if a more encompassing approach to the development of the architecture took place up front.
Finally, it has flagged several important lessons, which may seem obvious conceptually but
which are often overlooked in capability development:
x
x
x
x
An architecture is only as valuable as it is consistent, coherent and documented.
Views, mismatches in resolution or elementary content and lack of specificity
undermine the utility of the architecture in supporting reasoning;
Any compliance process must take into account the quality and validity of the
architectural data and products, not just the form of the Architectural views. Defence
should assess whether their compliance process meets such a goal;
Analysis of architectural data requires the data to be contained in analysable form, but
it isn’t enough to perform the analysis: the results must be meaningful to the end
audience. To make it meaningful requires an understanding of the end audience and
the questions they are ultimately going to want answered. This information, in turn,
dictates the form of the architecture. Often, multiple audiences exist and multiple
questions need to be answered, which amplifies the need to move beyond views
towards a structured repository; and finally,
Transferring knowledge isn’t simply about providing advice and recommendations; it
is also about transferring tools and the capacity for analysis to the end user.
REFERENCES
Australian Department of Defence, (2006) Defence Capability Development Manual, Defence
Publishing Service, Canberra.
Chief Information Officer Group, (2006) Defence Information Environment (DIE) Approved
Technology Standards List (ATSL) (version 2.5), Canberra.
Chief Information Officer Group, (2009) Directorate of Architecture Practice Management
(DAPM) - Defence Architecture Framework. [Online] (Updated 6 Apr 2009)
Institute of Electrical and Electronics Engineers, (2000) IEEE Std-1471-2000 Recommended
Practice for Architectural Description of Software-Intensive Systems, IEEE.
Weill, P. (2007) Innovating with information systems: what do the most agile firms in the
world do?, Proceedings of the 6th e-Business Conference-PwC & IESE. Barcelona, Spain
27 March 2007.
48
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
SYSTEMS ENGINEERING IN-THE-SMALL: A PRECURSOR
TO SYSTEMS ENGINEERING IN-THE-LARGE
Phillip A. Relf1; Quoc Do3; Shraga Shoval3; Todd Mansell2;
Stephen Cook3, Peter Campbell3; Matthew J. Berryman3
1
2
3
Raytheon Australia, email: [email protected]
Defence Science and Technology Organisation, email:
[email protected]
University of South Australia email: Matthew.Berryman, Peter.Campbell, Stephen.Cook,
Quoc.Do, Shraga.Shoval {@UniSA.edu.au}
Abstract – The teaching of the systems engineering process is made problematic due to the
enormity of experience required of a practising systems engineer. A ‘gentle’ project-based
introduction to systems engineering practice is currently being investigated under the
Microcosm programme. The Microcosm programme integrates a robotics-based system-ofsystems, as the first stages in building a systems engineering teaching environment. Lessons
learnt have been collected from the Microcosm Stage One project and the systems
engineering processes, used during the project, have been captured. This paper analyses the
processes and lessons learnt, compares them against typical large-scale Defence systems
engineering projects, and discusses the lesson learnt captured by the systems engineers who
had been working in-the-small. While executing the case study it was found that the lessons
learnt which are known to industry, would have been militated against their occurrence by
the use of robust industrial systems engineering processes but that the Microcosm project
schedule, with these industrial processes, would have been exceeded. As the Microcosm
Stage One project was successfully completed, effort must now be expended to ensure that the
participants understand the limitations and strengths of systems engineering in-the-small
procedures and also understand the issues associated with the scaling up of the procedures.
INTRODUCTION
This paper reports a case study of a systems engineering project that was conducted in-thesmall namely the Microcosm Stage One project; and contrasts the lessons learnt against
industrial systems engineering in-the-large practices. Where possible the relevant industrial
practices are cited from published papers. Finally, recommendations to balance the education
of systems engineers, who have only worked in-the-small, are given to offer appreciation for
working in-the-large.
49
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
BACKGROUND
Contrarytopopularbelief,systemsengineeringhasbeenwithusforsomeconsiderabletimebutwe
arestillstrugglingtodefinewhatitisandtounderstandhowtoteachit.Wearedoggedinourneed
tobuildincreasinglycomplexsystemsbutarefrequentlybeinghinderedbyourowncognitive
limitations.Microcosmisasystemsengineering‘sandpit’thatisdesignedtoallowournovice
systemsengineerstolearntheexplicitandtacitknowledgerequiredofasystemsengineer,and
guidethemonthepathtotransitionfromsystemsengineeringinthesmalltosystemsengineering
inthelarge.
Systems Engineering Scope and History
McQuay (2005) surveyed the literature and reports that a general consensus of the scope of
systems engineering has been achieved but Concalves (2008) notes that this has not always
been the case. Traditionally, systems engineering has used the “V” model (i.e., top-down
definition, and bottom-up integration and test) (Walden 2007) and has been characterised by
systems engineering standards such as ANSI/ITAA EIA-632, IEEE Std 1220, ISO/IEC 15288
and MIL-STD-499, to name a few. Systems engineering in its simplest manifestation is
concerned with the design of the whole and not with the design of the parts (IEEE 2000).
Lightfoot (1996) defines systems engineering as the controlled application of procedures,
standards and tools to a problem such that the developed solution manifests to satisfy a
specific need. INCOSE (2009) expands on this definition:
Systems Engineering integrates all the disciplines and specialty groups into a team
effort forming a structured development process that proceeds from concept to
production to operation. Systems engineering considers both the business and the
technical needs of all customers with the goal of providing a quality product that
meets the user needs.
Systems engineering was born out of World War II (Brown and Scherer 2000). Since the
1940’s the Defence sector has used systems engineering practices to develop complex
systems (McQuay 2005) which have been used to coordinate information, material and people
in the support of operational missions (Brown and Scherer 2000). However, systems
engineering practice continues to suffer unrest driven by economic and political pressures
(Hellestrad 1999). As an example, in 1994 The Secretary of Defence directed that
commercial off-the-shelf (COTS) equipment should be used to encourage open systems
architectural development and as insurance against obsolescence issues (IEEE 2000), hence
changing the scope of systems engineering practice in the process. Specifically, the
importance of systems requirements in the systems engineering lifecycle of new projects has
been reduced substantially as COTS equipment, by definition, already exists and has been
built to another customer’s system requirements specification.
Classical systems engineering is essentially a sequential, iterative development process that
results in the generation of a system (Rebovich 2008). The systems engineer works on the
assumption that the classical systems engineering process is driven by requirements which
ultimately result in a system (Rebovich 2008, Meilich 2005). However, classical systems
50
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
engineering is outmoded (Rebovich 2008) particularly due to the complexities imposed by the
use of COTS equipment as system components (Dahmann and Baldwin 2008, Walden 2007,
Rickman 2001). It has been recognised that the essence of systems engineering has become
how to integrate the system components into a whole (Chase 1966). More recently,
academics and practitioners alike now refer to systems integration as the process being
undertaken in the development of systems (Ai and Zhang 2008, Mindock and Watney 2008,
Meilich 2005). However, industry strongly rejects the practice of using “systems integration”
as a synonym for “systems engineering” as it creates confusion in that the term would refer
both to an engineering discipline and to a step in the systems engineering lifecycle.
Systems Engineering Process
Boarder (1995) analysed the systems engineering process and identified 400+ unique
activities. While using these activities to guide the development of a relatively simple system,
a total in excess of 1,000 distinct activities were subsequently identified (Boarder 1995).
Independently, Raytheon have developed their Integrated Product Development System
(IPDS) which, at the lowest level, contains over 1,000 activities (Rickman 2001). The
Raytheon IPDS integrates three systems engineering lifecycle models (i.e., evolutionary,
spiral and waterfall) into a tailorable systems engineering process (Rickman 2001). The
employment of such complex procedures may appear excessive to the novice but none the
less, the effective application of the systems engineering process can reduce the cost of
developing a system by the prevention of errors and hence results in the reduction of rework
(Lewkowicz 1988).
Education Needs
Lann (1997) stated that systems engineering was the “least rigorous” of the accepted
engineering disciplines. Given this assertion, it is apparent that some concerted education
effort was required but how to proceed? Systems engineering encompasses both technical
and non-technical (i.e., cultural, economic, operational, organisational, political and social)
contexts (Rebovich 2008, Stevens 2008, Lee 2007). In addition, a systems engineer must be
taught problem solving techniques but an undue influence on this aspect can in actuality
retard the learning of other systems engineering skills (Concalves 2008). Systems engineers
must also be taught how to manage complexity: including management complexity,
methodology complexity, research complexity and systems complexity (Thissen 1997).
However, due to the scope of systems engineering (see the INCOSE definition as an
example), one person cannot be expected to hold all the prerequisite skills and hence systems
engineering must be practised by a team of individuals who collectively hold the prerequisite
skills (Concalves 2008). Existing systems engineers have agreed that systems engineering
leaning can only effectively occur though experience and that guided experimentation is the
best approach (Newman 2001). As formal education has been based extensively on the
delivery of explicit knowledge (i.e., knowledge that can be readily communicated), a
paradigm shift is required to provide tacit knowledge (i.e., knowledge that cannot easily be
communicated – like learning how to ride a bicycle) (Concalves 2008).
51
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
We need an ability to provide systems engineering capability if we are to sustain growth
while managing system complexity (Concalves 2008). However, this necessary systems
engineering knowledge, which is created in the minds of individuals, does not disseminate
throughout an organisation naturally (Pope et al. 2006). This knowledge must be actively
transmitted to those neophytes who will become the next generation of systems engineers.
Management have attempted to disseminate systems engineering knowledge by capturing this
knowledge within procedures and then mandating the use of these procedures within their
organisation. However, experience has shown that this practice fails and that systems
engineering knowledge is created only through the practice of relevant formal education
combined with applicable systems engineering experience (Chase 1966).
Currently there is a global shortage of systems engineers (Concalves 2008). There are
relatively few systems engineering degrees being awarded, which is further exacerbating the
problem. Brown (2000) has estimated that there are less than 500 BS, approximately 250
Master and approximately 50 PhD systems engineering degrees awarded per year in the US.
For the same year, the National Academics Press (2008) published figures of 413 BS, 626
Master and 75 PhD systems engineering degrees (which are in the same order of magnitude
ranges as the previous figures). These figures have only marginally increased to 723 BS,
1,150 Master and 104 PhD systems engineering degrees awarded in 2006 (The National
Academics Press 2008), compared to approximately 74,000 general engineering degrees
awarded in 2004 (IEE 2004).
Strategies to instil systems engineering competencies into general engineering graduates have
included: attending formal courses; developed either internally or externally to the company;
and providing on-the-job experience, which can be slow to achieve results made questionable
when not accompanied by appropriate theory (Concalves 2008). Asbjornsen and Hamann
(2000) suggest that a minimum of 300 hours is required to give an engineer only the most
basic systems engineering knowledge. However, industry can be reluctant to make this
investment and would prefer to employ suitably knowledgeable systems engineers. Hence,
the impetus to relegate systems engineering training to the universities has come from
industry (Asbjornsen and Hamann 2000).
When approached by British Aerospace to develop a systems engineering course,
Loughborough University had some concerns regarding the content of the degree (Newman
2001). The university interviewed potential supporters for the degree to identify the
expectation to be placed on the degree graduate’s abilities, knowledge and skills. It became
apparent that the students would require eight to ten years of formal course work just to learn
the engineering basics prior to teaching the knowledge particular to systems engineering
(Newman 2001). Loughborough University subsequently developed a less ambitious four
year systems engineering BEng degree and a five year Master degree. The university found
that in addition to the usual student attributes, the systems engineering Master degree
students: provided mutual support to their fellow students; had a strong sense of identity;
demonstrated a strong ability to cope with change; were able to provide innovative solutions;
and were comfortable presenting their work in open forums (Newman 2001).
52
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
The USAFA have also developed a four year systems engineering Bachelor degree. The
USAFA degree in first year introduces students to systems engineering by having the students
develop a boost glider where they must use aeronautical engineering to support their system
design, astronautics engineering to launch the glider, electrical engineering to effect control
during glider flight, mechanical engineering to develop a robust design, and civil engineering
to build a launch pad (George 2007). The John Moores University, Liverpool has also
employed experimentation as part of their systems engineering education programme. The
university has used a systems engineering project employing a robot design for many reasons.
Some of these reasons include: the ability to use a simple design which is readily
understandable by first year students; the project has a particular relevance to the electrical
and manufacturing component of systems engineering; and the project is readily extendable
making it ideal as a student project (Boyle and Kaldos 1997).
The teaching of systems engineering, similar to the execution of complex systems engineering
projects, must be addressed incrementally. The University of Queensland found that the
IEEE Std 1220 standard proved to be too complicated in its untailored form for the students to
understand (Mann and Radcliffe 2003). Novice engineers have difficulty assimilating
systems engineering concepts as they are unable to see their relevance to the task (Mann and
Radcliffe 2003) and have an incomplete appreciation for what can go wrong in a project
(Concalves 2008). Similarly, the sequence in which systems engineering concepts are
presented is important to trainee systems engineers. As an example, systems engineers
working on small systems may develop a process based on their experiences with these
systems and become unable to modify their approach to address more complex systems
engineering challenges even in the face of evidence that their processes are inadequate
(Concalves 2008). The primary author has noted that another oversight in the education of
systems engineers is that they are not generally taught to recognise systems engineering
problems that are too complex for the waterfall or even the spiral models for system
development and an evolutionary model must, in these instances, be employed.
Systems Engineering In-The-Large
Systems engineering academia and practitioners alike now recognise that the engineering of
small-scale bounded predictable systems (i.e., systems engineering in-the-small) is inherently
different from the engineering of large-scale complex systems (i.e., systems engineering inthe-large) (Stevens 2008). Watt and Willey (2003) list the specific characteristics of largescale systems engineering projects that make them different from small-scale systems
engineering projects i.e.:
x
x
x
x
Management of subcontractors.
Management of factors that are considered to be high risk.
Competing view points between stakeholder held system priorities.
Integration of multiple system components and multiple technologies, some of which
may not exist at project start.
53
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
x
x
Development cycles measured in years and operational life-span measured in decades,
which has an impact on obsolescence issues and which must be considered during the
design phase of the project.
Typically funded by large organisations, with large budgets and often with high public
visibility.
When integrating COTS equipment to build a system-of-systems, the initial system
requirements are often changed to allow for unambiguous mapping to the individual COTS
equipment (Walden 2007, Rickman 2001). Similarly and quite often COTS equipment does
not seamlessly integrate with other COTS equipment, and ‘glue’ and/or ‘wrapper’ software is
required to effect the actual systems integration (Walden 2007). This software is typically
supported, within large systems engineering projects, by a single software entity often called a
Data Server or more recently as an Integration Backbone (Raytheon 2009).
CASE STUDY METHOD
A literature review was conducted to discover the current and historical scope of the systems
engineering domain. This literature review also covered the known failings apparent in
systems engineer training, particularly with reference to the transition from systems
engineering in-the-small to systems engineering in-the-large. The result of the literature
review has already been presented within the Background section of this paper.
Industrial systems engineering processes and the actual industrial practices, that the primary
author has been privy to, were compared to the practice demonstrated on a small systems
engineering project. The primary author has been closely involved with the Microcosm Stage
One engineering project and also conducted interviews with the participants to understand
their practices during the project. Judgement was made as to whether the industrial systems
engineering processes could have alleviated the issues encountered by the Microcosm Stage
One project systems engineers. For a description of the Microcosm Stage One project, see
Mansell et al. (2008).
CASE STUDY RESULTS
The Microcosm programme (Mansell et al. 2008) has commenced the development of a
systems engineering ‘sandpit’, using robot vehicles and airships, to teach necessary theory
and skills to systems engineering students. The case study used the Microcosm programme as
an example of systems engineering in-the-small and contrasted this program against various
industry projects such as the Air Warfare Destroyer (AWD), Jindalee Operational Radar
Network (JORN) and the Collins Replacement Combat System (RCS) projects which
represented systems engineering in-the-large. The comparison was made by the primary
author who has in excess of thirty years engineering experience.
Results – Systems Engineering Process
In the absence of a formal systems engineering procedure, an industrial process was heavily
tailored and captured as a series of activities described within a Project Management Plan
(PMP) on behalf of the Microcosm systems engineers by an industry experienced systems
54
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
engineer. The Work Breakdown Structure (WBS) contained within this industrial PMP
proved to be too ambitious for the project schedule. This industrial PMP was abandoned and
a lighter-weight PMP was developed by the staff who would be actually executing the
Microcosm Stage One project. During the planning stage the Microcosm systems engineers
were introduced to several tools (e.g., project scheduling software, the WBS concept) that not
all members of the team were familiar with. This familiarisation process took time from the
schedule.
The industrial PMP referenced 95 activities in a WBS, which was reduced to 31 activities in
the lighter-weight PMP. A corresponding reduction in effort saw the estimate of effort reduce
from 3.01 person years to 1.09 person years. The Microcosm Stage One project was
completed with an expenditure of 1.31 person years but as was noted above, some time was
lost in tool familiarisation, which was not accounted for within the project schedule and some
rework was also required to progress the project.
The systems engineering process actually used by the Microcosm Stage One project was to
conduct a needs analysis; develop a number of indicative scenarios; develop the system
architecture; conduct requirements analysis followed by; system design; system
implementation (i.e., build); system integration; and system test and evaluation. The needs
analysis was informally communicated to the participants by a PowerPoint presentation and
formally tracked using an Excel template developed to support the needs analysis activity
(Shoval et al. 2008).
The scenario definition (which were labelled as ‘use cases’ by the Microcosm systems
engineers) broadly consisted of: (1) provision of a systems engineering environment; (2)
provision of post graduate courses; (3) investigation of human-machine interfaces; (4)
autonomous vehicles research; (5) investigation of model-based systems engineering
approach; and (6) a demonstration of the robot vehicle’s operation.
The Microcosm Stage One architecture describes three systems, i.e.; the Modelling and
Simulation Control system; Microcosm Information Management system; and the Microcosm
Physical system. The system architecture was described using the Department of Defense
Architecture Framework (DoDAF): High-Level Operational Concept Graphic (OV1),
Operational Node Connectivity Description (OV2), Systems/Services Communications
Description (SV2) and functional flow diagrams.
The systems requirements were not formally captured but were presented on PowerPoint
slides during the Microcosm Stage One preliminary design review (PDR). The system design
was developed (extensively individually) by the relevant systems engineers and was jointly
presented during the PDR. Similarly, the system implementation, and system test and
evaluation were extensively developed by the responsible systems engineers.
The Microcosm Stage One system-of-systems was ‘sold off’ against a demonstration of the
sixth scenario (i.e., robot vehicles operation scenario), in the absence of any formally recorded
systems requirements. The first five scenarios defined by the scenario elucidation phase were
55
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
considered outside of the scope of the Microcosm Stage One project and hence were not
applicable to the ‘sell off’ of the Microcosm Stage One project.
Results – Lessons Learnt
The Microcosm Stage One project suffered from some rework, which resulted in a number of
lessons learnt by the Microcosm systems engineers (Do et al. 2009). The lessons learnt that
were documented by the Microcosm systems engineers are broadly grouped as: (1) program
management; (2) specialty engineering; and (3) COTS equipment related. In addition, lessons
learnt were also extracted during interviews with the Microcosm systems engineers and it
became apparent that they were dissatisfied with the level of their system documentation
effort and also with their system engineering risk assessment. As Colwell (2002) notes, large
systems require some considerable time to complete but that at completion the methods and
tools, successfully used on the project, may now be obsolete, necessitating that we continually
learn from our systems engineering endeavours.
The project management issues included but were not limited to the under estimation of work
effort, particularly the testing effort, consequently the project suffered from lack of staff
resources and project documentation suffered. The general work effort estimation and
specifically the testing effort estimation was acknowledged as being due to the inexperience
of the Microcosm system engineers who developed the schedule. Industry would typically
employ seasoned systems engineers who would also have access to historical data that could
be employed to support the effort estimates. Industry has learnt and continues to learn the
importance of allocating sufficient systems testing effort. For instance the Hubble telescope’s
early problems were partly due to inadequate systems testing (Colwell 2002) and
Constantinides (2003) states that the Mars Climate Orbiter and Mars Polar Lander were both
lost due to inadequate system testing.
The specialty engineering issues include but were not limited to configuration management,
system modelling and safety. The Microcosm Stage One project suffered during system
testing as it was difficult for the Microcosm systems engineers to identify tested software
modules from those software modules that were being actively modified. Industry recognises
this problem as a configuration management issue and implements configuration management
processes specifically to deal with this issue. The Microcosm systems engineers discovered
during system testing that their robot vehicle’s localisation software was unable to cope with
measurement inconsistencies introduced by the environmental conditions (i.e., uneven floor
surfaces). Industry would develop a system model which would, depending on the fidelity of
the model, could be expected to predict such issues as positional measurement
inconsistencies. Again during system testing, the robot vehicle went ‘rogue’, which could
have resulted in damage to the robot before manual action was forthcoming to secure the
robot vehicle’s motion. Industry would have employed a safety engineer whose task it would
have been to consider safety as a project risk. However, all of these industrial solutions are
applicable to systems engineering in-the-large and are of a lesser importance to systems
engineering in-the-small as demonstrated by the successful completion of the Microcosm
Stage One project.
56
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Industry have used systems engineering policy documents such as ISO/IEC 15288 (2002)
and MIL-STD-499C-Draft (2005) to develop their systems engineering process and as such
have empowered specialty engineers such as dedicated configuration management engineers
to manage project artefacts; modellers to model the proposed system architecture within the
real environment to validate the system design; and safety engineers to ensure that the system
design is safe to operate and does not cause unintentional damage. In the absence of a model
for the system processing performance requirements, the processing capability of the system
architecture was only found to be inadequate during system testing. Similarly, in the absence
of a functional executable model (i.e., simulator) the Microcosm systems engineers realised a
schedule slip when the COTS equipment failed and could not be repaired in a timely fashion.
The Microcosm systems engineers, who were under considerable schedule pressure at the
time, were not able to adequately document their system architecture and there is some
evidence that a hierarchical system architecture that affords ‘full instrumentation’ (i.e.,
adequate test points) was not achieved. Full instrumentation was an initial requirement for
the system architecture, see Mansell et al. (2008). Industry continues to learn these lessons
too. Beutelschies (2002) states that the loss of the Mars Climate Orbiter was due in part to the
lack of documented project-level decisions and lack of documented system architecture.
The COTS equipment demonstrated issues relating to a miss-match between asynchronous
and synchronous messaging; a disconnect between expected and actual measurement units;
and processing performance. In addition, the Microcosm systems engineers showed
frustration at their inability to access and hence modify the COTS equipment’s software.
Learning that COTS equipment must be handled as though it was a ‘black box’ was a useful
lesson as textbooks typically allocate very few pages to discussing COTS equipment.
However, industry too continues to learn from the use of COTS equipment. Colwell (2002)
recounts the loss of the Mars Climate Orbiter was actually due to a disconnect between units
of measurement and Lann (1997) argues that the loss of the Ariane 5 flight 501 was due to the
inappropriate reuse of the Ariane 4 software for use on a different rocket, supporting an
extensively different mission.
ANALYSIS
The industrial PMP was rejected on the grounds of its inability to support the imposed
schedule constraints of the project in favour of a lighter-weight PMP. Consequently the
lighter-weight PMP did not address certain activities, the presence of which would have been
able to militate against the potential for rework. This potential was realised and some rework
did occur. However, the lighter-weight PMP was obviously sufficient to complete the
Microcosm Stage One project and hence the possibility exists that the Microcosm systems
engineers, who used the lighter-weight PMP, may not necessarily appreciate the value of the
industrial PMP when scaling up for more complex projects.
CONCLUSIONS
We need to teach our fledgling systems engineers the systems engineering process, expose
them to systems engineering procedures and methods, and give them access to relevant tools.
57
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This exposure should be sufficient for them to recognise a systems engineering problem and
decompose it into necessary activities, and for the systems engineering student to be confident
enough to tailor the systems engineering process to meet the particular system development.
However, specifically from the experiences gained during the Microcosm Stage One project,
systems engineers need to be taught that there are inherent differences between systems
engineering in-the-small and systems engineering in-the-large and as a minimum these
differences are:
x
x
x
x
x
x
Systems engineering processes that are used for small system developments may not
necessarily work for large system developments.
Students need to practice effort estimation with exposure over many projects.
Students need to be given exposure to project management tools.
Students should be taught how to manage subcontractors.
Students need to be taught how to work in a team, document their work for the team
and to use strong Configuration Management processes that ensure that the team has
access to project artefacts in a known state.
Students need to be taught that they can’t expect to do every systems engineering task
proficiently, and that they should expect to specialise in some area of systems
engineering and be supportive of the team that culminates to provide the collective
systems engineering proficiencies.
Systems engineering students also need to be counselled that they need a full decade of
education and employment working as a systems engineer, before they can validly claim
the title of “Systems Engineer”.
REFERENCES
Ai, X.; and Zhang, Z. (2008) Study on Results-Oriented Systems Engineering (ROSE),
International Seminar on Future Information Technology and Management Engineering
ANSI/ITAA EIA-632 (2003) Processes for Engineering a System, Information Technology
Association of America (GEIA Group)
Asbjornsen, O. A.; and Hamann, R. J. (2000) Toward a Unified Systems Engineering
Education, IEEE Transactions on Systems, Man and Cybernetics, part C: Applications and
Reviews
Boarder, J. (1995) Systems Engineering as a Process, IEEE Systems Engineering for Profit
Boyle, A.; and Kaldos, A. (1997) Using Robots as a Means of Integrating Manufacturing
Systems Engineering Education, IEE Colloquium on Robotics and Education
Brown, D. E.; and Scherer, W. T. (2000) A Comparison of Systems Engineering Programs in
the United States, IEEE Transactions on Systems, Man and Cybernetics, part C: Applications
and Reviews
58
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Chase, W. P. (1966) System Design: Basic Realities and Common Myths, IEEE Transactions
on Aerospace and Electronic Systems, vol: 2, no: 4
Concalves, D. (2008) Developing Systems Engineers, Portland International Conference on
Management of Engineering & Technology
Dahmann, J.; and Baldwin, K. (2008) Understanding the Current State of US Defense
Systems of Systems and the Implications for Systems Engineering, IEEE International Systems
Conference
Do, Q.; Campbell, P.; Shoval, S.; Berryman, M. J.; Cook, S.; Mansell, T.; Relf, P. (2009) Use
of the Microcosm Environment for Generating a Systems Engineering and Systems
Integration Lessons Learnt Database, Improving Systems and Software Engineering
Conference
George, L. (2007) Engineering 100: An Introduction to Engineering Systems at the US Air
Force Academy, IEEE International Conference on Systems of Systems Engineering
Hellestrand, G. R. (1999) The Revolution in Systems Engineering, IEEE Spectrum, vol: 36,
issue: 9
IEE (2004) http://www.engtrends.com/IEE/1004D.php (accessed: 15May09)
IEEE (2000) Overview: What is Systems Engineering?, IEEE Aerospace & Electronic
Systems Magazine, Jubilee issue
IEEE Std 1220 (2005) IEEE Standard for Application and Management of the Systems
Engineering Process, IEEE
INCOSE (2009) Systems Engineering Scope Definition
http://www.incose.org/practice/shatissystemseng.aspx (accessed: 15May09)
ISO/IEC 15288 (2002) Systems Engineering – System Life Cycle Processes, International
Standard Organisation
Lee, D. M. (2007) Structured Decision Making with Interpretive Structural Modeling (ISM):
Implementing the Core of Interactive Management, Sorach Inc. Canada, ISBN: 0-9684914-13
Lewkowicz, P. E. (1988) Effective Systems Engineering for Very Large Systems: An Overview
of Systems Engineering Considerations, Digest of the Aerospace Applications Conference
Lightfoot, R. S. (1996) Systems Engineering: The Application of Processes and Tool in the
Development of Complex Information Technology Solutions, Proceedings of the International
Conference on Engineering and Technology Management
McQuay, W. K. (2005) Distributed Collaborative Environments for Systems Engineering,
IEEE Aerospace and Electronic Systems Magazine
59
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Mann, L. M. W.; and Radcliffe, D. F. (2003) Using a Tailored Systems Engineering Process
Within Capstone Design Projects to Develop Program Outcomes in Students, 33rd Annual
Frontiers in Education
Mansell, T.; Cook, S.; Relf, P.; Campbell, P.; Do, Q.; Shoval, S.; and Ross, C. (2008)
Microcosm – A Systems Engineering and Systems Integration Sandpit, Asia-Pacific
Conference on Systems Engineering
Meilich, A. (2005) Systems of Systems (SoS) Engineering & Architecture Challenges in a Net
Centric Environment, IEEE/SMC International Conference on System of Systems
Engineering
MIL-STD-499C-Draft (2005) Systems Engineering, Department of Defense
Mindock, J.; and Watney, G. (2008) Integrating System and Software Engineering Through
Modeling, IEEE Aerospace Conference
National Academics Press (2008)
http://books.nap.edu/openbook.php?record_id=12065&page=54#p200140399960054001
(accessed: 15May09)
Newman, I. (2001) Observations on Relationships between Initial Professional Education for
Software Engineering and Systems Engineering – A Case Study, Proceedings of the 14th
Conference on Software Engineering Education and Training
Pope, R. L.; Jones, K. W.; Jenkins, L. C.; Ramsev, J.; and Burnham, S. (2006) History of
Science and Technology Systems Engineering: The Histner, IEEE/AIAA 25th Digital Avionics
Systems Conference
Raytheon (2009) http://www.raytheon.com/capabilities/products/dcgs/ (accessed: 18May09)
Rebovich, G. (2008) The Evolution of Systems Engineering, 2nd Annual IEEE Systems
Conference
Rickman, D. M. (2001) Model Based Process Deployment, 20th Conference on Digital
Avionics Systems
Shoval, S.; Hari, A.; Russel, S.; Mansell, T.; and Relf, P. (2008) Design of a Systems
Engineering Laboratory Using a Scenario Matrix, Six Annual Conference on Systems
Engineering Research
Stevens, R. (2008) Profiling Complex Systems, 2nd Annual IEEE Systems Conference
Thissen, W. A. H. (1997) Complexity in Systems Engineering: Issues for Curriculum Design,
IEEE International Conference on Systems, Man and Cybernetics
Walden, D. D. (2007) The Changing Role of the Systems Engineer in a System of Systems
(SoS) Environment, 1st Annual IEEE Systems Conference
60
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Watt, D.; and Willey, K. (2003) The Project Management – Systems Engineering Dichotomy,
Engineering Management Conference, Managing Technological Driven Organizations: The
Human Side of Innovation and Change
61
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
62
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
SOFTWARE ENGINEERING
63
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
64
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
REQUIREMENTS MODELLING OF BUSINESS WEB
APPLICATIONS: CHALLENGES AND SOLUTIONS
Abbass Ghanbary,
Consensus Advantage,
E-mail: [email protected]
Julian Day,
Consensus Advanatge
Email: [email protected]
ABSTRACT
The success of web application development projects greatly depend upon the accurate capturing of the
business requirements. This paper discusses the limitations of the current modelling techniques while capturing
the business requirements in order to engineer a new software system. These limitations are identified by
modelling the flow of information in the process of converting user requirements to a physical system. This
paper also defines the factors that influence the change in business requirements. Those captured business
requirements are then transferred into pictorial and visual illustrations in order to simplify the complex project.
In this paper, the authors define the limitations of the current modelling techniques while communicating those
business requirements with various stakeholders. The authors in this paper also review possible solutions for
those limitations which will form the basis for a more systematic investigation in the future.
KEYWORDS: Modelling, Tools, Business requirements, Policies, Analyst, Testing, Process, Quality
1. INTRODUCTION
There have been significant advances in the modelling theory and modelling tools in
practice within past few decades to provide clearer presentation of the required web system.
The success of web application development projects greatly depends upon the accurate
capturing of the requirements. The analysis of the business requirements and appropriate
modelling (of those requirements) leads to the correct design of the new system. The
analysis and design of the new system can be classified as one of the most complex human
activities since the analyst and designer need to intellectually identify the requirements,
cope with the complexity and develop a new system that can satisfy those elaborated
requirements.
Modelling helps us to understand the reality of the existing system applications and
processes and create newer reality in order to develop new systems [8]. Business Analyst
(BA) starts the process of gathering and modelling the requirements of the system. The
understanding and documentation of the BA is extended iteratively and incrementally into a
solution level design [9]. The modelling plays an important part in the System Development
Life Cycle (SDLC). Every model is an abstraction of reality [7]. The modelling tools
present the business requirements as a pictorial illustration that can be visually reviewed.
The generated model should enable the people (user, business (client) as well as the
development team (Analyst, Designer and Programmer)) to identify the problem, propose
solution, recognize the behavior of the system and plan how to implement the proposed
solution.
65
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
The introduction of an information system to an organisation results in changes to
their business processes, which in turn causes changes to the implemented information
system [3]. At the same time there are numerous concerns due to the limitation of the
current modelling techniques since various stakeholders can translate and understand these
pictorial and visual illustrations in various ways. In current modelling techniques, a minor
change in business requirements can also have a massive impact on the project specifically
if the analysis and design of the system are in the parallel mode and happen simultaneously.
These issues have massive impact on the quality of the developed system. This
paper discusses the factors that impact on business requirements and identify the limitations
of the current modelling techniques and provides the solution for those limitations.
2. COMMUNICATING BUSINESS REQUIREMENTS
Project planning facilitates and supports the development of web applications based on
corresponding business requirements and changes to those business requirements. In
order to assure the quality in SDLC, the people (internal and external) with various sociocultural backgrounds must have a similar understanding of the business requirements and
modelling techniques. This is so because almost always the requirements emerge as the
most important factor from the user’s viewpoint in terms of their perception of quality.
Requirements also play a crucial role in scoping, planning and executing the project. The
communication of business requirements has two main aspects. The first aspect is the
information loss during the communication and the additional changing factors (internal and
external) to original business requirements. Figure 1 present the flow of business
requirements and demonstrate the line of communication for the business requirements. The
business requirement can get lost or corrupted during this transformation.
Aspect 1- Information loss:
The Flow of Business Requirements in SDLC
System Users
Business Analyst,
Designer, Developer
and Test Team
Business
(Client)
Time and Budget
Developed System
Figure 1. The Flow of Business Requirements in SDLC
66
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
2.1. System Users
The user in computing context is classified as people who use a computer system in order to
perform their daily activities to complete a business process. The user of the technology
adds value to the business (client) by improving efficiency, reducing costs, increasing
revenue, and creating stronger relationships with organisation and their clients. The user has
a major impact on the business requirements in SDLC as they are end users of the system
under the development. The user can also be classified as the trigger for SDLC because
these people identify the shortcomings of the current business process. The user has impact
on SDLC before and after the product is developed.
2.2. Business (Client)
Business (client) as a structured and organized entity identifies the need for the new system
to increase quality and profit. In SDLC, business evaluates the need for the new system by
assessing the cost and time involved while determining the benefits such as improving the
profit and/or service. The business is in direct contact with the user in order to determine the
requirements and dictating them to business analyst. The business must have clear
understanding of the requirements and transfer them correctly in order to achieve the
maximum quality in SDLC.
2.3. Time and Budget
Time and budget plays a major role in SDLC specifically in the business requirements. The
time and budget change the scope of the project which impacts on the scope of the business
requirements by moving the less important requirements to the future phases of the project.
2.4. Business Analyst, Designer, Developer and Tester Team
Business analyst, designer, developer and tester team alongside of the business (client) must
ensure that they have the same level of understanding in all stages of the SDLC. This
quality assurance currently is achievable by running numerous workshops based on the
modelling document to make sure all the business requirements are captured and the
involved people are in the same level of understanding.
These numerous workshops are required because various people involved in the
project are coming from different background and the modelled requirements (UML, EPC,
OPEN…..) do not make sense to everyone. The consultants or business analyst must
translate the documents to the business as well as the developers to make sure everybody
has the same understanding of the created document.
2.5 Developed System
The developed system will go under intense testing for the quality assurance by the
qualified system tester, business and the users before entering the next cycle if required.
67
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
However, there are so many other issues that influence the quality of the project that
complicates the situation in SDLC such as rapid change of the business requirements.
Aspect 2- Additional factors changing the business requirements
Figure 2 presents additional factors that impact on business requirements mentioned earlier
and may lead to changes in these requirements. The organizational and Government
policies, roles and regulations might be known to the business (Client) as well as the
development team. The development team as well as the business analyst must be aware of
limitations of the technology and must have a good knowledge of the enterprise architecture
(architecture of the system). The business analyst is also responsible to inform the business
(client) if the requirements alter due to these issues. The architecture of the system also has
a big impact on performance of the system. In many cases, the business requirements can be
delivered but the existing architecture of the enterprise is unable to cope with the load and
as a result the system might crash.
The Dynamic Nature of the Business Requirements in SDLC
Organisation's
Government's
Policies, Roles and Policies, Roles and
Regulations
Regulations
Enterprise
Architecture
System Users
Business Analyst,
Designer, Developer
and Test Team
Business
(Client)
Limitation of
Technology
Time and Budget
Developed System
Figure 2. Other Factors Impacting the Business Requirements in System
Development Life Cycle
2.6. Government’s Policies, Roles and Regulations
The legal issues also play an important role in business requirements and it is the
responsibility of the business as well as the business analyst to identify those Government
policies. As an example, if company A wants to develop a system to sell a product to the
68
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
employee of the company B while the funds are transferred directly through the salary
office of the company B, the company A must ensure that the taxation office approves such
a transfer of funds.
2.7. Organisation’s Policies, Roles and Regulations
The internal roles and policies of the organisations are important factors in changing the
business requirements. Referring back to the previous example, what happens to the system
under development, if the company A has the policy of not having the right to keep the
details of the company B in their database? The business, business analyst and designer
must identify these issues to either find a solution or change the business requirements.
2.8. System Architecture of the Organisation
Every individual organisation has their own system architecture designed based on the level
of their non-functionality requirements such as performance, security, implementation,
maintenance and reusability. In relation to our previous example, if the company A must
reveal some data of the company B employee to the finance department while these data is
not stored in the Company A’s database due to the organisational policy. The business
requirements have to be changed hence the organisation’s role has not allowed the database
of the employee’s in company B to be registered inside the architecture of the Company A
system. The developed system must access the database (Holding the details of Company
B’s employee) outside of the firewall and forward it to the finance company. If the
Company B is not allowing the data to be stored outside of the firewall then the business
requirements has to be changed.
2.9. Limitation of Technology
The limitation of technology also has a major impact on the business requirements. The
business might have requirements while the technological capability does not exist. The
research and development contributing and identifying these limitations and provide the
solution based on these business requirements.
The request of a system users or the pressure on the organisation initiates a request for a
new system or the changes to an existing system. The user might identify the limitation of
the existing system (which could be classified as a need for any system) and at the same
time organisation might identify the need to change or create a new system based on the
existing weakness (internal to the organisation) and threats (external to the organisation).
The modelling must ensure that user requests are considered, the specific problem is
identified, the purpose and the objectives of the new system is clearly demonstrated and the
constraints in providing the solution to the business requirements have been registered. The
following section discusses the limitation of the current modelling techniques.
3. LIMITATION OF THE CURRENT MODELLING TECHNIQUES
Business requirements are modelled using the formalisms (such as UML activity,
class, use case implementation, communication and state diagrams) these definitions
69
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
can rarely be understood by key business stakeholders or even business analysts [1].
The current software development techniques such as Rational Unified Process
(RUP), Object-oriented Process, Environment and Notation (OPEN), Process
Mentor and SAGE-II play a crucial role in the success of web development projects.
However the following limitations were identified based on the existing literature
and authors experience working on the various projects [11] [12] [13] [14]. A new
modelling tool is needed for representation of requirements that is understood by
end users as well as system analyst and developers considering that UML does not
meet the capability and interest of end users while project lacks from incomplete
requirements validation [4]. This tool must be reliable, flexible, modifiable, reusable
and must be capable of transferring and communicating the complexity that is
understandable by every one. The observation and investigation of the current
modelling techniques in practice revealed that the current modeling techniques are
unable to fully capture the business requirements as well as communicating it to the
developers and the client. These limitations of the current modelling tools consist of
the followings:
1.
The current modelling tools are unable to present the overall view of the
required system.
The current modelling techniques do not provide a diagram to picture the overall
requirement of the system. The current existing diagrams demonstrate the piece or
one business process at a time considering that a system is supporting multiple
business processes. A minor change in business requirements might have a big
impact on architecture of the information changing all the created design. In the
current modelling tools, the analyst and designer can not easily trace and evaluate
the impact of any change on the remaining structure of the system.
2.
The purpose for some of modelling tools are not explained as an example
how an activity diagram helps the developer to create a system.
The current modelling diagrams are providing a pictorial illustration of an
individual task or a business process. The literature does not define how this
pictorial illustration will lead to coding a program. The diagrams facilitate to
breakdown the problem to smaller pieces and visually present them in order to just
communicate the business requirements. The literature does not identify how these
diagram leads and facilitates the developers to actually code a system.
3.
There is no testing on current modelling to understand whether the analyst
has captured the right functionalities. The testing only takes place after the
system is developed.
In the current SDLC, the analyst and designer capture the business requirements by
reading the available documentation and through intense discussion by business and
70
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
the system users. The result of the discussion and reading the documentation are
modelled in various ways that is not understandable by the business (client). The
business (client) must sign a document that is not easily understandable since they
are unfamiliar with the concept. The client realises the impact of the signed
document when the actual system is developed. There should be a mechanism in
place that business can run a test (on prototype) to make sure that the analyst and
designers have captured the correct requirements.
4.
Lack of proper standardisation leading to numerous workshops in order to
understand the functionalities.
In the current environment, there is not a unique way of capturing the business
requirements since the understanding of the people are different. These limitations
of the standards lead to the workshops and meetings to bring all the various
stakeholders in to the same level of understanding. The details of the created
documents in many ways can be translated only by the business analyst or the
person responsible for creating the documentation.
5.
Lack of understanding in which tool belong to which space. For example,
the activity diagram is part of problem or solution space.
There is a big ambiguity in understanding what tools belong to what phase of the
development life cycle. In other words, while capturing the business requirements
should the analyst just concentrate on analysis or at the same time they should
identify the solution.
6.
Lack of support in defining the non-functionality requirements.
There is a big misunderstanding in order to identify when capturing the functionality
requirements finishes and the non-functionality requirements starts. Currently, there
is no place to define the non-functionality requirements in the use cases. The nonfunctionality requirements should be added as an additional row in the use case
description.
7.
Lack of understanding on how these modelling techniques lead to coding.
The current modelling techniques facilitate to breakdown the problem into smaller
pieces while it is not very clear how it will lead to coding. The only available
diagram that actually leads to coding is the class diagram. The class diagram
depending on the maturity level can be divided in to Analysis class diagram and
Design class diagram. Identifying the classes and the relationship of those classes
can be classified as the Analysis class diagram. The completion of the attributes and
methods transfer the Analysis class diagram to the Design class diagram. The design
class diagram should clearly demonstrate the expected behaviour of the system and
remaining UML diagrams must show the behaviour of the system in various stages
71
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
of the web system development while the system is actually coded and it is
developed.
8.
The current modelling techniques do not prioritise the processes.
The prioritisation of the business processes specifically while re-engineering is
happening can be classified as the most crucial factor for the “success” and
“Failure” of the project. Majority of the re-engineering studies and projects mainly
deal with those core business processes that actually generate profit. The current
modelling techniques do not support these prioritisations. There should be a
mechanism in place to support ranking of the business processes and classify their
significance impact on development of the web system.
9.
Complexity of the process complicates the situation by producing a
confusing model.
The current modelling diagrams become more complicated as the project becomes
more complex. This creates more confusion for the involved parties (Business, user,
analysts, designers and developers). The complexity increases when the tools such
as Visio or Rational Rose can not support the length and complexity of the
document and have to move to the next page.
10.
There are no clear definitions as to which tool is better for which specific
project
There should be a unique way to understand which of the current modelling
techniques are better to address the specific project based on the complexity and
related tasks and activities.
Modelling should enable decision makers to filter out the irrelevant
complexities of the real world, so the effort can be directed toward the most
important parts of the system under study [2]. The above explanation identifies that
while various modelling tools have been used in the industry the maximum benefits
have never been achieved due to the explained limitations. The following section
provides some solutions for these current shortcomings of the modelling techniques
as far as the capturing of the business requirements are concerned.
4. ENHANCEMENT IN CAPTURING BUSINESS REQUIREMENTS
There have been numerous studies by [5] [6] [10] proposing various construction
processes to overcome problems while capturing the business requirements. The
correct capturing of the business requirements facilitate to make the correct
decision.
In order to minimize the errors while capturing the business requirements
and provide all stakeholders (with various backgrounds) with the similar
72
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
understanding of the captured requirements, the we have reviewed the following
possibilities.
1.
Create a prototype.
The pictorial modelling could be reduced if a prototype is created based on the
textual capture of the business requirements. The prototype could be enhanced as
more information is achieved. The business can see the sample of the product while
the developers can easily improve on the existing prototype to create the new
system. This solution is currently being used in agile software development
environment. In case of the complex system, the various prototypes can be created
for the sub-systems. The creation of the screen design allows the business to
understand all the needed attributes, commands and screen format are correct while
they can also evaluate the possible performance of the system.
2.
Identify when, and how information get lost or corrupted.
There should be a mechanism in place to identify how the information are getting
lost, misplaced or corrupted when they are transferred from user, business to the
development. The loss of information or developing a system based on the
corrupted information result in a system that is unable to perform the desired task
which eventually caused the failure of the project.
3.
Look at implemented mistakes (communication modes: text, hypertext,
images, animation, video and audio).
Recording the minutes, communications and meeting allows us to make sure the
right requirements have been captured. These requirements can then be translated in
the form of text, hypertext, images, animation, videos and audios to make sure the
correct requirements have been captured before moving to the design phase of the
new system.
4.
Trust within involved parties.
Trust also plays an important part amongst human interaction. Our life is based
upon our trust otherwise is almost impossible to survive in the society. Similar
pattern and procedure applies when the requirements are transferred. It is very
important for all parties to trust each other otherwise is almost impossible to
develop a new system capable of performing its desired task.
5.
Mixture of existing models.
The mixture of the existing modelling tools such as UML, OPEN, Process Mentor
and Business Process Management Notation (BPMN) might be able to cover the
limitation of the current individual modelling tools. Additional research and
73
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
investigation is required to understand whether this combination could address the
shortcomings of the current modelling techniques.
6.
Partitioning to sub problems.
Breaking down the problems and partitioning them into various sections allows the
modeller to have a better control over them. This technique might ensure that the
information is not lost, misplaced or corrupted. This fact can be achieved by
dividing the functional requirements in to various parts and create a dependency
document for further traceability.
7.
Identification of the similar patterns required in every system (User access
level, …).
The business analyst should identify the similar pattern required in every system and
manage to use them rather than creating those requirements from scratch. For
example, if a system needs a private webpage, the analyst should be able to use
Access Control functionality and modelling. The modelling should be capable for
providing the stakeholders with traceability in order to distinguish the functional
dependency.
5. CONCLUSIONS
The importance of the modelling was described in this paper followed by the important
aspects of the System Development Life Cycle (SDLC) as far as the capturing of the
business requirements are concerned. The limitation of the current modelling techniques
while presenting those captured requirements was identified. As a result of those identified
shortcomings of the modelling tools, some possibilities were reviewed. However, further
research is in progress to construct and re-construct the new modelling tools in real projects
within the various industries. The future research area is classified as followings:
1: Evaluate of various modelling techniques in real projects (UML, OPEN, Process
Mentor....) in order to demonstrate how they link to the proposed limitations.
2: Individually test each limitation against each modelling techniques.
3: Based on each identified limitations (on each modelling techniques) propose and test the
possible solutions in order to identify the need and develop new modelling technique.
4: Identifying the cost and feasibility of the solution.
5: Problem-specific decisions related to solutions.
6: Required enhancement and modifications acceptance.
7: How modelling could take place in terms of the information flow.
74
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
REFERENCES
[1] Dubray, J.J. (2007). “ Composite Software Construction). USA. C4Media Incorporation. ISBN:
978-1-4357-0266-0
[2] Giaglis, G. M., A “Taxonomy of Business Process Modelling and Information Systems
Modelling Techniques”. The international Journal of Flexible Manufacturing Systems, 13 (2001):
209-228. © Kulwar Academic publishers, Boston.
[3] Ginige, A. “Aspect Based Conceptual Modelling of Web Applications”. Proceedings of the 2nd
International United Information Systems Conference UNISCON 2008, Klagenfurt, Austria, April
2008. pp.123-134
[4] Kop, C. & Mayer, H. C., “Mapping Functional Requirements: from Natural Language to
Conceptual Schemata”. Proceedings of 6th IASTED International Conference Software Engineering
and Applications. Cambridge, USA. Nov 4-6, (2002),.PP. 82-88.
[5] Leite, J.C.S. & Hadad, G.D.S. & Doorn, J.H. & Kaplan, G.N., “A Scenario Construction
Process”, Journal of Requirements Egineering, Vol. 5 No. 1, 2000, Springer Verlag, , pp. 38–61.
[6] Rolland, C. & Achour, C. B., Guiding the Construction of Textual Use Case Specifications, Data
& Knowledge Engineering Journal, Vol. 25 No 1-2, 1998, North Holland Elsevier Science Publ., pp.
125–160.
[7] Teague, L. C. & Pidgeon, C. W. (1991). “ Structured Analysis Methods for Computer
Information Systems”. Macmillian Publishing Compnay. ISBN: 0-02-946559-1
[8] Unhelkar, B. (2005). “Practical Object Oriented Analysis”. Thomson Social Science Press. ISBN:
0-17-012298-0.pp: 15
[9] Unhelkar, B. (2005). “Practical Object Oriented Design”. Thomson Social Science Press. ISBN:
0-17-012299-9.pp: 3
[10] Zhu, H. & Jin, L., “Scenario Analysis in an Automated Tool for Requirements Engineering”,
Journal of Requirements Engineering, Vol. 5 No. 1, 2000, Springer Verlag, pp. 2 – 22.
[11] http://www-306.ibm.com/software/rational/offerings/ppm/. Downloaded: 2/07/2008
[12] http://www.processmentor.com/Architecture/Default.aspx. . Downloaded: 2/07/2008
[13] http://www.dialog.com.au/content/view/28/45/. Downloaded: 2/07/2008
[14] http://www.open.org.au/. Downloaded: 2/07/2008
75
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
76
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
DESIGN UNCERTAINTY THEORY - Evaluating Software System
Architecture Completeness by Evaluating the Speed of Decision
Making
Trevor Harrison1, Prof. Peter Campbell1, Prof. Stephen Cook1, Dr. Thong Nguyen2
1
Defence and Systems Institute, 2Defence Science and Technology Organisation
Contact: [email protected]
ABSTRACT
There are two common approaches to software architecture evaluation [Spinellis09, p.19].
The first class of evaluation methods determines properties of the architecture, often by
modelling or simulation of one or more aspects of the system. The second, and broadest,
class of evaluation methods is based on questioning the architects to assess the architecture.
This research paper details a third, more fine-grained approach to evaluation by assuming an
architecture emanates from a large set of design and design-related decisions. Evaluating an
architecture by evaluating decision making and decision rationale is not new (see Section 3).
The novel approach here is to base an evaluation largely on the time dimensions of decision
making. These time dimensions are (1) time allowed for architecting, and (2) speed of
architecting. It is proposed that progress of architecture can be measured at any point in time.
For example: “Is this project on track during the concept development stage of a system life
cycle?” The answer can come from knowing how many decisions should be expected to be
finalised at a particular moment in time, taking into account a plethora of human factors
affecting the prevailing decision-making environment. Though aimed at ongoing evaluations
of large military software architectures, the literature review for this research will examine
architectural decisions from the disciplines of systems engineering, information technology,
product management and enterprise architecture.
1 INTRODUCTION
The acceptance of software architecture as resulting from a set of design decisions is now
relatively well established. Worldwide, the efforts of six separate communities or researchers
have resulted in proposed updates to an IEEE standard and complementary tools to capture
software architecture decision rationale. A literature review has revealed two blind spots
though. Almost zero references to, and no appreciation of impact from those individual or
team factors and their effects on decision making identified by the psychology and sociology
sciences. (An assumption is made here that humans are making all architectural decisions.)
Another blind spot is the absence of alternative decision-making philosophies such as
heuristic-based architecting found in systems engineering, which recognizes architecting as
an eclectic mix of rationale and naturalistic decision-making methods. This research aims to
77
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
quantify the uncertainty surrounding performance of the decision-making process(es) by
modelling the range/distribution of possible legitimate times taken to finalise different types
of architectural decisions. This will qualify the uncertainty surrounding processes with
inherent variation, and have the capability to quantify a reduction in uncertainty.
Decisions, decision relationships, decision-making strategies, and factors affecting speed of
decision making will be modelled using agent based modelling and simulation. This choice
of modelling allows exploration of the decision interactions and their time dimension
sensitivities. The complex system under study is thus one of architecting. (The authors want
to avoid a “systems thinking disability” [Senge06, p.51] by pulling apart a small piece of a
complex decision-making system only to discover this has destroyed the very system under
study.)
The remainder of the paper is structured as follows: section 2 covers the two time
dimensions of decision making (timing and time period), sections 3 and 4 cover architectural
decisions in general, sections 5 and 6 cover speed of decision making, section 7 covers
modelling of everything in sections 2 through 6. Finally, a challenge to conventional
research methods is revealed in section 8.
2 SIGNIFICANCE OF TIME AND TIMING
Timing refers to the time window when it is most optimal to make a decision. Time period
refers to the optimal period of time to spend on decision making. Both time dimensions have
lower and upper bounds. Decision making outside of these bounds will incur different types
of penalties.
2.1 The Importance of Optimal Timing of Decisions – Avoiding Re-work
Even with a limited set of critical or high priority decisions, the order of decisions can change
the architecture [Rechtin00, p.130] i.e. inappropriate order could mean over-constrained
decisions later on. At first glance, this may speed up decision making by reducing choices
later on. However, schedule overruns in other engineering activities will occur to
compensate for architecture deficiencies. The early detection of decision-making happening
too fast is closely related to the estimation of time to spend on architecting activities.
2.2 The Importance of Optimal Timing of Decisions – Cost of Gathering Information
Quality of decision making is often dependent on the quality of information available to make
decisions. Utility graphs in [Williams09, Fig 2.4] show a cut-off point when the cost of
collecting information exceeds the benefit of quality decision outcomes. Such graphs
quantify a cut-off time for a decision(s) on the most effective solution to a problem, and/or
the choice of concept.
2.3 Optimal Time to Expend On Software Architecture
Regression analysis to calibrate the Architecture and Risk Resolution (RESL) scale factor
from the COCOMO II estimation model [Boehm03, p.221] confirmed the hypothesis that
78
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
proceeding into software development with inadequate architecture and risk resolution will
cause project effort to increase, and that the rework effort increase percentage will be larger
for large projects. Corresponding “sweet spots” have been identified for minimum delay
architecture investment, shown in Figure 1. Less time expended than, or greater time
expended than these “sweet spots” will result in re-work (thus schedule delay) to cater for
deficiencies in a software architecture. NOTE: the larger the size of the software system, the
larger time and effort should be expended in architectural design.
Figure1:MinimumEfforttoAchieveLeastReworkDuetoArchitecturalDeficiencies[Boehm03]
While useful for forecasting, the RESL factor is ineffective as a work-in-progress tracking
measure during the early days of concept definition in major defence projects; architectural
decisions are made at time when very little, if any, code is written. (In the next section, the
authors propose to replace forecasting of thousands lines of code with forecasting the
finalisation of hundreds of design and design-related decisions.)
2.4 Decision Time Lines from Product Management
“Time line” refers to a time period for which a decision outcome/choice is valid. Typically
for product management, this is the same as the “shelf life” for an individual technology
component. For example, for the choice of a personal computer for office use, the decision
time line is approximately three years before the decision needs to be re-visited. Under
Government contracts, protracted or belated decision-making will often shorten a decision’s
time line (because a requirements specification stays fixed).
3
ARCHITECTURE GENRES AND THEIR RECOGNITION OF
DECISIONS
79
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This section reviews three different types of architecture; enterprise architecture, systems
architecture and software architecture. The key message here is that there are many nontechnical, design-related decisions, which set the context for architectural design.
3.1 Design Decisions in Enterprise Architecture
Drafting of “IEEE (draft) Std P1694 – Enterprise Strategic Decision Management (ESDM)”
is underway to define a standard framework for the enterprise-level management of strategic
decisions. Many strategic decisions set the context for modelling efforts such as architectural
design. The top half of Figure 2 shows a Decision Network template covering
business/strategy decisions, while the bottom half of Figure 2 covers Platform Architecture
Management (PAM) decisions. A Decision Network provides a "50,000 foot" view of the
decision-making process. It serves to focus resources on the most critical decisions and
provides a higher level method of communication concerning a decision-making situation. A
Decision Network provides a set of decisions which serves as a high level decision roadmap
for the entire project. The roadmap provides an analysis plan that helps avoid the common
pitfalls of "diving into detail" and "analysis paralysis". [Fitch99, p.41].
Figure2:DecisionNetworkTemplatefromIEEEStd(draft)P1694whereeachbox/bulletisamajordecision
3.2 Design Decisions in Software Engineering
Kevin Sullivan was the first to claim that many software design decisions amount to
decisions about capital investment under uncertainty and irreversibility [Sullivan96, p.15].
(Uncertainty about future requirements and system-wide qualities in say 10 to 20 years time.)
Design decisions are like “call options”. To bind and implement a design decision is to
exercise an option – to invest in a software asset such as architectural design [Sullivan96,
p.16]. Thus a software architecture can also be viewed as portfolio of options. (There was
80
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
earlier pre-1996 research into decisions for software design, but, there was no recognition of
the time dimension i.e. appropriate timing of decisions.)
3.3 Design Decisions in Systems Engineering
An INCOSE 1996 conference paper [Novorita96] is one example of elevating decisions to be
an equally important artefact as requirements and design models. Novorita’s paper details an
information model underlying a systems development process. The intent of such an
information model is to improve the speed of communication between marketing and
engineering teams. Without such an information model, design information is inevitably
found scattered and duplicated across numerous documents, spreadsheets, PowerPoint slides
and support systems databases such as product help desks. The consequences of this are, for
example, pieces of design information that have no individual owner and no relationship
meta-data existing between the pieces of design information. Figure 3 shows decisions as a
major part of an information model, to bring all design information into one place.
Decisions
Risk
Mgt
Models
Req’s
Tasks
Plans
Documents
Figure3EssentialDatatoSupportRequirements– includesDecisions[Novorita96,Fig.3]
4 INTERCONNECTEDNESS AND INTERACTIONS AMONGST DECISIONS
“A phenomenon, sometimes acknowledged but rarely explicitly investigated is that decisions
interact with one another. “ [Langley95, p.270]
Previous section(s) took a static view of decisions. This section looks at the dynamics of
decision-to-decision relationships. Different types of relationships have different effects on
time, such as appearing to “speed up time” or “freeze time”. This is the first sign of
complexity.
The disciplines of Product Management and Enterprise Architecture have traditionally
revealed strong connections between non-design decisions and architectural decisions.
Architectural decisions for any product are closely linked with decisions about marketing
strategy, manufacturing capabilities and product development management [Ulrich99, p.142].
81
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
A more fine grained view of relationships amongst software architecture decisions
themselves is contained in an ontology by [Kruchten04], shown in Table 1. The names of
these relationships imply time dimension-related effects; time can either be speeded up, or,
time can be frozen.
Table1: Relationships Between Architectural Design Decisions [Kruchten04, pp. 4-6]
Type
Example/Explanation
constrains
“must use J2EE” constrains “use JBoss”
forbids
a decision prevents another decision being made
enables
“use Java” enables “use J2EE”
subsumes
“all subsystems are coded in Java” subsumes “subsystem X is coded in Java”
conflicts with
“must use J2EE” conflicts with “must use .Net”
overrides
“the communication subsystem will be coded in C++” overrides “the whole systems is developed
in Java”
comprises (is made of)
this is stronger than ‘constrains’
an alternative to
A and B are similar design decisions, addressing the same issue, but proposing different choices
is bound to
decision A constrains decision B and B constrains A
is related to
mostly for documentation and illustration reasons
dependencies
decision A depends on B if B constrains A
relationship to external artefact
“traces from” and “does not comply with”
Further research by [Lee08] has attempted to visualise the relationships in Table1.
4.1 Classification of Decision Relationships
There is apparently been no further research since the claim by Ann Langley et al that no
comprehensive theory of decision interconnectedness exists [Langley95, p.270]. Though not
attempting to develop such a theory, the journal article [Langley95] has attempted to work
relationships into a typology shown in Table2.
82
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Table2:TypesofLinkageBetweenDecisions[Langley95,pp.271273]
Sequential Linkages
Lateral Linkages
Concerning the same basic issue at
different points in time.
Dealing with a major issue involves sub
decisions.
Links between different issues being
considered concurrently.
Concurrent decisions are linked as
they share resources.
Nesting Linkage
Snowballing Linkage
Recurrence Linkage
Pooled Linkage
Contextual Linkage
Enabling linkage
Evoking linkage
Precursive Linkages
Cutting across different issues and
different times, as decisions taken on
one issue affect subsequent decisions
on other issues within the same
organization.
Pre-empting linkage
Cascading linkage
Merging linkage
Learning linkage
5 SPEED OF ARCHITECTING (SPEED OF DECISION MAKING)
“Truly successful decision making relies on a balance between deliberate and instinctive
thinking.” [Gladwell05, p.141]
This section presents another time dimension of decision making; speed. Similar to lower and
upper bounds for both time period and timing of decisions, speed of decision making also has
lower and upper limits. These limits vary from individual to individual human decision
maker.
5.1 Speed of Architecting viewed as Short Cuts
The word ‘heuristic’ means to discover. It is used in psychology to describe a method (often
a short cut) that people use to try to solve problems under extremes of complexity, time
pressure and lack of available information [Furnham08, p.116]. Heuristics applicable to
systems architecting are well documented [Maier09].
The field of cognitive systems
engineering [Rasmussen94] demonstrates that a mix of decision-making styles is valid. The
“Decision Ladder” in Figure 4 represents the set of generic subtasks involved in decision
making. States of knowledge are arranged in a normative, rational sequence. There are three
legitimate points in the decision-making process to expedite certain steps based on heuristics.
Heuristic shortcut connections are typical of a natural, not formalised, decision-making style.
Taking heuristic short cuts are heavily dependent on the individual architect’s knowledge,
experience and perspective [Curtis06].
83
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure4DecisionLadderwithHeuristicbasedShortcuts[Rasmussen94,p.65]
The more popular short-cuts utilised within software engineering are patterns and pattern
language. A pattern is a specific form of prescriptive heuristic. When a number of patterns
in the same domain are collected together, they can form a pattern language. The idea of a
pattern language is that it can be used as a tool for synthesizing fragments of a complete
solution [Maier09, p.41].
5.2 Speed of Architecting viewed as State Transitions
The state transition chart in Figure 5 clearly has potential for decision-making loops (and
therefore increasing uncertainty about achieving an optimal amount of time to invest in
architectural design) before a particular decision is accepted. A visualising framework called
‘Profuse’ has been used to visualise these particular state transitions over a time period
(known as “decision chronology”). The example in the right hand side of Figure 5 shows
three decision creation or activity sessions over a two-week interval. The state of the
decisions is denoted by the shape: Diamonds are ‘Idea’, circles are ‘Tentative’; and squares
are ‘Decided’.
84
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure5–DecisionChronologyofStateTransitions
Combining Decisions and
Factors Affecting Decision-Making
Other loops (iterations) are seen in development process models e.g. “requirements loop” and
“design loop” in systems engineering [Young01, p.133] e.g. “triple peaks” in software
engineering [Rozanski05, p.75].
5.3 Speed of Software Architecting viewed as Hierarchy of Impacts
Florentz & Huhn discuss “three levels of architectural decisions” for embedded automotive
software. These levels are based on the point in time at which decisions are made
[Florentz07, p.45]. The three layers are (1) top-level decisions, (2) high-level decisions, and
(3) low-level decisions. Top level decisions vary the least1. High level decisions are made
early on. Low level decisions, being the closest to system realization, vary the most i.e.
provide the basis of architecture variants. Both predictability and known impact increase
when moving from top-level decisions to low-level decisions.
5.4 Speed of Architecting viewed as Holistic Thinking
Research by Microsoft Federal Systems Consulting [Curtis06] found a wide variation in the
time to complete an architecture of IT Systems (data centres, servers, LANs and WANs).
Even in cases where similar IT systems were being designed by architects with the same
levels of knowledge and the same years of experience, time to achieve customer sign off for
an architecture varied from three months to 18 months. Investigations revealed the speed of
architecting was determined by an architect’s amount of “perspective”. ‘Perspective’ is the
ability to consider impacts of any design decision upon other parts or processes of a business.
The outcome of the Microsoft research has been the Perspective Based Architecture method.
The PBA method is a question-based approach to architecting consisting of 46 questions.
Similar to a Decision Network (e.g. Figure2), the focus of the PBA method is to help guide
1
Ahierarchybasedonprocessingratesoccurwithinanyecosystem[O’Neill86].Forexample,growthofa
forestoccursoverdecades,leafgrowthovermonths,whilephotosynthesisoccursdaily.
85
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
designers and product planners in how to consider non-design decisions which are critical to
the success of implementing any architecture.
5.5 The Speediest Decision Making – Unconscious Thinking
“Thin slicing” is one explanation for split second decision-making; those moments when we
know something without knowing why [Gladwell05].
For any architectural decision-making, there will be a certain amount of intangible design
decisions. Some architectural design is made without an associated issue or decision;
examples include "It worked for me last time", "first thing I tried at random", “taboo-based
decisions”, or just insights for which no connection can be identified.
6 ENVIRONMENTAL FACTORS AFFECTING SPEED OF DECISION
MAKING
Identified earlier in section 4, individual architectural decision attributes (e.g. ‘priority’) and
decision-to-decision relationships (e.g. ‘forbids’) are some of the factors that either constrain
or accelerate the speed of decision making. Many more speed-affecting factors can be found
amongst the sociology and psychology literature.
6.1 Factors of the Project Environment
All software and system architectural design effort is carried out in a project environment,
mostly within an organisation. In organisations, bias is manifest as a culturally engrained
behaviour that constrains decisions and actions to a fixed and limited set of perceptions and
responses [Whittingham09, p.2]. Figure6 highlights the influence of culture and leadership
on decision-making processes of a team.
Figure6AProjectEnvironment’sImpactonDecisionProcesses[Shore08,p.6]
Decisions concerning selection of an engineering solution may be significantly influenced by
biases – this factor has very little to do with the mechanics of an engineering solution.
6.2 Human Behaviour Factors
Prospect theory from psychology [Furnham08, p.127] explains both why we act when we
shouldn’t (things that should not have been done, but never the less were done), and why we
86
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
don’t act when we should (things that should have been done, but were not). The former
category can be considered inefficient decision-making, the later unreliable decision-making.
Both affect speed of decision making with a decelerating effect on speed. Inefficient decision
making (e.g. not using configuration control on diagrams) has an immediate but ‘gentle’
deceleration. Unreliable decision making (e.g. not using explicitly documented systempartitioning criteria) has a delayed but almost abrupt, “show stopper” deceleration.
A unique case study2 that has observed large software projects in-situ is [Curtis88]. Layer
upon layer of people interactions affect design decision-making and subsequent productivity
in a project. An accumulation of these effects can be represented in the “layered behavioural
model” in Figure7. The size and structure of a project determines how much influence each
layer has. A large project (such as those defence projects to be studied by this research) is
affected by all factors!
Figure7 LayeredBehaviorModel[Curtis88]
Most research into software architecture decisions has restricted itself to studying decision
making (1) at the ‘Individual’ level of the layered behavioural model in Figure7, and (2) postproject data gathering e.g. re-enactment of projects, e.g. recollections from project
participation. This is simply due to the practicalities of studying a large project in-situ. The
next two sections justify a synthetic, in-situ project simulation to get closer to a real decisionmaking environment, taking into account as many factors affecting speed of decision making
as possible.
2
[Glass03]writesfifteenyearslaterthatnosimilarcasestudyofprojectsinsituhasbeencarriedoutsince.
87
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
7
MODELLING
THE
UNCERTAINTY
ARCHITECTURAL DECISIONS
TO
FINALISE
“The safest conclusion to be drawn from descriptive studies is that there is no single
decision-making (or problem-solving) process. Mental activities vary enormously across
individuals, situations and tasks.” [Hodgkinson08, p.467]
For this research, the unquantifiable subject of observation is the architectural design
decision-making process(es) and the associated throughput. What has to be quantified
though, is the uncertainty of the time it takes to get through the decision-making process; that
is, a range/distribution of times to make all those decisions which together constitute a
architecture of a desired state of maturity.
The inherent variation in speed of decision-making illustrated in the previous sections points
towards a probability distribution function of all possible actual times when attempting to
match an optimal time to expend on architecture design. Figure8 is an envisaged output from
this research. It represents all possible decision-making completion times from 1st January to
31st December. The most likely date is 1st April. The date with a 50/50 chance of being true
is the 1st May. There is a 0% probability of completing all decision making prior to 1st
January. It is the distribution that is the extent of (design) uncertainty.
Figure8 – TheDistributionistheUncertainty
7.1 Agent Based Modelling and Simulation
To an observing outsider, the whole business of architecting is shrouded in uncertainty and
complexity. To make any kind of generalisation or theory (to make a first step in
understanding the complex system that is ‘architecting’) requires an ensemble of project
instances [Macal08, p.14]. These instances must preserve many human factors and decisionto-decision relationships. Small adjustments of theses should be enough to provide the
randomness/stochastic nature of human decision making.
A computational model run thousands of times, representing thousands of in-situ projects,
shall be used to produce a distribution envisaged in Figure8. To re-iterate, it is the distribution
of possible time periods to finalise all architectural decision-making that is the uncertainty.
88
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
At the time of writing, an agent-based model [Miller07, Ch.6] appears best suited to
modelling decision makers (agents), the decisions (also agents), decision interconnectedness
& interactions, and the human behaviour factors & human environment factors impacting
speed of decision making.
8 RESEARCH METHODS
A research method suited to understanding is required. One must understand the complex
system of architecting before deciding what action to take regarding a time sensitive,
decision-based evaluation of an architecture.
8.1 Research Methods Suited to Research of Decision Making in Project Environments
The primary limitation of the software architectural design research case studies in the
literature review is the sample size of each study. Most of these studies compared or
examined less than ten participants performing software design; thus the external validity of
these studies is weak [Zannier05, p.4]. Furthermore, with the sample size being small, it is
highly likely those samples are not representative of the larger population of architects and
designers. As a consequence any results cannot be said to be statistically relevant.
For the study of complex systems such as ecosystems, valid research can only be obtained
from observations conducted “out in the wild” and not in a test tube, lab or zoo. The
equivalent of “out in the wild” for architectural design decision-making is inside a project
that is ‘in situ’. Unfortunately, the timeframe for the main author participating in a live
project together with researched participants is not within the time frame of a PhD.
The research method will thus have to consist of an artificial, computational model of a
project in situ, with the ability to quickly & cheaply modify synthetic human factors and
human environmental factors to see their effects on the speed of and consequent time period
for decision making. This is to be buttressed with discussions with architects, and, attempts
at decision data gathering from any software or system architecture development undertaken
at local Universities.
9 SUMMARY
This research paper has adopted the stance that architectural design is decision making, and
uncertainty pervades all design. (Architecting in major projects is about predicting the future;
if I design the system thus, how will it behave? [Hazelrigg98, p.657]) There is additional
uncertainty surrounding the varying speed of architecting/decision making; this variation is
inherent to numerous human factors affecting decision-making methods. Complexity arises
from changes in the interrelatedness and interconnections of decisions themselves as time
progresses. Modelling all this uncertainty is to be carried out using agent based modelling
and simulation – a technique already used to understand complex systems where many
components interact. The understanding will be a distribution of the legitimate time periods
for architecting and timing of decisions. The first envisaged application is knowing whether
a project is on track during the conceptual design stage of the system or product lifecycle
89
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
undertaken within large projects. The benefits arising from viewing architectures as a set of
decisions is evaluation by all stakeholders, technical and non-technical.
10 REFERENCES
Boehm, Barry, and Turner, Richard (2003), Balancing Agility and Discipline: A Guide for the
Perplexed, Addison-Wesley Professional.
Curtis, Bill, Herb Krasner and Neil Iscoe (1988), A Field Study of the Software Design Process for
Large Systems, Communications of the ACM, November 1988, Volume 31, Number 11.
Curtis, Lewis , and Cerbone, George (2006), “The Perspective-Based Architecture Method”, The
Architecture Journal, Journal No. 9, October 2006, Microsoft Developer Network (MSDN)
http://msdn.microsoft.com/en-us/architecture/bb219085.aspx , accessed November 2008.
Fitch, John (1999), Structured Decision-Making & Risk Management, Student Course Notes, Systems
Process Inc.
Florentz, B., and Huhn, M. (2007), Architecture Potential Analysis: A Closer Look inside Architecture
Evaluation, Journal of Software, Vol. 2, No. 4, October 2007.
Furnham, Adrian (2008), 50 Psychology Ideas You Really Need to Know, Quercus Publishing Plc.
Gladwell, Malcolm (2005), Blink: The Power of Thinking without Thinking, Penguin Books.
Glass, Robert (2003), Facts and Fallacies of Software Engineering, Addison-Wesley.
Hazelrigg, G.A. (1998), A Framework for Decision-Based Engineering Design, Journal of
Mechanical Design, December 1998, Vol. 120.
Hodgkinson, Gerald P., and Starbuck, William H. (2008), The Oxford Handbook of Organizational
Decision Making, Oxford University Press, USA.
Kruchten, Philippe (2004), An Ontology of Architectural Design Decisions in Software Intensive
Systems, Proc. of the 2nd Workshop on Software Variability Management, Groningen, NL,
Dec. 3-4, 2004.
Langley, Ann et al (1995), Opening up Decision Making: The View from the Black Stool,
Organization Science, Vol. 6, No. 3, May-June 1995.
Lee, Larix and Kruchten, Philippe (2008), A Tool to Visualize Architectural Design Decisions, QoSA
2008, Lecture Notes in Computer Science, pp. 43–54, Springer-Verlag.
Maier, Mark W., and Rechtin, Eberhardt (2009), The Art of Systems Architecting, Third Edition, CRC
Press.
Miller, John H., and Page, Scott E. (2007), Complex Adaptive Systems: An Introduction to
Computational Models of Social Life, Princeton University Press.
Novorita, Robert J. and DeGregoria, Gary L. (1996), Less is More: Capturing The Essential Data
Needed for Rapid Systems Development, INCOSE 96 Systems Conference, July 1996, Boston.
90
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
O’Neill, R.V. et al (1986), A Hierarchical Concept of Ecosystems, Princeton University Press.
Rasmussen, Jens, Annelise Mark Pejtersen, and L.P. Goodstein (1994), Cognitive Systems
Engineering, Wiley-Interscience.
Senge, Peter (2006), The Fifth Discipline, 2nd Revised edition, Random House Books.
Spinellis, Diomidis, and Gousios, Georgios (2009), Beautiful Architecture, 1st Edition, O'Reilly
Media, Inc..
Rechtin, Eberhardt (2000), System Architecting of Organizations – Why Eagles Can’t Swim, CRC
Systems Engineering Series.
Rozanski, Nick and Eóin Woods (2005), Software Systems Architecture: Working With Stakeholders
Using Viewpoints and Perspectives, Addison-Wesley Professional.
Shore, Barry (2008), Systematic Biases and Culture in Project Failures, Project Management Journal,
December 2008, Vol.39, No. 4, pp.5-16.
Sullivan, Kevin J. (1996), Software Design: The Options Approach, Joint proceedings of the second
international software architecture workshop (ISAW-2) and international workshop on multiple
perspectives in software development (Viewpoints '96), pp.15 – 18.
Ulrich, Karl T. (1999), Product Design and Development, McGraw-Hill Inc.,US; 2nd Revised edition
edition.
Whittingham, Ian (2009), Hubris and Happenstance: Why Projects Fail, 30th march 2009,
gantthead.com,.
Young, Ralph R. (2001), Effective Requirements Practice, Addison-Wesley.
Zannier, Carmen, and Maurer, Frank (2005 ), A Qualitative Empirical Evaluation of Design
Decisions, Human and Social Factors of Software Engineering (HSSE) May 16, 2005, St.
Louis, Missouri, USA
BIOGRAPHY
Trevor Harrison's research interests are in software systems architecture and knowledge
management. His background is in software development (real-time information systems),
technology change management and software engineering process improvement. Before
studying full-time for a PhD, he spent 6 years with Logica and 11 years with the Motorola
Australia Software Centre. He has a BSc(Hons) in Information Systems from Staffordshire
University and an MBA (TechMgt) from La Trobe University.
Prof. Peter Campbell is the Professor of Systems Modelling and Simulation and Research
Leader in the Defence and Systems Institute (DASI) at the University of South Australia from
2004 and founding member of the Centre of Excellence for Defence and Industry Systems
Capability (CEDISC), both of which have a focus on up-skilling government and defence
industry in complex systems engineering and systems integration. He currently leads the
design for the simulation component of the DSTO MOD funded Microcosm program and is
91
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
program director for two other DSTO funded complex system simulation projects. Through
mid 2007, he consulted to CSIRO Complex Systems Science Centre to introduce complex
system simulation tools to support economic planning of agricultural landscapes.
92
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
PROCESS IMPROVEMENT
93
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
94
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
APPLYING BEHAVIOR ENGINEERING TO PROCESS
MODELING
David Tuffley, Software Quality Institute, Griffith University
Terry Rout, Software Quality Institute, Griffith University
Nathan, Brisbane, Qld. 4111, AUSTRALIA
[email protected] | [email protected]
Abstract: The natural language used by people in everyday life to express themselves is often
prone to ambiguity. Examples abound of misunderstandings occurring due to a statement
having two or more possible interpretations. In the software engineering domain, clarity of
expression when specifying the requirements of software systems is one situation where
absence of ambiguity is important. Dromey’s (2006) Behavior Engineering is a formal
method that reduces or eliminates ambiguity in software requirements. This paper seeks an
answer to the question: can Dromey’s (2006) Behavior Engineering reduce or eliminate
ambiguity when applied to the development of a Process Reference Model?
INTRODUCTION
Behavior Engineering has proven successful at reducing or eliminating the ambiguity
associated with software requirements (Dromey, 2006). But statements of software
requirements are not the only kind of artefact developed in the software engineering domain
that need to be clear and unambiguous. Process Reference Models (PRM) is another category
of software development artefact that might also benefit from being clear and unambiguous.
A Process Reference Model is a set of descriptions of process entities defined in a form suited
to the assessment and measurement of process capability. PRMs have a formal mode of
expression as prescribed by ISO/IEC 15504-2:2003. PRMs are the foundation for an agreed
terminology for process assessment (Rout, 2003).
The benefits of a method for achieving greater clarity are twofold: (a) PRM developers would
gain from improving the efficiency of process model development, and (b) users of process
models would benefit by achieving a clearer understanding of the underlying intention of a
process which then serves as a consensus starting point for determining how a process might
be applied in their own case. This paper therefore examines the ability of Behavior
Engineering to disambiguate a particular PRM currently being developed.
This paper illustrates how Dromey's Behavior Engineering method (2006) can be used to
disambiguate process models, making the resulting model clearer and easier to understand. It
is suggested that this method has broader applicability in the Software and Systems
Engineering domains. The paper examines in detail how Behavior Engineering has been
applied in practice to a specific Process Reference Model developed by the authors. Before
and after views are given of several process outcomes that had already passed through three
previous reviews to remove ambiguity. The Behavior Engineering analysis results in evident
improvements to clarity.
95
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
In a more general sense, it is suggested that this method may be helpful to the modelling of
processes at both project and organisational levels, including project-level processes, highlevel policy documents, and project agreements.
WHAT IS BEHAVIOR ENGINEERING?
Overview
Essentially, Behavior Engineering is a method for assembling individual pieces to form an
integrated component architecture. Each requirement is translated into its corresponding
‘behavior tree’ which describes unambiguously the precise behaviors of this particular
requirement (Glass, 2004). The ‘tree’ is built up from (a) components, (b) the states the
components become, (c) the events and decisions/constraints associated with the components,
and (d) the causal, logical and temporal dependencies associated with the component (Glass,
2004).
When each component is modelled in this way, and then integrated into a larger whole, a clear
pattern of intersections becomes evident. The individual components fit together like a jigsaw puzzle to form a coherent component architecture in which the integrated behavior of the
components is evident. One component establishes a precondition for another component to
perform its function and so on. This allows a software system to be constructed out of its
requirements, rather than merely satisfying its requirements (Glass, 2004).
Duplications and redundancies are identified and removed, for example, when the same
requirement is expressed twice using different language in different places. Another benefit is
that requirements traceability is managed with greater efficiency by creating traceable
linkages between requirements as they move towards implementation.
Historical context
The practices now described as behavior engineering evolved from earlier work in which an
approach for clarifying and integrating requirements for complex systems was developed
(Dromey, 2001). This remains a significant application of the approach (Dromey, 2006);
however, as the technique evolved, it became apparent that it could be applied to more general
descriptions of systems behavior (Milosevic and Dromey, 2002). To date, some preliminary
exploration of applying the technique to the analysis and validation of process models has
been undertaken.
The OOSPICE Project (Stallinger et al, 2002) had the overall aim of improving time-tomarket, productivity, quality and re-use in software development by focussing on the
processes and technology of component-based software development (CBD). OOSPICE
combined four major concepts of software engineering: CBD, object-oriented development,
process assessment and software process improvement. Its objectives were the definition of a
(a) unified CBD process metamodel, (b) a CBD assessment methodology, (c) resulting in
component-provider capability profiles, and (d) a new CBD methodology and extensions to
the ISO/IEC 15504 Information Technology: Process Assessment. Joint Technical Committee
IT-015, Software and Systems Engineering (2005).
96
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
A key part of the OOSPICE was the definition of a model of coherent processes addressing
the issues of component-based development; the process model was strongly aligned to
ISO/IEC 12207 Standard for Information Technology-Software Life Cycle Processes (1998),
but incorporated significant additional processes that specifically addressed CBD issues. The
process model was developed following the approach of ISO/IEC 12207, with a series of
structured activities and defined tasks. Additional detail on input and output work products
was also specified for each activity.
The process model was examined using the behavior tree method in order to assess its
consistency and completeness. The behavior tree analysis was highly successful; a total of 73
task level problems, 8 process level problems and numerous task integration problems were
identified. In addition, examples were found where fragments of tasks were identified which
subsequently have no integration point with the larger process tree – a weakness caused by
unspecified or inconsistent task inputs or outputs.
An indicative example of the behavior tree method applied to a single process is shown below
(explanation of notation given later):
5.2.3.1
Proposal for New/Changed
Software
[Available]
5.2.3.1
Statement of Requirements
[Written]
5.2.3.1
Statement of Requirements
? [Agrees to] Sponsor ?
5.2.3.1
Statement of Requirements
[Available]
5.2.3.2
Statement of Requirements
?Available?
5.2.3.2
User Requirements
[Expressed]
5.2.3.2
User Requirements
?Comprehensive?
5.2.3.2
User Requirements
?NOT: Comprehensive?
5.2.3.2
User Requirements
[Available]
5.2.3.2
User Requirements ^
[Expressed]
5.2.3.5
User Requirements
?Available?
5.2.3.3
User Requirements
?Available?
5.2.3.5
Change Request
[Submitted]
5.2.3.3
Statement of Requirements
?Available?
5.2.3.5
Configuration Management
[Request Processed]
5.2.3.3
Traceability Report
[Available]
5.2.3.1
Statement of Requirements
? NOT: [Agrees to] Sponsor ?
5.2.3.4
No inputs specified
5.2.3.4
Applicable Standards
[Determined]
Figure 1 – Behavior Tree Analysis – OOSPICE Process Model
Figure 1 shows the integrated tree resulting from analysis of a single process. It clearly shows
missing (dark shaded 5.2.3.1, 5,2,3,4) and "implied" but unstated elements (light shaded,
5.2.3.1, 5.2.3.3, 5.2.3.5), and also a failure in integration, resulting in lack of overall
consistency (Ransom-Smith, McClung and Rout, 2002). The medium-shaded boxes were
unchanged.
97
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Encouraged by success on OOSPICE, the technique was subsequently applied to the review
of the Capability Maturity Model Integration (V 1.2) (Chrissis, Konrad and Shrum, 2003).
This work was undertaken in the context of a review of drafts of the SEI’s Capability
Maturity Model Integration (CMMI) V1.2, and the results formed the basis of requests for
change submitted to the SEI; given resource constraints, it was not possible to apply the
technique to support the complete review, but where problems were seen to exist, an analysis
was conducted.
Figure 2 is indicative of how the technique helped to clarify an ambiguity in the specification
of the Requirements Development Process Area in CMMI:
PA157.IG101.SP101.N101
PA157.IG101.
SP101.N101
PLC: Project Life Cycle
PA157.IG101.
SP101.N101
STAKEHOLDER
{Customer}
PA157.IG101.
SP101.N101
PA157.IG101.
SP101.N101
?
ALL PLC activities are
addressed by the
requirements. how do I
show that in the model
PA157.IG101.S
P101.N101
) Requirement+ (
) Requirement# (
Requirement#
addresses :> PLC Activity+
PA157.IG101.
SP101.N101
PLC Activity#
PA157.IG101.
SP101.N101
PLC Activity#
has impact on :> product
PA157.IG101.
SP101.N101
Requirement#
has impact on :> product
OR
Text could mean either of
these two
Figure 2 – Behavior Tree Analysis – Requirements Development Process Area
Given the potential identified in these two applications of the approach, it seemed logical to
apply the Behavior Tree approach to the larger task of verifying a complete model. The
subject of the current study is a specification for a set of organizational behaviors, specified in
terms of purpose and outcomes of implementation, which would support and reinforce
effective leadership in organizations, and particularly in integrated and virtual project teams.
Applying Behavior Engineering to a complete model
From the discussion above, it might reasonably be hypothesised that given the parallels
between process model and software system requirements (sets of required behaviors and
attributes expressed in natural language) that Behavior Engineering may prove useful in
verifying a process reference model.
LEADERSHIP PROCESS REFERENCE MODEL PROJECT OVERVIEW
The leadership of integrated virtual teams is a topic in the software engineering domain that
has received little attention until a project to develop such a process reference model was
undertaken by the Software Quality Institute.
The topic is an important one, considering the increasing trend in a globalised environment
for complex projects to be undertaken by virtual teams. The challenges of bringing any
98
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
complex project to a successful conclusion are multiplied by the coordination issues inherent
in virtual environments.
The Leadership Process Reference Model (PRM) is being developed using a Design Research
(DR) approach (Hevner, 2004). DR is well-adapted to the software engineering domain, and
IT development generally, being used to good effect by MIT’s Media Lab, Carnegie-Mellon’s
Software Engineering Institute, Xerox’s PARC and Brunel’s Organization and System Design
Centre (Vaishnavi and Kuechler,2004/5).
In this project, DR is applied in the following way, consistent with Hevner’s guidelines:
x
x
x
x
x
x
A designed artefact is produced from a perceived need, based on a comprehensive
literature review.
A series of review cycles follow in which the artefact is evaluated for efficacy by a
range of stakeholders and knowledgeable persons and progressively improved. In
this project, five reviews are performed.
The first and second reviews involve interviews (four interviews per round) with
suitably qualified practitioner project managers. These validate the content of the
PRM.
The third review applies ISO/IEC TR 24774 Software and systems engineering -Life cycle management -- Guidelines for process description (2007) to achieve
consistency in form and terminology of PRMs in the Software Engineering
domain.
The fourth review applies Dromey’s Behavior Engineering to the draft PRM.
The fifth review is by an Expert Panel comprised of recognized experts in the field
of PRM-building.
APPLYING BEHAVIOR ENGINEERING TO VERIFY PRM
In this project, Behavior Tree (a subset of Behavior Engineering) verification is applied as the
fourth (of five) reviews. Behavior Tree analysis could have been applied at any stage. The
circumstances of this particular project determined that the Behavior Tree analysis
verification was performed towards the end, not the first or last review stage.
Being the second last review, it might be construed that the number and extent of changes that
resulted from applying BE is an indication of its efficacy as a model verification tool. The
previous three reviews notwithstanding, Behavior Tree analysis resulted in a significant
number of changes. Indeed, most of the process outcomes needed to be reworded for clarity.
Unnecessary qualifiers were removed, conjoined outcomes were split into two, each
concerned with a single clear point.
Behavior Engineering is comprised of a series of related activities, performed in a broad
sequence, beginning with the Behavior Tree and followed by the Composition. With the space
available, this paper concerns itself with the Behavior Tree component of the broader
Behavior Engineering process.
99
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
BEHAVIOR TREES – FIVE W’S (AND ONE H)
The Behavior Tree approach is based on the systematic application, with associated formal
notation, of the principle of comprehensive factual description of an event known as the Five
W’s (and one H) whose origins extend back to classical antiquity. In the 1st Century BC,
Hermagoras of Temnos quoted the 'elements of circumstance' as the loci of an issue (Wooten,
1945):
Quis, quid, quando, ubi, cur, quem ad modum, quibus adminiculis
(Who, what, when, where, why, in what way, by what means)
In the modern world, this dictum has evolved into who, what, when, where, why and how.
This principle is widely recognised and practiced in diverse domains such as journalism and
police work, indeed almost anywhere that comprehensive and unambiguous description of
events or attributes is needed.
Translated to the Software Engineering domain, who, what, when, where, why and how
becomes Behavior Tree Notation. This is a branched structure showing component-states. The
table below shows the application of the Behavior Tree aspect of BE in which each distinct
component is described in terms of who, what, when, where, why and how, or the subset of
these six descriptors that is applicable to this particular component.
Behavior Trees are therefore defined as a formal, tree-like graphical device that represents
behavior of individuals or networks of entities which realize or change states, make decisions,
respond-to/cause events, and interact by exchanging information and/or passing control
(Dromey, 2002). Naming conventions, elements and syntax are illustrated below:
Table 1: Variable naming conventions (Dromey, 2007b)
Table 2: Elements of a Behavior Tree node (Dromey, 2007b)
100
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure 3: Behavior Tree Node concrete syntax example (Dromey, 2007b)
FUNCTIONAL REQUIREMENT TO BEHAVIOR TREE – INFORMAL TO FORMAL
Functional requirement. When a car arrives, if the gate is open, the car proceeds, otherwise
if the gate is closed, when the driver presses the button, it causes the gate to open.
Behavior Tree. Translating the above statement into Behavior tree is illustrated below:
Figure 4: Functional requirement to Behavior Tree notation (Dromey, 2007a)
101
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
REMOVING AMBIGUITY
Statement: The man saw the woman on the hill with a telescope. This statement can be
interpreted at least three different ways, a situation not uncommon with natural language. The
developer must determine which interpretation is valid.
Figure 5: Resolving ambiguity using Behavior Tree notation (Dromey, 2007a)
In this example, the application of BT notation clarifies the statement by establishing the
precondition (that the woman is located on the hill) from which the primary behavior can then
be distinguished (that the man saw the woman by using the telescope).
APPLYING BEHAVIOR TREE NOTATION TO A PROCESS MODEL
The left-hand column of the table below shows the outcomes of the V0.3 PRM before
applying the Behavior Tree notation. The Behavior Tree component column is what results
from applying the who, what, when, where, how and who (or subset) using formal notation,
from which a clear, simple restatement of the outcome can be derived, as shown in the third
column. Note that the material removed from the outcome is not discarded, but relocated to
the Informative Material section (not shown) where it serves a useful purpose for persons
seeking a fuller understanding of the outcome. Refer to the Rationale for Change for
discussion on specific improvements made by applying Behavior Tree notation.
In general terms, the improvements derived from the application of BT is greater clarity and
economy of words (eg. in first example below 17 words in V0.3 becomes 8 in V0.4 by
rephrasing ‘what is to be accomplished’ to simply ‘goal(s)’ and removing the qualifier
‘ideally seen as an accomplished fact’ to the informative section. BT highlighted where and
how these economies of expression could be made by applying the process illustrated in
Figure 5 to remove ambiguity; in other words the more informal language of V0.3 was
rendered into formal language in V0.4 PRM.
102
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
An advantage of BT notation here is that it provides a rigorous, consistently applied editorial
logic for people without qualifications and/or much experience as editors. An experienced
editor may achieve the same results without BT notation, anyone else would arguably benefit
from its application.
Behavior
Tree Component
V0.3 PRM
V0.4 PRM
Rationale for
change
FIRST EXAMPLE
Leader creates a
shared vision of what
is to be
accomplished, ideally
seen as an
accomplished fact.
Leader clearly
communicates the
shared vision with
team, ideally seen as
an accomplished fact.
Leader facilitates
strong commitment
in team to achieving
the shared vision,
encouraging
resilience in the face
of goal frustrating
events.
New outcome in
V0.4
Leader develops a
concrete and
achievable set of
goals that support
achievement of the
shared vision.
Leader creates a shared
vision of the goal(s).
1.1.1
LEADER
(creates)
what
SHARED VISION/
(of)
GOAL(S)
1.1.2
LEADER
(communicates)
what
SHARED VISION/
(of)
GOAL(S)
1.1.3
LEADER
(gets)
what
COMMITMENT /
(to)
SHARED VISION/
(of)
GOAL(S)
1.1.4
LEADER
(encourages)
what
RESILIENCE /
(in)
TEAM
when
GOALFRUSTRATING
EVENTS
1.1.5
LEADER
(develops)
what
OBJECTIVE(S) /
(to)
ACHIEVE
what
GOAL(S)
Goal(s) not ‘what is
to be accomplished’
Remove
qualification (ideally
seen as an
accomplished fact) to
Informative Material
Leader communicates
the shared vision of the
goal(s) with the team.
Goal(s) included
Leader gets
commitment from team
to achieving the goal(s).
Create a new
outcome about
resilience (it should
be a stand-alone
outcome rather than
a qualification of the
commitment to goals
outcome.
Leader encourages
resilience in team when
goal-frustrating events
occur.
New outcome
focussing on the
important issue of
resilience in the face
of goal-frustrating
events
Leader develops
practical objective(s) to
achieve the goal(s).
Practical objectives
support the
achievement of the
goal(s)
Remove
qualification
altogether redundant
Change ‘shared
vision’ to ‘goals’
since the objectives
derive directly from
the goals.
SECOND EXAMPLE
Leader consistently
displays integrity,
characterised by
1.2.1
Leader behaves with
integrity
LEADER
(behaves)
103
Remove qualifiers to
the informative
section
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
trustworthiness, and
adherence to
principle
Leader consistently
displays competence,
characterised by
technical,
interpersonal,
conceptual and
reasoning skills
what
INTEGRITY
1.2.2
LEADER
(behaves)
what
COMPETENCE
Leader behaves
competently
Remove qualifiers to
the informative
section
Leader provides teammembers with ondemand synchronous
high-resolution
communications media
Rename ‘richlytextured’ to ‘hi-res’
(a more common
term)
THIRD EXAMPLE
Leader provides
richly-textured
communications
media for team
members to use ondemand.
3.4.1
LEADER
(provides)
who
TEAMMEMBERS
when
ON-DEMAND
what
HIGH-RES ICT /
/ (that
is)
SYNCHRONOUS
Add ‘synchronous’
as appropriate
Reorder the sentence
to be subject-verbobject.
FOURTH EXAMPLE
Leader allocates
project requirements
before team members
are recruited to verify
the integrated team
structure is
appropriate to goals.
2.3.1
LEADER
(verifies)
what
TEAMSTRUCRURE/
/ when
RECRUITING
TEAMMEMBERS
(before)
how
Leader verifies team
structure before
recruiting teammembers by allocating
project requirements
Restructure sentence
to place emphasis on
correct aspects (this
outcome is primarily
about verifying the
team structure’)
Leader develops highcapability selfmanaging performance
functions where
complex tasks are
performed
asynchronously
Reword to simplify.
ALLOCATING
REQUIREMENTS
FIFTH EXAMPLE
Leader develops
higher capability selfmanagement
functions early in the
project lifecycle
where complex tasks
are performed
asynchronously in
virtual environments
(i.e. where temporal
displacement is
high).
3.5.2
LEADER
(develops)
what
PERFORMANCEFUNCTIONS
/ (that
are)
SELFMANAGING /
/ (and)
HIGHCAPABILITY
when
COMPLEX
TASKS /
/(are
perf)
ASYNCHRONOUSLY
Take ‘early in
project’ and put in
Informative section.
Table 3: Applying behavior tree notation to a process model
The Behavior Tree notation analysis was performed by the first named author after receiving
around 60 minutes of training from Professor Dromey. The data shown in Table 3 is a
104
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
representative subset of the review done on the V0.3 PRM. As an indication of the defect
density, this version of the model contained 24 processes with 63 outcomes collectively.
Almost all outcomes were changed in some way as a result of the analysis. The kind of
defects found and fixed is represented in the table above.
Defects are identified when the notation is applied, beginning with the main entity (leader in
most cases), a verb that describes what the entity does (eg. develops, or verifies, or provides
etc), and followed by the specific what, or who or when etc as makes sense for each outcome
in order to build up a complete unit of sense. This process goes beyond simple editing
however. When applied rigorously to the process model, a high-degree of consistency and
clarity of expression is achieved. Even with competent editors, other process models (eg.
OOSPICE and CMMI as discussed earlier) do not achieve this level of consistency and
clarity.
The analysis of the 24 processes and 63 outcomes took around six hours to perform, including
the documenting of the analysis using the kind of table seen above (with PRM before and
after, notation and rationale for change).
CONCLUSION
The approach to specifying processes in terms of their purpose and outcomes was developed
in the course of evolution of ISO/IEC 15504 (Rout, 2003) and is arguably a key innovation in
the approach to process definition and assessment embodied in the Standard. By viewing a
process (or collection of processes) in this way, it becomes clear that the outcomes represent
the results of desired organizational behavior that, if institutionalised, will result in
consistently achieving the prescribed purpose. The approach redirects the analysis of process
performance from a focus on conformance to prescribed activities and tasks, to a focus on
demonstration of preferred organizational behavior through achievement of outcomes.
Given this, it is logical to see that the application of the Behavior Tree approach to the
analysis of such process models will be effective. The earlier studies reported here were of a
much smaller scale than the current study, which embraces the full scope of a comprehensive
model of organizational behavior. The aim in applying the approach was to provide a more
formalised verification of the integrity, consistency and completeness of the model than
conventional approaches – based generally on expert review – could achieve.
It may therefore be seen from Table 3 above that applying Behavior Tree notation to the draft
outcomes of a process reference model produced significant improvement to the clarity of the
outcomes by simplifying the language, reducing ambiguity and splitting outcomes into two
where two ideas were embodied in the original.
It is suggested, based on the evidence outlined above, the Behavior Engineering is a useful
tool for model-builders in the domain of model-based process improvement. It reinforces the
claims that the technique is a superior tool for the verification of complex descriptions of
system behavior.
105
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
REFERENCES
Chrissis, M.B., Konrad, M., & Shrum, S., (2003). CMMI Guidelines for Process Integration
and Product Improvement. Addison-Wesley, Boston.
Dromey, R.G. (2001) Genetic Software Engineering - Simplifying Design Using
Requirements Integration, IEEE Working Conference on Complex and Dynamic Systems
Architecture, Brisbane, Dec 2001.
Dromey, R.G. (2002). From Requirements To Design – Without Miracles, Whitepaper
published by the Software Quality Institute. Available:
http://www.sqi.gu.edu.au/docs/sqi/gse/Dromey-ICSE-2003.pdf (accessed 13 April 2009)
Dromey, R.G. (2006). Climbing Over the 'No Silver Bullet' Brick Wall, IEEE Software, Vol.
23, No. 2, pp.118-120.
Dromey, R.G. (2007a). Principles for Engineering Large-Scale Software-Intensive Systems
Available: http://www.behaviorengineering.org/docs/Eng-LargeScale-Systems.pdf (accessed
14 April 2009) pg 39.
Dromey, R.G. (2007b). Behavior Tree Notation Available:
http://www.behaviorengineering.org/docs/Behavior-Tree-Notation-1.0.pdf (accessed 10 June
2009) pg 2-3.
Glass, R.L. (2004). Is this a revolutionary idea or not?, Communications of the ACM, Vol
47, No 11, pp. 23-25.
Hevner, A., March, S., Park, J. and Ram, S. (2004). Design Science in Information Systems
Research. MIS Quarterly 28(1): pp 75-105.
ISO/EIA 12207 (1998) Standard for Information Technology-Software Life Cycle Processes.
This Standard was published in August 1998.
ISO/IEC 15504 (2003) Information Technology: Process Assessment. Joint Technical
Committee IT-015, Software and Systems Engineering. Part 2 Performing an Assessment.
This Standard was published on 2 June 2005.
ISO/IEC TR 24774 (2007). Software and systems engineering -- Life cycle management -Guidelines for process description. This Standard was published in 2007.
Milosevic, Z., Dromey, R.G. (2002) On Expressing and Monitoring Behavior in Contracts,
EDOC-2002, Proceedings, 6th International Enterprise Distributed Object Computing
Conference, Lausanne, Switzerland, Sept, pp. 3-14.
M. Ransom-Smith, K. McClung and T. Rout, (2002) Analysis of D5.1 – initial CBD process
model using the Behavior Tree method. Software Quality Institute report for the OOSPICE
Project, December 4.
Rout, T.P. (2003) ISO/IEC 15504 - Evolution to an International Standard, Softw. Process
Improve. Pract; 8: 27–40.
106
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Stallinger, F., Dorling, A., Rout, T., Henderson-Sellers, B., Lefever, B., (2002) Software
Process Improvement for Component-Based Software Engineering: An Introduction to the
OOSPICE Project, EUROMICRO 2002, Dortmund, Germany, April.
Vaishnavi, V. and Kuechler, W. (2004/5). Design Research in Information Systems January
20, 2004, last updated January 18, 2006. URL:
http://www.isworld.org/Researchdesign/drisISworld.htm Authors e-mail: [email protected]
[email protected]
Wooten, C.W. (2001) The orator in action and theory in Greece and Rome. Brill (Leiden,
Boston).
107
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
108
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
SAFETY MANAGEMENT
AND
ENGINEERING
109
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
110
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
BRINGING RISK-BASED APPROACHES
TO SOFTWARE DEVELOPMENT PROJECTS
Felix Redmill
Redmill Consultancy
London UK
INTRODUCTION
The history of software development is strewn with failed projects and wasted resources.
Reasons for this include, among others:
•
•
•
•
•
•
Failure to take an engineering approach, despite using the epithet ‘software engineering’;
Focus on process rather than product;
Failure to learn lessons and use them as the basis of permanent improvement;
Neglect to recognise the need for high-quality project management;
Reliance on tools to the exclusion of understanding first principles; and
Focus on what is required without consideration of what could go wrong.
If change is to be achieved, and software development is to become an engineering discipline,
an engineering approach must be embraced. This paper does not attempted to spell out the
many aspects of engineering discipline. Rather, it addresses the risk-based way of thinking
and acting that typifies the modern engineering approach, particularly in safety engineering,
and it proposes a number of ways in which a risk-based approach may be incorporated into
the structure of software development.
Taking a risk-based approach means attempting to predict what undesirable outcomes could
occur in the future (within a defined context) and taking decisions – and actions – to provide
an appropriate level of confidence that they will not occur. In other words, it uses knowledge
of risk to inform decisions and actions. But, if knowledge of risk is to be used, that
knowledge must be gained, which means acquiring appropriate information.
In safety engineering, such an approach is essential because the occurrence of accidents
deemed to be preventable is not considered acceptable. (As retrospective investigation almost
always shows how accidents could have been prevented, this often gives rise to contention,
but that’s another matter.) In the security field, although a great deal of practice is carried out
ad hoc, standards are now based on a risk-based approach: identifying the threats to a system,
determining the system’s vulnerabilities, and planning to nullify the threats and reduce the
vulnerabilities in advance.
However, in much of software development, the typical approach is to arrive at a product
only by following a specification of what is required. Problems are found and fixed rather
than anticipated, and consideration is seldom given to such matters as the required level of
confidence in the ‘goodness’ of any particular system attributes.
A risk-based approach carries the philosophy of predicting and preventing, and this is an asset
both in the development of products and the management of projects. This paper therefore
111
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
proposes some first steps in creating a foundation for the development of such an approach in
software development and project management. The next section briefly introduces the
subject of risk, and this is followed by introductions to two techniques, used in risk analysis,
which are applicable in all fields and are therefore useful as general-purpose tools.
Subsequent sections offer thoughts on the introduction of a risk-based approach into the
various stages of software development projects.
It is hoped that the explanations offered in this paper are easily understandable, but they do
not comprise a textbook. Risk is a broad and tricky subject, and this paper does not purport to
offer a full education in it.
NOTES ON RISK
Risk exists in every situation in which we find ourselves and arises from every decision and
action that we take. Because of this, we are all practiced risk managers. But our familiarity
with risk is intuitive rather than conscious, and:
•
•
•
Our successes in intuitive risk management are mostly in simple situations;
Our failures are usually not sufficiently serious to warrant conscious assessment, and we
perceive them to be the result of bad luck rather than deficient risk management; and
Our intuitive risk-management processes are, mostly, not effective in more complex
situations, such as development projects and modern technological systems.
Psychologists have shown that our perception of risk is influenced by a number of factors, all
of which are strongly subjective. They include:
•
•
•
•
•
Whether the risk is taken voluntarily or not;
Whether we believe ourselves to be in control or not;
The level of uncertainty;
The value of the prize for taking the risk; and
Our level of fear.
Engineering risk analysis employs two factors: the probability of a defined undesirable event
occurring within a defined time period, and the potential consequences if it did occur. As both
lie in the future, neither can be determined with certainty, and the derivation of both must
include subjectivity. However, given appropriate information, both are estimable.
The key is information. In its absence, estimates of probability and consequence can be no
more than guesses (as they often are in project ‘risk workshops’). The importance of adequate
information cannot be over-emphasised. If there is to be confidence in risk estimates, there
must be confidence in the estimates of probability and consequence. And these, in turn,
depend on information in which there is confidence, i.e. information from trusted sources,
and sufficient of it to warrant the level of confidence that is required or claimed.
A great part of risk analysis is, therefore, the acquisition of an adequate amount of
information to provide the basis for risk estimates that are appropriate to the circumstances.
But what is appropriate to the circumstances? This question is answered by considering such
factors as the costs of getting it wrong, the level of confidence needed in the estimates, and
the costs in time and resources to achieve a given level of confidence. The greater the
importance of the enterprise, the more important it is to derive high confidence in risk
estimates, so the more important it is to acquire an appropriate amount of information of
112
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
proven pedigree. Yet, thoroughness in acquiring appropriate information is often thwarted by:
•
•
An intuitive belief that we understand risk better than we do; and
Over-confidence in our ability to estimate by quickly ‘sizing-up’ (i.e. guessing).
Project ‘risk workshops’, sometimes held to ‘identify’ project risks, do not do the trick –
unless we are content with extremely low confidence. But once such a workshop has been
held, many project participants, including management, are unaware of its inadequacy and
believe its results.
There is never a good excuse for failing to collect adequate information, for analysis of an
adequacy of the right information almost always disproves pre-held beliefs. It is always wise
to be suspicious of our preconceptions. A saying that is attributed, unreliably, to many
different authorities makes the point: ‘It ain’t so much the things we don’t know that get us in
trouble. It’s the things we know that ain’t so.’
Just as important as obtaining accurate risk values is the process of thinking that makes use of
them. Risk-based thinking does not focus attention only on what we want to achieve; it also
tries to determine what risks lie ahead, what options we have for managing them, how to
decide between the options, and then what actions to take.
In well understood situations, risks may be addressed intuitively, by reference to experience
or documentation. Or, rules for the management of previously experienced risks may be
created. Indeed, risk-management mechanisms are built into the processes considered integral
to traditional project management. But such devices flounder in novel or complex
circumstances, or when project managers, often because of inexperience or pressure from
management, cut corners by eliminating, changing, or failing to enforce processes or rules
whose origin and purpose they don’t understand.
When risky situations are well understood, it may be possible to make risk-management
decisions quickly and with confidence, without needing to obtain and analyse further
information. But when a situation is not well understood, it is essential to be more formal.
Then, the search for information needs to focus on sources that are relevant to the purpose of
the study and on what contributes to improved understanding of the risks. The level of
confidence that may be claimed in results depends on the pedigree of the sources.
TWO GENERAL-PURPOSE TECHNIQUES
In safety engineering, various techniques for directing the search for relevant information
have been designed and developed. In this section, two are described. Their application is not
restricted to the field of safety; they are useful in most situations, including software
development and project management.
Guidewords
Often we talk glibly about ‘failure’, as though it can occur in only one way. But there may be
many ways of failure (many failure ‘modes’) and, crucially, each failure mode carries
different potential consequences. One technique, Hazard and Operability Studies (HAZOP –
see Redmill et al (1999)), is based on the use of ‘guidewords’, each of which focuses
attention on a possible failure mode. By using a guideword to raise questions on its associated
mode of failure, information is gathered on both the mode of failure’s likelihood of
occurrence and its potential consequences.
113
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
A generic set of guidewords, with universal application, is presented in Table 1. In some
cases they may need interpretation, depending on the circumstances. For example, ‘No’ may
need to be interpreted as ‘None’ or ‘Never’; ‘As well as’ many need to be interpreted as
‘More’ when applied to an amount (say, of data), ‘Too long’ in the context of a time interval,
or ‘Too high’ when applied to a rate of data transmission. It is this flexibility that makes the
guidewords universal; without it, they would appear specific to certain situations.
Table 1: Guidewords and their Definitions
Guideword
No
As well as
Part of
Other than
Early
Late
Before
After
Definition
No part of the design intent is achieved
All the design intent is achieved, but with something more
Some of the design intent (but not all) is correctly achieved
None of the design intent is achieved, but something else is
The design intent is achieved early, by clock time
The design intent is achieved late, by clock time
The design intent is achieved before something that should have preceded it
The design intent is achieved after something that it should have preceded
As a simple example of the application of guidewords, consider the production (by a
software-based system) of an invoice. It is immediately clear that reference to ‘failure’ is
vague, for there are many ways in which it may be faulty, and Table 2 shows the use of
guidewords in identifying them.
Table 2: Use of Guidewords in Identifying and Examining Invoice Production Failure Modes
Guideword
No
Mode of Failure
No invoice is produced
As well as
Invoice contains additional items for which
the customer is not responsible
Invoice contains only some of customer’s
items
A document other than the invoice
(perhaps another customer’s invoice) is
produced
Invoice is produced before all work is done
Part of
Other than
Early
Late
Before
After
Invoice is produced after it should have
been
Not relevant
Not relevant
Potential Consequences (not exhaustive)
Customer suffers no loss but may lose
confidence in company.
Company fails to collect payment.
Customer loses confidence and may cease to
do business with the company.
Customer may lose confidence in company.
Company does not collect full payment.
Customer loses confidence and may cease to
do business with the company.
Company does not collect payment.
A further invoice has to be produced.
Customer may be confused.
Payment is collected late. If systematic,
company suffers cash-flow problem
Once the credible modes of failure and their potential consequences have been identified,
further investigation may be carried out to determine the value of each failure mode to a
defined stakeholder (in this case, the customer or the company). For example, we may be
interested to determine what events might result in the loss of a customer (or customers in
general), or what could lead to the company not being paid, or to a customer experiencing
unacceptably low quality of service. Then, the actions taken would be to ensure that the risk
of occurrence of such events is low – for example by improving reliability through design,
strengthening the management and quality of development, being more rigorous in testing, or
improving the monitoring of operation.
114
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Used judiciously, these guidewords can facilitate identification of failure modes in just about
any situation; they comprise a general-purpose tool. And, as seen from the above example,
the steps taken to reduce the likelihood of failure, or to achieve quality, are not necessarily
changed by taking a risk-based approach; rather, those steps are directed more effectively.
Fault Trees
Another universally applicable technique is fault tree analysis. In this method, a single ‘top
event’ is selected and top-down analysis is carried out to identify its potential causes. First,
the immediate causes are identified, as in the example shown in Figure 1. In this example, the
top event would result from any one of the potential causes, so they are linked to it by the
logical OR function.
Car fails
to start
OR
No petrol
delivered
No oxygen
delivered
Battery
problem
Electrical
fault
Other
causes
Figure 1: An example fault tree (to first-level causes)
In some cases, the occurrence of the top event would require the concurrence of two events,
so they would be linked to it by a logical AND function, as in Figure 2. Clearly, reliability is
improved if failure requires two (or more) simultaneous events rather than a single event, so
the fault tree may be used to inspire a solution as well as an analysis tool. Indeed, this is the
principle of redundancy, which is mostly used in hardware replication, but also, in some
cases, in software.
Loss of
power
AND
Mains
fails
Generator
fails
Figure 2: A fault tree with causes linked by AND
Once the immediate causes have been determined (see Figure 1), their (the second-level)
potential causes are determined. For example, the failure to deliver petrol to the engine may
be because there is none in the tank or its transmission from the tank is impeded, which in
turn may result from a blockage, a leak, or a faulty pump. The battery problem may be that
the battery is discharged (flat) or that the electrical connection is faulty. Then the third-level
causes are determined, and so on, until a full analysis is completed. In systems in which
115
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
probabilities of causes are known from historic data (for example, electromechanical systems
the reliabilities of whose components are recorded), calculations may be made to arrive at a
probability of occurrence of the top event. However, such results are not always accurate and
can be misleading.
Used qualitatively, the fault tree is a valuable tool in many situations. For example, it may be
used to model the ways in which a project might have an unsuccessful outcome – and, thus,
provide guidance on ways in which this could be avoided. It is traditionally said that there are
three principal ‘dimensions’ of a project – time, budget, and to specification – so that project
failure is defined in terms of any one of the three. So, if each of these is taken as a top event,
the resulting fault trees would reveal the ways in which a project could fail. Deriving such
fault trees for an entire project would result in too large a number of contributing causes, and
be too vague for specific instances – though the causes that they throw up could be built into
project management checklists. However, such fault trees could be useful if derived for the
ends of project stages, or at defined milestones. Figure 3 shows an example of first-level
causes for time overrun at the first project milestone. These, and the results of next-level
causes may be used as indicators of essential project-management responsibilities.
Like guidewords, a fault tree is a tool of universal application, which may be employed to
advantage in most risk analyses. This brief introduction neither explains every detail of fault
tree creation not presents its difficulties. A fuller account is given by Vessely et al (1981).
Time overrun at
first milestone
OR
Project
started late
Project
authority
delayed
Planning too
optimistic
Necessary
documents
unavailable
Delay in
creating
project
team
Unexpected
delays
Team lacks
necessary
skills
Other
causes
Figure 3: A Fault Tree of First-level Causes of Time Overrun at the First Project Milestone
RISK ANALYSIS AT THE OBJECTIVES STAGE OF A PROJECT
A thorough risk analysis consists of a number of activities:
•
•
•
•
•
•
Identifying the hazards that give rise to risk;
Collecting information to enable an estimate to be made of the likelihood of each hazard
maturing into an undesirable event;
Collecting information to enable an estimate to be made of the potential consequences if
the undesirable event did occur;
Analysing the information to derive estimates of likelihood and consequence;
Combining the estimates to arrive at values of the risks arising out of the hazards; and
Assessing the tolerability of each risk.
116
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
In safety engineering, and in other fields when the stakes are high, each of these activities
must be carried out so as to provide high confidence in the results (indeed, the two techniques
described above are used in the first four of these activities). However, it is not always
necessary to be so thorough. Nor is it always possible to be, for there are times when
sufficient information is not obtainable (e.g. at the early stages of a new disease, and in the
early days of climate-change debate). The following two generalisations may be made:
•
•
The thoroughness of a risk analysis and confidence in its results are constrained by the
availability of appropriate information; and
The necessity for thoroughness is usually dependent on the criticality of the situation.
An instance when detailed information is unavailable is at the Objectives stage of a project,
when product details have not been designed and only strategic proposals have been made.
Yet, at this point, carrying out a risk analysis is invaluable, for it can identify risks that could
lead to project failure and huge consequential losses. Indeed, the report into the London
Ambulance Service (1993) showed that failure to consider the risks, particularly early in the
project, was instrumental both in awarding the contract to replace the service’s allocation
system to an unsuitable company and in the total failure of the project.
The value of analysis at the Objectives stage may be exemplified by a proposal for a
hospital’s patient-records system, the objectives of which might be stated by management as
being to:
•
•
•
•
•
•
Store the clinical and administrative records of all patients;
Provide physicians with rapid access to patients’ medical histories;
Provide the means of updating patients’ records with diagnoses, prescriptions, and illness
histories;
Provide nurses with rapid access to treatment and dosing specifications;
Provide management with means of documenting, accessing and analysing all patient
transactions, both medical and administrative;
Produce invoices for services provided to patients.
These objectives are defined from the perspectives of some of the system’s stakeholders,
specifically the hospital’s management and medical staff. Typically, and importantly, they
state what they require of the system – which amounts to the creation and updating of
records, storage, provision of access, and the output of analysed information – all database
and information-system facilities. The objectives do not reveal the difficulties involved in
attempting to meet the stated goals or the risks attached to the failure of any of them. Yet, as
revealed by the London Ambulance Service Inquiry, understanding such risks at this early
stage is crucial. Deeper scrutiny reveals that a slow system would not be used by doctors (a
fact not considered until too late by those responsible for the UK’s health systems), that the
loss of records would result in the hospital not being paid for services, that nurses and
administrators should not have access to full patient records, that if dosing information were
corrupted patients’ lives would be threatened, and that unauthorized access could result in
breaches of safety. Moreover, the safety risks are to the patients who, though at the heart of
the system, are not mentioned in the objectives except by allusion. From these observations,
it becomes apparent that the system should not only meet its functional requirements but also
be highly available, secure and safe.
Such revelations are likely to be daunting to management who supposed their system to be a
simple one. They should lead management to recognize the need for appropriate expertise in
its specification, design, development and implementation – and, if the development project
117
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
is to be contracted out, in contracting and management of the project.
A risk analysis at the objectives stage facilitates:
•
•
•
•
The clarification of objectives;
The early detection of the implications of – and, importantly, the risks thrown up by – the
stated objectives;
The detection of conflicts between objectives; and
The definition of objectives that should have been defined but weren’t.
In addition to facilitating better definition of objectives, a risk analysis at this stage throws up
the need to take decisions about the future of the clarified objectives. For example, when
management comes to understand that project risks include safety, will they want to proceed
with the project as defined? In some cases they wouldn’t. Options for action provided by
analysis of the objectives include:
•
•
Cancelling some objectives. This may be done, for example, because it is realized that
their implementation would take too long, would require the use of technologies that are
untried or for which we possess no expertise, or would carry risks (e.g. safety risks) with
which we do not wish to be involved.
Cancelling the project. This may be done if the risks exemplified in the previous point
were carried by the core objectives.
And, if it is decided to proceed, that is to say, to accept the risks, the analysis provides the
basis for defining:
•
•
Requirements, to be included in the specification, for the appropriate management of the
identified risks; and
Special testing requirements for the risk management aspects of the design and, later, the
product.
This combines risk-based and test-based approaches, and thus offers a new and deeper way of
raising confidence in the ultimate success of both the project and the product.
A risk analysis also provides information for the credentials required of (and, thus, for the
selection of) a contractor, if development is to be contracted out – a point emphasised in the
London Ambulance System Inquiry (1993).
RISK ANALYSIS AT THE SPECIFICATION STAGE
At the Objectives stage, there is little detailed information about an intended project, so both
the objectives and the identified risks are mostly at a strategic level. But, as a project
progresses, more detail is introduced at every stage, and this has two major effects:
•
•
It allows more thorough identification and analysis of risks; and
It is likely to introduce further risks.
Thus, risk analysis should be carried out when the principal product of each stage has been
completed. For example, when the specification has been developed, a risk analysis should
serve at least four purposes. First, it should identify and analyse new risks introduced by the
functional requirements. Often these appear straightforward, but if, for example, they invoke
a domain, such as safety or security, with which the developers are unfamiliar, or if they call
118
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
for a level of system or functional integrity not previously achieved, the risks involved should
be recognized and the implicated requirements given special design considerations.
Second, it should identify and analyse new risks introduced by the non-functional
requirements. Often these are assumed to accor with what is ‘normal for such systems’ or at
least technologically possible. But they should be carefully assessed. The required response
times, load capabilities, and reliability may not easily be achievable, or not achievable at all
to the staff employed to develop the system. Analysts should also identify the absence of
necessary non-functional requirements – a frequent deficiency in specifications.
Third, it should identify and analyse the risks introduced by any constraints placed on the
product or the project. Examples are: unfamiliar technologies, software and hardware
platforms, and systems with which the system must be integrated, time allowed for testing.
Fourth, the analysis should be used to determine if requirements have been specified for the
mitigation of the risks identified at the Objectives stage.
It is sensible also to identify requirements that do not contribute to any of the objectives.
Although they may add value for some stakeholders, they do not add strategic value (at least,
not according to the project’s defined strategic intent) and are likely to cause time over-run.
Given the results of the analysis, options for risk-based action include:
•
•
•
Cancel risky requirements – which requires re-consideration of the objectives;
Contract the project, or parts of it, to others with appropriate competences and experience;
and
Accept the risks and plan their mitigation, e.g. by planning and implementing changes to
the constitution and competence of the development team, by designing risk-reduction
measures into the design, and by defining controls in operational procedures.
Thus, the project now proceeds with the purpose not only of meeting the specified
requirements, but also of mitigating identified risks and, thus, avoiding problems that could
otherwise have occurred. Not to carry out a risk analysis and take appropriate actions results
in the unrecognised risks maturing into project and product problems and, thus, leading to
time over-run, budget over-spend, unsatisfactory products, or total project failure – examples
of all of which abound.
RISK ANALYSIS AT THE DESIGN STAGE
When a design has been completed – whether architectural or detailed – it should be
examined from two risk-based perspectives. First, a technical check should be made to ensure
that it provides features that reduce the previously identified risks to tolerable levels. This
check should trace a path through all previous risk analyses back to the Objectives stage.
Each time an identified risk was accepted, requirements for mitigating it should have been
introduced, and those to be met in the design should now be verified. The check should also
establish whether any risk-mitigation requirements are to be effected by operational
procedures, and if so, it should confirm that their design is not neglected.
Second, a study should be undertaken to determine what risks the design introduces into the
operational system or into the project. Typically, the rationale of a design is that it should
meet the functional and non-functional requirements. But what if a failure should occur
during operation (e.g. of a function or a component)? In some cases, the resulting loss may be
119
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
deemed unimportant; in others, it may be catastrophic. So, by carrying out a meticulous study
of a design – say, using HAZOP and its guidewords, or fault trees, as discussed earlier – not
only are the risks identified, so are the most critical aspects of the system. Actions that follow
may then include:
•
•
•
•
Making design modifications for risk reduction, e.g. introducing redundancy or protection
functions, or even just error messages;
Informing the software development teams of the most critical functions so that they can
apply their most rigorous processes (including appropriate staff) to them;
Defining test requirements for the risk-reduction functions and, more generally, providing
information to inform risk-based testing; and
Defining operational procedures to ensure that the risk-reduction and risk-management
measures are effective and, if necessary, to introduce further risk management.
RISK-BASED TESTING
For well known reasons, exhaustive software testing is not possible in finite time and, if it
were, it would not be cost-effective. Planned testing must therefore be selective. But on what
basis should selection be made? It makes sense to carry out the most demanding testing on
the most important aspects of the software, that is, where the risks attached to failure are
greatest. But how might this be done systematically?
Normally, derivation of a risk value requires estimates of both consequence and likelihood.
However, though the consequence of failure of a defined service provided by the system may
be estimated, there is no information on which to base an estimate of the service’s probability
of failure prior to testing the relevant software. Yet there are ways in which a risk-based
approach can achieve, or at least improve, both effectiveness and efficiency in testing.
First, a ‘single-factor’ analysis may be conducted, based on consequence alone, on the basis
that if the consequence of failure of a service is high then the probability of failure is desired
to be low. On the assumption that testing – followed by effective fixing – reduces the
probability of failure, estimates of the consequences of failure of all the services provided by
a system may be used to determine the rigour of testing of the items of software that create
those services. Of course, it cannot be proved that the probability of failure has been reduced
to any given level, but a relationship between severity of consequence and rigour of testing
can be defined for a project. This technique carries the major advantages that:
•
•
Sufficient information is usually available for consequence to be determined accurately,
provided that analysts meticulously seek it out; and
Both the consequence analysis and the resulting test planning can be done in advance of
development of the software itself, so it is a strategic technique.
Many testers believe that they already carry out this type of risk-based test planning, but they
are usually undone because they fail to take the trouble to collect the information necessary
for making sensible consequence estimates. Confidence in an estimate can only be justified if
it is supported by an adequacy of relevant information. And, as pointed out earlier, deep
investigation of data almost always disproves preconceptions.
Second, since estimates of the quality of software may be made, by observation and historic
information, prior to testing, a single-factor analysis may be based on quality as a surrogate
for probability. Confidence in quality estimates cannot be as high as those in consequence,
and such estimates cannot be made until the software has been produced and inspected.
120
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
However, quality estimates may be used as a tactical technique late in a project – for
example, to provide a basis for planning the reduction of testing because of the lack of time.
Third, even if test planning has been based on a consequence-based analysis, a quality-based
analysis can later be used to offer confidence in decisions on reducing or re-prioritising
testing as time runs out. In this case, one might refer to ‘two-factor’ analysis.
The scope of this paper allows only this brief introduction to the possibilities of risk-based
testing, but the author has provided fuller details elsewhere (Redmill 2005).
SUMMARY
Traditionally, the culture in software development projects is to focus wholly on the
production of what has been specified. The result is that risks that might have been foreseen –
and mitigated – unexpectedly give rise to problems that throw projects off course and lead to
defective – and, in some cases, unusable – products. Modern engineering thinking, on the
other hand – particularly in domains such as safety and security – is to take a predict-andprevent approach by taking account of risks early. This entails carrying out risk analyses and
basing risk-management activities on them, so as to reduce the likelihood of undesirable
events later. This paper proposes the introduction of a risk-based approach into software
development and project management.
The paper outlines what such an approach implies and goes on to explain ways in which it
can be implemented. It describes key aspects of two risk-analysis techniques employed in
safety engineering, shows that they can in fact be used in all situations, and briefly
demonstrates their application in software development projects. It then shows how risk
analysis can be used at the various stages of projects, in particular at the Objectives,
Specification, and Design stages. Carrying out risk analyses at these points provides options
for developers, from strategic management at the Objectives stage to design decisions later
on. It offers the opportunity to make changes in response to the acquired knowledge of risks:
to cancel a project, or parts of it, to change requirements and adjust the specification, and to
build risk-reduction features into the design. Further, by the early identification of critical
system features, it also presents the opportunity for early planning of their testing.
Further, the paper offers an overview of ways of carrying out risk-based testing, by using
knowledge of risks to inform test planning and execution.
This paper is not a final manifesto, or a textbook. It only introduces the subject of risk-based
thinking. However, it is felt that the principles proposed could bring improvements to
software development and project management and take these disciplines a step closer to
embracing an engineering approach and culture.
REFERENCES
London Ambulance Service (1993). Report of the Inquiry into the London Ambulance
Service. South West Thames Regional Health Authority, UK
Redmill F, Chudleigh M and Catmur J (1999). System Safety: HAZOP and Software HAZOP.
John Wiley & Sons, Chichester, UK
Redmill F (2005). Theory and Practice of Risk-based Testing. Software Testing, Verification
and Reliability, Vol. 15, No. 1
Vessely W E, Goldberg F F, Roberts N H and Haasl D F (1981). Fault Tree Handbook. U.S.
Nuclear Regulatory Commission, Washington DC, USA
121
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
122
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
MODEL-BASED SAFETY CASES
USING THE HiVe WRITER
Tony Cant, Jim McCarthy, Brendan Mahony and Kylie Williams
Command, Control, Communications and Intelligence Division
Defence Science and Technology Organisation
PO Box 1500, Edinburgh, South Australia 5111
email: [email protected]
Abstract
A safety case results from a rigorous safety engineering process. It involves reasoned
arguments, based on evidence, for the safety of a given system. The DEF(AUST)5679
standard provides detailed requirements and guidance for the development of a safety case.
DEF(AUST)5679 safety cases involve a number of highly inter-related documents; tool
support is needed to manage the process and to maintain consistency in the face of change.
The HiVe Writer is a tool that supports structured technical documentation via a centrallymanaged datastore so that any documents created within the tool are constrained to be
consistent with this datastore and therefore with each other. This paper discusses how the
HiVe Writer can be used to support safety case development. We consider the safety case
for a fictitious Phased Array Radar Target Illuminator (PARTI) system and show how the
HiVe Writer can support hazard analysis for the PARTI system.
1
INTRODUCTION
Safety critical systems are those with the potential to cause death or injury as a result of accidents arising from unintended system behaviour. For such systems an effective safety engineering process (along with choice of the appropriate safety standards) must be established at
an early stage of the acquisition lifecycle, and reflected in contract documents. This process
culminates in a safety case: i.e. reasoned arguments, based on evidence, for the safety of the
system. Safety cases are important because they not only help to provide the assurance of safety
that is required by technical regulators, but can also – by providing increased understanding of
safety issues early in the project lifecycle – help avert substantial costs at later stages in the
project lifecycle.
There are well-known methods and tools to support the development of safety cases. For example, the ASCAD (Adelard 2009) tool makes use of the “claims, arguments, evidence” (CAE)
approach (Emmet & Cleland 2002), as do the hypertext systems AAA (Schuler & Smith 1992)
and Aquanet (Marshall, Halasz, Rogers & Janssen 1991). Another approach is called GoalStructured Notation (Wilson, McDermid, Pygott & Tombs 1996). GSN represents elements
(i.e. requirements, evidence, argument and context) and dependencies of the safety case by
means of a graphical notation. Of the tools available today the most widely used is the Adelard
Safety Case Environment (ACSE), which supports GSN as well as CAE (Adelard 2009).
A safety case will usually involve a complex suite of documents built on various assurance
artifacts and other forms of evidence. The methods and tools mentioned above are valuable, but
they do not fully address the fact that the safety case must be guaranteed to be consistent and
robust in the face of changes (Kelly & McDermid 1999): such changes may be trivial ones that
123
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
need to be tracked throughout the documentation, or they may be major changes that impact the
whole safety argument.
One approach to the consistency problem is to develop a glossary of technical terms and
to ensure that such terms are used systematically throughout the safety case documentation.
We speak of structured text to refer to documents with such embedded technical terms. The
aggregration of technical terms used in the documentation forms a light-weight model for the
safety case. Clearly there is considerable scope to provide tool support for developing such
model-based safety cases. The HiVe (Hierarchical Verification Environment), currently under
development at DSTO, is a general purpose tool for producing structured system documentation. It allows a user to develop the necessary technical glossary and ensures that references to
technical terms are managed consistently throughout. It has a number of potential applications;
in this paper we claim that the application of the HiVe to the development of safety cases offers
many advantages.
Another advantage to the HiVe’s implementation of structured text is the ability to enforce
workflows through the introduction of form-like constructions. By structuring the safety case
documentation in an appropriate way, it is possible to ensure structural compliance with the
requirements of the standard. The HiVe may also be programmed to perform simple correctness
checks on important details of the safety case or even to calculate the correct results of mandated
analyses. In this paper, we describe a specialization of HiVe that supports the development of
DEF(AUST)5679 compliant safety cases.
In Section 2.1 we give an overview of DEF(AUST)5679, focusing on the requirements for
hazard analysis. Section 2.2 summarises the HiVe Writer. In Section 3 we introduce the concept
of model-based safety case, and discuss issues for tool support. Section 4 presents an overview
of the hazard analysis for a realistic Defence case study. In Section 5 we show how this case
study is captured within the HiVe Writer. Section 6 presents some conclusions and suggestions
for further work.
2
BACKGROUND
2.1
DEF(AUST)5679
The recently published DEF(AUST)5679 Issue 2 (DSTO 2009b) provides detailed requirements
and guidance for the development of safety cases. A safety case involves the following steps:
•
An analysis of the danger that is potentially presented by the system. This involves an
assessment of the system hazards, along with the ways that accidents could occur, as well
as their severity. This is called hazard analysis.
•
A system design that provides safety features (internal mitigations), i.e. a safety architecture.
•
Arguments that system components have been built in such a way that provides assurance
of safety, called design assurance.
•
An overall narrative (or high-level argument) that is convincing to a third-party and pulls
all the above together.
The safety case must be acceptable to an auditor (whose role is to monitor the system engineering process and ensure that the procedural aspects of standards are followed); to an evaluator,
whose role is to provide a thorough independent and objective review of the validity of the technical arguments that critical requirements are met by the system; and to a regulator, whose role
is to set the policy framework within which decisions about system safety must be made (and
who also may have the role of certifying that the system is sufficiently safe to be accepted).
124
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
For the purposes of this paper we shall describe in more detail the hazard analysis stage of
the safety case. The aim of hazard analysis is to describe the system, its operational context and
identify all possible accident scenarios (and their associated danger levels) that may be caused
by a combination of the states of the system, environmental conditions and external events.
An accident is an external event that could directly lead to death or injury. The severity
of an accident is a measure of the degree of its seriousness in terms of the extent of injury or
death resulting from the accident. System hazards are top-level states or events of the system
from which an accident, arising from a further chain of events external to the system, could
conceivably result. Accident scenarios describe a causally related collection of system hazards
and coeffectors that lead to a defined accident.
An external mitigation is an external factor that serves to suppress or reduce the occurrence
of coeffectors within a given accident scenario. The strength of external mitigation must be
assessed as one of low, medium or high. The danger level of an accident scenario is a function
of the resulting accident severity and the assigned strength of external mitigation. There are six
danger levels, labelled from D1 to D6 .
Some of the key requirements in DEF(AUST)5679 that are relevant for hazard analysis
are reproduced in Figure 1. They have a special format: they are in bold face, with a unique
paragraph number (shown here in square brackets), and usually reference one or more technical
terms (shown in blue).
2.2
The HiVe Writer
The HiVe Writer represents a novel approach to the creation and management of complex suites
of technical documentation. It blends both modelling and documentation through a synthesis
of concepts from model-based design, literate programming, and the semantic web. It supports
a range of documentation styles from natural language descriptions through to fully formal
mathematical models. The Writer’s free text mode allows the author maximum expressive
freedom. The Writer’s syntax-directed editing mode ensures compliance with documentation
and notational standards. The two modes may be mixed freely in a single document. More
details on the HiVe may be found in (Cant, Long, McCarthy, Mahony & Williams 2008); in this
section we give a brief overview.
In the Writer, the modelling activity proceeds by interspersing commands through a normative design document (NDD), which serves as a “script” that builds up the model-based design.
These commands serve to enter data into a centrally managed datastore. The datastore records
the fundamental technical terms and other building blocks of our model — called formal elements — as well as properties satisfied by these formal elements. Commands may also be
used to initiate interactions with external analysis tools and to record results in the datastore.
Elements from the datastore may be freely referred to in any document. All such references
are created (and are guaranteed to remain) consistent with the datastore, greatly simplifying the
management of change propagation and consistency across complex suites of documentation.
The Writer also provides a highly sophisticated rendering layer that allows the user to
present information from the datastore in numerous styles. This allows the user to target the
presentation style to the needs of the intended audience and to create different presentations
of the same information for different audiences. Designers are encouraged to write design,
explanatory and technical documentation in parallel, with complete consistency and targeted
presentation styles, thereby helping them to produce documents that convince others of the
correctness of the design.
The capabilities of The Writer can be extended via a powerful plug-in facility. In particular,
125
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
[8.6.2.] The Hazard Analysis Report must provide a list of all Accidents arising in Accident Scenarios, including an estimate of the Accident Severity of each Accident as
determined by Table 8.1.
Accident Severity
Catastrophic
Fatal
Severe
Minor
Definition
Multiple Loss of Life
Loss of Life
Severe Injury
Minor Injury
Accident Severities (Table 8.1 in DEF(AUST)5679).
[8.9.4] The Supplier shall assign each Accident Scenario a Danger Level in accordance
with the following conditions.
•
For each Accident Scenario a default Danger Level shall be assigned based on
the Accident Severity using Table 8.2.
•
If no External Mitigations are present in the Accident Scenario, the Danger
Level shall remain at the default value for that severity.
•
If, for a given Accident Scenario, a strength of External Mitigation can be assigned, then the Danger Level shall be reduced from its default value according
to Table 8.2
[8.9.6] Danger Level assignments of greater than D4 must be explicitly justified in the
Hazard Analysis Report, showing cause why stronger External Mitigation of Damage
Limitation factors could not be introduced into the Operational Context.
Default Level
Accident Severity
Catastrophic
Fatal
Severe
Minor
D6
D5
D4
D3
External Mitigation
Low
Medium
D6
D5
D5
D4
D4
D3
D3
D2
High
D4
D3
D2
D1
Danger Levels (Table 8.2 in DEF(AUST)5679).
[8.10.2] The Supplier shall assign to the System a System Danger Level that is the maximum over all the Danger Levels assigned for Accident Scenarios.
Figure 1: Requirements for hazard analysis
126
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
plug-ins can be developed to support specific business processes and documentation standards.
In Section 5, we describe a safety case tool that we have developed as a HiVe plug-in.
3
MODEL-BASED SAFETY CASES
We have already noted that there are a number of potential benefits to be gained from adopting
a light-weight modelling approach in safety case development. Here we discuss some of the
experiments we have carried out in applying HiVe concepts in the DEF(AUST)5679 context.
3.1
Observations on compliance
In developing a safety case against a defined standard — in our case DEF(AUST)5679 — the
matter of compliance assurance comes to the fore.
There are various levels of compliance.
As a trivial example, DEF(AUST)5679 requires the use of four accident severities; if the
safety case actually makes use of five severities, then it will not be compliant. This kind of
compliance is easy to check and should (ideally) be enforced at the time that the safety case is
developed, so that such mistakes are impossible. We call this shallow (or surface) compliance.
At the other end of the spectrum, for example, would be the case where the reasoning used
to justify the accident severities is unconvincing. This would need to be identified by a skilled
reviewer (or evaluator) and is an example of deep (non-)compliance with the standard. It can’t
be automatically enforced during safety case development. Nevertheless, the safety case should
be built and laid out in such a way as to facilitate the checking of all forms of compliance.
We have identified a number of ways that the HiVe can enforce shallow compliance and
support deep compliance.
3.2
HiVe DEF(AUST)5679 experiments
The development of the standard itself is a complex endeavour, complicated in the case of
DEF(AUST)5679 Issue 2 by the parallel development of guidance papers and a significant
worked case study (DSTO 2009a). Our experiments in support of this process are described
elsewhere (Cant et al. 2008). In short, we developed an extensive glossary of formal elements
(such as accident, hazard analysis, severity etc) to manage consistent use of terminology across
this large body of documentation and also a light-weight model of the various actors and processes treated in the standard. Although modest in scope, these modelling activities gave us
useful tools for managing consistency and completeness across this large document suite. For
example, the tools ensured that each requirement had a clearly defined responsible agent and automatically collected a table of the responsibilities of each agent in an appendix to the standard.
We are confident that this tool support led to a significantly higher quality end product.
Encouraged by the results of this experiment, we began to consider the development of a
HiVe plug-in for supporting the development of actual safety cases. Most obviously, safety
case authors would benefit from access to the DEF(AUST)5679 technical glossary as well as
an ability to define and manage a technical glossary specific to the system under consideration.
The basic HiVe Writer already provides such capabilities, ensuring that all of the technical terms
(as well as the requirements) of DEF(AUST)5679 can be directly referenced in the safety case
under construction, with the HiVe maintaining consistency throughout. More ambitiously, we
were interested in developing light-weight models for the necessary workflows and deliverables
of DEF(AUST)5679.
127
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Prohibited
Areas
Threats
Phased Array
Radar Beams
ESSM
Laser Targetting
Beams
Own Ship
Figure 2: PARTI System and Environment
Building on the basic datastore of DEF(AUST)5679 requirements we introduced specific
commands for making compliance claims against the standard. For example, a command that
describes and adds a new accident scenario to the list required under paragraph 8.6.4 of DEF(AUST)5679 (see Figure 1). Using structured text techniques, this command can ensure that
each accident scenario has all of its required attributes, such as associated accident and danger
level, properly defined. It can even automate any calculations required by DEF(AUST)5679,
such as determining the danger level as modified by external mitigation according to Table 8.2
of DEF(AUST)5679 (see Figure 1).
Properly implemented, such a collection of commands can ensure a very high degree of
shallow compliance with the standard. They also provide useful guidance to the evaluator in
determining deep compliance by directing attention to the critical compliance claims.
Once the compliance claims of the safety case are entered into the HiVe datastore, it becomes possible to make automated consistency and completeness checks. For example, by
identifying an accident that has been declared but does not appear in any accident scenarios.
A more sophisticated check is to ensure all accident scenarios with danger levels above D4 are
properly justified in accordance with paragraph 8.9.6 of DEF(AUST)5679 (see Figure 1).
The facilities described above have been integrated into a prototype HiVe plug-in for DEF(AUST)5679 safety case development. Currently, the plug-in focuses on the hazard analysis
phase, but there are plans to extend it to a complete support tool for DEF(AUST)5679, possibly
even including support for formal specification and verification.
4
THE PARTI SYSTEM
The Phased Array Radar Target Illumination (PARTI) system is a fictitious ship system that
scans, detects, discriminates, selects and illuminates airborne threats. The illumination of targets provides a target fix for the existing Evolved Sea Sparrow Missile (ESSM) system. The
PARTI system incorporates the phased array radar (PAR) and associated control functionality; a
sub-system for detecting, discriminating and selecting airborne threats; an operator; and a laser
designator and associated control functionality, for target illumination. The PARTI system and
its environment are shown in Figure 2.
The PARTI system was used as a case study for DEF(AUST)5679 (DSTO 2009b) and the
detailed results presented in DEF(AUST)10679 (DSTO 2009a) (along with other material giving guidance on how to apply the standard). Here we are just interested in the hazard analysis
for the PARTI system. The results of this analysis are summarized in Tables 1a– 1c. Recalling
128
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Id.
HAZ
HAZ
HAZ
HAZ
HAZ
A
B
C
D
E
System Hazard
The PAR irradiates a prohibited area.
The laser illuminates a non-target object.
The PARTI sends an erroneous communication.
The laser fails to maintain target illumination.
The PARTI operates without command authorisation
(a) System hazards
Id.
A1
A5
A7
A8
A9
A10
A11
A12
Accident
Severity
Missile or other ordnance causes damage to a non-target
Helicopter crash due to RF interference
Laser causes damage to a non-target
Personnel injuries caused by Electromagnetic Radiation
Collision and Grounding of Ship
Laser causes eye damage to personnel
Laser kills a person
Personnel deaths caused by Electromagnetic Radiation
Catastrophic
Catastrophic
Catastrophic
Severe
Catastrophic
Severe
Fatal
Catastrophic
Default
Danger
Level
D6
D6
D6
D4
D6
D4
D5
D6
(b) Accidents
Accident Scenario
AS A 2
AS B 1
AS B 2
Hazard
HAZ A
HAZ B
HAZ B
Accident
A5
A8
A10
Default DL
D6
D6
D4
Mit. Strength
High
Medium
High
Assigned DL.
D4
D5
D2
(c) Typical accident scenarios
Table 1: PARTI hazard analysis results
Section 2.1, Table 1a of system hazards and Table 1b of accidents are self-explanatory.
For reasons of space, Table 1c only includes three representative accident scenarios – those
used in Section 5. Similarly, we do not go into the details of the accident scenarios. For example,
scenario AS A 2 involves (DSTO 2009a, PARTI-HAR):
the PAR irradiat[ing] a prohibited area (hazard HAZ A) while a helicopter is close
to the ship, causing the aircraft to malfunction, leading to helicopter crash due to
RF interference (accident A5) with multiple fatalities (and so default danger level
D6 ).
The table records the calculated danger level accounting for mitigation by external coeffectors.
In the example of AS A 2 the need for proximity and aircraft malfunction are two independent
coeffectors; thus the danger level is lowered to D4 .
The interested reader can find full details of the PARTI case study in DEF(AUST)10679
(DSTO 2009a). One such detail is that among the scenarios not shown are those involving
HAS E and HAS F, in which the PARTI emits arbitrarily lethal radar or laser radiation respectively. Such hazards can clearly lead to catastrophic accidents with little or no potential for
external mitigation. Thus, as defined in Figure 1 (Clause 8.10.2), the system danger level is D6 .
Fortunately, it is fairly easy to design the PARTI (by limiting available power output) so as to
eliminate these hazards.
129
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure 3: The NDD for the PARTI system
5
THE PARTI HAZARD ANALYSIS IN THE HiVe
In this section we demonstrate the HiVe Writer applied to the PARTI hazard analysis, making
use of a prototype plug-in – here called the hazard analysis plug-in – to provide commands
specific to hazard analysis.
5.1
Generic interface
Figure 3 shows part of the HiVe Writer interface (Cant et al. 2008), as captured in a screenshot of a session using the hazard analysis plug-in. The top left hand window is the Project
Navigator: this provides an easy mechanism for moving between the different documents in
different open projects. Underneath the navigator is a formatting palette, which can be used to
present information according to a given user-defined style — this is very useful for presenting
the same information to different audiences. The main editor window shows part of the NDD
for this project. The NDD provides a literate script that builds up the “model” on which the
130
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure 4: The datastore after processing the first accident scenario
safety case is built. The yellow background on a command indicates that it has been processed
by the Writer. Direct interaction with the NDD is through a toolbar (top right of the screen) that
allows the user to control the processing of commands in the script.
5.2
Hazard analysis support interface
The content of the NDD in Figure 3 reflects the domain-specific nature of the hazard analysis
plug-in. In particular, after declaring PARTI to be the name of the project as well as the system,
we have the command “begin hazard analysis report for PARTI” which serves to set the context
for further command invocations in the plug-in.
The next command constructs a module that covers the operational context and system
description (if these are not present — as DEF(AUST)5679 requires — the hazard analysis
plug-in will complain after the NDD is processed, thus enforcing shallow compliance in this
instance). The use of modules ensures that the definitions all lie within their own namespace in
the project’s datastore. This is immediately used to good effect in the introduction of hazards
and accidents, each defined in their own module.
As yet unprocessed, in the next module introduced in Figure 3, are the definitions of the
three accidents (along with their severities) from Table 1b. The snapshot shows just the first of
these accident scenarios (AS A 2): it is introduced with some descriptive text, followed by two
coeffectors. After we have processed down to the end of this block, we find that the datastore
not only records this definition, but also computes automatically the default and final danger
levels (in accordance with Table 8.2 of DEF(AUST)5679). This is shown in Figure 4.
If we further process the next two accident scenarios and then try to end the hazard analysis,
the HiVe will not permit this, because (according to DEF(AUST)5679 (Clause 8.9.6)), explicit
justification is needed for any danger level assignments greater than D4 . Now we make use
of the HiVe’s syntax directed-editing. Using a palette of commands we can enter the skeleton
for the command giving explicit justification; we can then complete this using a second palette
to enter the name of the second scenario. The NDD can now be completely processed (see
Figure 5, which also shows the two palettes).
6
CONCLUSION AND PROPOSED FURTHER WORK
In this paper we have discussed the HiVe tool and how it can be used to address the problem
of constructing convincing and trustworthy safety cases. We demonstrated the use of the HiVe
on the hazard analysis phase of a safety case for a realistic case study. Work is now focused
on extending the tool to cover the safety architecture and design assurance phases of the same
example safety case. The architecture verification for the PARTI system has already been explored using a theorem prover (Mahony & Cant 2008); it will be instructive to capture this work
131
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure 5: The processed NDD with two palettes
within the HiVe tool as well.
Constructing and managing safety cases present a huge challenge. It will be many years
before there is general agreement on how a safety case should be structured; it will also take
some time for tools to be available that are easy to use and achieve the desired properties that a
safety case should have.
Acknowledgments.
The authors wish to thank the Defence Materiel Organisation for sponsorship and funding of
The HiVe Writer prototype.
References
Adelard (2009), ‘The Adelard Safety Case Development (ASCAD) manual’.
URL: http://www.adelard.com/web/hnav/resources/ascad/index.html
Cant, T., Long, B., McCarthy, J., Mahony, B. & Williams, K. (2008), The HiVe writer, in
‘Systems Software Verification’, Elsevier.
132
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
DSTO (2009a), DEF(AUST)10679/Issue 1, Guidance Material For DEF(AUST)5679/Issue 2,
Australian Government, Department of Defence.
DSTO (2009b), DEF(AUST)5679/Issue 2: Safety Engineering for Defence Systems, Australian
Government, Department of Defence.
Emmet, L. & Cleland, G. (2002), Graphical notations, narratives and persuasion: a pliant
systems approach to hypertext tool design, in ‘HYPERTEXT ’02: Proceedings of the
thirteenth ACM conference on Hypertext and hypermedia’, ACM, New York, NY, USA,
pp. 55–64.
Kelly, T. P. & McDermid, J. A. (1999), A Systematic Approach to Safety Case Maintenance, in
‘SAFECOMP’, pp. 13–26.
URL: citeseer.ist.psu.edu/kelly01systematic.html
Mahony, B. & Cant, T. (2008), A Lightweight Approach to Formal Safety Architecture Assurance: The PARTI Case Study, in ‘SCS 2008: Proceedings of the Thirteenth Australian Conference on Safety-Related Programmable Systems’, Conferences in Research
and Practice in IT., pp. 37–48.
Marshall, C., Halasz, F., Rogers, R. & Janssen, W. (1991), Aquanet: a hypertext tool to hold
your knowledge in place, in ‘HYPERTEXT ’91: Proceedings of the third annual ACM
conference on Hypertext’, ACM, New York, NY, USA, pp. 261–275.
Schuler, W. & Smith, J. B. (1992), Author’s Argumentation Assistant (AAA): a hypertext-based
authoring tool for argumentative texts, in ‘Hypertext: concepts, systems and applications’,
Cambridge University Press, New York, NY, USA, pp. 137–151.
Wilson, S. P., McDermid, J. A., Pygott, C. H. & Tombs, D. J. (1996), Assessing complex
computer based systems using the goal structuring notation, in ‘ICECCS ’96: Proceedings
of the 2nd IEEE International Conference on Engineering of Complex Computer Systems
(ICECCS ’96)’, IEEE Computer Society, Washington, DC, USA, p. 498.
BIOGRAPHY
Tony Cant currently leads the High Assurance Systems (HAS) Cell in DSTO’s Command, Control, Communications and Intelligence Division. His work focuses on the development of tools
and techniques for providing assurance that critical systems will meet their requirements. Tony
has also led the development of the newly published Defence Standard DEF(AUST)5679 Issue
2, entitled “Safety Engineering for Defence Systems”.
Tony obtained a BSc(Hons) in 1974 and PhD in 1979 from the University of Adelaide, as
well as a Grad Dip in Computer Science from the Australian National University (ANU) in
1991. He held research positions in mathematical physics at the University of St Andrews, Tel
Aviv University, the University of Queensland and the ANU. He also worked in the Commonwealth Department of Industry, Technology and Commerce in science policy before joining
DSTO in 1990.
133
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This page intentionally left blank
134
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
THE APPLICATION OF HAZARD RISK ASSESSMENT IN
DEFENCE SAFETY STANDARDS
C.B.H. Edwards1 M. Westcott2 N. Fulton3
1
AMW Pty Ltd
PO Box 468, Queanbeyan, NSW 2620
Email: [email protected]
2
CSIRO Mathematical and Information Sciences
GPO Box 664, Canberra, ACT 2601
Email: [email protected]
3
CSIRO Mathematical and Information Sciences
GPO Box 664, Canberra, ACT 2601
Email: [email protected]
Abstract
Hazard Risk Assessment (HRA) is a special case of Probabilistic Risk Assessment (PRA) and provides the theoretical
basis for a number of safety standards. Measurement theory suggests that implicit in this basis are assumptions that
require careful consideration if erroneous conclusions about system safety are to be avoided. These assumptions are
discussed and an extension of the HRA process is proposed. The methodology of this extension is exemplified in recent
work by Jarrett and Lin. Further development of safety standards and the possibility of achieving a harmonisation of the
different approaches to assuring system safety are suggested.
Keywords: Probabilistic Risk Assessment, Hazard Risk Assessment, Safety Standards, Safety Evaluation, Hazard
Analysis
2
Introduction
The use of Probabilistic Risk Assessment (PRA) is widespread in industry and government. Examples include
environmental impact studies, food and drug management, border protection, bio-security and the insurance industry.
In many organisations the use of PRA has become institutionalised, being applied in a prescriptive manner with little
questioning about the assumptions implicit in the method. In the safety domain the risk of hazard realisation leading to
an accident is of concern. This is known as Hazard Risk Assessment (HRA) and is a particular application of PRA.
This paper examines the application of HRA in safety standards, which are used to guide the assessment of the safety
of defence systems.
In recent years there has been a growing body of literature expressing concern that the application of PRA can lead to
false conclusions about the nature of perceived hazards. For example, Hessami (1999) provides an analysis of the
limitations of PRA and (inter alia) notes:
The Risk Matrices, once regarded the state-of-the-art in pseudo quantified assessment are essentially outmoded and
inapt for today's complex systems and standards of best practice. They are a limited tool which cannot be universally
applied in replacement for systematic assessment and it is not possible to compensate for their structural defects and
enhance their credibility through customization of their numerical axes as advocated by the Standard (IEC). These also
encourage an incremental as opposed to the holistic view of risks through arbitrary allocation of tolerability bands. In
short risk matrices are best suited to the ranking of hazards with a view to prioritize the assessment effort. A systems
framework is required to provide a suitable and sufficient environment for qualitative and quantitative assessment of
risks within a holistic approach to safety.
A so called “precautionary principle” has evolved over a number of years and has been proposed as an alternative to the
use of PRA. O’Brien (2000) provides a guide to the application of this principle. When describing the precautionary
principle Wikipedia notes:
This is a moral and political principle which states that if an action or policy might cause severe or irreversible harm
to the public or to the environment, in the absence of a scientific consensus that harm would not ensue, the burden of
proof falls on those who would advocate taking the action, [Raffensperger C. & J. Tickner (1999)]. The principle
implies that there is a responsibility to intervene and protect the public from exposure to harm where scientific
investigation discovers a plausible risk in the course of having screened for other suspected causes. The protections
that mitigate suspected risks can be relaxed only if further scientific findings emerge that more robustly support an
alternative explanation. In some legal systems, as in the law of the European Union, the precautionary principle is also
a general and compulsory principle of law, [Recuerda. (2006)].
135
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Given the widespread use of HRA within the safety community it is important that the limitations of this approach to
system safety be well understood and that some form of a precautionary principle is woven into the further
development of safety standards.
3
Hazard Risk Assessment
3.1 HRA Process
HRA aims to identify the risk of hazards and to guide the application of resources to minimize assessed risk. This
concept has been applied to a wide range of situations, ranging from relatively simple OH&S problems, such as office
safety, to the acquisition of complex weapons systems. HRA attempts to quantify risk through the use of the Hazard
Risk Index (HRI) measure. After the derivation of the HRI for a particular hazard, an assessment of the application of
resources required to mitigate or remove the risk is made. Often this assessment is based on the As Low as Reasonably
Practicable (ALARP) principle. Notably, the ALARP method allows for a statement of Residual Risk, i.e. the risk
remaining after the completion of the safety process. Further discussion about ALARP can (inter alia) be found in Ale
(2005).
3.2 Derivation of the HRI
The derivation of an HRI for a particular hazard, i.e. a hazard derived from the HRA process, is typically based on a
tabulation of ‘Likelihood’ versus ‘Consequence’ as shown in Table 1. The acceptability of the HRI is then determined
by a grouping of derived HRI. An example is shown in Table 2.
Consequence
Likelihood
Catastrophic
Critical
Major
Minor
Frequent
1
3
7
13
Probable
2
5
9
16
Occasional
4
6
11
18
Remote
8
10
14
19
Improbable
12
15
17
20
Table 1. Hazard Risk Index
HRI
Risk Level
Risk Acceptability
1 to 5
Extreme
Intolerable
6 to 9
High
Tolerable with continuous review
10 to 17
Medium
Tolerable with periodic review
18 to 20
Low
Acceptable with periodic review
Table 2. Acceptability of Risk
3.3 HRI Measurement Difficulties
The Likelihood and Consequence scales on Table 1 are ordinal measures. Thus within a particular row or column of
Table 1 entries are ranked and comparison of those rankings is valid. For example, it is reasonable to assert that within
the Critical Consequence column an Occasional likelihood is a worse outcome than a Remote likelihood. Comparisons
of rankings from different rows or different columns are more problematic. To assert that Occasional but Catastrophic
136
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
(HRI=4) is equivalent to Frequent and Critical (HRI=3) is difficult to justify, particularly in the absence of a quantified
hazard consequence that is consistent with the hazard context. Thus a grouping of HRI, for example as shown in Table
2, is difficult to justify. Groupings may have some meaning if the context of the Likelihood and Consequence
assessments is known and done on a case-by-case basis, i.e. the system or issue under evaluation together with the
operational environment are well understood. Importantly, because a general a priori statement of HRI groupings has
no theoretical basis, groupings of HRI used to ascribe a level of risk acceptability must be done on a case by case basis
prior to the safety analysis, in a manner that takes into account the system context. Stevens (1946) provides a useful
discussion on the theory of measurement scales, while Ford (1993) discusses the application of measurement theory to
software engineering.
The HRI based approach to safety has intuitive appeal to program managers and continues to be widely used. This
practice follows from the fact that the ALARP concept appears to simplify the problem of resource allocation and that
the concept of residual risk leads to qualitative statements of remaining safety actions, such as additional procedures,
which once articulated provide a well defined end to a safety program. It is of interest to note that the ‘burden of proof’
or ‘required due diligence’ for estimating the residual risk in the absence of hazard mitigation appears to be the same
regardless of how high the inherent risk.
4
Assumptions and Limitations of HRA
4.1 The Importance of Context
One problem with a general application of the HRI, as shown in the example Tables 1 and 2 lies in the fact that it is not
always possible to apply appropriate context scaling to the Likelihood and Consequence groups. The likelihood of
various outcomes will be dependent on the context of the problem being studied, as will the severity of the realisation
of a particular hazard. For example, the distribution of acceptability of HRI for faults in a Full Authority Digital Engine
Control System (FADECS) is likely to be very different from an examination of the risks of an experimental drug
treatment for patients with advanced forms of cancer. In the former case it is likely that there would be little tolerance
of a fault in the FADECS, while in the latter patients might be willing to risk death if there was even a small chance of
a cure. Prasad and McDermid (1999) discuss the importance of the context of a system when attempting to identify
emergent properties such as dependability.
The importance of context in trying to assess the safety of complex systems is well illustrated by Hodge and Walpole
(1999) where they adapted Boulding’s (1956) hierarchy of systems complexity to the Defence planning. The General
Hierarchy of Systems was summarised and illustrated as seen in Figure 1 below.
Figure 1. General Hierarchy of Systems
The application of this concept to the appropriate use of safety standards follows from the fact that standards developed
in an OH&S context are typically aimed at application at the Social level, while standards aimed at providing assurance
that a complex system is safe are applied at the Control level. An attempt to apply an OH&S based standard to a
137
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
complex computer-based system is likely to produce erroneous estimation of the system’s safety. This topic is
discussed later.
There are a number of assumptions, limitations and requirements implicit in a valid HRA process. These include:
a.
The requirement for semantic consistency between the context description and the outcomes of a hazard
realisation i.e. the Consequence description;
b.
the need for an exact quantitative description of the Risk Likelihood used, i.e. the likelihood function and the
boundaries of the associated class groups;
c.
a consistent mathematical description of the Consequence of hazard outcomes; and
d.
the requirement for an ‘a priori’ mathematical model of the acceptability of risk that is consistent with the context
of the analysis.
Apart from the requirement for semantic consistency, Cox (2008a) has addressed these issues in a paper on the
fundamentals of risk matrices. He concludes that PRA often has limited benefit, noting:
The theoretical results in this article demonstrate that, in general, quantitative and semi quantitative risk matrices have
limited ability to correctly reproduce the risk ratings implied by quantitative models, especially if the two components
of risk (e.g., frequency and severity) are negatively correlated. Moreover, effective risk management decisions cannot
in general be based on mapping ordered categorical ratings of frequency and severity into recommended risk
management decisions or priorities, as optimal resource allocation may depend crucially on other quantitative
information, such as the costs of different countermeasures, the risk reductions that they achieve, budget constraints,
and possible interactions among risks or countermeasures (such as when fixing a leak protects against multiple
subsequent adverse events).
Cox (see also Cox (2008b)) makes many other important points, including that probabilities are not an appropriate
measure for assessing the actions of intelligent adversaries. He also suggests three axioms that a risk matrix should
satisfy and shows that many matrices used in practice do not meet them. A further observation is that using a large
number of risk levels (or colours) in a matrix can give a spurious impression of the matrix’s ability to correctly
reproduce model risk ratings. For a 5 x 5 matrix, his axioms imply the matrix should have exactly three levels of risk
(the axioms require at least three levels, but Cox also recommends keeping the number of levels to a minimum).
However, there are some cases where PRA might be usefully employed. As Cox (2008a) notes:
If data are sufficiently plentiful, then statistical and artificial intelligence tools such as classification trees (Chen et al.,
2006), rough sets (Dreiseitl et al., 1999), and vector quantization (Lloyd et al., 2007) can potentially be applied to help
design risk matrices that give efficient or optimal (according to various criteria) discrete approximations to the
quantitative distribution of risks.
Other variations of conventional HRA aimed at better relating the likelihood and consequence pairs have been
proposed. Swallom (2005) provides an example, while Jarrett and Lin (2008) suggest a practical process to consistently
quantify likelihood and consequence, leading to a quantified HRI. The latter approach is strongly data dependent and
appears to provide a thoughtful and defensible use of HRA.
4.2 Semantic Consistency
The description of the system context and the outcomes from the realisation of a hazard can be a fraught process
involving imprecise descriptions and relationships. Overcoming this problem for complex systems will often require
considerable effort with the process being aided by formal analysis of the semantics involved in the description of the
system design and resulting hazards. One method for achieving internal consistency of the description of system
context is the application of set theoretic modelling. Wildman (2002) provides an example of this process.
5
Application of HRA in Safety Standards
5.1 Risk Based Standards
Over the last two decades there has been a divergence in the theoretical basis of system safety standards. In essence
there are two lines of thought. The conventional approach involves a process that attempts to apply HRA to identify
and classify system hazards according to some sort of acceptability criteria. The alternative approach is a qualitative
one, driven by system safety requirements, in which each accident scenario is assigned a danger level, and each
component safety requirement is assigned an assurance level that dictates the level of rigour required to develop and
analyse system components. The alternative approach is discussed later.
Safety standards such as the UK DEF STD 00-56 and the ubiquitous US MIL-STD-882 (2000) are examples of the
conventional approach. These are based on the ALARP approach to safety. This approach accepts the possibility of
Residual Risk and attempts to quantify assurance through the use of the HRI metric. Note: The US MIL-STD-882D
does not specifically call out ALARP but is instead based on a reasonability test similar to the ALARP approach.
Locally, the RAN Standard ABR 6303 (2006) is another example of the application of ALARP. This standard has been
widely promulgated and has been used to assess the safety of a number of complex systems.
138
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
While it may be possible to apply HRA-based safety standards to situations where the system is well understood, there
are systems where it is possible to argue that, a priori, the application of HRA to provide an estimation of system
assurance is inappropriate. For example, it may be appropriate to apply an HRA-based safety standard to a simple
mechanical system where there are sufficient data to draw some inferences about the statistical properties of
(Probability, Consequence) pairs and where the system context is well understood. However, the application of HRA to
a computer intensive system requires further consideration and alternative strategies need consideration. In the case
where the use of an HRA-based standard has been mandated it will be important to ensure that critical software
components of the system are identified and treated appropriately.
6
Proposed Extension to HRA
6.1 Interpretation of Risk
We assume it is possible to quantify the (Likelihood, Consequence) pairs so that the definition
Risk = Likelihood x Consequence
makes sense (Royal Society Study Group, 1992, Sections 1.3.2, 1.3.4).
The meaning of the value of Consequence needs careful thought. In general this will be a random variable; that is,
different realizations of a hazard will produce different consequences. In an HRA, which is prospective though perhaps
informed by data, the values for Consequence could be decided by an individual or as part of a collective evaluation
process. In the former case, the value could incorporate the risk perceptions of the individual. In the latter case, the
value is likely to be closer to an average or expected Consequence. If so, there is a useful interpretation of risk as an
expected cost rate; see h. below and Section 6.3.4.
This conclusion also emphasises the importance of an inclusive and multidisciplinary process when evaluating
Consequence.
6.2 Concept Application
As noted previously, Jarrett (2008) appears to offer a practical method of developing an estimate of system safety
assurance.
The approach attempts to overcome some of the known deficiencies in the construction and use of risk matrices. It is
based on work by Anderson (2006) and Jarrett and Lin (2008); the latter work is summarized in Jarrett (2008). Their
mathematical basis for quantifying the margins of the matrix is very similar, though Jarrett and Lin embed this in a
wider process for deriving a risk matrix. The stated intentions of this process are to “create greater transparency” and
“develop a more quantitative approach”. The specific context for the work in Jarrett and Lin (2008) is assessment of
maritime threats to Australia.
The main steps of this process are as follows.
a.
Define the relevant hazard or threat categories.
b.
For each threat category, assess the consequences and the likelihood of the hazard. The consequences are also
classified into categories.
c.
For each consequence category, the possible severities are listed, described and ranked. It is important that the
severities with the same rank line up across the categories, so that they will be generally agreed to be comparable.
A guide to severities is that, where possible, the steps should correspond to roughly 10-fold changes in “cost”
(which might be dollars but could be fatalities, injuries, land areas affected, etc).
d.
The hazard is then assigned a score (rank) in each category.
e.
The overall consequence score for the hazard is calculated by combining the category scores in a particular way
(see below).
f.
The likelihood of the hazard is assessed by its expected frequency, defined as the expected number of occurrences
per annum. The score assigned to the likelihood also has steps that correspond to 10-fold changes in the
frequency. Verbal descriptions of the scores can be given but are really only indicative; the number is the crucial
element here.
g.
The risk score for the hazard is then given by
Risk score = Consequence score + Likelihood score
h.
This score has a rough interpretation as a log10(expected annual cost)
i.
The possible scores for a hazard can be assembled into a matrix or table. If desired, each cell of the table can be
assigned a measure of risk based on the risk score for the cell. This would look like a traditional risk matrix, but
the scores have a definite quantitative interpretation that is transparent and can be validated against data.
6.2.1
Combination of category scores (Jarrett and Lin)
Suppose there are c categories and the associated category scores are
score is
139
s1 ,
, s c . Then the combined (consequence)
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
S
log 10 10 s1
10 s 2
10 s c
This is similar to taking the maximum score, but it gives added weight to multiple occurrences of the maximum.
For example, consider two cases, with c=7.
I.
There is one score of 5 and six scores of 1.
II. There are seven scores of 5.
In each case, the maximum score is 5, so if S were taken to be the maximum score both these cases would get the same
score. However with the proposed system:
SI=log10 (100,000+60) = 5.002; and
SII=log10 (700,000) = 5.85
Thus case II has a substantially higher risk score, which seems appropriate since it has a high score in every category so
presumably is judged to have more a severe overall consequence. Variants of this basic scheme are clearly possible,
and might be desirable in a particular case.
6.2.2
Interpretation of cell entries
In h. above, the risk score is given the interpretation of a log(expected annual cost). The model behind this is as
follows. Suppose hazardous events occur randomly in time at a rate . Each occurrence of an event has an associated
cost which is a random variable from a distribution with mean µ. Provided the costs are independent of the occurrence
process, the expected total cost per unit time is the product .µ. Taking logs gives the relation in g. above, and explains
h.
6.3 Example
This example is taken largely from Jarrett (2008). It concerns maritime threats.
6.3.1
Threat categories
These were taken as:
• Maritime Terrorism
• Illegal Activity in a Protected Area
• Protected Area Breach
• Piracy
• Unauthorised Maritime Arrivals
• Illegal Exploitation of Natural Resources
• Marine Pollution and Biosecurity
6.3.2
Consequence categories and severity levels
The categories are given in the top row of the table below (from Jarrett (2008)), together with descriptions of the
severities at ranks 4 and 5. (Note: The table is a fragment from the complete table provided in the CSIRO report).
Consequence
category
Death, injury or
illness
Economic
Environmental
Symbolic
5: Catastrophic
Mass fatalities,
remains
collection
compromised
$5 billion+
Irreversible loss of a Destruction of
conservation value of nationally important
a bioregion
symbol
4: Major
Multiple
fatalities,
remains
collection
compromised
$1-5 billion.
Damage to a
Serious damage to a
conservation value nationally important
where recovery > ten symbol
years
One might dispute the equivalence of outcomes within a particular Consequence Category but it is clear that a high
degree of discussion and consultation was involved in production of this table. This discussion and consultation are an
essential part of the proposed risk assessment process.
It is worth noting that at this point the above table of Consequence and associated Severities is similar to the approach
provided by ABR 6303. The extension of the methodology proposed here would thus appear to be a natural extension
140
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
of the ABR 6303 process. However, there are limitations to the application of this approach, particularly when dealing
with computer intensive systems. These limitation are discussed later.
6.3.3
Likelihood
This is summarised in the following table (from Jarrett (2008)).
Likelihood
Indicative Rate
Australia-Wide
Description
Prob 0.01 of event
each year.
Likelihood Score
Rare
Aware of an event like this
occurring elsewhere.
1
Unlikely
The event will occur from time Prob 0.1 of event
to time.
each year.
2
Possible
The event will occur every few One every three
years.
years.
2.5
Likely
The event will occur on an
annual basis.
One every year.
Very Likely
The event will occur two or
three times a year.
Two to three events
a year.
Almost Certain
The event will occur on about a Ten events or more a
monthly basis.
year.
3
3.5
4
The following should be noted:
a. The 10-fold increase in frequency with each unit increase in the Likelihood Score. In this case, the authors have
refined the scoring system to include some changes of 0.5; these are associated with a 3-fold change in frequency. This
is broadly consistent, since
log 10 3
0 . 477
0 .5 .
b. The decision to equate score 1 with a frequency of 1 event per 100 years. This is entirely arbitrary. We shall see
shortly that it might be better to increase all likelihood scores by 1 in this instance.
c. The verbal descriptions in the first column are evocative but have no direct influence on the results. Effectively, they
are defined by the frequencies. This is in contrast to other uses of risk matrices, where terms on the
frequency/likelihood axis appear to be undefined (e.g. Fig.9 in FWHA (2006))
d. The caveats in Cox (2008b) about use of probabilities when the hazard results from the actions of intelligent
adversaries should be kept in mind.
6.3.4
Risk score
This is defined by the sum of the consequence and likelihood scores. The interpretation mentioned, that of the log of
the expected annual cost, can be seen as follows. A likelihood score of 3 corresponds to an expected frequency of one
event per year. If the entire category scores are 4, say, then the consequence score is about 4.85 (log10(7x104)), leading
to a risk score of 7.85. Looking at the table above, severity level 4 is associated with a $ cost of order $1 billion. So the
expected annual cost would also be of order $1 billion and its log would be 9. So the risk score represents roughly 1/10
of the annual expected cost. This is why increasing all the likelihood scores by 1 might be sensible in this case; it would
give a closer match between risk score and log expected annual cost.
The Risk matrix produced from this example is shown below.
141
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Likelihood
Overall
Consequence Score
1
2
2.5
3
3.5
4
Rare
Unlikely
Possible
Likely
Very Likely
Almost
Certain
Negligible
Low
Low
Low
Moderate
Moderate
Low
Low
Moderate
Moderate
Moderate
Moderate
Low
Moderate
Moderate
Moderate
Moderate
High
Moderate
Moderate
Moderate to
High
High
High
High to
Extreme
Moderate
High
High
High to
Extreme
Extreme
Extreme
1 to 1.85
Insignificant
2 to 2.85
Minor
3 to 3.85
Moderate
4 to 4.85
Major
5 to 5.85
Catastrophic
Here the choices for the cell entries will again be the outcome of an extensive discussion and consultation process. If
actions are associated with each risk level, these must also be carefully thought through and calibrated to be consistent,
and appropriate for the perceived level.
We note that this matrix might not accord with the recommendation in Cox (2008a) for a minimal number of levels
(colours), though it does satisfy his three axioms. Because the isorisk contours on a log scale are straight lines, the
banding in this matrix is more natural and defensible than in many other applications.
7
Discussion
7.1 Hazard Severity Based Standards
An alternative to HRA is a qualitative approach based on a perceived severity of identified system hazards, which in
turn dictates the level of rigour required to analyse a hazard. Issue 2 of DEF(AUST)5679 (2008) provides an example
of this approach, where the standard asserts that the necessary system assurance will be obtained because the hazard
has been ‘designed out’.
An important demand made by Issue 2 of DEF(AUST)5679 is a tight coupling between the safety program and other
aspects of system development. Thus safety requirements are determined as part of the general system requirements
development process and the satisfaction of those requirements are incorporated into the system design and
implementation phases. Importantly, and in contrast to HRA, DEF(AUST)5679 increases the ‘burden of proof’ as the
inherent danger of a hazard increases, the notation used in the standard being Hazard Danger Levels.
Application of Issue 2 of DEF(AUST)5679 to existing systems can present problems if the provenance of the system
safety argument is uncertain or non-existent. In these circumstances the application of HRA in the manner suggested by
Jarrett and Lin (2008) below appears to offer a practical method of developing an estimate of system safety assurance.
Noting that the treatment of Non Development Items (NDIs) in DEF(AUST)5679 allows for the use of informal
methods, it appears that a theoretically defensible approach to HRA could be incorporated into the standard when
assessing NDIs.
The SVRC Report (1999) provides further useful comparative information on the two different approaches, albeit on
earlier versions of the standards.
7.2 Application of HRA in Safety Standards
It is clear that the current use of HRA-based safety standards when assessing the safety of complex defence systems is
fraught with difficulties. Not only is the assessment of likelihoods largely qualitative and not based on supporting data,
but the associated consequences are unlikely to represent a global assessment of possible accidents. Both ABR 6303
and MIL-STD-882 tend to produce assurance assessments that could be readily challenged in the courts. They are both
essentially ‘low assurance’ standards.
The context of application of these two standards is important. MIL-STD-882 is a mature standard having evolved
through its application to military systems in the USA. The results of such application are normally evaluated by a well
142
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
resourced government organisation such as the USN Weapons Systems Explosive and Safety Review Board
(WSESRB), which can provide skills that allow the results of a MIL-STD-882 process to be examined carefully.
Notably, the WSESRB has the executive authority to enforce the results of its findings. Thus, while the theoretical
basis of the standard remains inadequate, the skill base of US organisations, such as the WSESRB, are able, to some
extent, to compensate for the limitations of the standard.
By way of comparison, safety organisations within the Australian Defence organisation do not possess the same degree
of executive independence as that enjoyed by their American counterparts. The final decision on system deployment is
not subjected to mandatory approval by the safety organisation, but rather is made at the ‘Social Level’ (of Figure 1) by
non-safety personnel who take into account the recommendations of a safety assessment. For example, a system
assessed as HRI 9 could be accepted into service with the residual risk being mitigated by procedure rather than by
changes to the system recommended by a safety analyst.
The result of the Australian Defence process is effectively a filtering process through different levels of the defence
bureaucracy, which can result in diluted assessment of the safety argument.
7.3 A Safety Paradigm Shift
During the latter phase of the long evolutionary development of MIL-STD-882 there has been a rapid development of
programmable technology. Noting the acknowledged limitations of HRA in dealing with this technology, the logical
conclusion is that a new paradigm for addressing computer intensive systems is required. A focus on safety
requirements, and subsequently identified hazard severities, exemplified by Issue 2 of DEF(AUST)5679, appears to
provide the basis for such a paradigm shift. Not only does this standard incorporate a precautionary principle into the
assessment of system safety, it also provides for a more rigorous and defensible safety argument.
The use of DEF(AUST)5679 as the default safety standard would not initially impose a markedly different process
from the use of MIL-STD-882. Both standards require a hazard analysis to be conducted in the first instance, with the
results of that analysis determining the nature of any subsequent safety effort.
7.4 Characteristics of Computer Intensive Systems
Computer intensive systems are now widespread in both defence and civilian applications. As paragraph 1.4.2 Issue 2
of DEF(AUST)5679 notes:
The implementation of system functions by SOFTWARE (or DIGITAL HARDWARE) represents some unique risks to
safety. Firstly, the flexibility of programming languages and the power of computing elements such as current
microprocessors means that a high level of complexity is easily introduced, thus making it harder to predict the
behaviour of equipment under SOFTWARE control. Secondly, SOFTWARE appears superficially easy and cheap to
modify. Thirdly, the interaction of other elements of the system with the SOFTWARE is often poorly or incompletely
understood.
The idea of ‘safety critical software’ is one fraught with conceptual difficulties. Software per se is simply a set of
logical constructs which have (hopefully) been built according to a design requirement. As such, software is neither
‘safe’ nor ‘unsafe’, but rather may contain constructs that when executed in a particular environment (both platform
and external environment) may produce unexpected results. Thus the context in which the software executes is just as
important as the code itself when it comes to assessing system assurance.
DEF(AUST)5679 has a particular focus on ensuring conformance of software with the design requirements, but as
noted previously, can produce difficult management issues when applied to NDI or Military Off The Shelf (MOTS)
products. Many of these products are either in military service or have a history of commercial application, and in these
circumstances it is likely that system reliability data would be available. For example, an Inertial Measurement Unit
(IMU) is a complex device that is commercially available and has application in both military and civilian systems.
Thus a safety case for a system containing an IMU might be able to take advantage of reliability data in the way
described above, where the IMU is treated as a black box within a wider system context. While such an approach
would seem to be consistent with the treatment of NDI products in DEF(AUST)5679, it would not reduce the
requirement for rigour in the analysis of the surrounding system and associated IMU system boundary. Rather the use
of reliability data would augment the safety argument.
It is clear that it is quite inappropriate to use the RAN Standard ABR6303 as a guide to assessing the assurance of
complex computer-based systems. Not only is the standard aimed at assessing the risk of OH&S hazards, it is not data
dependent and is qualitative in assessing likelihood risks. However, it is suggested that with the incorporation of the
methodology discussed above the applicable scope of standard could be widened. In particular it would allow
meaningful application to a larger class of physical systems.
7.5 Cultural Issues
Anecdotal evidence suggests that many program managers and system engineers regard a safety program as a necessary
evil, providing program uncertainty with little visible benefit. Such attitudes are inconsistent with system development
experience, but more support from senior management is required if an attitudinal change is to be achieved. Such
support should come from the reality that a well integrated safety program not only improves the system engineering
process, but the final quality of the product.
143
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Additionally, it behoves program managers to provide for the most complete and defensible safety program available
with current technology, because in the event of a serious accident, it is inevitable that today’s safety efforts will be
judged in the light of tomorrow’s safety standards.
Because software based systems are logically complex it follows that it is not possible to assure system safety through
testing alone. Conversely, testing combined with appropriate analysis offers the possibility of reducing the scope of a
safety program. This follows from the fact that such a combination can limit the requirement for an otherwise almost
infinite test regime, through an initial investment in the analysis of the safety critical properties of the system.
In dealing with the issue of the scope of a testing regime it should be borne in mind that testing and analysis go hand in
hand in supporting the safety argument. Different safety cultures interpret this truism differently, a fact which is
reflected in the oft repeated anecdotal quotation:
“In the USA they test, test and test and then analyse the results of the testing, while in Europe they analayse, analyse
and analyse and then test the results of the analysis”.
While it might seem trite to labour this point, it is apparent that this is a cultural difference that is reflected in the
divergence in the theoretical basis for safety standards.
7.6 Harmonisation of Safety Standards
Noting the limitations of safety assurance derived from HRA and the existence of safety standards based on both HRA
and an assessment of hazard severity, there would seem to be a need to harmonise the two different approaches when
assessing complex computer intensive systems. Such harmonisation should identify the validity of a particular
approach to providing assurance of system safety. More particularly it is imperative that cases of inappropriate
application of HRA are identified.
The main harmonisation issue flows from the fact that DEF(AUST)5679 does not mandate the determination of a
residual risk which, in the case of a HRA, is often made on the basis of unsupported qualitative assessments.
Essentially, the real difference between the HRA and assurance based techniques lies in the mandated determination of
safety risk.
MIL-STD-882 does not provide adequate guidance on the design and implementation of computer intensive systems.
As a result the standard is necessarily weak in requirements for assessing the assurance of the software product. The
standard tries to address the issue through a concept of ‘software control categories’. This approach does little to
improve the situation as system complexity often denies the accurate enumeration of these categories at an appropriate
level of abstraction. So, while at a macro level, i.e., at a high level of abstraction, such categorisation is possible,
identification of the actual code module responsible for the control function may not be easy.
Interestingly, Swallom (2005) notes that:
….. the F/A-22 matrix adds a “designed out” column for hazards where risk has been reduced to zero.
This acknowledgement suggests that, in the continuing attempts to extend the application of HRA, there has been a
development of a tacit recognition that the hazard severity approach of mitigating hazards through careful design has
some merit.
In comparison to MIL-STD-882, DEF(AUST)5679 provides strong guidance on the design and implementation of
computer intensive systems. While this concept works well for a true development process, the standards approach
when dealing with the integration and acceptance of NDI has the potential to present project management with difficult
financial decisions.
A safety case developed under DEF(AUST)5679 will almost certainly provide enough context information to allow an
informed groupings of risk and consequence to be made. Thus if demanded by a regulatory authority, it seems intuitive
that the approach outlined by Jarrett and Lin (2008) could provide a translation from the DEF(AUST)5679 approach to
a risk based approach. The point here is that while the process of moving from a severity based approach to a risk based
approach appears possible, the reverse is likely to be much more difficult.
7.7 Further Development of DEF(AUST)5679
As noted earlier Issue 2 of DEF(AUST)5679 has the propensity to provide program managers with difficult problems if
it is used to provide assurance for NDI based systems. Given the widespread use of software based NDI within defence
it is clear that further development in this area would increase the appeal of the standard to program mangers.
7.8 The Role of the Technical Regulatory Authority in Defence
As noted earlier there are a number of Technical Regulatory Authorities (TRAs) embedded within the fabric of the
Australian Department of Defence. The roles of these authorities vary in description, emphasis and basic function, but
all claim ‘safety’ as part of their raison d’être for existence. So for example, the Defence Safety Management Agency
will claim seniority in matters of Occupation Health and Safety (OH&S), whereas the Director General Technical
Airworthiness (DGTA) claims ownership of air and ground systems safety within the Royal Australian Air Force
(RAAF). Within the RAN there are a number of separate but interacting organisations involved in the assessment of
system safety. The TRAs are supported by administrative proclamations issued by various levels within the Defence
hierarchy.
144
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
The use of a specific safety standard is generally determined by the interested TRA. For example, DGTA almost
invariably requires software be developed in accordance with the development standard RTA/DO-178B and argues that
the development process ensures that the product is safe and suitable for service. An interesting defence of this process
has been provided by Reinhardt (2008).
The use of a particular standard to assure system safety must be endorsed by the TRA responsible for certifying the
safety of the system, where the system function can range from the development of safety critical avionics software,
through the development and deployment of explosive ordnance, to the methodology of reporting and investigating
OH&S issues. Importantly, any change to the application of new safety standards within Defence requires the approval
of the various TRAs. Thus any move away from HRA based safety standards within DoD would require TRA support.
In the case that a TRA demands an HRA based safety assessment, the provision of a method of mapping from the
hazard severity based approach of DEF(AUST)5679 to a HRI description of the outcome of the safety process would
assist in the acceptance of outcomes of the safety analysis.
7.9 The Need for Further Research
Given the pervasive use of HRA within the defence community there is an urgent need for research to further develop
the theoretical basis for the application of HRI when assessing system safety. In this regard Cox (2008a) notes:
In summary, the results and examples in this article suggest a need for caution in using risk matrices. Risk matrices do
not necessarily support good (e.g., better-than-random) risk management decision and effective allocations of limited
management attention and resources. Yet, the use of risk matrices is too widespread (and convenient) to make
cessation of use an attractive option. Therefore, research is urgently needed to better characterize conditions under
which they are most likely to be helpful or harmful in risk management decision making (e.g., when frequencies and
severities are positively or negatively correlated, respectively) and that develops methods for designing them to
maximize potential decision benefits and limit potential harm from using them. A potentially promising research
direction may be to focus on placing the grid lines in a risk matrix to minimize the maximum loss from misclassified
risks.
In particular, there is a need to better understand the relationship between, and possible integration of, the competing
safety methodologies, i.e. risk based, and those based on the concept of accident severity. Thus, in order to provide
improved interoperability between safety standards it is suggested that research into the relationship and applicability
between severity and risk based assessments of system safety be supported. Such research is not profitably done in
isolation, but rather needs to be done in the context of assessing the assurance of real systems. This requires support
from more than the primary research organisation.
8
Conclusions
Hazard Risk Assessment provides an inadequate theoretical platform for assessing the safety of complex systems.
Safety standards based on this approach can only be regarded as low assurance standards, not in tune with modern
safety thinking.
The extension to HRA proposed in this paper has the potential to extend the scope of the process to include many
physical systems. However, this requires a concomitant increased emphasis on the collection and analysis of
quantitative reliability data. This in turn demands the application of statistically sound data collection and analysis
methodologies, an approach not commonly found in today’s safety community.
Assessments of complex computer intensive systems continue to pose a particular problem for the safety analyst. Strict
conformance to design requirements and careful design of test regimes can assist the task, but system complexity can
make this approach expensive and time consuming, particularly if the safety requirements have not been identified or
adequately analysed.
9
Acknowledgements
The authors thank the referees for their constructive comments.
10 References
Ale, B. J. M. (2005): Tolerable or Acceptable: A Comparison of Risk Regulation in the United Kingdom and in the
Netherlands. Risk Analysis 25(2), 231-241, 2005.
Anderson, K. (2006): A synthesis of risk matrices. Australian Safety Critical Systems Association Newsletter, 8-11,
December 2006.
ABR 6303 (2006): Australian Book of Reference 6303, NAVSAFE Manual, Navy Safety Management, Issue 4
Boulding, K.E. (1956): General Systems Theory – The Skeleton of Science, Management Science, 2(3), April 1956.
Chen, J. J., Tsai, C. A., Moon, H., Ahn, H., Young, J. J., & Chen, C.H. (2006). Decision threshold adjustment in class
prediction. SAR QSAR Environmental Research, 17(3), 337–352.
Cox L.A. (2008a): What’s Wrong with Risk Matrices?, Risk Analysis, Risk Analysis 28, 497-512, 2008
Cox L.A. (2008b): Some Limitations of “Risk = Threat x Vulnerability x Consequence” for Risk Analysis of Terrorist
Attacks. Risk Analysis, 28(6) 1749-1761, 2008
145
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
DEF(AUST)5679 (2008): Commonwealth of Australia Australian Defence Standard, Safety Engineering for Defence
Systems, Issue 2
Dreiseitl, S., Ohno-Machado, L., & Vinterbo, S. (1999). Evaluating variable selection methods for diagnosis of
myocardial infarction. Proc AMIA Symposium, 246–250.
FWHA (2006): Risk Assessment and Allocation for Highway Construction Management, at
http://international.fhwa.dot.gov/riskassess
Ford G. (1993): Lecture Notes on Engineering Measurement for Software Engineers, CMU/SEI-93-EM-9, Carnegie
Mellon University
Hessami A.G. (1999): Risk Management: A Systems Paradigm, Systems Engineering, 2(3), 156-167.
Hodge, R. and Walpole, J. (1999): A Systems Approach to Defence Planning – A work in Progress, Systems
Engineering,
Test
&
Evaluation
Conference,
Adelaide.
20-22 October 1999
Jarrett, R. (2008. Developing a quantitative and verifiable approach to risk assessment, CSIRO Presentation on Risk,
August 2008
Jarrett, R. & Lin, X. (2008): Personal Communication.
Lloyd, G. R., Brereton, R. G., Faria, R., & Duncan, J. C. (2007): Learning vector quantization for multiclass
classification: Application to characterization of plastics. Journal of Chemical Information and Modeling, 47(4),
1553–1563.
MIL-STD-882D, (2000): Department of Defense, Standard Practice for System Safety
O'Brien, M. H. (2000): Beyond Democratization Of Risk Assessment: An Alternative To Risk Assessment
Prasad, D. & McDermid, J. (1999): Dependability Evaluation using a Multi-Criteria Decision Analysis Procedure,
dcca, p. 339, Dependable Computing for Critical Applications (DCCA '99).
Raffensperger, C. & Tickner, J (eds.) (1999): Protecting Public Health and the Environment: Implementing the
Precautionary Principle. Island Press, Washington, DC
Recuerda, M. A. (2006): Risk and Reason in the European Union Law, 5 European Food and Feed Law Review
Reinhardt, D. (2008): Considerations in the Preference for and Application of RTCA/DO-178B in the Australian
Military Avionics Context,13th Australian Workshop on Safety Related Programmable Systems (SCS’08), Canberra,
Conferences in Research and Practice in Information Technology, 100.
Royal Society Study Group (1992: Risk: Analysis, Perception and Management. Royal Society, London.
Stevens, S.S. (1946): On the Theory of Scales of Measurement, Science, 103(2684), June 7, 1946.
SVRC Services (1999): International Standards Survey and Comparison to Def(Aust) 5679, Document ID: CA38809101, Issue: 1.1
Swallom, D. W. (2005): Safety Engineer, U.S. Army Aviation and Missile Command: A Common Mishap Risk
Assessment Matrix for United States Department of Defense Aircraft Systems, 23rd International System Safety
Conference, San Diego, Ca., 22-26 August 2005
UK DEF STD 00-56: Issue 4 1 June 2007, Safety Management Requirements for Defence Systems.
Wildman, L. (2002): Requirements Reformulation using Formal Specifications: A Case Study, Software Verification
Research Centre, University of Queensland.
Wikipedia - the free encyclopedia, Precautionary Principle
146
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
INTEGRATING SAFETY AND SECURITY INTO THE SYSTEM
LIFECYCLE
Bruce Hunter
INTRODUCTION
System Safety and Information Security activities, while recognised as being critical aspects of a
system, are often only assigned targets to achieve at the concept or requirements phases of
development. They then are left to independently achieve outcomes that align with each other
along with other aspects of the system throughout its life. This somewhat cynical view of the
systems engineering model is reinforced by standards [IEC61508 Ed1, EN50126/28/29/59] that
don’t require either an integrated approach or verification of compatibility between resulting
safety and security controls.
While some attempts have been made to integrate the practices of safety and security engineering
[Ibrahim 2004)], key Information Security standards [IEC27001][ISO27005][SP 800-30] make
no mention of the safety aspects of security controls. Only later standards [SP 800-82][ISA-99]
[IEC62443] start to mention how security aspects support safety. Conversely the Functional
Safety series (IEC61508) edition 1 makes no specific mention of security and its impact on
achieving functional safety for the system. While later versions of sector safety standards (e.g.
EN50126, 128, 129 and 159) include security aspects but not how these interact and are
supported by security controls through the lifecycle.
As identified later in this paper, treating safety and security activities independently in the system
lifecycle can lead to unexpected and unwanted outcomes (see locked fire-door). Finding real-life
examples of these issues is not easy and may be due to the fact that incidents are considered
sensitive or the relationship has not been clearly understood or recognised. In recent surveys
more that 70% of organisations do not report security incidents to external parties [Richards
2009]. We have the legal system [Supreme Court of Queensland - Court of Decisions R v Boden]
to thank for details of an interrelated safety and security incident that would have gone unnoticed
and undocumented except for a sewerage system failure and subsequent spill…
Between 9 February 2000 and 23 April 2000 a former employee of a supplier
deploying a radio-networked SCADA system for Maroochy Shire Council accessed
computers controlling the sewerage system, altered electronic data in respect of
particular sewerage pumping stations and caused malfunctions in their operations.
The resultant sewerage spill was significant. It polluted over 500 metres of open drain
in a residential area and flowed into a tidal canal. Cleaning up the spill and its effects
took days and required the deployment of considerable resources.
The court imposed a two year sentence covering: 1 count of using a restricted
computer without the consent of its controller intending to cause detriment or damage
and causing detriment greater than $5,000; 1 count of wilfully and unlawfully causing
serious environmental harm; 26 hacking counts; and 1 count of stealing a two-way
radio and 1 count of stealing a PDS compact 500 computer. The concurrent sentence
imposed survived two appeals.
147
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
This case is now a benchmark for Cyber Security albeit that sometimes the facts are exaggerated.
The key the issues associated with ensuring systems are safe and secure in their operation are
shown in Table 1.
Table 1.
Maroochy Cyber Attack
Security Aspect
Compensating ISO27001
Control Objective
The safety of the system would not have considered the security implications at the
time due to expectations of industry norms and a culture of “Security by
Obscurity”;
A.10.6 Network security
management
The investigation found it difficult to differentiate between teething problems of
the system being deployed (still had not been resolved from completion of
installation in January) and the malicious hacking outcomes (this was also a source
of the later appeals);
A.10.10 Monitoring
The employee had vital equipment and knowledge in his possession after
resignation from the supplier including critical configuration software that allowed
the system data and operation to be changed remotely
A.8.3 Termination or change
of employment
The system did not discriminate between a masquerading rogue and real nodes in
the network
A.11.4 Network and A.11.5
Operating system access
control
26 proven hacking attempts were made over a three month period, anecdotally
there were more undiscovered events over a longer period
A.13 Information security
incident management
Open communications network was used for what is a Critical Infrastructure
operation, but again this was industry norm for the time. Communication
Technology was transitioning from point to point links to digital networks
A.10.6 Network security
management
The data controlling the system could be modified by an intruder based on past
knowledge
A.12.5 Security in
development and support
processes
By hacking attempts, it was possible to disable alarms hiding further changes and
unauthorised operation of the system
A.11.6 Application and
information access control
Even the hacker himself was not immune from security issues; in appeal evidence the stolen
laptop had problems in one of the hacking attempts as the “Chernobyl” virus had infected it.
While in hindsight it may be easy to see the risks associated with lack of security controls that
impacted the safety of the system (all of these issues could have been mitigated by the imposition
of basic security objectives from ISO27001), the development and commissioning of industrial
control systems at the time and their supporting standards would not have explicitly required this
to be considered.
It is easy to understand why there are good reasons to apply effective and timely security controls
to systems to support both operational and functional safety integrity but:
x Can they be addressed in isolation and still achieve their objectives?
x Aren’t they the same anyway and achieve a compatible outcome?
148
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
VALUES, PRIORITIES AND COMPATIBILITY
Before you attempt to integrate two value-based systems, it is important that you are sure their
value systems align. When it comes to safety and security the values and priorities that drive the
methodologies are not the same.
Assets versus People
The Common Criteria standard IEC 15408 addresses the assurance levels applied to security
management and the criteria for evaluating and claiming assurance levels. This can be used as a
level of trust that the security functions will be effective in the products protecting the
confidentiality, integrity and availability of their associated assets. While this may provide some
form of equivalent to the Safety Integrity Levels (SIL) associated with functional safety, there are
basic differences in the purpose and methodology that prevent this.
People
value
Owners
value
wish to minimise
at risk of
impose
Countermeasures
may be
aware of
may
harm
Hazards
that may impact
that may
possess
that require
affecting
Vulnerabilities
that may increase
Safety Functions
Threat agents
Could this be missing
when systems have
safety implications?
leading to
to
that exploit
Risk
that may
include
give
rise to
Threats
that
increase
to
Assets
to
wish to abuse and may damage
May also wish to harm
Adapted from IEC15408.1
Figure 1. IEC15408 Security Concepts and Relationship
Some explanation of the differences in approach can be seen with the Security Concepts and
Relationship Model [IEC15408.1] reproduced in Figure 1. The prime value here is the assets that
security protects from threat agents; whereas safety is about protecting against the risk of
physical injury or damage to the health of people [IEC61508.0] added into the diagram with
dotted lines). This incompatibility of values leads to the likelihood of conflicting risks and
controls being applied that may compromise system safety and security. This needs to be
considered in addition to the interdependencies between safety and security controls and their
impact as outlined in Figure 2. Controls may detract or contribute to the effectiveness of the
other.
An example of the possible incompatible application of security controls at the expense of safety
outcome (email of sound files to police blocked) can be found in the proceedings and
149
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
recommendations of the Coronial Inquest into the death of David Iredale and issues with the
NSW 000 Emergency service [Hall 2009][NSW Deputy State Coroner 2009].
Safety Framework
Security Framework
Hazards
Threats
Control Contributors
Faults
Safety
Controls
Reliability
Availability
Maintainability
Safety
Systems and
People
Security Controls protect malicious actions from
triggering hazardous action on or compromising
Safety-Related functions. e.g. Denial of Service
attack locking up safety related system.
Safety Controls protect users and maintainers
of Information Assets from hazards.
Control Detractors
“Malware” protection and fail-secure actions of
Security Controls may degrade safety
functions due to absorbing free system time.
Fail—safe actions of Safety Controls may add
back-door vulnerabilities to Information Assets
Vulnerabilities
Security
Controls
Confidentiality
Integrity
Availability
Traceability
Information
Assets
Figure 2. Safety and Security Control Contributors and Detractors
Safety Integrity versus Security Priorities
Safety hazard and risk analysis in association with any necessary risk reduction achieve a residual
risk rating, which is to be as low as reasonably practical (ALARP)[IEC61508.5]. Reliance on the
likelihood of a dangerous failure associated with this risk will lead to a required SIL. This is a
quantitative level, which is derived from the failure rate associated random and systematic
failures associated with the elements of the system that support the safety function.
Security Risk Evaluation, however, leads to ranking of risk associated with the likelihood of a
threat exploiting vulnerability and compromising and asset. Control Objectives are then applied
to mitigate in priority of the risk ranking identified. There is no guarantee that all risks will or can
be treated and risk treatment is invoked to reduce (by mitigating controls), retain (expecting the
risk may be realised), remove or transfer the risk. This security risk ranking rather than rating
approach is clearly not compatible with either the ALARP principle or other Safety risk
methodology.
RISK ASSESSMENT COMPATIBILITY
Both safety and security engineering make extensive use of risk management to assess and
mitigates risks that threaten the system integrity and compromise the safety of people and
security of assets. There are however important differences in the way risk management is
applied and the decisions made as a result of the risk estimated.
150
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Risk Impact
The consequence of hazardous events associates with safety is related to injury people and their
health ranging from individual with minor injury to many people killed. Security risk impact
usually relates to the value of the asset compromised in dollar terms from disruption to operation,
disclosure of information, loss of reputation and business and direct financial loss.
Some standards [ANSI/ISA-99] quantify the security risk in terms that include safety outcomes
and it is quite feasible that risks impacts could be aligned for “like” consequences. Other
standards and Guidance [NIST SP800-30][ISO27005][ITSEAG 2006] again do not take into
account the impact on safety or injury, just business impact.
Risk Likelihood
The comparison of likelihood methodology between safety and security is where these system
attributes diverge markedly. Probability of Failure for functional safety is broadly accepted for
random as well as systematic failures; although malicious action should be considered as part of
Preliminary Hazard Analysis, the probability of motive-based hazards is yet unquantified but
staring to be modelled [Moore, Cappelli, Trzeciak 2008].
Figure 3 illustrates the differences and alignment when between safety and security failures in a
generic cause consequence model.
Loss of Control
Almost Certain
Intentional
Threat
Constantly
Evolving
Enabler
Possible
Quantification
Vulnerability
Security
Design Fault
Lack of
Design
Rigour
Safety
Design Fault
Accidental
Quantification
by Standards
Wear-out
Quantification
by MTBF
Component
Life
Ineffective
Preventive
Control
Incident
Ineffective
Preventive
Control
Noncritical
Security
Outcomes
Interactions
Hazardous
Event
Systematic
Failure
Random
Failure
Ineffective
Reactive
Control
Critical
Security
Outcomes
Critical
Safety
Outcomes
Ineffective
Reactive
Control
Noncritical
Safety
Outcomes
Figure 3. Generic Cause-Consequence Mode of Safety and Security Failures
Assigning probabilities to security exploitations is difficult to say the least. Likelihood ratings in
most standards are very qualitative due to the non-deterministic nature of security threat and the
vulnerabilities of rapidly evolving Information Technology. Claiming a security incident is
improbable with a frequency of 1 in 1000 years is clearly “unrealistic” and open to abuse in
151
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
manipulating the risk rating. Again some security standards don’t help and recommend up to five
levels of likelihood from “Almost Certain” to “Rare” [ISO27005][ITSEAG 2006] with no
quantification.
The exponential rise of security incidents makes it hard to reliably quantify the frequency
contribution to likelihood and has reached such levels that this is now measure of time to survive
base on system type (in some cases currently less than an hour). Figure 4 [US CERT] shows the
evolution of reported attacks tied into infrastructure related incidents and the publication of safety
and security standards. This “flooding” dictates an attack probability of 1.
Little was mentioned on Information Security in safety standards available in 2001. Security
could have been considered as any other hazardous event risk but this would have required
domain knowledge of the possible threats and vulnerabilities of the system to attack. As outlined
previously in the SCADA system for Maroochy Shire Council, safety practitioners would not
have been fully aware of the vulnerabilities of an open communications network.
More recent attacks on Critical Infrastructure such as the Estonian and Georgian “Cyberwars”
[Nazario 2007] and the US Power Grids [Condon 2008] show that the risk of systems attached to
open communication networks is a difficult risk to quantify.
ISA-TR99
MIL-STD-882C
NIST SP800-30
IEC61508
ED1
AS4444
IEC62443-3
160
9
140
8
7
120
6
100
60
40
20
4
3
2
2009
2008
2007
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1
1992
0
5
CERT stops attack
reporting Survival time now
30..100 min for
unprotected
Windows systems
80
Vulnerabilities (10 3 )
BS7799
2006
CERT Reported Atacks (103 )
AS4048
ISO27001
IEC61508
ED2?
0
Year
Attack triggered
Maroochydore
Sewer Spill
Slammer
Worm Ohio
Nuclear Plant
Estonian
Cyber
War
Georgian
Cyber
War
US Electricity
Grid Cyber
Profiling
??
Figure 4. US CERT Trend in Reported Computer Attacks in Context
Understanding probability in terms of random component failure and the probability of
introducing systematic faults with defined levels of development rigour proven foundations. The
assignment of probability to motive-based attacks, however, is not tenable. Security attacks
driven by the attraction of the target to the motives of the attacker and this is hard or even
impossible to measure. Attack targets and mechanisms are continuing to evolve from spamming
to political agendas to cyber-crime. Probability of motive-based attacks is best considered as 1
and other protective layers introduced.
152
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Along with attack targets and mechanisms, new security vulnerabilities are constantly emerging
due to evolving technology which driven by the needs of consumers. The converging of system
and communication technology to meet these needs has created previously unconsidered
vulnerabilities for control systems.
The time taken to fix discovered vulnerabilities is too short to rely on patches alone for
protection. The XP vulnerability exploited by “Conficker” malware, which was discovered and
detected early by Virus Checkers, took several months for an effective patch and cleaning
software. This also used different attack mechanism in the form of the “autorun” in USB thumb
drives and other removable media thus avoiding usual detection and protection controls.
What if Safety is Reliant on Security Reliability?
One possible method to align safety and security risk likelihood is to use the Layer of Protection
Analysis (LOPA) as supported by IEC61511.3 and yet to be published ED 2 of IEC61508.5.
LOPA could be used to assign failure probability to security risks as long as realistic values are
attributed to each Protection Layer (PL) and the rules of Specificity (one security function),
Independence (or separation from other layers [Hunter 2006]), Dependability (quantitative failure
rate) and Auditability (validation of functionality) are applied.
Open Threat
Environment
Protection Layer 1
High Risk Attack
Detected
Security Incident
Response
Security
Remediation Control
Other Response
Controls
Protection Layer 2
PFAPL
Incident
Detection
Time (TDET)
PTEXP TDET TREM Incident
Remedy
Time(TREM)
Protection Layer N
Time for threat to exploit
next layer vulnerability (TEXP)
Protection Layer
Compromised
Figure 5. Probability Model of Security Defence in Depth
Knowing threats and the vulnerabilities they exploit are subject to change most organisations
apply a defence-in-depth strategy, where no one single control is relied on for protection against
attack. This, however, is dependent on effective threat and vulnerability monitoring with fast and
reliable incident response. This must prevent an attack from breaching the next protective layer
before the threat agent discovers and/or exploits its vulnerabilities as illustrated in the Figure 5.
The Probability of Failure on Attack (PFA) for each PL can be derived from probability of an
attack exploiting the next protective layer before a mitigating response is implemented. The
product of each layer’s PFA then provides the total PFAAVG. Table 2 is an example of the
application of safety LOPA to security protection [IEC61511.3][IEC61508.5].
153
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Table 2.
LOPA Example with Security Defence in Depth Layers
Initial Risk
Event Impact
Description
Severity Level
Network attack performs Major
unauthorised operation Environmental
Damage
Initiating
Cause
Hacking
Attack
Initiating
Likelihood
1
PL1
PL2
Message
Encryption PFA
Firewall
Controls
PFA
0.1
0.1
(can’t predict
(XXX Standard)
likelihood of attack
PL3
PL4
Network Controls Application Controls
PFA
PFA
0.1
1
(two-factor
authentication)
(can’t application
determine
vulnerabilities)
Residual likelihood
Intermediate
likelihood
10-3
Security
Response PFA
0.1
Mitigated
likelihood
10-4
(assuming
response preexploitation)
When a Safety Function is reliant on security protection from likely threat of being compromised,
the PFA could form part of the Probability of Failure on Demand. This practice would need
considerable more work to be counted towards a resulting SIL and establishment dependable
statistics of Survival Probability. While the Common Criteria Evaluation Assurance Level may
provide confidence and a level of trust in the security controls applied, this does not equate to a
Safety Integrity Level. Care must also be taken, as in Figure 5 that remediation against the attack
does not compromise a safety function.
APPLICATION OF SAFETY AND SECURITY CONTROLS
The application of security controls to support functional safety has been addressed in other
papers [Smith, Russell, Looi 2003][ Brostoff, Sasse 2002].
Inherent conflicts between safety and security methodology become evident in the mitigation
controls against the associated risk. Risk discussion forums and Application Standards show
anecdotal cases where incompatible security controls lead hazardous situations in systems [NIST
SP800-82][ISA-99]. Typically these relate to security scanning and penetration testing and may
have just exposed inherent systematic faults that would eventually lead to these situations.
With an increasing threat environment to safety-related systems and uncertainty to the
vulnerabilities they contain, there is an increasing risk of ill-considered security controls being
applied which directly degrade the functional safety of these systems.
Establishing Control Compatibility
Security
Objectives
Safety Objectives
Must
Don’t care
Must not
Must
Contributing
Compatible
Incompatible
Don’t care
Compatible
Compatible
Compatible
Must not
Incompatible
Compatible
Contributing
Figure 6. Proposed Control Compatibility Model
One helpful, albeit obvious, way to manage where conflicts or incompatibilities arise is to divide
functionality into value objectives: “must”; “must not”; and “don’t care” for each functional
aspect as depicted in the compatibility chart proposed in Figure 6.
154
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
For compatibility to be achieved no “must” control in one is allowed to coexist with a “must not”
in the other aspect; other matches are compatible. This can be seen in the “Locked Fire door”
example following.
Conflicts - The Locked Fire-door Paradox
An example safety and security conflicts in everyday life is the paradox of locked fire doors. This
has the serious consequences on the safety of occupants who may be either trapped in a burning
building or fire escape stairs. Conflicts can be identified and resolved between the safety and
security controls by using Goal Structure Notation (GSN) as in Figure 7 [Lipson, Weinstock
2008)[Kelly, Weaver, 2004].
FireSafety
Keep occupants of an area
safe from fie or other
dangerous situations
Enclosed areas provide danger
to occupants during fire outbreak
EvacuateEarly
Evacuate occupants quickly to
remove them from harmful and
life-threatening conditions
Experience has shown that
injuries and deaths have
been caused by ineffective
exits
AssetSecurity
Protect security assets from
threats it their confidentiality,
Integrity and Availability
AlternativeEgress
Have alternative paths of egress
with unfettered passage to cater
for danger at one exit
SecureEntry
Secure boundary to prevent
access from outside
Assets compromised by
unauthorised entry and exit from
area
SecureExit
Secure boundary to prevent
access compromised from
inside
“Safety - Must”
Experience has shown that
patrons have let
accomplices in by fire exits
“Safety - Must Not”
Ensure doors
are unlocked
from inside to
allow quick
egress
Provide multiple
exits for
alternate
egress in
emergency
Lock doors
from outside to
prevent
unauthorised
entry
Lock doors
from inside to
prevent
unauthorised
exit & breach
Limit entrances
to reduce
security risk
“Security - Must”
“Safety – Don’t care”
“Security – Must Not”
Conflict 1
Conflict 2
Alignment 1
Use video
surveillance to
identify pending
breaches
Alarm exits to
alert when
security
compromised
Alignment 2
Figure 7. Simplified GSN view of Locked Fire-door Conflicts
If the safety and security controls in this example are considered together, then not only are
conflicts identified early but can be modified to improve the effectiveness of both.
AN ALIGNED APPROACH
Rather than combining disparate methodologies for safety and security, the paper proposes
Lifecycle Attribute Alignment to ensure effective and compatible safety and security controls are
established and maintained is at key lifecycle stages of: Concept; Requirements; Qualification;
and Maintenance.
In Figure 8, interaction in these phases is shown in terms of alignment attributes (A); requirement
allocation attributes (R) and verification effectiveness attributes (V). Engineering, System Safety
and Security Management plans should include these attributes as objectives to be achieved and
maintained in the lifecycle.
155
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Operational Concept
Studies
System Safety
Management
A2
Information Security
Management
A1
System Safety
Planning
Requirements
Analysis
Preliminary Hazard
Analysis
System/ Architectural
Design
R3
Establish Safety
Requirements &
Boundaries
Requirements
Allocation
Characterise System
and Set Asset Security
Objectives
Assess Threat and
Vulnerability Risk
Estimate and Evaluate
Security Risk
R1
Determine Risk
Treatment
Allocate Safety
Requirements
Technical
Performance
Management
A3
System Integration
and Test
Review and Define
Security Controls
R2
Safety Requirements
Implementation
V2
Qualification Testing
V1
Safety Requirements
Verification &
Validation
Implement Security
Controls
A4
Monitor Emerging
Security Threats &
Vulnerabilities
Installation & Test
System Safety
Assurance
Maintain & Update
Security Controls
R4
Acceptance &
Transition to Support
System Safety
Maintenance & Update
Through-life Support
A5
A7
R4
A6
Evolving Security Threats, Technology and Obsolescence
Systems
Development
Upgrade and Retrofit
Obsolescence and
Withdrawal
A1 – Ensure security and operational concepts align
A2 – Ensure safety and operational concepts align
A3 – Ensure controls are free from conflict
A4 – Ensure validation includes compatibility
A5 – Ensure security updates don’t compromise safety
A6 – Ensure functional updates don’t compromise safety
A7 – Ensure functional updates don’t compromise security
R1 – Ensure security risk are considered in hazard analysis
R2 – Ensure safety requirements are allocated
R3 – Ensure security requirements are allocated
R4 – Ensure security vulnerability updates are conducted
R5 – Ensure secure information is removed before disposal
V1 – Ensure safety requirements are validated and match current risk
V2 – Ensure security requirements are validated to vulnerabilities
Figure 8. Key Lifecycle Alignment Points
The yet to be published IEC61508.1 does reference newer security standards and the
consideration of malevolent actions during hazard and risk analysis through the safety lifecycle.
The Australian IT/6 Committee have recommended that security considerations, including
compatibility, be added to IEC61508 at key points in the safety lifecycle and hopefully these will
be in the final Edition 2 of the standard.
156
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
CONCLUSION
System Safety and Security are not the same in their values, methodology or their stability;
neither can they be treated independently if conflict is to be avoided in their mitigating controls.
As discussed in this paper the two key issues that limit the success of integrating safety and
security in the systems lifecycle: incompatibility with risk management; and possible conflicts
with mitigating controls. Using the safety LOPA technique may allow the determination of
probability of security control failure and resulting dangerous failure probability. The use of GSN
or similar techniques should be applied where safety and security controls may have conflicts.
This has also being used to develop more sound Security Cases [Lipson, Weinstock 2008].
Ensuring continued alignment of the dependence and compatibility of safety and security through
the lifecycle by is the key to their successful integration. Dangers of not treating safety and
security seriously together are:
x An increasing risk of successful attacks on infrastructure systems with safety functions;
x An increasing risk of mitigating security controls compromising safety functions somewhere
in their lifecycle; and
x The likely imposition of governmental controls on critical infrastructure protection if the
industry cannot demonstrate adequate support for information security in the systems they
supply, operate or maintain.
Benefits of integrating safety and security into the lifecycle are not only safer and more secure
systems but minimisation of the cost associated with late discovery issues in the implementation,
acceptance or support phases; building safety and security in from the start is essential.
REFERENCES
ANSI/ISA-99 (2007) Security Guidelines and User Resources for Industrial Automation and
Control Systems.
AS IEC 61508 Ed. 1 (1998) Parts 0 to 7, Functional safety of electrical/ electronic/
programmable electronic safety-related systems.
AS ISO/IEC 15408 Part 1-3 (2004) Information technology - Security techniques - Evaluation
criteria for IT security
AS/NZS ISO/IEC 27001 (2006), Information technology—Security techniques—Information
security management systems—Requirements
Brostoff, S., & Sasse, M. A. (2001, September). Safe and Sound: a safety-critical approach to
security. Position paper presented at the New Security Paradigms Workshop 2001,
Cloudcroft, New Mexico, USA.
Condon, Stephanie (2008) Cyberattack threat spurs US rethink on power grids. ZDNet.co.uk
Security threats Toolkit article, 15 Sep 2008
Hunter, B.R., (2006) Assuring separation of safety and non-safety related systems. In Proc.
Eleventh Australian Workshop on Safety-Related Programmable Systems (SCS 2006),
Melbourne, Australia. CRPIT, 69. Cant, T., Ed. ACS. 45-51.
Hall, L., (April 28, 2009) Police had to fill in forms and wait for David Iredale phone Tapes,
Article in the The Australian
157
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Ibrahim, L. et al, (2004) Safety and Security Extensions for Integrated Capability Maturity
Model. United States Federal Aviation Administration
IEC61511-3:2003 Functional safety – Safety instrumented systems for the process industry
sector. Part 3: Guidance for the determination of the required safety integrity levels.
ISO IEC/PAS 62443-3 (2008) Security for industrial process measurement and control –
Network and system security
ISO/IEC 27005 (2008) Information technology — Security techniques — Information security
risk management
IT Security Expert Advisory Group - ITSEAG (2006) Generic SCADA Risk Management
Framework, Australian Government Critical Infrastructure Advisory Council
Kelly, Tim P., Weaver, Rob A. (2004), The Goal Structuring Notation – A Safety Argument
Notation. Proceedings of the Dependable Systems and Networks 2004
Lipson, H., Weinstock, C. (2008) Evidence of Assurance: Laying the Foundation for a Credible
Security Case. Department of Homeland Security Build Security In website, May 2008.
Moore, A.P., Cappelli, D.M., Trzeciak, R. F., (May 2008) The “Big Picture” of Insider IT
Sabotage Across U.S. Critical Infrastructures. CMU/SEI-2008-TR-009
Nazario, J. (2007) Explaining the Estonian cyberattacks. ZDNet.co.uk Security threats Toolkit
article, 30 May 2007
NSW Deputy State Coroner, (07.05.2009) 1427/2006 Inquest into the death of David Iredale,
Office of the State Coroner of New South Wales
Richards, K. (2009) The Australian Business Assessment of Computer User Security: a national
survey. Australian Institute of Crime Reports Research and Public Policy Series 102
Smith, J., Russell, S., Looi, M., (2003) Security as a Safety Issue in Rail Communications. In
Proc. 8th Australian Workshop on Safety Critical Systems and Software (SCS’03).
SP 800-30 (2002) Risk Management Guide for Information Technology Systems, US National
Institute of Standards and Technology,
SP 800-82 (2008) Guide to Industrial Control Systems (ICS) Security. US National Institute of
Standards and Technology, Final public draft September 2008
Supreme Court of Queensland (2002)- Court of Decisions R v Boden, QCA 164 (10 May 2002)
Appeal against Conviction and Sentence
US-CERT, United States Emergency Response Team - http://www.us-cert.gov/
BIOGRAPHY
Bruce Hunter ([email protected]) is the Quality and Business Improvement
Manager for the Security Solutions & Services and Aerospace divisions of Thales Australia. In
this role Bruce is responsible for product and process assurance as well as the management of its
reference system and its improvement.
Bruce has a background in IT, systems and safety engineering in the fire protection and
emergency shutdown industry and has had over 30 years of experience in the application of
systems and software processes to complex real-time software-based systems.
Bruce is a contributing member of Standards Australia IT6-2 committee, which is currently
reviewing the next edition of the IEC61508 international functional safety standards series. Bruce
is also a Certified Information Security Manager and Certified Information Systems Auditor.
158
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
WHAT CAN THE AGENT PARADIGM OFFER SAFETY
ENGINEERING?
Louis Liu, Ed Kazmierczak and Tim Miller
Department of Computer Science and Software Engineering
The University of Melbourne
Victoria, Australia, 3010
Abstract
A current trend in safety-critical applications is towards larger, more complex systems. The agent
paradigm is designed to support the development of such complex systems. Despite this, agents are
having minimal impact in safety-critical applications.
In this paper, we investigate how the agent paradigm offers benefits to traditional safety engineering processes. We demonstrate that concepts such as roles, goals, and interactions narrow that gap
between engineering and safety analysis, and provide a natural mechanism for managing re-analysis
after change. Specifically, we investigate the use of HAZard and OPerability studies (HAZOP) in
agent-oriented software engineering. This offers a first step towards broadening the scope of systems
that can be analyzed using agent-oriented concepts.
Keywords: agent-oriented software engineering, safety-critical systems, safety analysis, HAZOP
1
INTRODUCTION
A current trend in safety-critical systems is towards systems that are larger, more complex and have
longer life-spans than their predecessors (Mellor, 1994). Many modern systems are characterised by
being autonomous and independent nodes distributed over a network, having multiple modes, and more
functionality than their predecessors (Milner, 1989). Further, such systems typically undergo numerous
upgrades and adaptations over their lifetime.
The multi-agent paradigm is well-suited to modelling and analysing such systems. Despite being tailored
for the development of complex distributed systems there has been little uptake of agent-oriented software
engineering (AOSE) methods in safety-critical systems development — either in research or in practice.
Current practice in safety engineering centres around processes that ensure that the hazards of a system
are identified, analysed and controlled in requirements, design and implementation (see for example
(Ministry of Defence, 1996; RTCA, 1992; IEC, 2003)). Hazard analysis forms a critical part of the
engineering of safety-critical systems and there are numerous techniques reported in the literature for
conducting such hazard analysis. These analysis methods are predominantly team-based and rely on
documented accident analyses from similar systems and the ability and experience of engineers to predict
potential accidents.
In this paper, we discuss how the agent paradigm offers benefits to traditional safety engineering processes. We demonstrate that concepts such as roles, goals, and interactions narrow that gap between
engineering and safety analysis, and provide a natural mechanism for managing re-analysis after change.
Specifically, we investigate the use of HAZard and OPerability studies (HAZOP) in agent-oriented software engineering, which we overview in Section 3.
The goal of our research programme is to develop analytical methods for assuring safety in multi-agent
systems. In Section 4, we illustrate a way of analysing multi-agent systems based on the idea of interactions, or how to perform a HAZOP study based on interactions. To do this we introduce the idea of an
159
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
interaction map and show how to adapt the HAZOP to interaction maps. We then go on to show how the
role and goal paradigm found in methodologies such as Gaia (Zambonelli et al., 2003) and ROADMAP
(Juan et al., 2002) can be used to improve the hazard identification and analysis process by providing direct feedback from safety analysis to design, how roles and goals can limit changes and how interaction
maps lead naturally to quantitative safety analysis similar to Mahmood and Kazmierczak (2006) in the
form of Bayesian accident networks. These improvements are used to reduce the total effort spent on
manual identification and analysis, and to provide feedback into the system design.
2
AN ILLUSTRATIVE EXAMPLE
To illustrate our ideas, we consider a simple fruit packing and palletising system. Different types of fruit
are conveyed by means of several special conveyor systems to a central sorting and packing area. Each
conveyor—here called a fruit line —transports a number of different types of fruit. Here a centralised
sorting, packing, palletising and storing system sorts the fruit into boxes for shipping to supermarkets or
for further processing.
The system must implement the following four features:
A.
A packing feature, in which the system must select quality pieces of fruit, ignoring damaged and
bruised fruit, and pack the selected fruit into boxes without bruising or damaging the fruit;
B.
A palletising feature, in which the system must place approximately 640kg of fruit onto a pallet in
an 8 × 8 × 5 symmetrical arrangement of 2kg boxes of fruit;
C.
A wrapping and nailing feature, in which the system wraps the completed pallet in a protective
sheet and nails the sheet to the pallet; and
D.
A storing feature in which the system must move the completed pallet to a cool store.
The performance requirements on the system are that it should be able to pack 6 pallets per hour, and
to be available for 166.5 hours per week, leaving 30 minutes per working day for routine cleaning,
maintenance and recalibration. It is anticipated that humans will be required to interact with the fruit
packing system, thus safety will be important.
An agent-oriented analysis using a role based method such as Gaia or ROADMAP might arrive at the
following decomposition of the problem. A main goal that is to sort, palletise and store fruit and a
decomposition of the main goal into four key subgoals: (1) to sort the fruit; (2) to pack a 2kg box of fruit;
( 3) to palletise the boxes by placing enough boxes on the pallet to make up the 640kg pallet; and ( 4) to
store the completed pallets in a cool store. Each goal is decomposed further into one or more roles. The
roles are shown schematically in Figure 1 as stick-figures.
3
HAZARD AND OPERABILITY STUDIES
Traditional non-agent approaches to safety analysis rely heavily on constant hazard identification and
analysis during all phases of the system development life-cycle. Hazards are the sources of potential
accidents and the aim in safety engineering is to identify and control the hazards of a system.
There are many different techniques for hazard analysis in the literature but the most often used are:
Exploratory Methods that aim to simply explore the system and its behaviour in order to identify hazards. Prominent among the exploratory methods are HAZOP studies that we will investigate further
below, and SHARD (Fenelon et al., 1994).
Causal Methods that work backwards from Hazards to their possible causes. Prominent among causal
methods is Fault Tree analysis (Leveson and Shimeall, 1991) that grows an AND-OR tree of events
back from each hazard to system-level events.
160
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Safely
Pack and
Store 6
Pallets/Hour
Sort the
Fruit
Sorting
Role
Available for 99%
(or 166.5 hours)
per week.
Sort and
Store Fruit
Pack the
Fruit into
Boxes
Packing
Role
Place on
Pallet, Wrap
and Staple
Palletising
Role
Wrapping
& Nailing
Role
Store the
Fruit
Storing
Role
Figure 1: A typical role-goal model for the fruit palletising system. The goal of packing and storing the
fruit is decomposed into the four roles of packing, palletising, wrapping, and storing.
Consequence Methods that begin by identifying system level events and then investigating the consequences of those events. Prominent among consequence methods are Failure Modes and Effects
Analysis (Palady, 1995) and Event Tree Analysis.
Here we investigate the use of HAZOP for conducting hazard analysis on the fruit packing system described in section 2 above. HAZOP studies are a team-based method for identifying and analysing the
hazards of a system. Originally developed for the chemical engineering industry (Kletz, 1986), it has
been applied in many other engineering domains, including software (Ministry of Defence, 2000).
Hazard and Operability studies are a well-established technique for Preliminary Hazard Analysis (PHA)
whereby a specific set of guide-words is used to explore the behavior of a system and the causes and
consequences of deviations. For example, one HAZOP guide-word is “after”, which prompts the analyst
to explore the consequences of a component performing some action or sending some message after a
key point in time.
HAZOP expects a sufficiently detailed understanding of the system such that the system components and
their attributes are specified as either a system model or a set of requirements. A team of analysts selects
each component and interconnection in turn, interprets the guide-words in the context of the system and
applies the HAZOP guide-words to specific attributes of the component in the study.
The output from a HAZOP study is a list of hazards, their causes and the consequences of each hazard.
According to the HAZOP standard (Ministry of Defence, 2000), the output of a HAZOP study must
include the following:
(1) Details of the hazards identified, and any means within the design or requirements model to detect
and mitigate the hazard;
(2) Recommendations for mitigation of the hazards and their effects based on the team’s knowledge
of the system and the revealed details of the hazard;
(3) Recommendations for the later study of specific aspects of the design when there are uncertainties
about the causes or consequences of a possible deviation from design intent.
What might a HAZOP look like if the packer role is implemented by an agent in the form of a moving
pick-and-place robot? From the domain, the types of accidents that may occur involve excess loss of
fruit leading to substantial financial losses or one of the moving robots colliding and injuring a human.
As an example of the analysis of a palletising and wrapping agent for the guide-words “before” and
“after” is shown in Table 1.
161
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Guide-word
Interpretation
Possible Cause
Consequences / Im-
Indication / Protec-
Before
Agent performs the
The agent fails to
plication
Boxes of fruit may be
tion
nailing of the sheet
identify that the pallet
damaged as they are
before the pallet is
is not filled.
moved to the pallet.
Recommendation
...
ready.
Before
...
Storeman (human being) is injured through
interacting with a pallet being nailed by the
nailing agent.
After
The nailing agent per-
The agent fails to
Boxes of fruit may be
forms its action af-
identify that the pallet
damaged if the pallet
ter the pallet has been
has left.
is moved without the
moved.
...
protective covering.
Humans
injured
The status of the pal-
through an interaction
let and the proximity
with the wrapping and
of humans to the agent
nailing agent when it
must be checked.
...
is triggered late.
Table 1: Partial results from a HAZOP applied to the wrapping and nailing ability in the palletiser
example.
Despite HAZOP’s simplicity there are several drawbacks in practice. The level of design details, the
interpretation of guide-words, and output for the HAZOP study result in a large amount of documentation
for even small systems. The result is often a table of information accompanying a bulk of paperwork
that becomes unmanageable in situations where changes to requirements, design, and technology occur
frequently (Mahmood and Kazmierczak, 2006). However, HAZOP is used considerably by industry
practitioners. In the remainder of this paper we show that the multi-agent system paradigm presents an
opportunity to exploit agent-oriented concepts and techniques to complement and improve HAZOP-style
safety analysis.
4
APPLYING HAZOP IN MULTI-AGENT SYSTEMS
HAZOP has been adapted to a number of different software analysis and design paradigms by interpreting
components and attributes according to the paradigm. For example, the HAZOP standard (Ministry
of Defence, 2000) includes a guide to interpretation. To adapt HAZOP to the analysis of multi-agent
systems requires an interpretation of the guide-words and attributes. Our first step therefore is to adapt
HAZOP for the analysis of multi-agent systems.
4.1
HAZOP Based on Interactions
Traditional non-agent based HAZOP is based on identifying components and their attributes. In the
original HAZOP the components were pipes, valves and tanks and the attribute was flow. In systems
HAZOP, the components can be hardware units or software modules such as packages or classes and the
attributes are signals, messages or data flows depending on the analysis or design paradigm used. The
problem for multi-agent systems is that they are often complex systems. Complex systems are generally
viewed to have at least the following characteristics (Newman, 2003):
162
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
(1) Complex systems are composed of a large number and diversity of components.
(2) Components in a complex system interact with each other. The interactions between components
may be non-linear and may vary in strength.
(3) Complex systems have operations of formation by which smaller components can be formed into
larger structures. Complex systems also form decomposable hierarchies. Operations in different
parts of a hierarchical structure may occur in different time scales.
(4) The environment in which the system operates and the demands of the environment influence the
behavior and evolution of a complex system.
The key point in this view is the emphasis on interactions between components, and not on the components themselves. This observation hints at a HAZOP analysis based on interactions. HAZOP does not
explicitly guide the analyst toward interactions, but rather relies on the analyst being able to understand
or imagine what interactions might arise and how they may lead to hazards. Leveson (2004) has made
the connection between complex systems, interactions and accidents, but has not used HAZOP in the
analysis.
4.2
Interaction Maps
We begin by developing an analysis notation that makes interactions explicit. Let us define “interaction”
in the context of a multi-agent systems HAZOP. First we will need to define the key elements necessary
for our analysis: actors, abilities and resources.
Actors are entities that perform the system functions and in this paper are either roles or agents. To carry
out their tasks actors require and produce resources. Abilities are the actors’ key functional capabilities,
expressed as a set of tasks. Resources are defined as everything else in the system other than actors and
abilities. There may be different types of resources such as physical resources from the environment such
as fruit, conveyors and pallets in our example, or communication channels between actors. To achieve a
task an actor performs one or more actions, for example, sending a message, actuating a device, accepting
a task, or calling a software function in another actor.
Definition 4.1 Given two actors or resources A and B we say that A interacts with B if and only if A
influences the actions of B or B influences the actions of A.
Some observations are necessary regarding Definition 4.1. The first is that the definition assumes a
reciprocal relationship between actors. An actor’s behaviour is characterised by the actions it performs,
thus if A influences the actions of B then B’s actions depend on A and the converse. We include in our
definition of interaction the case where A influences the actions of B but not the converse.
The second is that interactions may be transitive but need not be. If A interacts with B, and B interacts
with C then A does not necessarily interact with C. If B influences the behaviour of C because of the
influence of A, then we consider this an independent interaction. Further, interactions can be internal
so that, for example, A can change its own state without outside interference, influencing its own future
behaviour.
The interaction between actors and resources describe how actors influence each other and how their
operational environment influences them. Three types of interaction can exist in a system: (1) an actor
interacts directly with the environment (physical resource); (2) actors interact with each other via a
resource, for example, a communication channel; and (3) resources interact with other resources.
We use the idea of interaction maps to identify and record the network of interactions that exist between
actors and resources, as well as the abilities that agents have to interact with resources. Interaction maps
are networks in which there are three types of entity:
(1) resources are nodes drawn in rectangular boxes;
(2) abilities are nodes drawn in hexagonal boxes; and
163
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
(3) actors are collections of nodes and interactions.
Edges in the network represent interactions. To be well formed an interaction map must always have a
resource node between any two ability nodes.
Figure 2 gives an example of an interaction map for the stacker and palletiser roles. Using the interaction
map we can see that the two actors—the stacker and palletiser—roles, interact indirectly via the the
“Completed Pallet” resource.
Nails
Palletiser
Nail Sheet
to Pallet
Plastic Covered
Boxes
Covered and
Nailed Pallet
Wrap
Sheet
around
Pallet
Protective Sheet
Completed Pallet
Storing
Agent
Fruit
Fill Box
with Fruit
Packer
Filled Box
Stack Box
on Pallet
Box
Pallet
Figure 2: An interaction map for the packer and palletiser roles.
Figure 2 also illustrates how interaction maps define interaction within actors, for example, the packer
role consists of the internal resources needed to cover and nail the protective sheet to the pallet as well
as the abilities to achieve the goal. Interaction maps show the structure of the interactions in the system.
Interactions maps exist at a higher level of abstraction than other methods of describing interactions
such as collaboration diagrams in UML. By identifying the interactions, the analyst can hypothesise
which actors can interact to cause accidents and even identify the boundary conditions of the accident.
Further, interaction maps help the analyst to uncover any causal factor of the accident, and by examining
the interaction, may find ways of mitigating or stopping an accident-causing interaction from occurring
(Mahmood and Kazmierczak, 2006).
Methodologies such as Gaia and ROADMAP explicitly aim to identify the interactions in systems at an
early stage of development. We argue that this makes deriving an interaction map a straightforward task
given a system specification.
As an example, consider a human interacting with a covered pallet in order to take it to the cool store.
The analyst notes that covered pallets are the result of the Packer actor undertaking its “Wrapping and
Nailing” ability, which can be hazardous if the human comes into proximity with the nailing device. To
mitigate or avoid such an undesired interaction, a gate can be used to prevent humans from accessing
pallets until the nailing has ceased. This implies that humans interact with the gate, and the hazardous
interaction between humans and the pallet is avoided.
While we do not explore this idea further in this paper, it is possible to derive causal Bayesian networks
from certain interaction maps to quantify risk. This is done by identifying and modelling the possible
states of a resource during its lifetime and the possible states that an ability goes through during its execu-
164
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
tion. The set of states of each actor, ability and resource and their associated probability distributions can
be thought of as random variables; for example, see Figure 3. The direction of the links in the network
will depend on how the actors, resources, and abilities interact.
States
of AA
Ability AB
Resource RA
Interaction Map
Resource RB
States
of RA
Ability AA
Bayesian Network
States
of RB
States
of AB
Figure 3: An interaction map and its corresponding Bayesian network.
If the states of the resource RA can be observed then the probability of being in given state at time t can
be estimated. The same is true of the abilities. Alternatively, these can be measured during development.
4.3
Interpreting HAZOP Guide-words
Our approach to HAZOP uses the interaction map as well as the guide-words to guide the analysis. The
analysis explores the effects of an actor’s ability being applied incorrectly to a resource, or to the incorrect
resource. The analysis uses actors as the system components of the study, and the actor’s abilities as the
attributes to which guide-words are applied.
To apply HAZOP using interaction maps, we have to identify the interpretation of each of the guidewords with respect to interaction maps. Table 2 specifies our interpretation of each of the existing HAZOP guide-words for interaction maps. One can see that the aim of the guide-words is to investigate the
effect of an actor incorrectly applying one of its abilities on the resources in the system.
Guide-word
None
More
Less
Part of
Other than
As well as
Before
After
General Interpretation
The ability does not influence the resource.
The ability influences the resource more than intended.
The ability influences the resource less than intended.
The ability influences the resources only partly as intended, or only part of the ability
is exercised on the resource.
The ability influences the resource in a way other than intended; or, the ability influences a resource other than the intended one.
The ability influences the resource as intended, but influences additional resources.
The ability influences the resource before intended.
The ability influences the resource after intended.
Table 2: HAZOP Guide-words interpreted on the abilities of an actor.
The interpretation of the guide-words is quite general at this point. To apply them to a specific system,
their interpretations must be further refined depending on the context in which they are applied. As an
example, consider the ability of the palletiser role to nail the plastic sheet to the pallet. Table 3 outlines
one interpretation of the guide-words for that ability.
Using these guide-word interpretations, a simple method for the analysis can be given as follows.
1. Select an actor— role or agent— as the basis for study;
165
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Guide-word
None
More
Less
Part of
Other than
As well as
Before
After
Specific Interpretation
The palletiser does not nail the sheet down.
The palletiser uses too many nails.
The palletiser uses too few nails.
The palletiser does not nail down the entire sheet.
The palletiser nails a resource other than the sheet.
The palletiser nails the sheet and a resource other than the sheet.
The palletiser nails the sheet down earlier than intended.
The palletiser nails the sheet down later than intended.
Table 3: HAZOP Guide-words interpreted on the ability of the palletiser role to nail the plastic sheet to
the pallet.
2. For each ability of the actor in turn, interpret the meaning of every guide-word in the current
context, and explore the effects of the guide-word on each node (resource) connected to that ability
in the interaction map.
3. Document the effect of every guide-word, and recommend a mitigation strategy if that effect results
in a hazard.
As an example, again consider the ability of the palletiser role to nail the plastic sheet to the pallet. Using
the interpretations of the guide-words from Table 3, we explore their effects, resulting in the observations
about its effect on the pallet in Table 4.
Guide-word
None
More
Less
Part of
Other than
As well as
Before
After
Effect on Pallet
There is no sheet on the pallet, perhaps resulting in the hazard of fruit on the workshop floor.
No hazard.
The sheet is not secure, perhaps resulting in the hazard of fruit on the workshop floor.
The sheet is not secure, perhaps resulting in the hazard of fruit on the workshop floor.
A number of possible hazards.
A number of possible hazards.
The palletiser nails the sheet down before it is filled with fruit, possibly resulting in
the hazard of fruit on the workshop floor (and damaged fruit) as it attempts to load
more onto a covered pallet.
The palletiser nails the sheet down as the human storing agent attempts to pick it up,
resulting in an injury to the human agent.
Table 4: The result of a HAZOP using the guide-words on the ability of the palletiser role to nail the
plastic sheet to the pallet.
4.4
Design Feedback
An integral part of the analysis process is to use the HAZOP study to refine the analysis model. Each
row of the HAZOP table corresponds to a deviation from intent applied to an ability and a possible
consequence of that deviation. For example, the second row of Table 1 details a possible consequence of
the guide-word “after” applied to the nailing ability of the packer role.
The result of this analysis is that the packer role has a hazard associated with it through its nailing ability.
Further, we can use the interaction map to gain insight into the potential interactions leading to the hazard.
How can the information in the HAZOP be used to refine the model?
166
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
The HAZOP table identifies the hazards but the interaction map shows what elements of the current
model must interact to lead to the hazard. The model can be refined to mitigate or avoid such an interaction. There are several options on how to refine the model to do this: (1) we can manipulate the
resources; that is, altering, adding or deleting resources, to control the hazard; or (2) we can manipulate
the abilities of the actor to control the hazard.
In Table 1, we wish to change the interaction between the storing role (played by a human agent) and the
palletiser. An example of changing the resources is shown in the interaction map in Figure 4, in which
we add a guard under the control of the palletiser and only allow the storing role access to the pallet if
the palletiser actor deems it safe.
Nails
Palletiser
Nail Sheet
to Pallet
Protective Sheet
Plastic Covered
Boxes
Covered and
Nailed Pallet
Wrap
Sheet
around
Pallet
Guard
Storing
Agent
Figure 4: Modifying the palletiser by the addition of a resource and the alteration of an ability.
We may also be able to control the interaction with addition of an additional ability to the palletiser actor.
The updated role/goal model may appear as in Figure 5 where a new means of implementing the safety
goal has been added as the result of the HAZOP study.
Pack and
Store 6
Pallets/Hour
Available for 99%
(or 166.5 hours)
per week.
Sort the
Fruit
Pack the
Fruit
Sort and
Store Fruit
Safely
Add a resource - a guard
- to prevent the palletiser
interacting directly with
the store agent while
nailing the sheet.
Palletise,
Wrap and
Nail.
Store the
Fruit
Figure 5: Updated role/goal model resulting from the analysis of the second row of the HAZOP table.
In general each row of the HAZOP table can be used to extend the role/goal model in this way.
5
HAZOP BASED ON INTERACTIONS AND SYSTEM EVOLUTION
Most systems undergo some form of evolution, either to adapt them to situations that were not imagined
at the time of their design, or to add additional features and functions desired by users. Adaption to
new situations is also a key feature of multi-agent systems, therefore safety analysis methods must be
able to cope with change. Unfortunately, traditional non-agent HAZOP incurs a large overhead when
dealing with changes to existing systems as much of the HAZOP must be reworked. For example, the
HAZOP standard (Ministry of Defence, 2000) specifies that for any change in the system design the
entire HAZOP study should be redone.
Interactions maps coupled with the role/goal analysis of multi-agent system requirements provide clear
167
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
boundaries on what must be re-analysed in the event of a change.
5.1
Isolating Changes in the Role/Goal Model
The hierarchical nature of the models means that constraints and quality goals in lower-level models must
be consistent with the higher-level models, but this also means that changes at the lower-level models are
isolated from the higher-level models.
If we change a role but not a goal, then the new role must also meet the goal, so the goal does not need
to be reanalysed. The role and its interaction, however, must be re-analysed. If we change an agent but
not a role, then the agent’s externally observable interactions are the same as those for the role. In this
case, if the new agent introduces a new external interaction then it needs to be reflected back up to the
role model. Otherwise a HAZOP study on the agent model alone will be sufficient.
If we change a role and an agent, then the system model will still be hierarchical. In this case we can
always perform a HAZOP study on the role model first before performing a HAZOP on the agent models
that implement the role. Observe however that, if the agent model belongs to the role, then a change to
the role model will imply a change in agent model anyway.
If the interaction map is correct then it will tell us what needs to be updated. If we are unsure what needs
to be updated, then a HAZOP on the local change will indicate if there are new interactions or resources
introduced into the system.
As an example, consider a model in which the packer role being played by one agent evolves into a
design in which the role is played by two agents (based on the role’s abilities): one agent to place the
sheet on the box, and one agent to nail the sheet down. The role specification has not changed in this
case, and neither has the interaction map associated with the role model. Considering this, the HAZOP
study does not need to be re-performed at this level. The agent model, and its related interaction map,
have changed. As a result, the HAZOP study needs to be performed at this level.
5.2
The Key is the Interaction Map
In our analysis, the interaction maps specify which roles interact with which other actors, and through
what resources. Consider again the example of changing the agent model such that the packer role is
played by two agents instead of one.
The HAZOP study must be redone on the two agents that now implement the role, but not the role
provided that no new externally observable interactions have been added.
What other actors in the system need to be re-analysed? The interaction map can be used to answer
such a question. If we study the interaction map in Figure 2, we see that the only abilities affected are
the palletiser’s wrapping ability and the palletiser’s nailing ability. The external interactions in Figure 2
with the packer and the storing agent remain unchanged. We conclude that this change requires only the
two agents implementing the palletiser role to be re-analysed.
It is straightforward to then identify the benefits of exploiting the hierarchical nature of many agent
methodologies, and the interaction map: we can significantly reduce the burden on the safety engineers
during design evolution by helping them to systematically identify which parts of a design must be
re-analysed after a change. While this is possible using other development methodologies, the unique
factor in the agent paradigm is that it forces developers to consider goals and interactions early in the
development life-cycle.
168
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
6
RELATED WORK
Several authors have integrated safety analysis into agent systems. Dehlinger and Lutz (2005) propose a
product-line approach to requirements by developing requirements schemata and reusing these to capture
shifting requirements in a multi-agent systems. Dehlinger and Lutz show how to group requirements so
that the impact of change is minimised. Such an approach can be applied to safety requirements, but
they do not discuss this. Feng and Lutz (2005) propose a way of handling safety analysis by proposing a
bi-directional safety analysis for product-line multi-agent systems that extends Software Failure Modes,
Effects and Criticality Analysis (FMECA) and Software Fault Tree Analysis (FTA) to incorporate multiagent systems. They show how to generate safety cases and constraints in the Gaia role model. The aim
of the product-line approach is to increase the reusability of safety analysis. Our work shares a similar
motivation, however, we do not use product lines.
Giese et al. (2003) propose a way to ensure safety in self-optimized mechatronics system. They do not
provide detail analytical methods to generate safety cases, but instead discuss how to ensure safety in
system hierarchies with safety cases already provided by domain safety experts.
Bush (2005) proposes an extension of traditional HAZOP studies for the I* development model. Our
work is closely related, however, Bush applies HAZOP analysis on goals. We believe that abilities and
resources are useful for safety analysis, because goals are the conditions that agents desire, whereas the
abilities and resources outline how to achieve those goals — something that we believe is closer related
to safety.
7
CONCLUSIONS AND FUTURE WORK
In this paper, we have demonstrated that existing techniques such as HAZOP studies can be used with
agent oriented software engineering methodologies with little amount of extension. We demonstrate that
the introduction of interaction maps however can greatly ease the burden of re-analysis when changes to
the system model occur. Dealing with change is perhaps more important for multi-agent systems than for
traditional non-agent systems as their very design is often aimed at adapting to changing circumstances.
To this end the use of interaction maps becomes vital as they help to identify the elements of the multiagent system—roles, goals, and agents— that need to be re-analysed in the event of changes to the system
model. Despite greatly easing the burden of maintaining safety by re-analysing the system, if change is
perpetual then the constant re-analysis of safety becomes a tiresome and costly overhead.
The question is whether or not safety, once analysed, can be maintained by the system itself, even in the
presence of constant change and evolution to the agents and even the roles. The goal of our research
programme is to develop methods for assuring safety in multi-agent systems even in the presence of
constant system evolution and adaptation. Our research program involves the use of accident knowledge
to allow agents to perform safety analysis of their own behaviour. This will allow agents to change their
behaviour at runtime after taking into consideration the cause of accidents involving other agents, and is
the subject of current and future research. It is hoped that our research program will aid in the uptake of
the agent paradigm in safety-critical systems.
References
Bush, D., August 2005. Modelling support for early identification of safety requirements: A modelling
support for early identification of safety requirements: A modelling support for early identification of
safety requirements: A preliminary investigation. In: Fourth International Workshop on Requirements
for High Assurance Systems (RHAS‘05 - Paris) Position Papers.
Dehlinger, J., Lutz, R. R., 2005. A product-line requirements approach to safe reuse in multi-agent
systems. In: International Conference on Software Engineering. Vol. 3914. pp. 1–7.
169
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Fenelon, P., McDermid, J., Pumfrey, D., Nicholson, M., 1994. Towards Integrated Safety Analysis and
Design. ACM Applied Computing Review 2 (1), 21–32.
Feng, Q., Lutz, R. R., 2005. Bi-directional safety analysis of product lines. Journal of Systems and
Software 78 (Issue 2), 111–127.
Giese, H., Burmester, S., Klein, F., Schilling, D., Tichy, M., 2003. Multi-agent system design for safetycritical self-optimizing mechatronic systems with uml. In: OOPSLA 2003 - Sec- ond International
Workshop on Agent-Oriented Methodologies, Anaheim, CA, USA. pp. 21–32.
IEC, 2003. IEC 61508 Functional Safety of Programmable Electronics Safety-Related Systems. International Electrotechnical Commission.
Juan, T., Pearce, A., Sterling, L., 2002. ROADMAP: Extending the Gaia methodology for complex
open systems. In: Proceedings of the First International Joint Conference on Autonomous Agents and
Multi-Agent Systems. ACM Press, pp. 3–10.
Kletz, T. A., 1986. HAZOP & HAZAN notes on the identification and assessment of hazards. The Institution of Chemical Engineers, London.
Leveson, N. G., April 2004. A new accident model for engineering safer systems. Safety Science 42 (4).
Leveson, N. G., Shimeall, T. J., July 1991. Safety verification of Ada programs using software fault trees.
IEEE Software 8 (4), 48–59.
Mahmood, T., Kazmierczak, E., December 2006. A knowledge-based approach for safety analysis using
system interactions. In: Asia Pacific Software Engineering Conference, APSEC’06. IEEE Computer
Society Press.
Mellor, P., 1994. CAD: Computer Aided Disaster. High Integrity Systems 1 (2), 101–156.
Milner, R., 1989. Communication and Concurrency. International Series in Computer Science. Prentice
Hall.
Ministry of Defence, 1996. Defense Standard 00-56: Safety Management Requirements for Defence
Systems.
Ministry of Defence, 2000. Defense Standard 00-58: HAZOP Studies on Systems Containing Programmable Electronics. 2nd Edition.
Newman, M. E. J., 2003. The structure and function of complex networks. SIAM Review 45, 167–256.
Palady, P., 1995. Failure Modes and Effects Analysis. PT Publications, West Palm Beach Fl.
RTCA, December 1992. RTCA DO-178B: Software Considerations in Airborne Systems and Equipment
Certification. RTCA Inc.
Zambonelli, F., Jennings, N. R., Wooldridge, M., 2003. Developing multiagent systems: The Gaia
methodology. ACM Transactions on Software Engineering Methodology 12 (3), 317–370.
170
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
COMPLEXITY & SAFETY: A CASE STUDY
Author
George Nikandros, Chairman aSCSa, BE (Electrical)
INTRODUCTION
Despite correct requirements, competent people, and robust procedures, unsafe faults
occasionally arise. This paper reports on the outcomes of an investigation into a series of
related events: a series of events that involves a railway level crossing. Whilst the direct cause
of the failure was defective application control data, it was a defect that would be difficult to
foresee and if foreseen, to test for. The last failure event occurred after the correction was
supposedly made. The correction was made as a matter of urgency.
To understand the underlying complexity and safety issues, some background knowledge in
relation to active level crossing controls i.e. flashing lights and boom gates and railway
signalling is required. The paper therefore includes a description of the operation of the
railway level crossing controls and the railway signalling associated with the case study.
The official incident report is not in the public domain and therefore this paper has been
prepared so as to not identify the location of the series of incidents, the identity of the
organisations or the people involved.
THE UNSAFE EVENTS
There being three events, with the same unsafe outcome, in that a driver of a train was
presented with a PROCEED aspect in the same trackside signal when the actively controlled
crossing was open to road traffic i.e. the flashing lights were not flashing and the boom gates
were in the raised position. Had the driver not observed the state of the active level crossing
controls and proceeded on to the crossing, a collision with a road vehicle or pedestrian would
have been very likely; the crossing is a busy crossing with some 4300 vehicles per day and
500 pedestrians per day.
The first occurrence of this outcome occurred some seventeen days after the initial
commissioning of a new signalling system and was not given the appropriate classification for
investigation and action when logged. The second occurrence occurred two days later, a
Saturday. This time the correct classification was made and actions were immediately
initiated i.e. designer engineers were called in to identify and fix the problem. The third event
occurred five days after the second occurrence and after the design flaw was supposedly
removed.
THE RAILWAY CONTROL SYSTEM
Level Crossing Controls
The key aim of active level crossing controls is to provide the road crossing user sufficient
warning that a train is approaching and where boom gates are provided, to close the crossing
to road traffic before the train enters the crossing. Once the train has passed, the crossing
171
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
needs to be reopened with minimal delay. If a second train approaches the crossing when
already closed, the crossing is held closed. Figure 1 shows the typical train trigger points for
controlled rail crossings for a unidirectional line.
Figure 1: Typical train trigger points – one direction only
Once opened the crossing needs to remain open for a sufficient time so as to ensure that the
appropriate warning is again given to the road users.
Particularly for busy roads, a level crossing should not be closed unnecessarily i.e. if a train
stops short of the crossing at a signal displaying a STOP aspect for a time, then the crossing
should be opened for road traffic. The signal should not then display a PROCEED aspect,
until the appropriate warning is again given to the road crossing users.
However level crossings are rarely located to make life simple. Having multiple tracks and
locating a level crossing in the vicinity of a station stop significantly adds complexity. More
than one train may approach the crossing simultaneously from both directions and trains may
stop for long periods of time at the station platforms.
Another complexity which usually occurs in urban areas is the use of road traffic control
signals. There needs to be coordination (an interlock) between the road traffic control signals
and the level crossing control signals; it would be unsafe to have a “GREEN” aspect in a road
traffic signal for road vehicles to travel through the level crossing with the level crossing
controls in the closing or closed states. The approach of a train needs to be detected earlier to
enable the road traffic control system to cycle in sufficient time so that the signals allowing
the road traffic across the level crossing to return to RED prior to the active level crossing
controls begin closing the crossing. The road traffic signals also need to provide sufficient
warning to the road users.
Figure 2 shows the schematic of the level crossing of interest. It contains all the complexities
mentioned.
172
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Point where rear of train needs
to pass before No 4 Signal
changes to PROCEED
1 Signal
2 Signal
3 Signal
4 Signal
Point where rear of train needs to
pass before signals leading to
No 4 Signal change to PROCEED
Signal of interest.
There is 2m
between No 4
Signal and the
edge of road
Controlled road intersection
Figure 2: Layout Schematic for Level Crossing
Rail Signalling Controls
The aim of the signalling system is to safely regulate the movement of trains on a railway
network. The signalling system ensures that:
x
x
x
the path ahead is clear, and
there is no other path set or can be set for another train to encroach the path set, and
any active level crossing controls are primed to operate so as to provide the
appropriate warning to the road crossing user and close the crossing where boom gates
are provided.
Only when all these conditions are satisfied is an authority to proceed is issued.
For the location of interest, the authority to proceed is conveyed via a PROCEED aspect in a
trackside colour light signal.
Signals may be controlled or automatic. Controlled signals are signals which display a STOP
aspect until commanded otherwise by a Train Controller (the person responsible for managing
the movement of trains on the railway network). Although the Train Controller commands a
signal to PROCEED, it only displays a PROCEED aspect if the signal interlocking system
deems it is safe to do so. Controlled signals automatically return to STOP after the passage of
the train.
173
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Automatic signals are signals which are automatically commanded to PROCEED by the
signal interlocking system i.e. there is no participation of the Train Controller; the Train
Controller can neither command them to STOP nor PROCEED. Some controlled signals have
an automatic mode which the Train Controller can select and deselect as necessary.
Of the signals of interest, 3 Signal and 4 Signal are automatic signals, 1 Signal and 2 Signal
are controlled signals.
If there are no trains in the vicinity, 3 Signal and 4 Signal will each display a PROCEED
aspect. Figure 1 depicts an example of this condition; the signal near the crossing represents 4
Signal.
As a train, Train A, approaches 4 Signal, the road traffic controls are commanded to cycle,
and after the allowed cycle time has elapsed, the flashing lights are activated to warn the road
crossing users and after the required warning time has elapsed, the boom gates descend to
close the crossing.
Whilst Train A remains on the approach to 4 Signal the crossing remains closed. When the
Train A passes 4 Signal, 4 Signal is automatically placed at STOP and no other train can
approach 4 Signal at STOP until the rear of Train A passes the point when the signals
applying to 4 Signal are permitted to display a PROCEED aspect (see Figure 2); this point is
know as the overlap limit. The overlap is the safety margin provided at signals should the
train driver misjudge the train’s braking. Once the rear of Train A clears the level crossing
and there is no other train approaching the crossing on the other tracks, the crossing control
commences its opening sequence. When the crossing is opened, the road traffic signals
resume their normal cycle. It is important to note that the rail control system influences the
road traffic signals; the road traffic signals do not initiate any action in the rail control system.
Once the rear of Train A is beyond the overlap limit for 4 Signal, anyone of the signals
applying towards 4 Signal, assuming no other trains are in the vicinity, can be placed at
PROCEED, thus allowing another train, Train B to approach 4 Signal; this time however 4
Signal is at STOP (Figure 3).
Figure 3: Level crossing operation for following trains
174
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Because of the close proximity of the 4 Signal to the level crossing, the level crossing needs
to be closed to safeguard the road crossing users and the train in the advent of a braking
misjudgement by the train driver.
If Train B is detained at 4 Signal when at STOP for a sufficiently long enough period, for this
case 35 seconds, the crossing opening sequence commences to allow road traffic flow to
resume. Should the conditions to allow 4 Signal to display a PROCEED aspect be satisfied
after the crossing is commanded open, then 4 Signal will remain at STOP until the crossing
opening sequence is completed, the minimum road open time conditions are satisfied, the road
traffic signals are again cycled, sufficient warning that the crossing is closing is given to the
crossing users and the boom gates have lowered.
If however Train B is detained at 4 Signal when at STOP and 4 Signal subsequently changes
to display a PROCEED aspect i.e. the rear of Train A has passed the overlap limit for 3
Signal, within 35 seconds, the crossing remains closed until the rear of Train B clears the
crossing, irrespective of how long the train takes.
When the Train A passes 3 Signal, 3 Signal is automatically placed at STOP. When the rear
Train A passes the overlap limit for 3 Signal, 4 Signal is automatically commanded to display
a PROCEED aspect, but only does so if it is safe i.e. there is no train detained at 4 Signal with
the level crossing open.
SYSTEM ARCHITECTURE
The signal interlocking system that performs the safety functions has a distributed
architecture. The system consists of programmable logic controllers located geographically
along the railway and interconnected with point to point serial data links, such that, referring
to Figure 4, data that needs to go from Controller C to Controller A needs to go through
Controller B.
Figure 4: Distributed architecture showing area of control
175
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
It is important to note, that the architecture is not a master-slave architecture, where the slave
controllers perform an input/output function as directed by the master controller. For this
application the interlocking function is distributed over each of the controllers.
Controller Technology
The controllers are commercial-off-the-shelf (COTS) products specifically developed for
railway signalling interlocking applications. All of the controllers are of the same type and
version.
Each controller maintains a time-stamped event log. However the controller clocks are not
synchronised.
The technology is modular and programmable. It uses plug-in modules for connectivity to
different types and quantities of inputs and outputs. Thus only the hardware actually required
for an application needs to be installed. The technology is supported by graphical tools for
application design, simulation and testing. The suite of tools is used to define the modules and
logical operation of the system and verify and validate the application logic.
To satisfy the safety requirements the controllers operate on a fixed, nominally 1 second time
cycle. Consequently an input change will not be immediately detected, however there is
certainty as to when an input change will be detected and processed.
THE DELIVERY PROCESS
The system was delivered under a design, build, test and commission contract arrangement,
where the contractor is responsible for delivery in accordance with the specification, and the
railway organisation is responsible for verifying compliance and validation of the system to
key signalling safety and operating principles.
The contractor organisation was also the developer of the COTS controller technology and
had a considerable history for deploying that technology on many railway networks, including
that of the railway organisation commissioning the contract works. However this was the first
time that a distributed interlocking architecture was to be deployed; neither the contractor
personnel undertaking this work, nor the railway personnel verifying and validating this work
had any prior experience in implementing a distributed interlocking architecture with this
technology.
The delivery model and the underlying processes had been well established. These had
evolved over time and were considered best railway practice at the time.
The personnel involved were appropriately qualified and experienced in the design of signal
interlocking application logic of this complexity and in the use of the technology, albeit not in
the design and implementation of the distributed interlocking architecture using this
technology.
176
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Hazard Analysis
Because the project was considered “routine” there was no specific hazard analysis performed
for the application design. The technology had been used for similar applications, albeit with
a different architecture, and the application scenario i.e. an actively controlled level crossing
in the vicinity of road traffic signals and station platforms, was not new. The hazards of the
application were well understood. The potential hazards due to the processing latency of the
controllers and their associated communications links were understood, but how to design for
them was not. The application manual for the controllers did warn of the latency, but provided
no guidance as to how this latency should be treated to eliminate the hazards or test for them.
Application Logic
The railway organisation specified the interlocking requirements for this application. The
contractor designed the controller architecture, the modules and the application data and
submitted the “design” for review by the railway.
The reviewed design was amended as appropriate by the contractor and the application data
produced for each of the controllers. The contractor tested the amended application design for
compliance with the specification using both simulation tools and the target hardware (the
personnel were required to be independent i.e. they were not involved in developing for the
design under test).
The application logic was then tested by the railway organisation to validate compliance with
the key signalling safety and operating principles using simulators and the target hardware.
Those tasked with the validation task had no involvement in the development of either the
interlocking specification or any of the design reviews.
THE FAILURES
There were three unsafe events. The first two were due to the same latent defect, although the
initiating event was different. To assist in understanding, the sequence of events for Event 1 is
provided in Table 1. The time base used is Controller B. The time-stamps for Controllers A
and C have been aligned with the Controller B time. The event sequence should be read with
reference to Figures 5, 6 and 7.
Figure 5: A state of the system prior to the Event 1 failure
177
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Event 1 – The First Occurrence
“C”
“B”
“A”
08:26:11
Event
Train A approaches 3 Signal at STOP
08:27:13
Crossing closes for train approaching 1 Signal
08:28:18
3 Signal changes to PROCEED
08:28:29
08:28:32
Train B approaches 4 Signal at STOP [crossing
already closed]
Train A passes 3 Signal
08:29:05
08:29:04
Crossing called open – train at 4 Signal too long
Train A (rear) passes 3 Signal overlap limit and
08:29:06
4 Signal changes to PROCEED
08:29:07
Crossing starts to open
08:29:07
Crossing called closed [4 Signal at PROCEED]
08:29:08
08:29:17
Crossing opens
08:29:41
Crossing commences to close
08:29:49
Crossing closed
08:30:22
Train B passes 4 Signal
Table 1: Event 1 Sequence of Events
The sequence of events show that 4 Signal was at PROCEED for some 40 seconds with a
train on its approach and the level crossing not closed.
The initiating event was Train A being detained at 3 Signal with Train B closely following.
The reason for the detention of Train A at 3 Signal was because of incomplete works in
relation to 3 Signal. Figure 6 shows the situation just as 3 Signal changes to PROCEED.
Figure 6: Train A receives signal to continue, Train B at platform
178
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
Figure 7: The state of the system when the Event 1 failure occurred
Figure 7 shows the situation as Train A clears the overlap beyond 3 Signal, thus enabling 4
Signal to be called to clear.
Controller B allowed 4 Signal to display a PROCEED aspect at 08:29:06, because according
to Controller B the crossing was closed and section of track from 4 Signal to 3 Signal and the
overlap was clear. However Controller A had commanded the level crossing controls to open
at 08:29:05, but Controller B did not receive and process this open command until 08:29:07.
The failure is depicted in Figure 7.
Once the crossing began to open, it could not again be closed until the crossing was open for
the required crossing open time.
The incident occurred because states of the crossing controls and 4 Signal in Controllers A
and B were different for 1 second. The incident would not have happened if the conditions for
4 Signal to change to PROCEED were satisfied coincidently as the conditions to open the
crossing were satisfied.
Event 2 – Categorised correctly, investigated and fixed
The initiating event was a failure of Controller C and the consequential loss of the
communications link between Controllers B and C. Railway signalling systems have
traditionally been required to fail safe. To meet this fail safe requirement, railway signal
interlocking systems are required to fail to a safe state in the event of a failure i.e. trains are
assumed to be everywhere, signals display STOP aspects and level crossings close.
The loss of communications occurred whilst 1 Signal was at PROCEED in preparation for a
future train. The failure resulted in the interlocking (Controller A) presuming that a train was
approaching 1 Signal and closed the crossing and 4 Signal was placed at STOP because the
track ahead was assumed to be occupied by a train.
179
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
A train approached 4 Signal at STOP at 14:42:47 automatically triggering the standing too
long at platform timer. The timer times out at 14:43:22 and primes the crossing to open. The
crossing does not open because of the presumed train approaching 1 Signal at PROCEED.
At 14:44:00, the communications is re-established and the “presumed train” approaching 1
Signal disappears and, as 4 Signal was at STOP and the standing for too long timer for 4
Signal has expired, Controller A commands the crossing to open.
However Controller B allowed 4 Signal to change to PROCEED because the track ahead was
now confirmed clear when the communications recovered.
At 14:44:02, the crossing was commanded to close because the interlocking (Controller A)
conditions require the crossing closed when 4 Signal is at PROCEED and there is a train
approaching i.e. the train standing at 4 Signal. However once the crossing began to open, it
could not again be closed until the crossing was open for the required crossing open time. 4
Signal was at PROCEED for 42 seconds before the crossing was closed.
The Fix
The solution to the problem was to repeat the interlocking conditions requiring 4 Signal to be
at STOP before opening the crossing in Controller A, in Controller B, thus ensuring that the
states of 4 Signal and the level crossing control are always the same.
Event 3 – After the Fix
Some 5 days after the fault was supposedly corrected, there was another occurrence of the
failure.
This failure had a similar sequence of events as for Event 1, in that a train was detained at 3
Signal and a following train detained at 4 Signal long enough for the standing for too longer
timer to expire. There was another train in the vicinity and it was approaching 1 Signal.
The detention of the train at 3 Signal, this time, was due to a failure of Controller C which
also caused a loss of communications between Controllers B and C.
On recovery of Controller C and the re-establishment of communications between Controllers
B and C, 3 Signal changed to PROCEED. The detained train moved on and when the rear of
the train was beyond the overlap, 4 Signal changed to PROCEED, and the crossing called
open. One second later, the crossing was commanded to close. This was essentially the same
sequence of events for Events 1 and 2.
So why did the fix not work?
The reason why the fix did not work was because it was implemented incorrectly. Instead of
requiring the crossing to be closed before 4 Signal changed to PROCEED when a train was
standing at the signal, the implemented logic required 4 Signal to be at PROCEED to
command the crossing to open. This effectively ensured that the crossing would always open
automatically when 4 Signal changed from STOP to PROCEED.
180
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
THE FIX PROCESS
The railway signalling industry gives unsafe signalling failures high priority. This failure was
no exception. Subsequent to the second failure, which occurred on a Saturday (the first
occurrence was not correctly categorised and hence not immediately acted upon), the railway
organisation investigated the failure, devised the solution and implemented the next day, a
Sunday.
Because it was a failure of the safety system, it was considered a matter of urgency to correct
the problem. People were specially called in.
The personnel involved were those who verified and validated the interlocking system
supplied under contract and so should have had good knowledge of the interlocking logic.
However, their collusion in the investigation, the identification of the solution and its
subsequent design, test and commissioning, compromised their independence.
The change was not tested using the simulation tools and test facilities as it was assumed that
the sequence of events could not be accurately simulated. This was a timing issue and the
events had to be timed to the second. One second either way meant that the failure would not
have occurred.
The change however was relatively simple.
There was some attempt to test the deployment of the change using the target system.
However this only confirmed that the fix had no adverse affect on the normal sequence of
events. It was not possible to induce the changes of state with sufficient accuracy to prove that
the problem was corrected.
THE COMPLEXITY FACTOR
The interlocking logic flaw which led to the unsafe failure events described above was a
direct result of the complexity created by the architecture selected for this particular
application.
Whilst the people involved appreciated the timing complexities of distributed systems, there
were no prescribed processes as to how to deal with transitioning internal states common to
different controllers.
It is important to note than had 4 Signal been a controlled signal instead of an automatic
signal, the flaw would not have as readily been revealed. The request to clear 4 Signal from
the Train Controller would have had to arrive within 1 second of the conditions allowing 4
Signal to change from STOP to PROCEED were satisfied.
The problem is, there appears to be no obvious practical way of identifying such precise
timing-related flaws. How can we be ever certain that there are no other similar flaws which
have yet to be revealed? The system has been in service now some nine years and there have
been no other interlocking logic flaws revealed.
179
Improving Systems and Software Engineering Conference (ISSEC), Canberr, Australia, August 2009
SAFETY
None of the unsafe events resulted in any harm. However, that does not mean that this was not
a serious safety flaw. The primary safety system failed and it was only the vigilance of the
train drivers involved that prevented any harm from occurring.
There is an increasing trend in the railway industry to automate train operations i.e. operate
trains without a driver. Had there been no driver on the three trains involved in the failure
events, then harm would have certainly occurred. The PROCEED aspect in 4 Signal would
have been sufficient for the train to automatically depart with the crossing open to road traffic.
If the controlled level crossing did not exist then the events would not have happened. The
events only occurred because of the need to guarantee a minimum road open time before
reclosing the crossing.
CONCLUSION
The versatility of programmable logic controllers tempt application designers to use them in
ways not originally intended. Whilst the particular controllers had the functionality to
communicate serially, the use of this functionality to construct such a distributed interlocking
system was an innovative use of the technology. Whilst the equipment manuals did not
preclude such use, there were warnings on the use of the serially links with respect to latency.
The series of failures described in this case study demonstrates the subtlety of design errors
that can be created in a distributed system which may lie dormant until revealed, sometimes
with serious consequences.
When such flaws are revealed, the urgency to correct the flaw is often a strong temptation to
bypass the usual rigorous procedures. This case study demonstrates what can happen when
such procedures are not adhered to, despite the involvement of appropriately competent
people.
180