Standardising Causal Analysis

As a member of the German national committee for standards concerning the functional safety of electrical/electronic/programmable-electronic systems (known in the jargon as E/E/PE systems), I received on 11th May a document sent to another standards committee, proposing an international standardisation project for Root Cause Failure Analysis through the International Electrotechnical Commission, IEC, the ISO affiliate responsible for things computerish.

Now, I like to think I know something about Causal Analysis of accidents involving engineering artefacts. I proposed my method Why-Because Analysis (WBA) (see also examples of WBA amongst Causalis publications), based on the insights into causality of, amongst others, David Hume and David Lewis, somewhere around 15 years ago. We used it then predominantly for analysing accidents to large commercial aircraft, whose operation increasingly involved computational components and I was one of the very few people I knew who was both an instrument-rated pilot and an analyst of the kinds of distributed computer-based systems found in such aircraft. Our (rather, my group’s tech-transfer company Causalis) first commercial analysis contract was 1998, as an advisor to the lawyers for the plaintiffs in the civil lawsuit concerning the 1994 Nagoya A300 accident.

WBA attracted somewhat of a following. Two divisions of Siemens, Rail Automation (which makes signalling systems) and Mass Transit (which makes trams) have adopted it as an internal company analytical procedure, and we started the Bieleschweig Workshops in Bielefeld and Braunschweig, whose first few meetings concentrated on Root Cause Analysis. The two German university departments which aid the German railways on accident analysis, those IfEV at TU Braunschweig and the Institute for Rail Systems at TU Dresden, also adopted WBA for research and teaching. We continue to use it of course to aid accident-compensation negotiations, and even to aid the criminal defence of a inappropriately-accused microlight inspector.

Nancy Leveson at MIT has an accident analysis method which she announced in the early 2000’s, STAMP. It is based on a hierarchical model of social organisation, due to Rasmussen and Sveding, with each level construed as a feedback control system. WBA is based largely on the rigorous application of one specific test for being a causal factor, the Counterfactual Test. Colleagues of ours at Siemens and TU Braunschweig compared the methods, as they then were, in 2003. There are of course other methods – Chris Johnson has surveyed some in his book, such as MES, STEP, MORT and PRISMA. Chris seems partial to ECF analysis. The ATSB used to use the so-called “Reason model”, after the analyst Jim Reason, formerly of Manchester University, and now uses Accimaps, a simple hierarchical representation due to Rasmussen before the more elaborate work with Sveding. (My contention is that all these methods use an informal rule-of-thumb intuitive version of the Counterfactual Test, and add stuff on top. WBA makes sure one gets at least the counterfactual analysis right, whatever else one wants to do. A universal method, if you like, even if you don’t use our software.)

I read in the working draft of the RCFA standardisation proposal, under “Analysis Phase”, that

The analysis phase uses the collected data to build the sequence of events leading to the failure event, which is presented as a cause chain. The cause chain determines the direct, contributing, and root causes of the failure event. The direct cause is the first one in the cause chain, thus directly leading to the failure event. The root cause is the last one in the chain, while the contributing causes are the ones in between the direct and the root causes.

This root cause is the stopping point and is the place where, with appropriate corrective action, the problem will be eliminated and will not reoccur.

To be effective the analysis must establish a sequence of events or timeline to understand the relationships between contributory factors, the root causes and the failure event. The analysis will identify the reasons why the causes immediately preceding and surrounding the failure event existed, working backwards to the root causes.

I was -negatively- astonished. Much of this material contradicts what those of us who work in accident and failure analysis know.

I know people who are involved with the relevant German committee as guests. I wrote to our committee administrator listing some technical mistakes. He forwarded my note via the administrator of the responsible committee to the vice-chairman of that committee, who contacted the people I said I knew for their opinion on my technical points. They all responded that they agreed with the technical points (of course!). But it is also part of the process to decide whether a country supports a standardisation effort on a particular subject or not, and some indicated they would support the project; they are, however, guests not committee members.

I also wrote my German committee colleagues directly, as well as those others I know around the world who are interested in causal analysis of engineering failures and accidents. I wrote this note to the Safety-Critical System mailing list at the University of York: (Readers may follow the thread most easily by going to the archive page , choosing “thread view“, and searching for “New International Standard – Urgent Action Needed”, because there are ostensibly two thread titles referring to the same subject matter, and there are also some replies to my original note which occur in the thread view but are somehow not regarded as part of the thread. In the thread view, the messages are spatially, although not temporally, contiguous.) The responses were almost uniformly negative. Check out, for example, Rob Alexander’s, Andrew Rae’s and Nancy Leveson’s devastating short comments. Bertrand Ricque’s note pointed out that standardisation efforts may well be motivated by reasons other than technical.

I said the following in my personal e-mailing to colleagues.

The following things are most obvious technically wrong with this.

A1. There is usually no “cause chain”. There is rather an interconnected network (or mathematical “graph”) of causal factors. Here “usually” means: in my work I have seen no case of a chain which provided anything like an adequate causal analysis of a failure.

A2. There is usually no single “direct cause” as here defined. Rather, the causal factors are dependent on what events and states are regarded by the analyst(s) as relevant, and *relative to that choice* there may or may not be a “direct cause” as here defined. In most of the failures I have analysed, there is no single “direct cause” as here defined; rather, many.

A3. There is usually no single “root cause” as here defined. Rather, the causal factors are dependent on what events and states are regarded by the analyst(s) as relevant, and *relative to that choice* there may or may not be a single “root cause” as here defined. In most of the failures I have analysed, there is no single “root cause” as here defined; rather, many.

A4. There is usually no “stopping point” as here defined. Rather, the analyst(s) must invoke a “stopping rule” to say what further causes they no longer consider as relevant. This stopping rule is best formulated explicitly, and it represents a choice by the analyst (to use that explicit rule) rather than anything objective in the causality itself.

A5. The standard does not adequately define the various notions of “cause”, despite there being logically precise definitions in the scientific literature since, at the latest, 1973, and precise engineering-relevant definitions in the engineering literature since the 1990’s.

To support point A5, consider the entire set of definitions of “cause” in the proposed standard:

[begin quote]

3.2 failure cause: circumstances during specification, design, manufacture or use that result in failure

3.4 direct cause: condition or action that directly resulted in the failure event, without which the failure event would not have occurred

3.5 contributing cause: condition or action that occurred that did not directly lead to the failure event and therefore by itself would not have caused the failure event

3.6 cause chain: cause and effect sequence in which a specific action creates a condition that contributes to or results in a failure event

3.7 root cause: condition or action that sets in motion the cause and effect chain that creates the failure event

[end quote]

The following things are technically wrong with these

B1. They are imprecise; e.g. “circumstances…that result in….”
i. What are “circumstances”? Events or states?
ii. What does it mean to “result in”?

B2. With the exception of the word “directly”, this definition is the so-called Counterfactual definition of causal factor; most (in some representations, all) of the factors represented in a causal graph satisfy this definition, not just one. The word “directly” remains undefined.

B3. All events and states in the world, *except* those that “directly led to the failure event”, satisfy this definition of contributing cause. My typing these words now did not directly result in the Fukushima nuclear power plant accident and therefore by itself would not have caused the Fukushima nuclear power plant accident. It follows, word for word, that according to this proposed standard my typing these words now is a contributing cause of the Fukushima nuclear power plant accident. This suggestion is absurd! But it is what the definition says.

B4. “Cause…chain” is a “sequence”. “Sequence is not further defined. According to normal usage, a sequence is a linear ordering, a succession of items, one following another. There is nothing wrong with this definition as such. There is something very wrong with an analysis technique which suggests that “cause….chain” under this definition is what needs to be identified, as in A1 above.

B5. In the definition of root cause, the concept “sets in motion” is a pure metaphor and not a precise term. A causal chain does not “move” in anything other than a metaphorical sense. The causal graphs I print only move when the piece of paper on which they are printed moves. Otherwise, the network stays the way it was printed, thank heavens!

B6. In the definition of root cause, a cause chain is said to “create” a failure event. The word “create” applied to events is not further defined and it is unclear what it should mean.

Some of the comments I had made were redacted and supported by the responsible German committee and forwarded to the IEC. The IEC commentary format is very restrictive. It allows one to comment only on individual sentences in the original. This means that overall critique, such as that something is wrong-headed and contrary to the state of the art, such as offered by Andrew Rae and Nancy Leveson in the notes referenced above, cannot be included.

Here is my paraphrase of what made it of my critique into the official commentary (I may not, of course, distribute the original).

In most cases, there is no “causal chain”: there is a (mathematical) graph (informally: a network) of causes.

“Stopping point” is an undefined term. It should be replaced by “stopping rule” which is explicitly defined by the analyst.

There is often no single place where the “problem” can be eliminated, but rather multiple places.

There is mostly no single “root cause”: there are usually multiple causes.

One can see very clearly here the reduction, both in words and in content, from my original critique. The IEC and thereby the proposal originator and drafter(s) of the proposal standard receive only this reduction.

Overall, I understand there was sufficient international support for a standardisation effort on Root Cause Failure Analysis to go ahead. Not all countries which supported the proposal nominated “experts”. Five did. You will search the literature on causal analysis in engineering in vain for any of their names.

There are a number of things wrong with this process. Some of them have been eloquently articulated by others I have referenced above. A further one is the reduction in content of the critique. Another is the obscurity of the process. Yet another is that the “experts” drafting the standard do not include anyone with an international scientific reputation in causal analysis.

There are a set of tropes about engineering standardisation procedures which were proposed by Derek Jones in the thread (see also his later note). I responded to – or, as a colleague called it, demolished – Derek’s points in this note.

Finally, now that we apparently have an international standardisation project for root-causal analysis of engineering failures, I encourage everyone who knows about such things to weigh in with their views. Via the appropriate national committee which, as Derek says, you can maybe manage to find out about with a few well-placed phone calls to some national organisation you think might know. But good luck in fitting your opinions on the IEC comments form. You have seen above what happened to mine.

The Abnormal Distribution

We distribute Thoughts