Root Cause Analysis

The International Electrotechnical Commission, IEC, is currently preparing an international standard to be known as IEC 62740 Root Cause Analysis. I prepared some material for potential inclusion in the standards document but as of writing it appears it will not be used. I think it is quite useful, so I make it hereby available.

The paper on the RVS WWW-site, Root Cause Analysis: Terms and Definitions, Accimaps, MES, SOL and WBA, consists of

a vocabulary I put together defining the terms I think are needed to talk effectively about root-causal analysis, based on the International Electrotechnical Vocabulary, IEC 60050, which all international electrotechnical standards are required to use. I am not completely happy with a variety of the definitions of fundamental concepts in the IEV. I make my discontent clear through notes which I have added to the IEV definitions. Other concepts are new, and not (yet) in the IEV. Readers might like to compare with the vocabulary which I prepared in 2008 for system safety uder the auspices of Causalis Limited, Definitions for Safety Engineering.
Brief introductions to the root cause analysis methods for accidents, Accimaps (from Jens Rasmussen, successfully applied by Andrew Hopkins and now the Australian Transport Safety Board in Australia), Multilevel Event Sequencing (MES, from Ludwig Benner, Jr. formerly of the US National Transportation Safety Board), Safety through Organisational Learning (SOL, from Babette Fahlbruch and SOL-VE GmbH, used in the German and Swiss nuclear industries), and Why-Because Analysis (WBA, originated by me and developed by colleagues at Uni Bielefeld RVS and Causalis Limited, used by two divisions of Siemens and now the German Railways DB, as well as Causalis for its accident analyses for clients). Each method description includes pictures, so readers get an idea of the presentation of results, a short section on process – what one does, and a section on strengths and limitations.

I think it would be a good think to have similar descriptions for all methods in current industrial use for root cause analysis of significant incidents. My personal list of such methods stands currently as follows:

Accimaps (in the document)
Barrier Analysis. BA is really an a priori method favored in the process industries, but also used post hoc to determine which barriers failed and why. Typified in Reason’s “Swiss Cheese” diagram.
Causes-Tree Method (CTM). Widespread and, I am told, sometimes legally required in France for accident analysis.
Events and Causal Factors (ECF) Analysis and Diagrams. ECF is dealt with extensively in Chris Johnson’s Failure in Safety-Critical Systems: A Handbook of Accident and Incident Reporting
Fault Tree Analysis (FTA). I had considered FTA primarily an ab-initio risk-analysis method at system design, but Nancy Leveson tells me she has seen more root cause analysis performed with the help of fault trees, sometimes put together after an incident rather than pre-existing, than with any other technique.
Fishbone or Ishikawa Diagrams. These are minimally a method, more a presentation technique, and not one I find particularly helpful. More applicable in industrial quality control than in significant-incident analysis, I would think.
Multilevel Event Sequencing (MES, and its associated technique STEP), in the document
The Reason Model of human operational analysis, involving human error in operations, classification such as skill-based, rule-based and knowledge-based operations (SRK), the notion of latent errors, or misdesign of operations allowing mishap sequences to occur normally, the “Swiss Cheese” model.
Safety through Organisational Learning (SOL, with its associated toolset SOL-VE), in the document.
STAMP and its associated methods, Leveson’s feedback-control-system model of critical-operational control, applied to the Rasmussen-Svedung hierarchy of operational, organisational and institutional context, dealt with extensively on Nancy Leveson’s WWW site
TRIPOD, a method developed over many years by oil companies in cooperation with Jim Reason’s group, and in wide use in the oil industry
Why-Because Analysis (WBA), in the document.

Besides these, there are special methods for root cause analysis of incidents involving human operations; maybe one can call these “human factors root cause analysis” methods. Amongst these are:

Connectionism Assessment of Human Reliability, CAHR, from Oliver Sträter’s group at Kassel, which has been used in analysing marine accidents and incidents.
Human Information-Processing Models. These originated with Peter Lindsay and Don Norman, include methods sometime used by NASA’s human factors research group (NASA Ames, at Moffett Field in California). Our PARDIA classification is such a model.
Human Factors Analysis and Classification System (HFACS).
Management Oversight and Risk Tree (MORT), developed by William Johnson for the US Nuclear Regulatory Commission and widely used in the US nuclear industry.
The SHEL model (note that the referenced page spells it mistakenly with two “l”s).
Shorrock and Kirwan’s TRACEr model for identifying and classifying cognitive error in air traffic management and control operations. For example, see this paper.

There are other promising methods which I could include, but I don’t know how much industrial “traction” they yet have. If readers could let me know of other worthwhile methods which have found some foothold in industry, I would be grateful. I would be even more grateful for descriptions of methods similar to those that are already in the document! Authorship will of course be acknowledged in the usual manner.

The Abnormal Distribution

We distribute Thoughts