A Watershed in System Safety Engineering?

The report on the RAF Nimrod accident in 2006 has recently come out and at least British safety engineers regard it as a major event. This is a milestone, and could be a watershed event, in system safety engineering in Britain.

Put briefly, the report found that there have been various technical questions about the Nimrod design dating back to 1969. In 1979 the aircraft were modifed with a supplementary cooling pack, some of whose hot lines were in proximity to parts of the fuel system. In 1989, the Nimrods were modifed for air-to-air refuelling, which allowed the possibility of overflow of fuel into the parts of the aircraft occupied by the hot line of the supplementary cooling pack. Indeed, this happened in 2006, the aircraft caught fire and the aircraft and crew were lost. This from a summary article in The Daily Telegraph recommended by John McDermid, who has also privately cautioned against drawing conclusions without having read the report.

Which I am about to do (both. But first, here, the drawing of conclusions 🙂 )

There was a Safety Case prepared in this decade (that is, after the mods implicated in the accident) by BAe Systems. The Safety Case was criticised in the report as being incomplete and badly argued (as were the preparers and their MoD supervision – and these people are named!). I do not need persuading that many important Safety Cases are junk. Indeed, I have written publically on it. But this one landed on the first page of British newspapers.

I think it very likely that one of the consequences of the report will be increased attention paid to the quality of Safety Cases.

One obvious way to improve the quality of Safety Cases is to improve the quality of the argument in them. One could even say that these are the same: a Safety Case is nothing more nor less than one long argument. It is an argument that a system is sufficiently safe (whatever the criteria may be) to operate. And it is meant to reflect reality, which it can only do if the argument is valid (argues from true premisses to true conclusions).

So one obvious way to improve the quality of Safety Case is to check its quality – that is, to have it reviewed by experts in its subject matter. The common subject matter of all Safety Cases is argumentation. The study of argument is known as logic and the people who study argumentation are known as logicians.

Philosophy undergraduates are usually required to take at least two courses (one full year) of logic, no matter where they study. Our computer science students in Bielefeld study logic first in math (where they might be introduced to Boolean algebra), then in Theoretical Computer Science (where they are introduced to propositional logic and the syntax of predicate logic) and those who do a HW track will have to do some more in Technical Informatics (i.e. HW, which goes as far as expecting people to draw and evaluate Karnaugh diagrams and use the Quine-McCluskey minimisation algorithm).

I used to examine these people orally. Remember they had had three courses which handled formal logic. To see what they knew about propositional logic, for example, I used to introduce a sentence such as A IMPLIES B OR A and ask them what it meant according to the traditional semantics. About two thirds of them would get their knickers in a twist. Some might say “it’s a tautology” and then I’d say “no, it’s not!”. Others might say “it can’t be simplified any further, so it means what it says” and then I’d say “no, it can be simplified to TRUE”. And then I would ask what was going on. Most couldn’t say. Then I would point out that the meaning was not usually given by syntactic equivalence (that is, reading the sentence out loud and saying what other sentences it was the same as or different from). So how is it given?

So, Exercise 1 for readers of this note has two parts: first, what’s going on here with this formula? Second, how is its meaning given?

This is trivial stuff with which most philosophy students with a course in logic would have no problem whatsoever. But most computer science courses do not require any kind of understanding of basic argument forms. Neither do most engineering courses. Neither, indeed, does the University of York Certificate in System Safety Engineering. Nor its M.Sc. in Safety-Critical Systems Engineering

Here is Mr. Haddon-Cave QC saying that poor argumentation and its uncritical acceptance is largely responsible for the deaths of 14 people and loss of an aircraft. Thank you, Mr. Haddon-Cave! Some of us have been saying such things could happen for years, but you got it on the front page of the newspapers!

And no wonder we are in such a mess. Imagine if we allowed civil engineering firms to do business without anyone in them having any demonstrated training, experience or ability in Newtonian statics.

But traditional philosophy courses in logic will not suffice. The kind of facility with argumentation which is required for Safety Cases is far more than traditional mathematical logic (by which I mean classical propositional and predicate logic). It requires also the ability to understand the quality of inductive arguments (by which I mean arguments involving probabilities), causal arguments (which goes over and above classical logic into the domain of what is called modal logic), and the ability to take a piece of running text, represent the form of the argument which it presents, and analyse its quality. This latter skill is trained in many philosophy courses of study, where it is often called “informal logic”, which my former Berkeley-student-colleague John Burgess justifiably calls in his important new book an oxymoron. But I do not know of any computer science or engineering courses of study which require such a course or which test their students in any way on their skills in analysing text-based arguments. I will call the three areas of logic, which I have singled out, as follows: classical logic (classical propositional and first-order predicate logic), philosophical logic (the term Burgess uses for the non-classical-logical formal reasoning, such as probability-based inductive reasoning, temporal reasoning, and modal reasoning including – if one believes David Lewis et al. – causal reasoning), and text-based logic (“informal logic”).

Now we, as inter alia teachers of system safety, have the time, opportunity, and encouragement to do something about this situation.

I propose that a demonstrated ability in classical logic, some forms of philosophical logic, and text-based logic should be required for any formal qualification in system safety engineering. This likely won’t help us for the next twenty years, unless there is some magical way of making this requirement retroactive, but then it will have an impact.

Examples

This is so far an abstract argument without examples, so maybe I should provide some to back up
the manifesto.

I gave a paper analysing TCAS in 2004 to the annual Australian Safety-Critical Systems Club conference then in Brisbane. When I submitted the written version for review (a formality, since the paper was invited), one of the reviewers, an engineer with extensive TCAS background, who had indeed given a paper on TCAS to the same conference, professed confusion as to my first point that there had obviously been a TCAS requirements failure.

The BFU report into the Überlingen midair collision on 1 July 2002 glossed over possible requirements failures, even though there was a sentence acknowledging it in the report (try to find it!).

Yet Eurocontrol had known of this specific failure for at least two years before the accident and had filed a change request addressing it explicitly with RTCA in 2000. (This change request, known as CP112, in its by-then extended version CP112E, finally made it into the TCAS Version 7.1 standard in April 2008, some 7+ years, numerous incidents, and one fatal midair collision later).

The BFU Überlingen report acknowledged that, had the phenomenon I am calling a requirements failure not been manifested, the aircraft would have missed each other. This was the conclusion of simulationss apparently performed at their request by the Eurocontrol ACASA project (the exact results of those simulations have to my knowledge not been made public). That means, explicitly, that according to the Counterfactual Test the requirements failure is a necessary causal factor in the accident. Yet it does not appear in the list of causal factors at the end of the BFU report. This is clearly a causal-logical fallacy in one of the most widely-read causal analyses of our time. It is not the only one in the report. Yet most of the people I talk to accept fallacy-written accident analyses as routine. It is time we started applying higher standards.

Some years ago I took apart the Eurocontrol Safety Case for RVSM in European Airspace (see the WWW page on it. It didn’t take long. I was invited by the Eurocontrol RVSM Project Leader Joe Sultana, and the Safety Case author, Dr. Bernd Tiemeyer, to a discussion about my critique at Eurocontrol. Such appropriate and gracious gestures, and an interest in resolving a critique, are still all too rare amongst us engineers.

The gentlemen seemed to want to explain to me how I had misunderstood the report and my conclusions were mistaken. Since my argument was simple, based explicitly on the text, and correct, readers can imagine that they had an uphill battle 🙂 One of their main points, which indeed required further analysis on my part, was that they had used a critical premiss which was the explicit result of research performed in Eurocontrol’s ACASA project. They gave me a huge folder of reports from ACASA – I guess 1,000 pp or so. I was able to review these quite quickly, since my argument was simple and the needed premiss quite clear in its use, and I determined that the work ACASA had performed did not justify the specific conclusion which the RVSM Safety Case had used as a premiss (in fact, I would argue that the RVSM people had reinterpreted the statement: as ACASA made it, it seemed to be a justifiable conclusion of their work, but the RVSM people wanted to take it to say something somewhat different. In other words, the statement as written was ambiguous).

I was at the IET System Safety Conference last week, and Bernd Sieker reported to me that in a talk about ATM issues, someone had identified themselves as the author of the Euro-RVSM Safety Case and that “despite what you may read on certain WWW sites” the Safety Case is correct (exemplary? I am not sure what degree of accolade was asserted). I didn’t see Dr. Tiemeyer at the conference. (Maybe it was the author of the Post-Implementation Safety Case? But then, I didn’t write anything about that because it wasn’t available at the time.) The following question arises: why hasn’t Eurocontrol succeeded in understanding what is wrong with its Safety Case argumentation in all these years? And fixed it? (It probably can’t be fixed in the way Eurocontrol might like, but at least it can be rewritten to draw correct conclusions from the evidence they adduce.) The answer could be because they don’t have anyone with the specific ability and formal training in argument to fix it.

Other examples, including some decisive and ultimately quite expensive ones, lie within the scope of non-disclosure agreements.

The Abnormal Distribution

We distribute Thoughts