Fault, Failure, Reliability Again

On the System Safety Mailing list we have been discussing software reliability for just over a week. The occasion is that I and others are considering a replacement for the 18-year-old, incomplete, largely unhelpful and arguably misleading guide to the statistical evaluation of software in IEC 61508-7:2010 Annex D. Annex D is only four and a half pages long, but a brief explanation of the mathematics behind it and the issues surrounding its application resulted in a 14pp paper called Practical Statistical Evaluation of Critical Software which Bev Littlewood and I have submitted for publication. Discussion in closed communities also revealed to me a need to explain the Ur-example of Bernoulli processes, namely the Urn Model, introduced and analysed by Bernoulli in his posthumous manuscript Ars Conjectandi of 1713, as well as its application to software reliability in a paper called Software, the Urn Model, and Failure.

This discussion about statistical evaluation of software has shown that there is substantial disagreement about ideas and concepts in the foundations of software science.

On the one hand, there are eminent colleagues who have made substantial careers over many decades, written seminal papers in software science and engineering, and published in and edited the most prestigious journals in software, on the subject of software reliability.

On the one hand, there are groups of engineers who say software cannot fail. They don’t mean that you and I were just dreaming all our struggles with PC operating systems in the ’90’s and ’00’s, that those annoying things just didn’t happen. They mean that, however you describe those frustrating events, the concept of failure doesn’t apply to software. It is, as Gilbert Ryle would have said, a category mistake.

I knew that some people thought so twenty years ago, but I had no idea that it is still rife in certain areas of software practice until I was informed indirectly through a colleague yesterday. I have also been discussing, privately, with a member of the System Safety List who holds this view. I have encouraged him to take the discussion public, but so far that hasn’t happened.

The Urn Model can be considered a trope introduced by one man 300 years ago and still not well understood today. Yesterday, I noted another 300-year-old trope that was recognised as mistaken nearly a half century later, but still occurs today without the mistake being recognised, and which I regularly encounter. That is, John Locke’s account of perception and Berkeley’s criticism, which is regarded universally today as valid. It occurs today as what I call the “modelling versus description” question (I used to call it “modelling versus abstraction”), and I encounter it regularly. Last month at a conference in Bristol in questions after my talk (warning, it’s over 50MB!); and again yesterday in a System Safety List discussion. I don’t know when the trope calling software failure a category mistake got started (can someone advise me of the history?) but it’s as well to observe (again) how pointless it is, as follows.

Whatever the reasons for holding that “software cannot fail” as a conceptual axiom, it should theoretically be easy to deal with. There is a definition of something called software failing in the papers referenced above, and I can obviously say it’s that conception which I am talking about. You can call it “lack of success“, or even flubididub, if you like, the phenomenon exists and its that about which I – and my eminent colleagues whose careers it has been – are talking. Furthermore, I would say it’s obviously useful.

Another approach is to observe that the concept of software failure occurs multiple times in the definitions for IEC 61508. So if you are going to be engineering systems according to IEC 61508 – and many if not most digital-critical-system engineers are going to be doing so – it behooves you to be familiar with that concept, whatever IEC 61508 takes it to be.

There is, however, a caveat. And that is, whether the conceptions underlying IEC 61508 are coherent. Whatever you think, it is pretty clear they are not ideal. My PhD student Christoph Goeker calculated a def-use map of the IEC 61508 definitions. It’s just under 3m long and 70cm wide! I think there’s general agreement that something should be done to try to sort this complexity out.

What’s odder about the views of my correspondent is that, while believing “software cannot fail“, he claims software can have faults. To those of us used to the standard engineering conception of a fault as the cause of a failure, this seems completely uninterpretable: if software can’t fail, then ipso facto it can’t have faults.

Furthermore, if you think software can be faulty, but that it can’t fail, then when you want to talk about software reliability, that is, the ability of software to execute conformant to its intended purpose, you somehow have to connect “fault” with that notion of reliability. And that can’t be done. Here’s an example to show it.

Consider deterministic software S with the specification that, on input i, where i is a natural number between 1 and 20 inclusive, it outputs i. And on any other input whatsoever, it outputs X. What software S actually does is, on input i, where i is a natural number between 1 and 19 inclusive, it outputs i. When input 20, it outputs 3. And on any other input whatsoever, it outputs X. So S is reliable – it does what is wanted – on all inputs except 20. And, executing on input 20, pardon me for saying so, it fails.

That failure has a cause, and that cause or causes lie somehow in the logic of the software, which is why IEC 61508 calls software failures “systematic”. And that cause or causes is invariant with S: if you are executing S, they are present, and just the same as they are during any other execution of S.

But the reliability of S, namely how often, or how many times in so many demands, S fails, depends obviously on how many times, how often, you give it “20” as input. If you always give is “20”, S’s reliability is 0%. If you never give it “20”, S’s reliability is 100%. And you can, by feeding it “20” proportionately, make that any percentage you like between 0% and 100%. The reliability of S is obviously dependent on the distribution of inputs. And it is equally obviously not functionally dependent on the fault(s) = the internal causes of the failure behavior, because that/those remain constant.

The plea is often heard, and I support it, to take steps to turn software engineering into a true engineering science. That won’t happen if we can’t agree on the basic phenomena concerning success or failure – call it lack of success if you like – of software execution. And, even if we do agree on the phenomena, not being able to agree on words to call them by.