Ludi Benner just asked me privately about the feasibility of dumping stack traces from operating SW in flight. I concluded that it is not a very practical idea for a number of reasons. First, there is a lot of it. Second, you can’t analyse them for every flight, because there aren’t human resources for it, and no automatic tools which can detect coding errors from stack traces. Third, even if you analysed them in the case of an accident, there has as yet been no accident in which coding error was suspected (although there have been accidents and incidents in which requirements or design failure of computer-based systems was a causal factor), so even had they been available, no one would have needed to look at them.
Looking at stack traces is also a primarily a measure for assessing software quality. You can tell from a stack trace maybe whether the SW was doing what it was required/designed to do, and thus detect coding errors. But in safety-critical systems you are not interested primarily in deviation from requirements in general, you are interested primarily in deviation from the safety requirements.
There is a general method for formulating safety requirements:
(I) identify hazards (however you might define them), and
(II) then formulate a safety requirement per hazard H as S.H = [either avoid hazard H or exit out of hazard H within Q.H time period], and
(III) define the safety requirements as ( // S.H), the conjunction being taken over all hazards H
Nancy Leveson defines hazards as states of the system such that……… (Leveson, Safeware, Addison-Wesley 1995, Chapter 9). Others speak of states of the system+environment such that …… or events such that…… (see for example Chapter 4 of my 2001 on-line book, and Chapter 5 , and this set of definitions from Causalis)
Let’s use the Leveson definition.
The fundamental insight is this. Suppose you have relatively complete safety requirements ( the definition from an earlier blog post ). Then you can insert monitoring SW to look at SW-state, detect hazard states H when they occur, (this can be achieved by techniques for run-time verification – I shall call it here logical monitoring) and then you trap to SW which either exits or mitigates H with worst-case execution time (WCET) Q.H.
For this to work unfailingly, the following conditions have to be fulfilled
(a) your safety requirements are relatively complete,
(b) the hazard-detection is perfect (a “perfect oracle”), and
(c) the (H-exit or H-mitigation) SW is perfect,
This suffices to ensure that the SW does not engender dangerous behavior. Assumption (a) ensures that fulfilment of the safety requirements suffices to avoid dangerous behavior. Assumptions (b) and (c) ensure the safety requirement associated with hazard H is fulfilled. The assumption of perfection for the detection SW in (b) and the avoidance/mitigation software in (c) is critical. As is the condition that, when the perfect oracle detects the presence of H, the trap to the avoidance/mitigation software is also perfect. However, such traps are HW-based and a failure of such a trap could occur due to a HW problem. (Of course, the victims are unlikely to care whether a trap failure is classified as HW or SW).
Logical monitoring is, I propose, to the point at which (b) and (c) are practical. John Rushby notified me of this in December 2009, pointing me to a brief survey of his which I found helpful. I don’t belong to the run-time verification “community”, although I knew about it in general (Manuel Blum and others at Berkeley whom I know had been working on it theoretically a couple of decades ago). So I am proposing it as it were on hearsay rather than through personal experience. It seems to me to be plausible that one can synthesise perfect oracles as well as perfect avoidance/mitigation software.
Such software added to safety-critical SW would be possibly-perfect software in the Strigini sense, that is, software which you would like to be perfect, which you have good reason to think is perfect, and the question is mainly the confidence you have in your judgement that it is. Possible-perfect software and its use in achieving demonstrably-ultrahigh-reliability software has been recently discussed in Littlewood and Rushby’s forthcoming paper in the IEEE Transactions on Software Engineering which I think is a landmark paper.
There are then two questions.
One is assessing the level of confidence we could have that such logical-monitoring software is indeed perfect, and how that would affect the level of confidence we have in the exceptionless fulfilment of the safety requirements for the otherwise possible imperfect SW in which this logical monitoring is inserted. I suspect that techniques such as exhibited in the Littlewood-Rushby paper are applicable.
The second question is also twofold:
(i) whether, for every hazard H, it is the case that a safety requirement of the form S.H = [either avoid hazard H or exit out of hazard H within Q.H time period] suffices to avoid all dangerous consequences of H, and
(i) whether it is possible to produce such avoidance/mitigation SW with WCET less than Q.H
Concerning (i), logical monitoring SW cannot help in avoiding H. It detects when H is present (step (b) ). So if harmful consequences of H can occur within a shorter time period than it could possibly take to detect, trap and exit H, this approach cannot be guaranteed to fulfil the safety requirements for the SW. However, in such a case I suspect very much that the software should be redesigned to ensure the avoidance of H. Since H is a SW state, I see no reason why this should not be generally possible.