Following on to my recent post on certification requirements for commercial aircraft, John Rushby and I have been discussed a paper of his, on commercial aircraft software and the guidelines DO178B, in the invited session on certification at EMSOFT 2011.
John is concerned with whether DO178B “works”, that is, leads to high-quality code which is fit for purpose, and, if so, why and how. I think that is a hard and important question and I commend his bravery in addressing it squarely (rather than hiding behind a blog format, as I do ). I commend his paper to people when he publishes it – I imagine it will be on his publications page sometime in October 2011. The paper is not long, but it is dense. Rather as if it were a poem, I had to read it multiple times, carefully. It took me a week to respond. I concluded I do really prefer Goethe, but then he didn’t talk about avionics SW.
John suggests that DO178B is focused on assuring the correctness of the executable code. I found that surprising; I think both Martyn Thomas and I are concerned that DO178B is focused largely on processes which people thought and still think (often, we claim, without much scientific evidence) correlate with code that is fit for purpose.
I use the term “fit for purpose” as does Martyn. John suggests that is a British term not used in the US. I prefer to use it to using the term “correct”, so let me translate. There are at least two ways in which code may be said to be “correct”. One is that it fulfils its specification; let us call this correct-1. The second is that the code causes the system to behave in a manner appropriate for the task at hand; let us call this correct-2. I call the correctness-1 properties of code its quality, and correctness-2 properties fitness for purpose. But let me say correct-1 and correct-2 for the remainder of this note.
The specification may not subsume the “task at hand” in all cases: quality, correct-1, does not imply fitness for purpose, correct-2. Indeed, there is good evidence dating back some twenty years now, say, in work of Robyn Lutz published in 1993 , that most failures of correctness-2 in moderate to complex systems are not failures of correctness-1.
Martyn and I would agree on the lack of evidence that “clean and tidy” SW development processes entail any concrete property of the resulting code (such as fitness for purpose). Martyn would also suggest, citing work of Andy German of QinetiQ, that higher DALs in DO178B do not necessarily correlate with higher-quality software. Here is what Martyn says, information from personal communication with Andy:
Here is some data from the formal analysis of the avionics software of a current American military aircraft that was certified against DO-178B Levels A and B for use in civil airspace.
The following defects were among those reported in the software after certification:
Erroneous signal de-activation.
Data not sent or lost
Inadequate defensive programming with respected to untrusted input data
Warnings not sent
Display of misleading data
Stale values inconsistently treated
Undefined array, local data and output parameters
Incorrect data message formats
Ambiguous variable process update
Incorrect initialisation of variables
Inadequate RAM test
Indefinite timeouts after test failure
Timing issues – system runs backwards
Process does not disengage when required
Switches not operated when required
System does not close down after failure
Safety check not conducted within a suitable time frame
Use of exception handling and continuous resets
Invalid aircraft transition states used
Incorrect aircraft direction data
Incorrect Magic numbers used
Reliance on a single bit to prevent erroneous operation
The worst module had a defect density greater than 1 defect in 10 lines of code. the best had 1 defect in 250 lines.
The problem, as I see it, is that most software assurance relies on testing, although we have known for at least 40 years that testing can only show the presence of errors and not their absence. Until software assurance is mostly based on mathematically rigorous analysis of the software (which can be done at no increase in cost if the software is developed with this in mind) these unacceptable rates of software defects will continue.
Notice that these are errors in the sense of correctness-1.
I hold it to be significant that a number of these errors could not have occurred had the SW been written in a strongly-typed language and the compiler correctly implemented the strong typing. This has been an issue for forty years, ever since the Algol project gave up. This is one of the demonstrably best-known ways to avoid certain well-defined classes of program error. If DO178B is truly focused on the correctness of the implemented code, why doesn’t it require development in strongly-typed source, and use of a demonstrated type-correct compiler?
Andy also claimed in a paper in Crosstalk 16(11), the Journal of Defence Software Engineering, November 2003 that “no significant difference” had been found with respect to levels of correctness-1 in code developed according to DO178A Design Assurance Level A and Level B. Development according to DAL A is regarded as significantly more resource-intensive that development to DAL B, in part because DAL A requires so-called MC/DC testing (see the helpful tutorial by Kelly Hayhurst, pointed out to me by Mike Holloway), which is quite hard work.
BTW, that edition of Crosstalk also includes a fine article on the so-called Ravenscar profile, for interprocess communication in Ada which admits straightforward static analysis, by Brian Dobbing and Alan Burns.
The big UK certification effort on which I understand much of the QinetiQ work was performed was for an aircraft manufactured by Lockheed Martin (BTW, one barely calls them “manufacturers” any more, but rather “system integrators”). John said by way of anecdote that he had indications that internal data of both Boeing and Airbus show “more issues” with DAL B software than with DAL A software.
Assuming these observations are correct, the question here would be how two experienced companies develop quality-improved software using the extra requirements for DAL A, but a third experienced company does not see any improvement.
The answer must be that there are hidden factors at work, factors which actually do lead to an improvement in SW quality, which in two companies are associated with the extra effort required for DAL A development, but which in a third company are not.
Since DO178B misses those factors (for otherwise all companies would show improved quality in DAL A development over DAL B development), isn’t it important to find out what they are, and then write them explicitly into DO178C, which is currently in its final stages?
BTW, if you want my view on what an ideal SW safety standard should say (thank you for asking ), check out slide 22 of my Ada Connection keynote talk of 21 June 2011.
I might point out that it is much shorter than the 150pp of IEC 61508 Part 3 Version 2, which I mutter about on the previous slide.