I was browsing the invited lectures given under Martin Abadi’s College de France lecture series and came across this elegant, simple explanation of so-called Byzantine failures by the gentleman who invented the term, Leslie Lamport. Leslie’s two papers on the subject with Rob Shostak and Marshall Pease in the early 1980’s, Reaching Agreement in the Presence of Faults and The Byzantine Generals Problem, are seminal. Kevin Driscoll et al.’s SAFECOMP 2003 paper, Byzantine Fault Tolerance: From Theory to Reality, as well as Kevin’s brilliant keynote talk at SAFECOMP 2010, Murphy Was an Optimist (of which the slides seem no longer to be on the WWW) shows how prescient the SRI work was.
I met Leslie at SRI in 1984. Rob had just left, to finish and then sell his PC database SW “Paradox” with Richard Schwarz, starting his second career as a serial entrepreneur. A colleague commented at the time that the market for PC database software seemed already to be saturated, so leaving a good job for that was risky. I guess that’s how some make millions and some don’t! Marshall was still there, was reputed to be quite a successful stock purchaser, but is no longer with us.
Leslie’s Slide 2 shows what appears to be an Airbus A380, computers of some sort issuing pitch control commands (probably primary pitch control; Byzantine failures in the FMGEC software, which includes the autopilot, would not likely be safety-critical). And Slide 4 speaks of an “FAA requirement” that the “probability of catastrophic failure” of an airplane’s computer be less than “10-10 per hour”.
It is common amongst computer scientists who deal with avionics issues to think that the reliability requirement for critical equipment with safety-related behavior is a probabilistic requirement. But it isn’t so. Probabilities of some sort do enter into assessment processes somewhere, but not so directly. It seems to me to be worthwhile to say some words about certification regulations. They can be somewhat abstruse unless you are a certification engineer (even for the regulator! See John Downer’s Trust and Technology: The Social Foundations of Aviation Regulation).
First, an aside about units: they should be “operational hours”, not simply “hours”. Most people probably correctly assume that. Besides, the difference between “operational hour” and “hour” for most commercial airplanes in continual, regular use is probably only a factor of two to four averaged over the service life of the airplane. Still, best to be precise.
Second, there is a figure known as the “10-9 xxxxx” (where “xxxxx” is variously “requirement”, “condition”, “criterion”, depending. I guess this is what Leslie is referring to, rather than a “10-10” criterion. There is a 10-9 criterion in the Accepted Means of Compliance (allied to the qualitative probability “Extremely Improbable”. The general functional safety standard IEC 61508, which does not apply to commercial aviation, although is sometimes used for military systems, is written to regard anything claimed below a reliability level of 10-9 per ophour as unrealistic (Ron Bell, Chair of the Maintenance Team for 61508 Parts 1-2, personal communication. Also, PBL self-communication: I am on the German national committee).
It is possible, though, that there are automotive systems, typically small electronics boxes fitted to many different common models of car, that might well get of the order of 1010 operational hours on them (Mike Ellims, personal communication).
The 10-9 criterion was looked at hard by John Downer, in his PhD thesis at Cornell The Burden of Proof (I don’t think it has been published yet, which is a shame. I have a copy).
So, on to the main theme.
The certification requirements for large airplanes (i.e., all commercial transports) are contained in a document known in Europe as CS-25, the 2003 and subsequent versions of which are available from the EASA WWW site.
First observation. Contrary to what it looks like from Leslie’s slide, the technical requirement for computers or computer behavior is nil. Computers inherit any conditions on failure behavior solely through the requirements on the pieces of kit which they control, in the sense that there are dangerous-failure requirements on the entire subsystem. And the requirements on the pitch control subsystems are purely functional, saying what loads they must also withstand under which conditions, and how they must dynamically behave. (Check them out for yourself here!) No probability, no probability terms, no quantitative probability. So it is misleading to associate any 10x condition with a requirement.
There is, however, an accompanying document to CS-25 called “Acceptable Means of Compliance” (AMC). That is, in order to demonstrate to the satisfaction of the certification authority that subsystem X does this and withstands that (as the certification requires), it is deemed by the authority acceptable to follow the guidance in the AMC. Of course, you can do it some other way also, if you can find one!
This is a notionally subtle but practically significant difference, between what is required and what is accepted as evidence that a requirement is fulfilled. If any system (such as the one Leslie illustrates) brings the airplane into a hazardous or catastrophic state, then it is an airworthiness issue and the problem has to be fixed. Full stop. And that is what is done. However, if the requirement were to be numerical, say “probability of dangerous failure of 1 in 10-9 per operating hour”, then one instance, or two instances, or even twenty instances, of a hazardous or catastrophic state, is/are compatible with that numerical requirement and the problem would not necessarily need to be fixed, since it could be argued that this very small probability had unfortunately been realised way earlier than expected. This difference is significant for lawyers arguing about the distribution of compensation (or “recovery” as they say), and compensation for loss is a universal principle some many thousands of years older than airplanes and their certification.
I note with some embarrassment, however, that IEC 61508 makes “probability of dangerous failure of 1 in 10x per operating hour” into a requirement, suffering the disadvantage I just noted of leaving it open, in the circumstance of a dangerous failure, if the requirement has been met or not. I guess the lawyers can expect some business 🙂
Actually, the whole business of what “probability” means in “probability of dangerous failure” is a can of worms. Let me leave that for another time.
AMC uses terms for hazard: Minor, Major, Hazardous and Catastrophic. It also uses terms for probability: Probable, Remote, Extremely Remote, and Extremely Improbable. These are technical terms and when they occur in the requirements they are capitalised. The meaning of “Extremely Improbable” is (historically) “not expected to occur within the service life of the airplane type“, “service life of the airplane type” means here the total number of operational hours of all airplanes of that type throughout the entire use history of the airplane (assuming of course that the airplanes are maintained as designed). The meaning of “Extremely Remote” is “…..once….“; the meaning of “Remote” is “…once per individual aircraft, and several times in the service life of the type“; “Probable” is “…..several times in the life of an individual aircraft“.
These definitions come from previous versions of the certification documentation (when it was known as JAR 25) and may be found in a 1982 book by Lloyd and Tye, Systematic Safety, published by the UK CAA. These definitions will have been applicable directly to the certification of the two most popular airplanes flying today, the Boeing 737 series (certification mid 1960’s) and the Airbus A319/320/321 series (certification mid 1980’s), but not to the certification of, say, the Airbus A380, which is mid 2000’s. So let’s also look at later versions of the document.
The 2003 AMC-25 uses the terms for subsystem compliance, for example AMC 25-19 §6(c) says
“(3) Extremely Improbable Failure Conditions: Extremely Improbable Failure Conditions are those so unlikely that they are not anticipated to occur during the entire operational life of all aeroplanes of one type, and have a probability of the order of 1 x 10–9 or less. Catastrophic Failure Conditions must be shown to be Extremely Improbable.”
We see that in the current certification document the qualitative terms are firmly bound to quantitative probability statements.
The reason for this change is that, in the days of Lloyd and Tye, someone did a back-of-envelope calculation and figured that “service life of the airplane type” could be expected to be somewhat less than ten million hours. It was then! But, for example, Airbus’s safety chief, Yannick Malinge, when giving evidence to a Subcommittee of the Brazilian Parliament in August 2009, pointed out that the A320 fleet had at that time some 55 million operational hours or more (if I remember correctly. I also did a crude calculation of my own then, based on a guess at operational hours per year for a typical model, a uniform build rate since service introduction in 1988, and 25-year service life of an individual airplane, and came up with a similar figure). So for modern purposes that pre-1980’s back-of-envelope calculation is at least an order of magnitude too low.
Then, following on with the reasoning as in Lloyd and Tye, people apparently thought there would be about 100 airplane subsystems which could be a single point of catastrophic failure, and so the condition that no single-point catastrophic failure should occur in the service life is 1 in 10 million (1 in 107) divided through 100 airplane systems, so one in one billion per airplane system, leading to an average “probability” over the service life of 1 in 10-9 per operational hour.
Anyhow, that is where the 10-9 condition comes from, and nowadays the qualitative term is directly anchored to it, to avoid any calculations over expected fleet lives, since the actual fleet lives have proved to be rather different from that expected at certification time. Nobody expected they were going to sell going on for ten thousand airplanes of these types, but that is what it looks like might happen now!
And there is nothing in the AMC about reliability of computers. There are things about reliability of systems which are driven by computers, for example displays, AMC 25-11 §4(3)(i):
“(i) Attitude. Display of attitude in the cockpit is a critical function. Loss of all attitude display, including standby attitude, is a critical failure and must be Extremely Improbable. Loss of primary attitude display for both pilots must be Improbable. Display of hazardously misleading roll or pitch attitude simultaneously on the primary attitude displays for both pilots must be Extremely Improbable.”
So that’s what the regulations say and the acceptable means of compliance suggest you do. For insight into how this works out in practice, read John Downer!
I offer here many heartfelt thanks to Clive Leyman, quondam Chief Aerodynamicist of Concorde, who did his best to put me straight on all this over the last few years (I hope he thinks he succeeded!)