Malware and the August 2008 Madrid Spanair Take-Off Accident


On 20 August 2008, a MD-82 aircraft of the airline Spanair crashed on takeoff (TO) from Madrid-Barajas airport. The high-lift devices on the wing had not been properly configured to give the necessary lift on takeoff, and the aircraft was unable properly to lift off as planned. See Aviation Safety Net’s report of this accident for more details.

There had been a maintenance issue during a previous attempt at departure, and maintenance personnel had addressed this issue. In effecting the repair, however, the takeoff configuration warning horn, which aurally warns the crew that the high-lift devices are not appropriately configured for takeoff, had also been disabled. The crew is required, in the pre-take-off check list which they have to perform, to check that the aircraft is appropriately configured for takeoff, and it seems that they did not do so at the second departure: they performed some of the items, but not the full list.

Spanair uses a ground-based computer to process aircraft logs for maintenance issues. The fault which caused the accident aircraft to return to the gate had apparently occurred more than once the previous day, and been logged. But the press has recently reported that malware in this computer delayed the processing of reports, and so maintenance was not aware of the problem the previous day, when they would have been able to correct it, before the fated flight. The Press reports have thereby connected this malware with the accident. See, for example, a summary in english of the reports by Daniel Johnson on the University of York Safety Critical Systems Mailing List.

Brian Reynolds commented on these reports that “This is totally bogus” and clarified that he meant that it is “totally bogus” “[t]hat a virus or Trojan in a ground maintenance computer is casually related to this incident.

Reynolds seems to be denying the claim that malware in a ground-based maintenance computer is causally related to the accident. But he omitted to say what his criterion for causal-relatedness is.

I have one: the concept of necessary causal factor, proposed in 1973 by the philosophical logician David Lewis, who credits the concept to David Hume (his “second definition” of cause). I took over Lewis’s semantics 15 years ago for use in failure analysis.

According to this semi-formal, objective notion of causal factor, there is demonstrably a chain of causal factors leading from the presence of the malware to the accident. According to this concept, Reynolds is provably wrong.

So now let me show this.

Here is the Counterfactual Test:

Let A and B be events or states.

A is a necessary causal factor in the occurrence of B just in case:

If A had not occurred, then B would not have occurred.

This last sentence is called a counterfactual (or contrary-to-fact) conditional. “Conditional” comes from the “if…then…” form; “Counterfactual” from the fact that A and B did as a matter of fact happen, and one is supposing what the world would then have been like had A not occurred. In order to determine this, I adapt the Lewis semantics: suppose A had not occurred, but the world stayed otherwise as similar as possible to the actual state of affairs that pertained. Did B occur in this possible state of affairs? Most often, we cannot answer with absolute certitude “yes” or “no”, but it turns out that we can most often answer “most likely, yes”, or “most likely, no”. The Counterfactual Test is to ask this question I just posed. If the answer is “most likely, yes”, the Counterfactual Test is “passed” and A is a necessary causal factor of B. If the answer is “most likely, no”, then A is not a necessary causal factor of B. We have found the Counterfactual Test to be very useful in complex engineering failure analyses.

To show a causal connection between the presence of malware on the maintenance computer and the accident, here are five instances to check with the Counterfactual Test:

1. Had the malware not been present, the fault causing the phenomenon would have been noted by maintenance personnel in a timely manner (let us say: at latest, end of the previous day).
2. Had the fault causing the phenomenon been noted by maintenance personnel in a timely manner, it would have been appropriately repaired before the accident flight.
3. Had the fault been appropriately repaired before the accident flight, the TO-configuration warning would have sounded on the accident flight.
4. Had the TO-config warning sounded during TO on the accident flight, the TO would have been aborted when the warning sounded and the aircraft properly configured before subsequent TO.
5. Had the TO been aborted when the warning sounded, the aircraft would not have crashed as it did.

I consider all of these counterfactuals to be true according to the Lewis semantics. It follows:

1a. The presence of the malware was a necessary causal factor in the lack of timely awareness of the fault.
2a. The lack of timely awareness of the fault is a necessary causal factor in lack of timely repair.
3a. The lack of timely repair is a necessary causal factor in the TO-config warning inhibition.
4a. The TO-config warning inhibition is a necessary causal factor in continuing TO to loss-of-control.
5a. Continuing TO to loss-of-control is a necessary causal factor in the accident.

So, there is a chain of six causal factors, chain-length five, connecting the presence of malware to the accident. QED.

I emphasise, just to avoid misunderstanding, that these are by no means the only causal factors relevant to the accident: that the crew failed adequately to perform the pre-takeoff check list on the accident flight is most certainly a necessary causal factor in the loss of control. The reader is invited to try out the Counterfactual Test to assure himherself of this.

Applying the Counterfactual Test rigorously throughout the list of potentially-relevant factors, to see which ones are indeed causally relevant and which not, is the core of our analysis method Why-Because Analysis (WBA). For those interested in seeing relatively quickly how we perform WBAs nowadays, there is available a case study on how to perform a WBA using the SERAS Reporter and SERAS Analyst tools. Here is some general info concerning our experience with Why-Because Analyses. Typically, depending on the level of detail provided by the investigation, a detailed causal analysis (which we represent in graphical form as a Why-Because Graph) ends up showing a hundred to a couple of hundred individual factors, of which a quarter to a third are “root-causal factors”, that is, causal factors which are not regarded as themselves having pertinent causes. So WBA also includes a fair amount of bookkeeping, or “complexity control”, or whatever one wants to call it. For example, given a WBG with a couple hundred items, one would assemble these causal factors into a small number of subgroups, and give these subgroups appropriate titles, to provide an “executive summary” of the analysis. The SERAS Reporter and SERAS Analyst software is available as freeware from Causalis Limited .

We can well expect a full WBA of the Spanair accident to contain between a hundred and a couple of hundred factors.

,

Leave a Reply

Recent Comments

No comments to show.

Archives