It happened again! On 13 December 2008, a Boeing 767-39H suffered a tailstrike on takeoff at Manchester Airport. A tailstrike can occur on takeoff when the pilots pitch the nose of the aircraft too high in the air before it has lifted off the ground. This can occur when the aircraft is “rotated”, that is, the nose pitched up, to fly off the ground – and doesn’t fly off, so the pilot pitches the nose higher in order to get it to do so. The tailstrike is the symptom of a very dangerous phenomenon, as follows.
Why wouldn’t the airplane fly off? Well, before flight, various computers and software calculate the speed at which rotation should occur, known in aviation-speak as Vr, from, amongst other things, the total weight of the aircraft at take-off (TOW). If the TOW value is too low, then the calculated Vr will be too low and the aircraft will not fly off at Vr. When the aircraft is rotated, the aerodynamic drag also increases, so it accelerates more slowly. Not only that, but the TOW is used in calculating the thrust setting of the engines for take-off, which will also be correspondingly lower, so the airplane will have accelerated more slowly to get to the too-low Vr in the first place. So it’s triply bad: you took too long and too much runway to get to your too-low Vr and the act of rotation hinders you even further from getting to the true Vr at which the aircraft will fly off the runway. It is a very dangerous situation and accidents have happened. The crucial observation is that the TOW value is calculated from data delivered and typed into computers by humans, and humans can make inadvertent mistakes.
This incident was not the latest, merely the most recent I have found out about. In March 2009 a similar incident with an Emirates A340 in Melbourne, with almost three hundred people on board, came very close to being the worst accident Australia has seen. They got off the ground relatively safely, after the runway end, having taken some of the runway-end equipment and part of a shed with them, and returned to land safely. No one was injured.
Other notable occurrences: in June 2002 it happened with an Air Canada Boeing 767 in Frankfurt; in March 2003 with a Singapore Airlines B747 in Auckland; in October 2004 with a MK Airlines B747 freighter in Halifax, Nova Scotia, in which the aircraft crashed off the end of the runway and the few people on board died.
All because a too-low TOW was given to the devices which calculate Vr and take-off thrust. So shouldn’t the pilots carefully check the numbers used to calculated TOW before they enter them? Well, of course! Do they do so? Most certainly: all are aware of the dangers. And there are formal procedures, part of Standard Operating Procedures (SOPs) to help them do so.
So are the crews that do this sloppy? Badly trained? Incompetent? Should they be fired? Commentary on the professional pilots’ internet forum PPRuNe on the Melbourne accident has recently increased since the pilot in command (PIC) was interviewed by an Australian newspaper. One should always be careful when drawing conclusions from such forums because PPRuNe is anonymous, there are a lot of poseurs, a fair number of non-poseurs who are not pilots but interested, and I imagine even pilots who express views other than those which they really hold. Those caveats noted, the most frequently expressed opinion castigates the PIC for dereliction of duty.
Is this fair? After all, there are SOPs which if followed accurately are supposed to ensure correct TOW, and he knows the dangers. And isn’t he playing roulette with his passenger’ lives?
No, this is not fair. There are a number of reasons why not. First, as all major line pilots know, and NASA has recently documented in detail, the amount of distraction in an airline cockpit from outside sources during pre-flight preparations can be enormous. And distraction leads to error, SOPs or no. Second, as my colleague Bernd Sieker and I have recently found out through analysis, typical SOPs, considered as algorithms for getting the right V1, Vr and thrust setting are not very robust, considered by the standards applied to safety-critical computer programs. And an SOP is after all a sort of program, a human+computer program in this case. Third, this can happen to anyone. A pilot colleague who knows the Air Canada incident crew, and who was extensively involved in setting up his airline’s Flight Operations Quality Assurance (FOQA) program, to which things like flubbing TOW entry centrally belong, tells me the crew are “among [the] finest, most competent supervisory captains, highly respected leaders within the airline”. And finally, someone who accuses crew of playing roulette with their passengers’ lives forgets that they are equally putting their own lives at risk. Except for those very, very rare cases of murder-suicide (I think there have been only three in the quarter century in which I have been interested), this is always an off-base accusation. Everyone is sitting in the same fuselage.
Fair or unfair, though, is not the most pressing issue. The most pressing issue is how do we stop this happening again, maybe with 300 people dead rather than just bent metal? Compare: we have had five incidents in seven years, with 7 fatalities and almost a few hundred more. What is going to happen in the next fifteen years if we carry on as we have been?
There are three solutions. One is: better training, more attention to SOPs. Second is: internal aircraft weighing systems, whereby the aircraft can assess its own weight on the ground. Third is: more robust data-entry procedures for TOW calculations.
Consider training. Remember, “it can happen to anyone”. OK, more attention might be paid to this specific task, but this suggestion does not solve the underlying problem of human reliability under distraction.
Consider internal weighing systems. They do exist, but airlines choose not to pay for them. Why not? First, they cost lots of money. Second, I understand and can well believe that reliability, and ensuing maintenance costs, is an issue. Most such systems measure the compression of the oleo struts on the landing gear. Theoretically, it’s a great idea, but in practice it does bring data-integrity problems with it; and there is also the question of what one does when components fail: it is unlikely that the aircraft would thereby be grounded, so one would fall back on the human calculation anyway. That then comes back to the third solution: ensuring the human procedures are robust.
Besides all that, how much would it cost to retrofit the entire fleet of commercial aircraft with internal weighing systems? How likely do we think it is that this will happen? Much of the current fleet is going still to be flying in fifteen years, and what do we think are the chances that someone will buy the farm in a big way, in this way in that time?
Consider designing more robust procedures. Which I shall now do in a little more detail, because I propose that they are the most practical prophylactic measure. Bernd Sieker and I have written a paper on this, which we have submitted for publication in the technical literature.
Let me focus on getting the right TOW where it should be in the Flight Management Computer (FMC). We call this business a “engineered multi-agent cooperative” function, or EMC function. Let me not worry here about why we call it that. This function is executed by performing various human, automated, and human+automated subsidiary procedures, including data transmission, data exchange (through humans writing things down and typing them in, and through automatic means), calculation (both human and automated), and verification (checking that intermediate numbers have more or less correct values). Hence we can analyse it in exactly the same way we analyse multiprocessor computer programs, which also consist of combinations of sequences of small but precise actions.
First I observe that, writing down the sample SOPs in a form more similar to how one would write a program (dotting the “i”’s and crossing the “t”’s), the SOPs turned out to be rather more complicated that it appears at first sight. This immediately leads one to suspect that intuition (what Sieker and I call the Cognitive Model, CM) may not necessarily be a good guide to reality (the actual workings of the SOPs, which we call the Procedural Model, PM). There is a third model involved, namely the description of what the function does, “getting the right TOW where it should be in the FMC”. When we make this more precise, this is what we call the Requirement Model (RM).
The goal of demonstrating that the SOPs are adequate to achieve the function is expressed in technical terms by saying that the PM implements the RM; alternatively, that the PM refines the RM. It doesn’t do so as is, of course, without further assumptions, such as for example that the humans in the process aren’t deliberately trying to sabotage the function, or – one assumption which may particularly concern us when thinking of this sequence of accidents – that they don’t make random transcription errors and then read through those errors when cross-checking. Because humans are involved, and will stay involved, in the EMC function, and because the humans involved (the pilots) must supervise the process (amongst other things, they carry legal responsibility), we impose the additional constraints that the PM must implement the CM, and the CM must implement the RM.
There is a branch of computer science which arose about forty years ago which is concerned with checking whether programs achieve their goals. It is called formal verification. Techniques of formal verification can identify exactly where assumptions are needed in order to show that the PM implements the RM. Indeed, one can break down the PM into various subsidiary actions, like so-called procedures in a computer program. One can then consider the PM as composed of these subsidiary actions, and separately each of the subsidiary actions, thus breaking down the task of considering the whole into a number of distinct parts. One considers, for each subsidiary action, the preconditions for it to be started: how the world has to look when you start the action; and at the end of it the postcondition: what the action has accomplished when it finishes. Then you can chain them together by ensuring that the postcondition of one action ensures that the preconditions of the following action hold. This means of approaching the verification of programs is known as Floyd/Hoare logic after its discoverers forty years ago. Floyd/Hoare logic has a venerable and distinguished history in reliable computing. Like calculus, it is a standard technique which is unlikely ever to go out of favor.
Our paper points to certain obvious problems in the pre-flight TOW data entry. For example, we did some information-flow analysis. Information-flow analysis is a more recent technique, devised by Bernard Carré a quarter-century ago, and used by such ultrareliable-program development systems as Praxis’s SPARK (whose tool also uses Floyd/Hoare logic). We didn’t need to go too deep into our analysis to find some issues which need addressing.
The SOPs (and thereby the CM) from which we worked only identified two values of parameters used to calculate TOW: those in the preliminary load manifest and those in the final load manifest. So the CM has two sets of values: call them preliminary and final.
However, looking in detail at the SOPs and analysing what would happen if some human were to put in random values at various places, we could see that there were at least five quantities floating around the procedures which were all supposed to be the same as either preliminary or final manifest figures, but there was nothing in place to ensure (or “coerce” as computer scientists prefer to say) that these values actually were the target values. In jargon, there were five independent value sets in the PM which must be coerced into at most two set in the CM. This is known as a “data integrity” issue, and techniques to solve it abound in the technical literature on fault-tolerance. One can address distractions during human tasks (by invoking techniques such as “rollback”) and “finger trouble” (by including independent “sanity checks” at strategic points). And the Floyd/Hoare logic would then tell one whether one has resolved the integrity issue or not.
Sounds simple, doesn’t it? We note that for modern complex computer programs with many thousands to millions of lines of code it is not as simple as it looks. But we believe that for EMC functions typical of SOPs it is relatively simple. Many SOPs exhibit a complexity only as great as the kinds of examples one finds in textbooks and tutorials on Floyd/Hoare techniques.
There are lots of organisations with expertise in these methods, often with their own highly-developed SW toolsets. SRI Computer Science Lab in California and Praxis High-Integrity Systems in the UK are two of the most well-known. And of course the tech-transfer firm Causalis Limited associated with my research group. My point here is not to advertise, but rather to persuade readers that we are advocating an approach to avoiding tailstrike problems that is, for the scale of the application, mature in engineering terms, while being the least expensive of the avialable options for addressing the issue, and a lot less expensive than the likely consequences of continuing with the present strategy!