The Airbus A330-303 VH-QPA experienced uncommanded nose-down pitch commands while in cruise at FL370. Lots of unsecured people were thrown to the ceiling, and some were injured severely. The aircraft declared an emergency and landed as soon as practicable, at Learmonth, where the injured were treated and several hospitalised. It has been known for a while that the accident was caused by data anomalies from a air data computer (ADIRU) which were not filtered out by the primary flight control computers (FLight Control Primary Computers, FCPC, also known as PRIM). However, it has been a mystery – and remains so – how the anomalous data values were generated. It has happened three times: twice with the unit on VH-QPA, and once on another unit on another aircraft, also Qantas, also in Western Australia, within a couple of months of this incident.
The fix is apparently to modify the BITE test of the ADIRU specifically to look for such anomalies, and to modify the data-filtering algorithms of the Flight Control Primary Computers (FCPC, also known as PRIM) of the A330.
The Final Report is now available on the ATSB WWW site.
There was a note from Andrew Heasley in Risks 26-67 with a title saying the accident was “Blamed on Software“, pointing to a newspaper article. I find this claim misleading. The problem which arose had nothing to do with anything for which any software engineer would have been responsible.
The fixes were implemented in both SW and HW, but fixes to non-SW problems are very often implemented in SW.
The PRIMs ran a data-assurance algorithm for data received from three different ADIRUs, which are electronic boxes built by a different manufacturer. This data assurance algorithm had a specific vulnerability to spiky angle-of-attack (AoA) data presented in a particular time-sequential manner, which was exploited during the occurrence. The algorithm, which uses AoA data from three ADIRUs, filters out multiple data spikes from a unit which occur within a specific time frame. Spikes on the culprit ADIRU occurred with similar values just over the boundary of this time frame, and were thus taken as veridical by the PRIMs. The resolution algorithms for the AoA data (with that from the other ADIRU units) in the PRIMs let these values through, and the PRIMs reacted accordingly by commanding sudden nose-down pitch.
Responsibility for the design of such algorithms lies clearly with those who are experts on the engineering of electronic data generation and transmission equipment, not on any software engineers.
To give a similar example with which I been recently involved, it turns out that signals of certain frequencies in AC electric circuits can bypass the Type A and Type B circuit protection equipment (circuit breakers) that are required in most electric circuits (household and industrial) in Germany. A committee on which I sit has recently considered attaching equipment which is, as far as we know, theoretically capable of generating such frequencies to such circuits. A similar situation, how to handle anomalous signals, but no SW in sight. Pure electrical engineering.
Concerning my earlier note here on Certification Requirements for Commercial Airplanes, I find it interesting and commendable that the Bureau considered likelihoods of events in their summary (quoted below). However, I don’t believe they formulated it in quite the words I would have liked to have read.
They give reason to classify the event as “hazardous”, and with a fleet operating experience of 28 million flight hours this occurrence fits within the expected value (a technical term) of the operating time within which the effects of a hazardous event may occur (defined to be less than or equal to one occurrence within ten million operating hours), according to the acceptable means to determine compliance with certification criteria (now known as AMC 25). Notice it is not the event itself of which they assess the occurrence – that has occurred three times – but the deleterious effects upon safety of the event, which have only occurred once.
They speak of “certification requirements“. Strictly speaking, this is incorrect. The certification requirements are expressed in CS 25 and do not involve probabilities. The severity classification terms “catastrophic”, “”hazardous” etc and their associated acceptable/unacceptable frequencies occur in risk-matrix-type form in the Acceptable Means of Compliance document which accompanies the certification requirements (AMC 25), not the requirements themselves. (I note that these documents were called something slightly different at A330 certification time, 1993).
The certification requirements themselves are quite clear: the airplane shall behave in such-and-such a manner. If a wing falls off, or a flight control computer sends it into a loop, it is obviously not behaving in that manner; thus violating certification requirements. However, it is accepted that one cannot provide proof that such untoward things will never ever happen (will the sun rise tomorrow? Will your steering wheel come off in your hands? WIll your control sidestick come out of its holder in your hand?), so a less strenuous regime based on arguing likelihoods is defined as an “Acceptable Means of Compliance” with the regulations for purpose of certification.
This is not hair-splitting. It has consequences, in particular in this case, for how anomalies are dealt with, as follows.
If the requirement were that, say, “hazardous effects shall only occur on average once in between 10^7 and 10^9 operating hours“, which is what the AMC says you have to show to demonstrate compliance acceptably, then it would have been open to the manufacturer to do nothing in reaction to the QF72 event: the hazardous effects occurred only within the expected time value of their occurrence. If you think about it, it would also be open to a manufacturer to do nothing until the second occurrence of any hazardous or indeed catastrophic effects, even if the problem occurred first within the early experience of flying the aircraft! This is simply a consequence of the meaning of the probabilistic concepts used.
Whereas, as things now stand, separating requirements, which are absolute, from acceptable compliance (which may be based on occurrence frequency) any in-flight anomalous behavior must be fixed or the airworthiness certificate will be withdrawn. This is because such behavior violates the written requirements, that the aircraft shall not behave that way. To repeat, the conditions on behavior are absolute, not likelihood-based.
And that is how one wants things: The requirements are absolute, but it is accepted that in science and engineering you are often only convinced to some degree, so it is regarded as acceptable to argue your conviction up to a certain degree, and not to have to prove it, which would likely be impossible. But if something does go wrong, you want it fixed right away.
One can argue that any given set of occurrences is compatible with any probability requirement whatever, and thus that probabilistic requirements are inappropriate to determine airworthiness in any case. However, I don’t think such an argument works. Say these three events had occurred within 3 million operating hours, each with damage. One could estimate the likelihood that an piece of equipment fulfilling the condition of an expected value of at most once in 10 million operating hours to exhibit three events within 3 million operating hours. One would conclude that it is unlikely, say with small probability P. It follows that the situation that the aircraft fulfills the acceptable-compliance criterion has the same probability P. The small probability P that the aircraft acceptably complied with certification requirements would provide good reason for withdrawing the airworthiness certificate.
Concerning the data anomaly itself stemming from the ADIRU, its cause remains a mystery. The report says:
Some of the potential triggering events examined by the investigation included a software ‘bug’, software corruption, a hardware fault, physical environment factors (such as temperature or vibration), and electromagnetic interference (EMI) from other aircraft systems, other on-board sources, or external sources (such as a naval communication station located near Learmonth). Each of these possibilities was found to be unlikely based on multiple sources of evidence. The other potential triggering event was a single event effect (SEE) resulting from a high-energy atmospheric particle striking one of the integrated circuits within the CPU module. There was insufficient evidence available to determine if an SEE was involved, but the investigation identified SEE as an ongoing risk for airborne equipment.
The report says that the manufacturer is developing a modification to the BITE to detect such failure modes:
Without knowing the exact failure mechanism, there was limited potential for the ADIRU manufacturer to redesign units to prevent the failure mode. However, it will develop a modification to the BITE to improve the probability of detecting the failure mode if it occurs on another unit.
Here is the executive summary. It is well and concisely written. I include the three paragraphs about seat belts and the investigative process for completeness.
Executive Summary
At 0132 Universal Time Coordinated (0932 local time) on 7 October 2008, an Airbus A330-303 aircraft, registered VH-QPA and operated as Qantas flight 72, departed Singapore on a scheduled passenger transport service to Perth, Western Australia. At 0440:26, while the aircraft was in cruise at 37,000 ft, ADIRU 1 started providing intermittent, incorrect values (spikes) on all flight parameters to other aircraft systems. Soon after, the autopilot disconnected and the crew started receiving numerous warning and caution messages (most of them spurious). The other two ADIRUs performed normally during the flight.
At 0442:27, the aircraft suddenly pitched nose down. The FCPCs commanded the pitch-down in response to AOA data spikes from ADIRU 1. Although the pitch-down command lasted less than 2 seconds, the resulting forces were sufficient for almost all the unrestrained occupants to be thrown to the aircraft’s ceiling. At least 110 of the 303 passengers and nine of the 12 crew members were injured; 12 of the occupants were seriously injured and another 39 received hospital medical treatment. The FCPCs commanded a second, less severe pitch-down at 0445:08.
The flight crew’s responses to the emergency were timely and appropriate. Due to the serious injuries and their assessment that there was potential for further pitch-downs, the crew diverted the flight to Learmonth, Western Australia and declared a MAYDAY to air traffic control. The aircraft landed as soon as operationally practicable at 0532, and medical assistance was provided to the injured occupants soon after.FCPC design limitation
AOA is a critically important flight parameter, and full-authority flight control systems such as those equipping A330/A340 aircraft require accurate AOA data to function properly. The aircraft was fitted with three ADIRUs to provide redundancy and enable fault tolerance, and the FCPCs used the three independent AOA values to check their consistency. In the usual case, when all three AOA values were valid and consistent, the average value of AOA 1 and AOA 2 was used by the FCPCs for their computations. If either AOA 1 or AOA 2 significantly deviated from the other two values, the FCPCs used a memorised value for 1.2 seconds. The FCPC algorithm was very effective, but it could not correctly manage a scenario where there were multiple spikes in either AOA 1 or AOA 2 that were 1.2 seconds apart.
Although there were many injuries on the 7 October 2008 flight, it is very unlikely that the FCPC design limitation could have been associated with a more adverse outcome. Accordingly, the occurrence fitted the classification of a ‘hazardous’ effect rather than a ‘catastrophic’ effect as described by the relevant certification requirements. As the occurrence was the only known case of the design limitation affecting an aircraft’s flightpath in over 28 million flight hours on A330/A340 aircraft, the limitation was within the acceptable probability range defined in the certification requirements for a hazardous effect.
As with other safety-critical systems, the development of the A330/A340 flight control system during 1991 and 1992 had many elements to minimise the risk of a design error. These included peer reviews, a system safety assessment (SSA), and testing and simulations to verify and validate the system requirements. None of these activities identified the design limitation in the FCPC’s AOA algorithm.
The ADIRU failure mode had not been previously encountered, or identified by the ADIRU manufacturer in its safety analysis activities. Overall, the design, verification and validation processes used by the aircraft manufacturer did not fully consider the potential effects of frequent spikes in data from an ADIRU.
ADIRU data-spike failure mode
The data-spike failure mode on the LTN-101 model ADIRU involved intermittent spikes (incorrect values) on air data parameters such as airspeed and AOA being sent to other systems as valid data without a relevant fault message being displayed to the crew. The inertial reference parameters (such as pitch attitude) contained more systematic errors as well as data spikes, and the ADIRU generated a fault message and flagged the output data as invalid. Once the failure mode started, the ADIRU’s abnormal behaviour continued until the unit was shut down. After its power was cycled (turned OFF and ON), the unit performed normally.
There were three known occurrences of the data-spike failure mode. In addition to the 7 October 2008 occurrence, there was an occurrence on 12 September 2006 involving the same ADIRU (serial number 4167) and the same aircraft. The other occurrence on 27 December 2008 involved another of the same operator’s A330 aircraft (VH-QPG) but a different ADIRU (serial number 4122). However, no factors related to the operator’s aircraft configuration, operating practices or maintenance practices were found to be associated with the failure mode.
Many of the data spikes were generated when the ADIRU’s central processor unit (CPU) module intermittently combined the data value from one parameter with the label for another parameter. The exact mechanism that produced this problem could not be determined. However, the failure mode was probably initiated by a single, rare type of trigger event combined with a marginal susceptibility to that type of event within the CPU module’s hardware. The key components of the two affected units were very similar, and overall it was considered likely that only a small number of units exhibited a similar susceptibility.
Some of the potential triggering events examined by the investigation included a software ‘bug’, software corruption, a hardware fault, physical environment factors (such as temperature or vibration), and electromagnetic interference (EMI) from other aircraft systems, other on-board sources, or external sources (such as a naval communication station located near Learmonth). Each of these possibilities was found to be unlikely based on multiple sources of evidence. The other potential triggering event was a single event effect (SEE) resulting from a high-energy atmospheric particle striking one of the integrated circuits within the CPU module. There was insufficient evidence available to determine if an SEE was involved, but the investigation identified SEE as an ongoing risk for airborne equipment.
The LTN-101 had built-in test equipment (BITE) to detect almost all potential problems that could occur with the ADIRU, including potential failure modes identified by the aircraft manufacturer. However, none of the BITE tests were designed to detect the type of problem that occurred with the air data parameters.
The failure mode has only been observed three times in over 128 million hours of unit operation, and the unit met the aircraft manufacturer’s specifications for reliability and undetected failure rates. Without knowing the exact failure mechanism, there was limited potential for the ADIRU manufacturer to redesign units to prevent the failure mode. However, it will develop a modification to the BITE to improve the probability of detecting the failure mode if it occurs on another unit.
Use of seat belts
At least 60 of the aircraft’s passengers were seated without their seat belts fastened at the time of the first pitch-down. Consistent with previous in-flight upset accidents, the injury rate, and injury severity, was substantially greater for those who were not seated or seated without their seat belts fastened.
Passengers are routinely reminded every flight to keep their seat belts fastened during flight whenever they are seated, but it appears some passengers routinely do not follow this advice. This investigation provided some insights into the types of passengers who may be more likely not to wear seat belts, but it also identified that there has been very little research conducted into this topic by the aviation industry.
Investigation process
The Australian Transport Safety Bureau investigation covered a range of complex issues, including some that had rarely been considered in depth by previous aviation investigations. To do this, the investigation required the expertise and cooperation of several external organisations, including the French Bureau d’Enquêtes et d’Analyses pour la sécurité de l’aviation civile, US National Transportation Safety Board, the aircraft and FCPC manufacturer (Airbus), the ADIRU manufacturer (Northrop Grumman Corporation), and the operator.