Further Comment on the IEEE Spectrum article concerning MCAS

Gregory Travis has responded to my comments in the Risks Forum Digest at https://catless.ncl.ac.uk/Risks/31/22#subj23 . He includes a wealth of interesting new information. He only disagrees with one of the points I made concerning the accuracy of his article, namely the categorisation of the frequency of failure of AoA sensors. I said

Travis suggests AoA sensors are unreliable: “..particular angle of attack sensor goes haywire — which happens all the time“. It does not happen “all the time“, or even very often. Peter Lemme writesReliability of the AoA sensor was evaluated over a 4-6 year period, with a mean time between unscheduled removals was 93,000 hours. A typical airframe is modeled at about 100,000 hours, so the AoA vane typically last nearly the lifetime of the airplane.

Travis says

Angle of attack sensor failure is common, contrary to assertions
otherwise. The service difficulty database has about 200 entries and that
typically represents 5% of the real-world situation., Frozen water
(heater failure) in the system is a very common failure cause.

Fortunately, we don’t have to determine the meaning of “all the time”, “not very often“, “common”, and so on, because failure rates have already been classified in the airworthiness certification regulations, in order to enable FMEA of critical kit. The technical terms are “probable“, “remote“, “extremely remote“, and “extremely improbable“; their definitions from EASA airworthiness requirements are given in the Appendix below.

Clive Leyman has suggested privately that “downstream cascade effects” are usually to be considered when assessing failures and their rates. MCAS failure is a “downstream cascade effect” of (certain types of – see below) AoA sensor failure. Hence AoA sensor failure is germane to an FMEA assessing MCAS failure conditions (this may have been a separate FMEA, or it may have been included in the STS FMEA, since MCAS is considered to be an STS function, as I noted in my previous post). It is worth considering this aspect rather more carefully.

Lemme suggests a Mean Time Between Failure of 93,000 hours; Travis suggests 200 service difficulty reports (SDRs) in an unspecified number of flight hours. Another source can be found in the article Redline: the many human errors that brought down the 737 Max by Darryl Cambell in The Verge, brought to my attention by John Downer. It discusses the severity classification of MCAS failure and the reliability of AoA sensors in a way in which other articles I have read do not (so far). In particular, it refers to an article on AoA sensors in HeraldNet (no author given), which commences with an incident to a Lufthansa A320 in 2014, which I commented on at the time for the German news weekly Der Spiegel. HeraldNet tells us

Angle-of-attack sensors have been flagged as having problems more than 50 times on U.S. commercial airplanes over the past five years, although no accidents have occurred over millions of miles flown, according to reports made to the Federal Aviation Administration’s Service Difficulty Reporting database. That makes it a relatively unusual problem, aviation experts said — but also one with magnified importance because of its prominent role in flight software.

So, 50 SDRs (to contrast with Travis’s 200). No flight-hours estimate, but Campbell says

…over the last five years, 50 flights on US commercial airplanes experienced AoA sensor issues, or about one failure for every 1.7 million commercial flight-hours. Sure, that’s a low rate, but it’s still nearly six times above what the FAA allows for “hazardous” systems: they’re only supposed to fail once every 10 million flight-hours.

Campbell is referring here to flight-hours numbers from https://www.transtats.bts.gov/TRAFFIC/, the Bureau of Transportation Statistics of the US Department of Transportation. The frequency of occurrence of AoA sensor “service difficulties” in the last five years is, according to these numbers, 1 in 3.4 × 104 flight hours, which translates to a frequency of 2.9 × 10-5 per flight hour. That is classified as “probable“, and is almost three times the rate for “remote“, and about three-hundred times the rate for “extremely remote“.

This is within a discussion in which Cambell says that MCAS failure was classified as “hazardous” (this conforms with indirect information in my previous post , which quoted an anonymous colleague that an MCAS failure condition was classifed as “major” in level flight and “hazardous” in turns). A hazardous condition is required by CS-25.1309 b.(2) to be “extremely remote“, which translates to “an Average Probability Per Flight Hour of the order of 1x 10-7 or less, but greater than of the order of 1 x 10-9.” (see Appendix below. CS-25 and its associated Acceptable Means of Compliance – AMC – , Book 2 of the CS-25 document, are the EASA airworthiness regulations, which are similar to but not identical with 14 CFR 25. I use them here because they are available and easily accessible as one document through EASA’s WWW page.)

I think we can agree, in the aftermath of two fatal accidents, that at least some MCAS failures are “catastrophic“, not merely “hazardous” (see AMC-25.1309 definition in the Appendix below), so this would be one mistake in the FMEA. Another possible mistake could be that a hazardous MCAS failure condition, say due to AoA sensor failure, fails to be “extremely remote” as required by 14 CFR 25.1309 and CS-25.1309.

The numbers we have so far are too crude for any FMEA-type assessment of MCAS failure. We should be more discriminating, as follows. MCAS fails (“experiences a failure condition”) when
* (a) it triggers when actual AoA is not over the trigger threshold; or
* (b) it does not trigger when actual AoA is over the trigger threshold; or
* (c) something not connected to AoA sensing

MCAS failure due to incorrect AoA input is covered by (a) and (b); (c) is not relevant to this discussion.

AoA failures occur when
* (x) you lose a vane; or
* (y) a vane sticks; or
* (z) something else (wiring problems, say)

If you lose a vane, you’re likely in case (a), because the counterweight in the sensor pulls it high and if MCAS is using that sensor it will trigger.

If your vane sticks, you could get to case (b) if you actually exceed trigger AoA when MCAS is
active; but if you don’t exceed trigger AoA then MCAS won’t malfunction.

Let’s classify AoA failures as
(m) reading high;
(n) reading low.

Then an (x) will lead to an (m) and likely an (a).
A (y) will lead to an (n), which will only lead to a (b) if (i) the actual AoA exceeds the MCAS AoA
trigger threshold when the other conditions for MCAS activation are present.

So MCAS only fails for some AoA failures. Travis mentions in particular frozen AoA sensor vanes, as for example occurred in the November 2008 Air New Zealand/XL Airways handover test flight, but, as we have just seen, a freezing sensor, case (y) above, does not necessarily lead to an MCAS failure, and, if it does, the result is likely to be relatively benign (MCAS doesn’t trigger when its trigger conditions are fulfilled; an MCAS non-activation leads to inappropriate handling characteristics, we are told, and one might well consider this to be a “major” failure condition rather than “hazardous“).

So let’s run with the “hazardous” MCAS failure-conditions-upon-AoA-sensor-failure being exclusively type (a). Type (a) failure condition follows from class (m) failures of AoA sensing. For the hazardous condition to be extremely remote, we would surely need class (m) failures of AoA sensing to be extremely remote. Going with the failure-frequency figure of 2.9 × 10-5 per flight hour above, that would mean that only 1 in (about) 300 or fewer SDRs could involve class (m) failure conditions if there were to be extremely remote.

Is that so? If so, the failure-rate criterion for “hazardous” checks out. If not, then there would have been an additional FMEA failure to ensure that a determined-“hazardous” condition was “extremely remote“.

Appendix: applicable definitions of severities and frequencies of occurrence. I use the applicable airworthiness requirements from EASA, rather than the FAA equivalent. From AMC-25.1309 §5, Definitions:

c. Average Probability Per Flight Hour. For the purpose of this AMC, is a representation of the number of times the subject Failure Condition is predicted to occur during the entire operating life of all aeroplanes of the type divided by the anticipated total operating hours of all aeroplanes of that type (Note: The Average Probability Per Flight Hour is normally calculated as the probability of a Failure Condition occurring during a typical flight of mean duration divided by that mean duration).

In AMC-25.1309  §6 c (2) we find the classification of failure conditions and the qualitative and quantitative definitions of “probabilities“, that is, frequencies of failure conditions:

a. Classifications.

Failure Conditions may be classified according to the severity of their effects as follows:

(1) No Safety Effect: Failure Conditions that would have no effect on safety; for example, Failure Conditions that would not affect the operational capability of the aeroplane or increase crew workload.

(2) Minor: Failure Conditions which would not significantly reduce aeroplane safety, and which involve crew actions that are well within their capabilities. Minor Failure Conditions may include, for example, a slight reduction in safety margins or functional capabilities, a slight increase in crew workload, such as routine flight plan changes, or some physical discomfort to passengers or cabin crew.

(3) Major: Failure Conditions which would reduce the capability of the aeroplane or the ability of the crew to cope with adverse operating conditions to the extent that there would be, for example, a significant reduction in safety margins or functional capabilities, a significant increase in crew workload or in conditions impairing crew efficiency, or discomfort to the flight crew, or physical distress to passengers or cabin crew, possibly including injuries.

(4) Hazardous: Failure Conditions, which would reduce the capability of the aeroplane or the ability of the crew to cope with adverse operating, conditions to the extent that there would be:

(i) A large reduction in safety margins or functional capabilities;

(ii) Physical distress or excessive workload such that the flight crew cannot be relied upon to perform their tasks accurately or completely; or

(iii) Serious or fatal injury to a relatively small number of the occupants other than the flight crew.

(5) Catastrophic: Failure Conditions, which would result in multiple fatalities, usually with the loss of the aeroplane. (Note: A “Catastrophic” Failure Condition was defined in previous versions of the rule and the advisory material as a Failure Condition which would prevent continued safe flight and landing.)

b. Qualitative Probability Terms.

When using qualitative analyses to determine compliance with CS 25.1309(b), the following descriptions of the probability terms used in CS 25.1309 and this AMC have become commonly accepted as aids to engineering judgement:

(1) Probable Failure Conditions are those anticipated to occur one or more times during the entire operational life of each aeroplane.

(2) Remote Failure Conditions are those unlikely to occur to each aeroplane during its total life, but which may occur several times when considering the total operational life of a number of aeroplanes of the type.

(3) Extremely Remote Failure Conditions are those not anticipated to occur to each aeroplane during its total life but which may occur a few times when considering the total operational life of all aeroplanes of the type.

(4) Extremely Improbable Failure Conditions are those so unlikely that they are not anticipated to occur during the entire operational life of all aeroplanes of one type.

c. Quantitative Probability Terms.

When using quantitative analyses to help determine compliance with CS 25.1309(b), the following descriptions of the probability terms used in this requirement and this AMC have become commonly accepted as aids to engineering judgement. They are expressed in terms of acceptable ranges for the Average Probability Per Flight Hour.

(1) Probability Ranges.

(i) Probable Failure Conditions are those having an Average Probability Per Flight Hour greater than of the order of 1 x 10-5.

(ii) Remote Failure Conditions are those having an Average Probability Per Flight Hour of the order of 1x 10-5 or less, but greater than of the order of 1 x 10-7.

(iii) Extremely Remote Failure Conditions are those having an Average Probability Per Flight Hour of the order of 1x 10-7 or less, but greater than of the order of 1 x 10-9.

(iv) Extremely Improbable Failure Conditions are those having an Average Probability Per Flight Hour of the order of 1x 10-9 or less.

 

 

Leave a Reply