Just over a decade ago, in July 2002, there was a catastrophic mid-air collision of a Russian passenger aircraft heading westwards and a freighter aircraft of DHL heading northward, near the town of Überlingen on Lake Constance (Bodensee) in Southern Germany near the Swiss border. I wrote a paper on it about a month later, ACAS and the South German Midair, RVS Technical Note RVS-Occ-02-02, on 12 August 2002, in which I suggested that there were issues concerning the verification of the algorithms used in TCAS, as well as the assumptions about cockpit decision-making upon which the successful use of TCAS depends.
In May 2004 the final report of the investigating body, the German BFU, was published. It is 114pp long in english, without the appendices. There are mistakes in it, one of which I had already anticipated in my August 2002 note. I then wrote a paper based on my 2002 note, which accompanied an Invited Talk I gave at the Ninth Australian Workshop on Safety-Related Programmable Systems in Brisbane, Australia, in 2004, Causal Analysis of the ACAS/TCAS Sociotechnical System. This paper is also available on the Publications page of the RVS WWW site.
Neale Fulton, a colleague at the state research agency CSIRO in Canberra, who has been working on algorithms for proximity/collision avoidance for some years, recently told me of a paper by Peter Brooker, in the journal Safety Science 46(10), December 2008, entitled The Überlingen Accident: Macro-Level Safety Lessons, which refers to my work. That’s four years ago. Brooker apparently says some things about my work.
I haven’t seen the paper. Gone seem to be the old courtesies by which one forwarded a copy of an academic paper to a colleague whose work was discussed. Our library used to subscribe to the journal, until 2002, but I suppose it became too expensive. It is certainly expensive now: the publisher Elsevier wishes to charge me (or my library) €31.50 for this paper of about 15 pages. As I have said before, I don’t agree with the current commercial politics of many academic publishing houses. Not all authors do as I do to ensure that some version of a published paper appears for free on a WWW site under the auspices of the taxpayer-funded organisations who pay me a salary for this work. I hope Professor Brooker will understand me seasonally donating to charity the €31.50 I have saved by not buying his paper.
Brooker says some odd things about my work. Also, in 2008 the TCAS standard was amended. So it seems time to revisit those considerations.
There is now a TCAS II Minimal Operational Performance Standard RTCA/DO-185B. There is an FAA Technical Standard Order (TSO) TSOC119c, and an EASA TSO ETSO-C119c, corresponding to TCAS II Version 7.1, as it is now called, which includes two changes, detailed in Change Proposals CP112E and CP115, as in this Honeywell white paper. CP112E is directly relevant to the Überlingen accident, as below.
There are three main points which I wish to address again.
First, I pointed out in my 2004/5 paper (Section 3) that use of TCAS played a direct causal role in the accident. To phrase it technically, the use of TCAS was a necessary causal factor in the collision. I proved this by means of the Counterfactual Test. However, amongst the probable causes which the BFU report lays out, this factor is missing. That is a logical mistake.
I still encounter many technical people in aviation who refuse to accept this observation. I fail to understand why the proof is not routinely accepted. Instead, few seem to want to say in public that use of TCAS was a necessary causal factor in the accident. Maybe politics and wishful thinking triumph over logic once again?
Second, my Issue 4.1 of the paper concerns the fact that the Reversal RA mechanism apparently did not operate as it should have. I labelled this a requirements problem. The design of the kit did not operate in the way the requirement intended. People have waffled about this too, but here is the BFU report telling us that the failure to issue a Reversal RA was a necessary causal factor of the collision according to the Counterfactual Test:
A Eurocontrol specialist team has analysed the accident based on three TCAS simulations. Three different data sources and two different analysing tools for TCAS II were used. It is the BFU’s opinion that the following important insights can be drawn from the Eurocontrol study:
The analysis confirmed that the TA’s and RA’s in both airplanes were triggered according to the design of the CAS-logic
The simulation and the analysis of the alert sequence showed that the initial RA’s would have ensured a safe vertical separation of both airplanes if both crews had followed the instructions accurately.
Moreover, Eurocontrol conducted a further analysis how TCAS II would have reacted in this case with the modification CP 112 which had already been developed prior to the accident. According to the results provided, TCAS would have generated a Reversal RA after the initial RA which would have led to a sufficient vertical separation of both aircraft if the Boeing B757-200 [the DHL freighter] crew would have reacted according to the Reversal RA.
Despite this clear statement, this necessary causal factor did not appear amongst the causes in Section 3 of the BFU report.
In fact, it was known to Eurocontrol in 2000 that Reversal RAs did not function as desired. In engineering-scientific parlance, the design of TCAS did not fulfil its requirements specification. Eurocontrol filed a change notice with the committee, CP 112, to get this fixed. Two years later, there occurred the Überlingen collision. Two years after the problem was first openly acknowledged. Then there were other near-misses, detailed in the Eurocontrol SIRE+ project. Finally, in 2008, RTCA accepted the amended CP 112+ as well as another Change Proposal, resulting in TCAS II Version 7.1 (some issues are detailed in the document Decision criteria for regulatory measures on TCAS II version 7.1 by Stéphan Chabert & Hervé Drévillon).
The anomaly was known in 2000. A major accident in which it was a causal factor occurred 2002. The change was made in 2008. I think it is a scandal that it took so long to remedy this anomaly and that so many were killed on the way.
Third, Issue 4.5 of my paper concerned the cognitive state of the operators (the crews) and the decisions they took. I used an analysis method which I called the Rational Cognitive Model (RCM). Intuitively, it works like this. Suppose the operators were replaced by perfect robots with the same cognitive information and programmed with the TCAS operator procedures, as well as algorithms to make decisions according to the information and procedures. What would the robots do? I pointed out that the robots piloting the Russian aircraft might well have chosen to descend, as the Russian crew did, and for which they have been roundly criticised by all and sundry.
I have subsequently looked at various sociotechnical interactions using RCM. A number of them are analysed in Verbal Communication Protocols in Safety-Critical System Operations, a chapter in the Handbook of Technical Communication, Mouton-de Gruyter, 2012. I have also analysed road accidents, including multiple-vehicle pile-ups on motorways in fog, in The Assurance of Cyber-Physical Systems: Auffahr Accidents and Rational Cognitive Model Checking, which was supposed to be a chapter of a book. I applied RCMs subsequently to same-direction road traffic conflicts (as a bicycle rider, and not necessarily a slow one, I have plenty of experience to draw on). The paper is not yet available.
Ten years on, it is instructive to see how far we have come. I suggested that TCAS be verified using Rational Cognitive Model Checking (RCM-checking). RCM-checking consists in enumerating all the configurations which can occur and determine that the desired operator behaviour under decision-making gives the right outcome. I exhibited in my 2002 note and 2004 paper, and again explicitly in the 2012 Handbook chapter, a situation in which this “right outcome” cannot be assured, namely the Überlingen situation. The 2012 Handbook-chapter formalism makes clear this is (small) finite-state-machine calculation, well within the ability of existing model checkers.
However, verifying a specific scenario for correctness or anomaly is clearly easier than running through all possible scenarios to check. Could current automated model-checkers check and verify all such states for a given system such as TCAS? I put this question to John Rushby, who has applied model checking in similar situations. Say, his paper from 2002 on Mode Confusion and other automation surprises), of which I saw the original contribution in Liege in 1999. John has been at it three years longer than I, although I did have a go at Ev Palmer’s “Oops” example also using WBA and PARDIA in 1995-6. The latest version of John’s work with Ellen Bass, Karen Reigh and Elsa Gunter is from 2011. John suggested that checking large numbers of RCMs (say, more than 50 or so different scenarios) might well be difficult with current model checkers.
I am disappointed at the meagre take-up of these model-checking approaches to algorithms involving cooperative operator behavior. The technical material involved is not so very hard – every digital engineer nowadays has to deal with FSMs. Maybe a problem lies in that people still do not consider operator procedures subject to the same kinds of verification as other algorithms. Maybe this will change as more and more robots come “on-line” to replace humans in various activities. The safety of their interactions is surely governed by the international standard for functional safety of E/E/PE systems, IEC 61508, although for industrial fixed-base robots a new international standard is being developed. IEC 61508 requires assurance measures; maybe this will prompt interest in verification.
There are apparently still intellectual hurdles to overcome. One seems to lie in persuading people that sociotechnical procedures can be verified in the rigorous way it is (sometimes) done in informatics. Another is apparently to persuade them that this would yield any advantage. Which brings me to Brooker’s paper. Neale sent me an excerpt. Brooker takes exception to what I suggested should be done, namely
1.Check and fix the Reversal RA misfit so that design fulfils requirement.
2.Check the interaction between ACAS and Reduced Vertical Separation Minima (RVSM) more thoroughly
3.Determine precisely in which circumstances ACAS algorithms are correct, and where they fall short.
4.Deconflict requirements and advice to pilots on use of ACAS.
5.Causally analyse the operator interactions using Rational Cognitive Models and decision theory.
6.Analyse carefully what happens when one actor has a false model of system state.
Brooker’s comment on all this: “some of Ladkin’s recommendations may not be very wise”.
Huh?
Brooker explains how he comes to this conclusion by means of an analogy. He discusses in a couple of paragraphs a situation in Ancient Rome, whereby bricks would fall off buildings onto or near passers-by. Apparently wives would push their husbands out of the way. He discusses some decision-theoretic aspects of, well, pushing one’s husband out of the way (as opposed, one might think, to pushing him under).
No arguments for relevance of this situation to that of ACAS are proffered.
So I have to look for clues around and about. Brooker says: “Ladkin says that it ‘‘should be precisely determined in which circumstances ACAS algorithms are correct and in which circumstances they fail.” But the first task is precisely what has been done under ICAO’s auspices for decades (Carpenter, 2004)”. I take it from this suggestion that Brooker has little idea of what is involved in verifying algorithms, as that term is understood in informatics. And I take it he is not familiar with my work, despite citing me, or that of Rushby.
I recommend that people take a look at Fulton’s work on collision-avoidance to see what such algorithm verification might look like. And, for those who are unfamiliar with it, at Rushby’s and my work to see some ways of verifying procedures involving operator decisions.
As I indicated, I think that the poor TCAS/ACAS engineering standards which were causally involved in the deaths of 70-odd people ten years ago are a scandal, as is the fact that it took a further six years for them to start to be fixed. We are on the way to developing techniques which can be used to avoid such poor engineering in the future. I think that work should be encouraged. I don’t see any point in denigrating that endeavor through facile commentary.