Another Glitch, Same Old Moral

Martyn Thomas chaired a committee convened by the UK Royal Academy of Engineering on infrastructure vulnerabilities to GPS disturbances. The committee reported in March 2011 and Martyn was briefly on the front page of UK news media on March 10, 2011 until the Tohoku event happened the day after.

What Martyn’s committee found was astonishing. For example, critical infrastructure functions whose builders and operators were convinced had no connection with any GPS functionality – and which stopped working when a GPS jammer was activated. The Committee’s report is well worth reading all the way through. Its remit includes all SatNav systems, not just GPS.

Martyn gave a Keynote talk at the 20th Safety-Critical Systems Symposium in Bristol a couple of weeks ago. A Google preview of Martyn’s paper is available, as well as an film of his talk. (The Institution of Engineering and Technology, IET, filmed many of the presentations. You can check out my Keynote on the Fukushima Daiichi accident as well if you like :-) )

It is amazing to me that anyone wouldn’t take Martyn’s observations very seriously indeed.

However, we do appear to have a few journalists that poo-poo it, for example Lewis Page again recently in The Register after his commentary a year ago upon the report’s release, just as we had an astonishing number of journalists who made public their opinion that Y2K was never a big deal. A very silly point of view. As Martyn points out in his talk, the reason Y2K was not a big deal is that people such as himself worked very hard to eliminate as many as possible of the Y2K vulnerabilities discovered in our critical infrastructure, and were obviously quite successful. He knows what they were, since he was the senior technical advisor for some of that work (for example, UK air traffic services provision), and knows what would have happened had they not been taken care of.

The main social point here is, I think, people who worry versus people who don’t. If we didn’t have people who worried, then we wouldn’t be able to operate because things would be continually going wrong, such as possibly UK air traffic services at the turn of the millenium had NATS not worked very hard to eliminate those vulnerabilities. And on the back of such successful effort there are journalists who say “everything’s OK, isn’t it? Why worry?”. Yes, things are OK. Why worry? Because if some of us didn’t, they wouldn’t be.

Here is an example of a daily vulnerability that bit. It’s also old hat. But it happened to me two days ago, and most of those involved are a professional computer scientists with a PhD (or about to obtain one) and decades of experience of such matters.

I have used my e-mail system as a memo system very effectively for the last few decades. I am based on IMAP, so it’s what people now call “in the cloud” but used to be called “stored on a server“. Over the years, when a subject or task occurs to me, I have got pretty good at remembering the context in which it occurred and indexing into e-mail (I send quite a few messages just to myself). It works for me very well. For decades.

Until Tuesday. I was writing an email, and the longish memo I was writing started losing characters backwards from where I had been typing, at the similar repetitive rate to that deriving from, say, a stuck delete key. It took a few seconds to realise what was happening. Then I went into the menu-strip at the top of the screen (I use the Apple OS+environment) and tried to quit my mail client (Thunderbird – Apple Mail apparently does not work well with IMAP. I lost all my mail for about a year at one point a few years ago and it took a couple of days to generate a solution from backup. The second time it happened, I switched to Thunderbird). The menu would come down, but disappeared again as I moved the mouse onto it. This happened repeatedly. I tried the same on the Apple main menu (so I could “Force Quit” the mail client) but the same happened there. I tried a hardware shutdown – the OS refused because Thunderbird would not quit and it advised me to quit Thunderbird and then try again. I have never actually tried to log in as root and am not sure I remember the root password, so trying that, and if successful getting the process number and performing “kill -9” didn’t seem like a good option given the urgency.

So, hardware kill: press the “off” switch and hold until the machine powers down. Good news for me: this worked.

When it came back up and I fired up the mail client, it showed me that all the messages from Wednesday 15 February at 16:35 (15:35 UTC) until that Tuesday morning, 21 February, were no longer there. There are a bunch of important interventions that had disappeared.

So I asked the faculty computer services to restore the mails from backup. One of the two officers is Jan Sanders, with whom I have worked closely for over a decade; he also works with Causalis (people from SSS2012 may remember him from the booth) and will shortly finish his Ph.D. with me. And he installed and maintains this blogging system. These two people, along with 50-75% more help from assistants, manage the Technology Faculty’s (TechFak) computer systems, which account for over half the data volume per day of the entire university. A couple of years ago, we purchased backup hardware for some €30,000 because the university computer center proved to be unable to provide backup services as needed by some high-data-volume colleagues. The university is trying to centralise as many “routine” computing services as possible, and this situation was and is a major negotiating point over the future organisation of research computing services in the university.

Well, our backup HW+SW didn’t work. Jan + colleagues were unable to extract my e-mail Inbox directory alone. They ended up rebuilding the entire TechFak mail-server IMAP file system on a restore disk, some seven hundred gigabytes or so to be restored from main+incremental backup tapes. Estimate on Tuesday lunchtime was Wednesday morning. But on Wednesday morning, when they came in to work, the job had terminated with an error, and then only had up to 6 February cleanly restored.

Moral: the cloud is vulnerable in the ways that people concerned with the provision of computing services have known about for a long time. This is not the first time this has happened to me (indeed, the third time I have lost amounts of mail in five years). There are obvious ways to avoid specific problems, but there is mostly neither time nor resources to implement and manage all those solutions perfectly all the time. In this case, there were (at least) two failures, and it is clearly impractical for the faculty computing services to check continuously whether they can effectively restore data through such two failures, as well as all the other possible failures that could occur. This is a resource-intensive on-demand function and it is combinatorially impossible to check regularly the execution of all such functions in even a moderately complex system such as e-mail backup.

When someone comes up with easy ways to solve any digital-computational vulnerabilities, say to GPS interference, that is less than half the tale. The rest of the tale concerns whether those solutions are implemented, and also continuously and effectively maintained.

There is a lot of superb computer science behind this nowadays. Versions of Leslie Lamport’s Paxos algorithms are enabling Google’s servers to provide us with our daily informational bread (Paxos logically serialises distributed database transactions).

Most journalists and digital-services marketing people have not heard of, let alone understand, the combinatorial impossibility of checking and maintaining all your on-demand functions, or even routinely how the various Paxos variants work and three-phase commit doesn’t. To find out what is possible and what is not, in other words, you still have to talk to computer scientists with authoritative knowledge. Such as Martyn and his GPS-vulnerability team from the Royal Academy of Engineering. And be wary of what is said in thoughtful articles about “cloud computing” in news media unless it comes from such people.

What actually happened to me? I don’t know. The “stuck delete key” hypothesis seems to me to be implausible (it has worked fine since). And a software glitch in my mail client alone would not explain why the windowing system pull-down menus failed to operate as expected. I am not unfamiliar with forensic analysis of this sort (indeed we do it for major accidents) but this is not the first time an explanation has eluded me and I doubt it will be the last.

Leave a Reply