We Had An Accident

On Friday evening, 26th February, we suffered an accident. In what is known as the Swiss Cheese Model, stemming in all but name from Jim Reason, all the holes lined up.

Pictures of the Swiss Cheese Model abound, for example here and here and
here. The idea of the Swiss Cheese Model is that there are multiple «layers of protection» against unwanted events (the slices of cheese), each with weaknesses (the holes in the slice), and so for an unwanted event to occur (a straight trajectory through the slices), all the weaknesses must be exploited simultaneously (the holes line up). It is a compelling image, some quarter century old now, cited often by people in on-line forums such as PPRuNe discussing unwanted events such as airplane accidents.

We do Why-Because Analysis, of course. We also get pretty pictures (Why-Because Graphs, WBGs), but the pictures in this case are not all the same. Indeed, we have observed the odd phenomenon that you can tell what accident is being shown, just by looking at the shape of the WBG and leaving out all of the factual information. We haven’t quite got a handle on this cognitive phenomenon yet, but it does suggest the truth of the old adage that, whatever similarities one may see, all accidents are causally unique. (This observation also indicates the limits of the Swiss Cheese Model as explanatory device, for all one can vary schematically in that are the number of slices.)

So what happened to us on Friday evening? Our mail server fell over. I have been running my own mail server and WWW server independent of faculty resources since 1996, and have thereby had much better service than my colleagues for most of that time (the administration was performed for almost a decade by Marcel Holtmann, for many years the maintainer of the Linux Bluetooth suite, BlueZ, whose Diploma thesis contained a thorough analysis of the first successful Bluetooth exploits). Recently, the faculty services have become more dependable, due to personnel changes, and indeed my Sysadmin and AbnormalDistribution blogger Jan Sanders works for the services group. However, the personpower is very thinly spread. The informatics part of our faculty has grown from six research groups to seventeen and is also provided essential services for the CITEC Excellence Cluster and the CoR-Lab robotics institute, with still only two full-time-equivalent positions. Our backups alone are more than the entire rest of the Uni combined! So shouldn’t our faculty find the resources for an appropiate increase in personnel? Sure, but German faculty negotiations resemble the situation so well described by Lamport, Shostak and Pease almost thirty years ago in their canonical papers Reaching Agreement in the Presence of Faults and the more colorful The Byzantine Generals Problem. The third piece of work relevant to such negotiations is Garrett Hardin’s Tragedy of the Commons, passably described in this Wikipedia article.

So what happened with the mail server? It fell over Friday early evening, the chosen time for essential electronic services to croak. Normally Jan restarts it remotely and all is well – happens about once a year. Not this time. Saturday afternoon he removed the hardware, took it back to his office – and couldn’t even get a console to start. First hole in the cheese. There wasn’t any obvious sign of electronic or thermal damage when we looked at it, and both disks fired up quietly when they got juice, and they are RAID-configured, so the data would be there.

We decided over the weekend to migrate the service to the faculty machines. We assumed that messages over the weekend would be stored on the university computer center machines for forwarding when our mail domain name awoke again, because that’s what the MX records say. And the mails I had already received were stored not just on the server, under IMAP, but on my laptop, as well as on the external USB disk I have configured to be a Time Machine backup server. But I haven’t backed up in a month or so, because we had the builders in at home and I stored it away from all the fine dust (some vain hope!). Second hole in the cheese. I use email as a form of notepad, so losing the last month’s worth of notes as well as messages did not appeal to me.

Bernd Sieker came in specially, Monday morning early, to help rescue the data. He and Jan figured how to get the machine to boot. One of the disks had bad blocks, but the other seemed OK, and it is RAID, so – wait a minute, it’s configured for RAID but there isn’t RAID running on it (we take it somebody ran out of patience and didn’t finish the job). Third hole in the cheese. All the data is on the bad disk.

No matter, I have the mails on my laptop and can copy them over to my IMAP account on the faculty machine when it’s up. Jan configures it; I give the relevant server data to my mail client, and we call up the friendly university administrator to change the MX record for our mail server in the local DNS database. Then I call my uni mail box and see, not the thousands of message in my Inbox I have since July 2009 (the last time I archived it), but one new message. Where have they gone? I look in the local mail storage files on my laptop and see – they are just gone! Just like that! No dialog box «do you really want to do this?» or anything. Just gone. Fourth hole in the cheese. So much for trusting manufacturer SW that is supposed to be archiving and synchronising.

But not quite all the holes lined up. It seems as if the client mail storage on our original mail server is mostly or maybe completely healthy. So we should be able to get those files copied across. And in any case I have everything up to about the last month on the Time Machine. Or so I think. By the time we are finished, it should be four or five person-days of work. Because we thought we had a passable set-up, but it turns out we didn’t. That is, I didn’t – everybody else did but me. And I am supposed to be expert. How embarrassing!

Gee, I hope all this new kit works the way Jan assures me it does……..

Leave a Reply