Incommunicado

« previous post | next post »

I may perhaps have commented before that I am a firm believer in the pessimistic principle that every upgrade is a downgrade. So when I saw on December 22 a message from our technical staff at the University of Edinburgh saying that on the following day the Unix servers would be taken down "for the installation of security patches and general maintenance", I naturally felt a chill like the coldness of the grave.

Installation of security patches, involving changes that could affect remote access, and "general maintenance", all to be completed five hours before a complete university Christmas shutdown? Well, naturally (for I do basically take this sort of thing as completely normal now), since I left the office on December 22 there has been no access of any kind to the Unix server on which my life mainly depends: ping (a program for probing remote machines to see if they are there) reports the machine as up and running, but secure shell connections do not go through (they do not even time out, they just hang), scp copying hangs as well, and not even web pages will load (my home page is offline). I need to find a way to inform everyone who has academic or administrative contact with me and is waiting for me to confirm things or send them things that I am incommunicado.

People like Cherine Yu and Chu-Ren Huang in Hong Kong; Lars Hinrichs in Texas; Andrew Garrett in California; the LTTC people in Taiwan; Rodney Huddleston in Australia; Harry van der Hulst in Connecticut; Jeff Pelletier in Canada; Markku Filppula in Finland; Nick Enfield in Holland… These people need to hear from me, and are probably emailing me — and email to the machine in question does not trigger a bounce message. Everyone will think I am just AWOL, or treating them rudely. Of course, in principle, since I do have access to the rest of the web, I could just post a mayday message to them all here, and letting them know that they should communicate with my Gmail account instead (the account name is pullum). But that would be using a personal privileged access to pervert the whole purpose of Language Log, and it would be wrong.

Of course, nearly all such disasters with information technology eventually get (to some degree) fixed. After a few days of desperate academics panicking and being unable to finish their conference papers for conferences between Christmas and New Year, hard-working technical staff do come in and reboot. So it really should not surprise you too much that as soon as I had finished the above post, literally within seconds of finishing it, I saw one of our hard-working technical staff members walking past the door of my office in our almost-empty building. And of course he had just fixed the problem in question. After six days with our main server off the Internet, the moment I told you about it, the trouble was repaired.

The nerds among you will want to know what had happened. What could be the explanation for a machine that was up and running and connected to a functioning network but nonetheless unable to serve a web page or accept a login?

The answer was that the "general maintenance" had included (in my heart I knew it) an upgrade of the Linux kernel. And the new kernel ran for 36 hours or so, just to lull people into a spurious sense of confidence about it being all right. Then on Christmas Eve, as soon as the last technical staff person had left the building and gone home for the holiday, it went into a state known as kernel panic. It is interesting (did you imagine there was never going to be a touch of linguistics in this post?) that we redeploy terminology relating to emotional and psychological states to obtain a vocabulary for unanticipated concepts like the behavior of the core of an operating system when it has begun to spend so much of its time in a desperate effort to maintain itself internally that it has no ability to perform its functions externally. The operating system had gone into a state that is so reminiscent of neurotic crises and irrational panic attacks that the metaphor seems absolutely perfect — better than mechanical metaphors like "spinning its wheels".

During a kernel panic, low-level systems (brain stem and subcortical activity, to continue the neuropsychological metaphoricity) are functioning, so ping would get a positive response; but more elaborate operations, like running an http daemon to serve up web pages or starting up a login session in response to ssh, cannot be accomplished. (Why email was being received and filed I do not fully understand. Perhaps that counts as low-level enough to be executed, or perhaps messages were being held in a queue on some other server en route.)

The hard-working technical staff member who came in (thank you, Cedric!) knew how to roll the operating system back to use the version it had been using before December 23, and that got things back in order. (As so often happens, it was necessary to downgrade in order to restore functionality.) If psychiatrists knew how to do the same thing with human minds, psychiatry would be a science that could support engineering, and a lot of mental illness would be readily and easily curable. But instead, psychiatry is in a state similar to that of linguistics: we know a fair bit, but so much fundamental knowledge is lacking that compared to many scientific subjects we are just groping in the dark. As are, of course, the hard-working technical staff members who try to keep servers running through Christmas holidays…

Update two days later: Aaron Davies, who objects to my remark about psychiatry-derived metaphors, has written to tell me a bit more about kernel panics. I paraphrase what he wrote.

Kernel panics are generally due either to hardware problems or to memory access errors within kernel-space code like drivers — the same sort of errors that would make a user-space program deliver a segmentation fault. The panic itself is a protection mechanism meant to ensure that a system fail completely rather than continue operating in an unknown state that may create security vulnerabilities, corrupt data, etc.

The looping behavior that we call kernel panic is generated by a piece of code like the following (Aaron took this from kernel source code released as recently as last Wednesday):

        for (i = 0;;) {
touch_softlockup_watchdog();
i += panic_blink(i);
mdelay(1);
i++;
}

The default implementation of panic_blink just returns 0. Before the loop starts, there's a page or so of other code that appears to turn off all multitasking and multiprocessing, guaranteeing that nothing else can be happening other than running this pointless piece of code.

That is what Aaron has told me. I like to think of the behavior of a computer running the above loop as being rather like the behavior of a human who adopts a fetal position on the floor and mutters "We're all going to die!" over and over again. But as Aaron points out, it is also very much like having your back wheels lifted off the ground (in a standard rear-wheel-drive vehicle) so that no matter how you accelerate you don't go anywhere.



Comments are closed.