2015-06-21

The spectacular kind of hardware failure

Gentle reader, I have attempted several times to pen my thoughts on the epic hack of the US Office of Personnel Management that compromised the security information of pretty much everyone who works for the US government, but I keep losing my vision and hearing a ringing in my ears when I try to do so. So I turn to a lesser-known and differently-awesome fail: the US visa system.

Since a computer failure on the 26th of May - over three weeks ago - the US embassies and consulates worldwide have been basically unable to issue new visas except in very limited circumstances. You haven't heard much about this because it hasn't really affected most US citizens, but believe me it's still a big issue. It seems that they're not expecting the system to be working again until next week at the earliest. Estimates of impacted users are on the order of 200,000-500,000; many people are stuck overseas, unable to return to the USA until their visa renewal is processed.

What happened? The US Department of State has a FAQ but it is fairly bland, just referring to "technical problems with our visa systems" and noting "this is a hardware failure, and we are working to restore system functions".

So a hardware failure took out nearly the entire system for a month. The most common cause of this kind of failure is a large storage system - either a mechanical failure that prevents access to all the data you wrote on the disks, or a software error that deleted or overwrote most of the data on there. This, of course, is why we have backups - once you discover the problem, you replace the drive (if broken) and then restore your backed up data from the last known good state. You might then have to apply patches on top to cover data that was written after the backup, but the first step should get you 90%+ of the way there. Of course, this assumes that you have backups and that you are regularly doing test restores to confirm that what you're backing up is still usable.

The alternative failure is of a relatively large machine. If you're running something comparable to the largest databases in the world you're going to be using relatively custom hardware. If it goes "foom", e.g. because its motherboard melts, you're completely stuck until an engineer can come over with the replacement part and fix it. If the part is not replaceable, you're going to have to buy an entirely new machine - and move the old one out, and install the new one, and test it, and hook it up to the existing storage, and run qualification checks... But this should still be on the order of 1 week.

A clue comes from a report of the State Department:

"More than 100 engineers from the government and the private sector [my emphasis] are working around the clock on the problem, said John Kirby, State Department spokesman, at a briefing on Wednesday.
You can't use 100 engineers to replace a piece of hardware. They simply won't fit in your server room. This smells for all the world like a mechanical or software failure affecting a storage system where the data has actually been lost. My money is on backups that weren't actually backing up data, or backing it up in a form that needed substantial manual intervention to restore, e.g. a corrupted database index file which would need every single piece of data to be reindexed. Since they've roped in private sector engineers, they're likely from whoever supplied the hardware in question: Oracle or IBM, at a guess.

The US Visa Office issues around 10 million non-immigrant visas per year, which are fairly simple, and about 500,000 immigrant visas per year which are a lot more involved with photos, other biometrics, large forms and legal papers. Say one of the latter takes up 100MB (a hi-res photo is about 5MB) and one of the former takes up 5MB; then that's a total of about 100TB per year. That's a lot of data to process, particularly if you have to build a verification system from scratch.

I'd love to see a report on this from the Government Accountability Office when the dust settles, but fear that the private sector company concerned will put pressure on to keep the report locked up tight "for reasons of commercial confidentiality and government security". My arse.