2013-10-17

Drop dead dates

I had the educational privilege, a few years ago, to watch a team in my workplace try to roll out a new business system to replace an existing system which had worked well for a while but grown gnarled, unmaintainable and no longer scaled to likely future demands. Well aware of the Second System Effect they made the new system feature-for-feature compatible, and even had a good stab at bug-for-bug. However, it was a complex problem and they spent many months spinning up a prototype system.

Eventually their manager decided that they needed to run something in production, so they picked a slice of the traffic on the existing business system that was representative but not critical, and set a target deadline of a week hence to launch it. The developers were privately rather twitchy about the prospect, but recognised the pressure that their manager was under and were willing to give it a shot. Come switchover day the new system was enabled - and promptly fell on its face. The developers found the underlying bugs, fixed them and restarted. It ran a little longer this time, but within a few hours fell over again. They fixed that problem, but within 12 hours it became clear that performance was steadily degrading with time...

The developers had a miserable time during the subsequent week. I got in pretty early as a rule, but the dev team was always in (and slurping coffee) by the time I arrived, and never left before I got home. The bugs posted in their area steadily accumulated, the system repeatedly fell down and was restarted with fixes. The team were living on the ragged edge, trying to keep the system up at the same time as triaging the bugs, adding tests and monitoring to detect the bugs, and trying to measure and improve the performance. This was analogous to changing the wheels on Sebastian Vettel's F1 car mid-lap - one hiccup and either you lose a limb or the car embeds itself in a track barrier. It became clear that the team's testing system had huge gaps, and their monitoring system couldn't generally detect failures happening - you could more or less infer what had caused the failure by checking the logs, but someone had to mail the team saying "hey, this job didn't work" for the team to look at the logs in question.

After a fortnight of this, with the team having pulled an average of 80-90 hour weeks, their manager sensibly realised that this approach was not sustainable. He announced the switch back from the new system to the old system effective next day, and immediately shaped expectations by announcing that they would not be switching back to the new system before three months had passed. The team breathed a sigh of relief, took a few days off, and re-scheduled themselves.

Once the system was pulled offline, the developers made reasonably rapid progress. They'd accumulated a host of bug reports, both in functionality and performance, and (more importantly) had identified crucial gaps in testing and monitoring. For each functional and performance bug, they first verified that they could reproduce it in their testing system - which was where they spent the bulk of their development time for several weeks after turndown - and that the monitoring would detect the condition and alert them appropriately. They triaged the bug reports, worked their way through them in priority order, built load tests that replicated the system load from normal operation and added metrics and monitoring on system latency. The time spent running in production had provided a wealth of logs and load information which gave them a yardstick against which they could measure performance.

After a few months they felt ready to try again, so they spun up the fixed system and loaded in the current data. This went much more smoothly. There were still occasional crashes, but their monitoring alerted them almost instantly so they could stop the system, spend time precisely characterising the problem, fix it, test the fix, deploy the fix and restart. The average time between crashes got longer and longer, the impact of failures got smaller and smaller, and after 6 months or so the system achieved its stated goal of greater scale and performance than its predecessor. However, all this was only possible because of the decision to roll back its initial roll-out.

I was reminded of this today when I saw that informed insiders were estimating the US federal healthcare exchanges as "only 70% complete" and needing "2 weeks to 2 months more work" to be ready. Since there are several tens of millions of potential users who need to register before January 1st, this looks to be a precarious situation. It's doubly precarious when you realise that "70% complete" in a software project is code for "I have no idea when we're going to be done." My personal rule of thumb is that "90% complete" means that you take the number of weeks spent in development so far, and expect the same again until the system is working with the specified reliability.

Megan McArdle, whose coverage of the health care exchanges has been consistently superb, makes a compelling case that Obamacare needs to set a deadline date for a working system, and delay the whole project a year if it's not met:

...given that they didn't even announce that they were taking the system down for more fixes this weekend, I'm also guessing that it's pretty bad. Bad enough that it's time to start talking about a drop-dead date: At what point do we admit that the system just isn't working well enough, roll it back and delay the whole thing for a year?
She's right. If the system is this screwed up at this point, with an unmoveable deadline of January 1st to enroll a large number of people, any sane project manager would move heaven and earth to defer the rollout. In the next 6-9 months they could address all the problems that the first roll-out has revealed, taking the time to test both functionality and performance against the traffic levels that they now know. There's no practical compulsion to run the exchanges now - the American healthcare system has been screwed up for several decades, the population is used to it, waiting another year won't make a great difference to most voters.

Chance of this happening? Essentially zero. The Democrats have nailed their colours to the mast of the good ship Affordable Care Act, and it's going out this year if it kills them. If they hold it over until next year then the full pain of the ACA's premium hikes will hit just before the mid-term elections, and they will get pummelled. They're hoping that if they launch now then the populace will be acclimatised to the costs by next November. As such, launching this year is a politically non-negotiable constraint. Politics, hard deadlines and under-performing software - a better recipe for Schadenfreude I can't imagine.

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.