2013-11-28

When perfection is less desirable than excellence

An interesting view into the trade-offs of large-scale computing from LISA 2013:

[Google's] engineers aim to make its products as reliable as possible, but that's not their sole task. If a product is too reliable — which is to say, beyond the five 9s of reliability (99.999 percent) — then that service is "wasting money" in the company’s eyes.
"The point is not to achieve 100 percent availability. The point is to achieve the target availability — 99.999 percent—while moving as fast as you can. If you massively exceed that threshold you are wasting money," Underwood said.
It's interesting that "five nines" seems to be viewed as the desirable limit of reliability. Recall that this means 10 minutes of downtime per year; it seems reasonable that it's unlikely for anyone to notice this level of downtime unless it's a 24x7x365 service with hundreds of millions of global users (Gmail, Facebook etc.). If we assume 50M daily users distributed evenly across the planet, and an average of 5 minutes of daily engagement (times when they'd notice the service failing to respond) then that's about 150,000 users who would notice and maybe 1% who would publicly complain (via twitter etc) so 1500 tweets - that's around the margins of a detectable level of complaint. Certainly from recorded Gmail outages it seems to be about right. If you have 10% of this number of daily users, you could have a four-nines reliability for the same level of complaint.

The really interesting (and no doubt intentionally controversial) comment was on the end of the age of the BOFH:

Underwood, who has a flair for the dramatic, stated: "I think system administration is over, and I think we should stop doing it. It's mostly a bad idea that was necessary for a long time but I think it has become a crutch."
It's not yet obvious that small companies are going to shed BOFHs in order to outsource their system maintenance to "the cloud" no matter how apparently economically appealing this is; I suspect that having a person physically on-site that you can shout at when things go wrong is going to be sufficiently psychologically helpful that BOFHs (or at least PFYs) will be with the SME for a while yet. There's also the practical matter of selecting the correct combination of storage, network bandwidth and peak vs average processing power for the business - you have to hire someone who knows how to make this choice, and you can't easily fire them once they've made this choice for you. Perhaps cloud computing can let CTOs scale back their IT departments, but I'd be surprised if they can be completely eliminated.

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.