2014-02-27

Fixing Healthcare.gov - the inside story

The new Time covers in depth the work of the team who fixed Healthcare.gov. It's a fantastic read, with good access to the small but extremely competent team who drove the fix - go absorb the whole thing.

The data coming out of the story confirms a lot of what I suspected about what was wrong and how it needed to be fixed. Breaking down by before-and-after the hit team arrived:

Before

  1. By October 17 the President was seriously contemplating scrapping the site and starting over.
  2. Before this intervention, the existing site's teams weren't actually improving it at all except by chance; the site was in a death spiral.
  3. No one in CMS (or above) was actually checking whether the site would work before launch.
  4. The engineers (not companies) who built the site actually wanted to fix it, but their bosses weren't able to give them the direction to do it.
  5. There was no dashboard (a single view) showing the overall health of the site.
  6. The key problem the site had was being opened up to everyone at once rather than growing steadily in usage.
  7. The site wasn't caching the data it needed in any sensible way, maximising the cost of each user's action; just introducing a simple cache improved the site's capacity by a factor of 4.
I refer the reader in particular to my blogpost The Curse of Experts where CMS head Marilyn Tavenner was trying to dodge blame.
During the Tuesday hearing, Tavenner rejected the allegation that the CMS mishandled the health-care project, adding that the agency has successfully managed other big initiatives. She said the site and its components underwent continuous testing but erred in underestimating the crush of people who would try to get onto the site in its early days. "In retrospect, we could have done more about load testing," she said.
As the Time article shows, this was anything but the truth about what was actually wrong.

After

  1. There wasn't any real government coordination of the rescue - it was managed by the team itself, with general direction but not specific guidance from the White House CTO (Todd Park)
  2. The rescue squad was a scratch team who hadn't worked together before but was completely aligned in that they really wanted to make the site work, and had the technical chops to know how to make this happen if it was possible.
  3. Fixing the website was never an insurmountable technical problem: as Dickerson noted "It's just a website. We're not going to the moon." It was just that no-one who knew how to fix it had been in a position to fix it.
  4. The actual fixes were complete in about 6 weeks.
  5. One of the most important parts in improving the speed of fixing was to avoid completely the allocation of blame for mistakes.
  6. Managers should, in general, shut up during technical discussions: "The ones who should be doing the talking are the people who know the most about an issue, not the ones with the highest rank. If anyone finds themselves sitting passively while managers and executives talk over them with less accurate information, we have gone off the rails, and I would like to know about it."
  7. The team refused to commit to artificial deadlines: they would fix it as fast as they could but would not make promises about when the fixes would be done, refusing to play the predictions game.
  8. Having simple metrics (like error rate, concurrent users on the site) gave the team a good proxy for how they were doing.
  9. Targeted hardware upgrades made a dramatic difference to capacity - the team had measured the bottlenecks and knew what they needed to upgrade and in what order.
  10. Not all problems were fixed: the back-end communications to insurance companies still weren't working, but that was less visible so lower priority.

The overall payoff for these six weeks of work was astonishing; on Monday 23rd December the traffic surged in anticipation of a sign-up deadline:

"We'd been experiencing extraordinary traffic in December, but this was a whole new level of extraordinary ... By 9 o'clock traffic was the same as the peak traffic we'd seen in the middle of a busy December day. Then from 9 to 11, the traffic astoundingly doubled. If you looked at the graphs, it looked like a rocket ship." Traffic rose to 65,000 simultaneous users, then to 83,000, the day's high point. The result: 129,000 enrollments on Dec. 23, about five times as many in a single day as what the site had handled in all of October.
Despite this tremendous fix, however, President Obama didn't visit the team to thank them. Perhaps the political fallout from the Healthcare.gov farce was too painful for him.

The best quote that every single government on the planet should read:

[...] one lesson of the fall and rise of HealthCare.gov has to be that the practice of awarding high-tech, high-stakes contracts to companies whose primary skill seems to be getting those contracts rather than delivering on them has to change. "It was only when they were desperate that they turned to us," says Dickerson. "I have no history in government contracting and no future in it ... I don't wear a suit and tie ... They have no use for someone who looks and dresses like me. Maybe this will be a lesson for them. Maybe that will change."
The team who pulled President Obama's chestnuts out of the fire didn't even think they were going to be paid for their work initially; it looks like they did eventually get some money, but nowhere near even standard contracting rates. And yet, money wasn't the motivator for them - they deeply wanted to make Healthcare.gov work. As a result they did an extraordinary job and more or less saved the site from oblivion. This matches my experience from government IT developments: it's reasonable to assume that the government don't care about whether the project works at all, because if they did then they'd run it completely differently. Though if I were President I'd be firing Marilyn Tavenner, cashing in her retirement package and using it to pay bonuses to the team who'd saved my ass.

If you have a terribly important problem to solve, the most reliable way to solve it is to find competent people who will solve it for free because they want it to work. Of course, it's usually quite hard to find these people - and if you can't find them at all, maybe your problem shouldn't be solved in the first place.

Don't give the guests power over the residents

Top legal blog "The Volokh Conspiracy", now at the Washington Post, analyses the recent California 9th Circuit decision that wearing American flag shirts at high school can legally be prohibited. Eugene Volokh notes that the actions of the principal (Mr. Rodriguez) in banning wearing of American flag clothing in fear of it causing violent disruption may be constitutional but not at all a good idea:

This is a classic "heckler's veto" — thugs threatening to attack the speaker, and government officials suppressing the speech to prevent such violence. "Heckler's vetoes" are generally not allowed under First Amendment law; the government should generally protect the speaker and threaten to arrest the thugs, not suppress the speaker’s speech. But under Tinker's "forecast substantial disruption" test, such a heckler's veto is indeed allowed.
I have to confess sympathy for Mr. Rodriguez in his predicament - his job is to ensure order and prevent disruption at his school, and students who wear the American flag did seem very prone to be correlated with disruption:
At least one party to this appeal, student M.D., wore American flag clothing to school on Cinco de Mayo 2009. M.D. was approached by a male student who, in the words of the district court, "shoved a Mexican flag at him and said something in Spanish expressing anger at [M.D.;s] clothing."
Now there are plenty of legal Mexican immigrants in the USA, so we shouldn't assume anything about the angry student's immigration status, but if (for instance) a student of Scottish heritage took offense to a Live Oak student wearing an American flag on St. Andrew's Day and threatened him "I'll cae the pins o' ye!" I can't imagine Mr. Rodriguez reacting the same way. The principal does seem to be bowing to the opinions of a category of "guest" students in preference to those who are citizens of the country. (Irish students aren't likely to cause problems on St Patrick's Day because it's celebrated in the USA as much if not more than in Ireland).

If you want to know more about what Live Oak High School is like, take a look at the California department of education stats for the school. It's about 50-50 demographic split between white and Hispanic/Latino students. The standardised scoring indicates that white, Asian and black students improved significantly in the past academic year but the Hispanic/Latino students went backwards. It seems that pandering to them isn't doing them any favours academically. Incidentally, I'm dubious about the "Two or more races" stats - only 1 student of mixed heritage out of 858? My arse.

I can do no better than quote Volokh's takeaway:

The school taught its students a simple lesson: If you dislike speech and want it suppressed, then you can get what you want by threatening violence against the speakers. The school will cave in, the speakers will be shut up, and you and your ideology will win. When thuggery pays, the result is more thuggery. Is that the education we want our students to be getting?
If Live Oak High School really wants to help its Hispanic/Latino students, it should insist that they meet the standards expected of all other students.

2014-02-25

Ukraine and Clancy's prescience

At Christmas I read the newest (and final, given the author's passing) Tom Clancy novel Command Authority where hero and president Jack Ryan is battling to keep Russia from taking over Ukraine with military force, using agents provocateur in conjunction with the substantial number of Russian passport holders in east Ukraine as a pretext for involvement and invasion.

All is passing as Clancy has forseen:

Russia's large landing ship Nikolai Filchenkov has arrived near the Russia Black Sea Fleet's base at Sevastopol, which Russia has leased from Ukraine since the fall of the Soviet Union in 1991.
The ship is reported to be carrying as many as 200 soldiers and has joined four additional ship carrying an unknown amount of Special Forces troops. Flot.com also reported over the weekend that personnel from the 45th Airborne Special Forces unit and additional divisions had been airlifted into Anapa, a city on Russia's Black Sea coastline. In addition, it is believed that Russia's Sevastopol base contains as many as 26,000 troops already, according to the German Institute for International And Security Affairs.

Clancy will never be remembered as a giant of literature, but you have to give the man his due in anticipating world events. Russia is no doubt terrified of Ukraine pivoting westwards, especially given the billions of dollars invested there:

Moody’s cited Putin as saying Ukrainian borrowers owed Russia about $28 billion, according to a report this month, before the $15 billion of bond purchases. Ratings agencies will "eventually" have to look at this exposure, Grafe said.
With fracking and Canadian oil exerting downwards pressure on fuel prices, one wonders where the Russian siloviki are going to find the money to maintain their lifestyles. For sure V. V. Putin can turn off the taps of Ukraine's gas supply, but that's only likely to make the citizens more angry at Russia.

I'd hope that rolling tanks across the border would be a step too cheeky even for Putin, but who knows what internal pressures are being exerted?

2014-02-23

Apple's SSL bug - better code reviews required

There's a great technical discussion by Adam Langley at Imperial Violet on the inadvertent security hole that Apple introduced to iOS 7 and later versions of OS X. They've released a patch for iOS (which is how people noticed) but are still working on the OS X fix. My sympathies are with Apple despite them being panned for the delay - the fix is straight forward, but building, qualifying, canarying and distributing the desktop fix inevitably takes a while, and if you try to speed up this process then you have a high risk of making things much, much worse.

The effect of the bug is that it allows a certain kind of attack ("man in the middle") which intercepts secure web connections, say from a user on an Apple laptop to their online banking system. An attacker with sufficient access and resources can pretend to the user to be their online banking server, and the user will have no practical way to detect this. However in practice it is very difficult to exploit, and is only really a concern for users who believe that they may be targeted by government agencies or well-funded and persistent private parties; it's unlikely that it will be widely exploited. Modern iOS and Safari users are not a large fraction of internet traffic, even if you only look at HTTPS traffic.

The bug itself is probably only interesting to code nerds such as your humble correspondent, but how it came about is quite telling about how software development works at Apple.

Here's a cut-down version of the offending function:

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams,
                                 uint8_t *signature, UInt16 signatureLen)
{
	OSStatus        err;
	[...]
	if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
		goto fail;
	if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
		goto fail;
		goto fail;
	if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
		goto fail;
	[...]

fail:
	SSLFreeBuffer(&signedHashes);
	SSLFreeBuffer(&hashCtx);
	return err;
}
See that third "goto fail;" line in the middle? That's the error. Almost certainly it was the result of a fat-finger in a code editor, it's very unlikely to be a deliberate change. For tedious reasons related to how code blocks work in the C programming language, the effect of the third "goto fail;" is very different to the first two. It isn't tied to a condition, so if the program manages to get past the first two "if" statements successfully (the initial secure connection checks) then it never carries out the third check. When it reaches the end of the code, the result in the variable "err" actually represents whether the first two checks completed successfully, not (as required) whether all three checks completed successfully.

The reason this interests me is that this change made it into an official Apple release without being detected. I claim that if this code change was reviewed by a human being (as it definitely should have been) then anyone paying the slightest attention would have seen the duplicate "goto fail;" line which would have made absolutely no sense. I can fully understand this error not being picked up by automated testing - it's not straight forward to build a test which could cause this particular case to fail - but this is another indicator that Apple are not paying nearly enough attention to developing software reliably. Getting another person to review your code changes is a basic part of the software development process. If it's not being done, or only being conducted in a cursory fashion, you're going to leave your code riddled with bugs. There is no shortage of bad actors waiting for you to make a mistake like this.

I'm really curious about how this got noticed. My money is on someone browsing the code to make an unrelated change, and being drawn to the duplicate line, but that's only speculation.

I've given Apple heat for their sloppy approach to security in the past and I'm concerned that they're not reacting to the clear signs that they have a problem in this area. If code changes to a key security library are not going through human review, they're going to continue to have problems.

2014-02-19

Unionisation is in trouble when data is being fudged

Keen observers of the progress of unionisation in the USA have been following with interest the story of VW's Tennessee workers rejecting unionisation despite management neutrality on the issue. The Union of Auto Workers (UAW) - whose stranglehold in Detroit has worked out so well for that city - tried and failed to persuade Tennessee that their interests were aligned, despite VM management being studiously neutral on unionisation.

Now CBS staff writer Sadhbh Walshe desperately tries to put a pro-union spin on these events. (For those curious about the name, it's pronounced Sive).

In Germany, for instance, auto workers at VW plants get paid an average of $67.14 an hour. That's more than double the average hourly rate for an established unionized worker in Detroit, and it's more than three times what the non unionised workers in Chattanooga can hope to earn.
Wow. Because, of course, the cost of living in German cities is totally comparable to the cost of living in both Detroit and Tennessee. I wonder why Ms. Walshe didn't bring up this point in the article? Amazingly, she doubles down on it in the comments:
I'm not sure what the rates are in other countries, but I do know that the cost of living in the US is comparable to Germany so salaries ought to be comparable too.
Ah, "the US". Because 330 million people spread across 3.8 million square miles and 51 different states is just one entity for comparison purposes, despite Germany being 80 million people and less than 5% of the area. Heck, even in the single state of California the differnce in cost of living between the Bay Area (San Francisco to San Jose) and Fresno is huge. A $40K annual salary would be pretty comfortable in Fresno, but practically poverty-level in San Francisco. Contrast Chattanooga, with a median home price of $114K, with Kassel in Germany (home of a VW plant) where you'd be lucky to get a garden shed for that price.

That German earnings figure is extremely suspect even in isolation; as commenter GregUS notes:

Are you seriously claiming that assembly line auto workers make $67.14/hour? This works out to $140,000 per year, which is about the average salary of a primary care physician in Germany (€110,000/year).
German workers are also nearly twice as productive as USA workers, so even if they are paid higher it would seem reasonable for VW to require USA productivity to rise before paying US workers more.

I particularly like Ms. Walshe's spin on right to work states:

It's true that with Tennessee being a so called "right to work" state, where wages are generally low and poverty is high, $19.50 an hour probably seems like a pretty good salary.
"Right to work" means (in essence) that you are not obliged to join a union or pay union dues if you join a unionized workplace. This would seem a no-brainer to most UK workers, but in fact state law bans this practice in about half the states. The non-right-to-work states are CA, OR, WA, MT, CO, NM, MN, WI, MO, IL, KY, OH, WV, PA, NY, VT, MD, NJ, NY, NH, ME, MA, NH, DE, CT, RI.
Let's compare this with the top 20 states with highest unemployment, in descending order. States in bold are non-right-to-work:
RI, NV, IL, MI (just changed to Right to Work), CA, DC, MS, KY, TN, AZ, GA, CT, AR, NJ, OH, NY, OR, MA, PA.
See a pattern here? Of course, I can't claim causation with this superficial evidence - but at the very least, it seems that states without the right-to-work law seem to be more likely to suffer from unemployment.

Sadhbh Walshe is desperate to have us believe that the only thing standing between Tennessee auto workers and prosperity is them needing to join the UAW, but it seems that even compulsory unionization is more likely to cause unemployment than raise wages.

2014-02-13

A tip on tipping

Oh, for fuck's sake. Arun Sethi in CiF complains about the poverty of tipped employees in the USA:

Across the country, advocates are organizing, rallying, and speaking out in support of raising the US tipped minimum wage of $2.13.
Yes, you read that right. Tipped workers include parking lot attendants, bellhops, baggage porters, manicurists, and barbers. They also include many people in the restaurant industry – waiters, waitresses, and food deliverers. They haven't seen a raise since 1991.
That's a rise in their employer-paid minimum wage. Why aren't they deserting these roles in droves for a (federal, hence minimum across all states) minimum wage of $7.25/hr, say at the local Taco Bell? Well it might just be because retail prices are rising rapidly, and the tip % is rising independently. All that employer minimum wage covers for salary is slack hours when you're doing nothing; as soon as you get a customer, tips come in to play.

If you haven't lived in America, here's a primer on tipping. A reasonable tip for most roles is 15%-20% of the purchase price - and in the past 10 years this has moved closer to 20% than 15%. For bartenders it's $1-$2 per drink (glass of wine, shot of regular spirits, pint or bottle of beer). If you go out to a restaurant for a family meal, say 4 people at a table for 1 hour and a bill of $50-$80 then you'll pay a tip of $10-16. If the waitress only serves one table per hour she'd be way over minimum wage. Realistically she's covering more like 5-10 tables - she'll do really well out of this; not as well as the figures suggest since she has to "tip out" (pay a % of tips) to the busboys and kitchen staff, but still a very comfortable wage. She certainly won't get the high rates during the 2pm-7pm slot, since the customers drop off, but even two occupied tables per hour should keep her quids in.

Attendants, porters and bellhops get $1 per bag (ballpark); I don't know much about their baggage carrying rate, but they should clear minimum wage without any problems and should be able to make $12/hour (one customer with one bag per 5 mins) just fine during moderately busy periods. Manicurist charges are about $20/30 mins, so 20% yields $8/hour; this is by no means extravagant, and unlike waitressing is capped at one customer per 30 mins or so, but is still above minimum wage. They get screwed over by slack periods of course. But compare "service" at a restaurant in the UK versus one in Manhattan, Dallas or Minneapolis and you'll see just how well financial incentives work.

If by "parking lot attendants" Sethi means "valet" - you don't tip the person in the booth at the city parking lot - then they do staggeringly well. At a hotel with valet you pay $10-$20 per day and tip $1 min, $5 max per fetch (5 mins). Valet 8 cars per hour with minimal tip and you're still ahead of the game. I've typically tipped in the $2-$3 region, and the valets I've seen are busy for most of the day.

Note that fast food staff aren't covered by this since you don't normally tip at McDonalds, Five Guys, In-N-Out, Wendys, Taco Bell; they get regular minimum wage.

You don't tip gas station attendants; postal workers and gardeners get a one-off Christmas tip. Taxi drivers get 15%-20% and do just fine - try leaving a 10% tip for a Manhattan cabbie and learn inventive new words. And only assholes don't tip at all; I've left a sub-15% tip on maybe 5% of occasions, and only for service that was well below "adequate" and tending to "derisory".

Who is this mendacious clown?

Arjun Sethi is an attorney in Washington, DC, and a frequent commentator on civil rights and social justice related issues. He is a member of the Minimum Wage Coalition, a consortium of more than 40 groups, nonprofits, and faith-based organizations, working to increase the wage.
Wow, certainly no conflict of interest there. I note that he doesn't add "and conspiring to reduce employment" to that description, despite that being the net effect of his lobbying; highly-qualified and privileged asshole that he is.

2014-02-11

Autocompleting Keith Vaz

This was amusing in a puerile sense: MP Keith Vaz backing the Daily Mail in trying to remove offensive Google autocomplete search results:

The search engine was found to be making crude and offensive suggestions when users typed in 'Why is' followed by the name of a city. For example, if a user typed in the phrase 'Why is Bradford...' the site automatically suggested the search 'Why is Bradford so full of P****.' Typing in the phrase with Leicester and Birmingham also produced similar results.
Offensive, certainly; and Leicester MP Keith Vaz wasn't amused:
'I am shocked these search options come up. It is totally unacceptable and I will be raising this matter in Parliament'
Apparently these results come up based on what users select, so presumably a lot of people in the UK have a rather slanted view of Leicester. I had a scout through other cities; the best result I found was "why is norwich inbred", and I was disappointed that "why is milton keynes" didn't autocomplete to "full of roundabouts".

I wondered what Google autocomplete search thought of Keith Vaz:

Oh. Fair enough, I guess.

Microsoft angling to take on Internet searches in China?

An interesting tidbit reported in the Grauniad today; Microsoft seems to be serving PRC government-approved search results to Chinese language users in the USA:

A search on Bing in Chinese for Bo Xilai (薄熙来), the former high-flying Chinese government official now serving life imprisonment for corruption, shows equally different results. The top search result is again Baidu Baike. Wikipedia is the third entry. There are no western reports on the politician on the front page. In English the search is topped by Wikipedia, then by stories from the New York Times, BBC and Financial Times.
I should hasten to point out that it's possible (though not particularly likely) for this ordering of search results to arise from users, not Microsoft company decisions; one significant signal in search engine ranking is the choices of users when presented with a search results page. If 100 users searching for "dancing cats" are presented with the same top ten list of search results on Bing, Bing can gather good relevance data by seeing which of these search results users tend to click on first; better yet, if users click on result #1 and then shortly afterwards result #2, but don't tend to click on result #3, this may indicate that result #1 doesn't work for them but result #2 does, so maybe Bing should serve result #2 first for this search term.

Bing uses a lot of different data for its ranking such as user location, language and previous search behaviour; famously, for "long tail" search results it appears to use the choices of Internet Explorer users searching through Google as one signal. However, this particular case strongly suggests that something more is going on. Chinese users in the USA aren't particularly likely to follow the PRC government line on what results are "relevant"; if anything, I'd expect them to slant towards freedom of information since they tend to be richer than the average American and more likely to work in the tech sector. So if they aren't choosing PRC government approved sites for their search results, who is? The simplest answer is "Microsoft".

The motivation is suggested in the article's final paragraph:

Bing accounts for a small percentage of search in China but has been building up its web services in the country. Microsoft is in the middle of hiring 1,000 new employees to build up its services in China.
Search in China is dominated by the home-grown service Baidu ; Yahoo is big in Japan but not the PRC, especially after Jerry Yang's experience yielding user details to the Chinese government, and Google has repeatedly butted heads with PRC censors in the past after refusing to play the government's censorship game, such as highlighting to Chinese users search terms that might result in the Great Firewall dropping their connection. Bing is a pretty decent search engine with a big corpus of docs from the non-Chinese part of the world, and could probably (with a following wind) do much better than Baidu on Chinese search results if it can build up a good model of Chinese language searching; the Chinese-speaking population of America and Europe is probably a great place to test out result. Of course, if Bing wants to tune its algorithms here then it has to present Western users with the same kind of results that Chinese users would see - hence the apparent censorship.

I don't think any reader of my past blogs has much doubt where I stand on Chinese internet censorship, but I have to commend Microsoft for its commercial sense in taking this approach. If you don't care about whether you're doing the right thing for free speech, it's a perfectly reasonable path towards an income stream of yuan from the hundreds of millions of Chinese users. I do wonder though whether they've thought through how it's going to look in the USA when (not if) they are forced to knuckle under the PRC government's agenda.

Update: Microsoft responds who? not us guv'nor, totally a mistake:

"Bing does not apply China’s legal requirements to searches conducted outside of China," Bing senior director Stefan Weitz said in a statement provided to The Reg.
"Due to an error in our system, we triggered an incorrect results removal notification for some searches noted in the report but the results themselves are and were unaltered outside of China."
Oopsie. I'd be more prepared to give them the benefit of the doubt here if I didn't believe they'd sell their employees' organs to kept their Office monopoly going...

2014-02-10

Aligned incentives to improve uptime and usability

Something I've been chewing over for the last couple of weeks is why we see such a disparity between applications that work really well, scale well and are extremely reliable (e.g. Facebook), and others which are and remain a complete disaster despite huge amounts of money spent developing and supporting them (e.g. Healthcare.gov). I'm going to propose Hopper's Law of Operational Sanity:

Axiom: Your codebase can only be significantly improved for users when your developers feel users' pain.
Corollary: If you're outsourcing support for your application, you don't really care about making it better.
A bold claim; let's try to justify it.

Let's assume that our application in question is used heavily by the public, and that it includes a reasonable feedback mechanism for problems (e.g. a forum, FAQ page, static support pages with a feedback form for more detailed problem reports). We're doing all the monitoring basics; logging success/failure rates, have trend reports around these stats, have a QA/test team and do reasonably frequent releases to production, and have an operations team responsible for monitoring the system and reacting to problems. I claim that this is nowhere near enough for a system that users actually want to use. Why not?

The problem comes with the disconnect between the interest of the engineers developing and testing the application, the interests of the operations team, and the interest of people using it. Developers are paid based on features they launch and on the visibility of bugs they fix. If the operations team is getting alerts every hour for a condition that isn't really important (or, at least, can't be fixed) the development team is unlikely to care - they will just tell the ops team to ignore the alerts. That's fine, but you're desensitising the ops team to alerts, and filling their mailboxes with noise. When an actual problem happens and an alert is sent, it'll take the ops team longer to notice; they may even have set up a mailbox rule to file these alerts, and not notice the real one for days.

Similarly, if there are occasional system overload problems the ops team will have to scramble to fix them, generating lots of bustling activity. The developers don't care because the ops team can deal with them. The ops team are paid for their activity, so don't mind the occasional panic. No-one has the interest in adding code to detect this condition arising and making it easier to handle (e.g. by measuring system capacity and prophylactically alerting if the system goes over 80% of that level, or adding the option to switch the system into a lower-load mode by temporarily blocking expensive functions) despite the fact that this would be of great long-term benefit to the company in reducing failures and associated cost.

The gap between the developers and the users is even more straight forward. The developers and the marketing team can come up with all sorts of cool ideas and features for the product, spend months developing and launching them, and still be left with the product being slammed by unsatisfied users. How come, after all this effort developing and testing the product? Well, it could simply be that, according to the old saw, the dogs don't like it. If the developers are relying on feedback from the launched product to know if the users are happy, they've already lost; the development feedback loop will take way too long. Instead they need to know about problems before they launch the feature. How to do that? Dogfood it!

"Dogfooding" is the process by which employees at a company use a pre-release version of the application for their daily work; ideally, nearly all the developers of the app would dogfood. This is very different from system and QA testing because this work is not done with test data and test users; instead, a real person is interacting with the app and trying to make it do what they want. The only difference between this person and a regular user is that the tester has (or should have) an easy way of raising a specific bug on the developers if they run into something that gets in their way.

Scaling is certainly a user-facing problem, and you can't easily test scaling with a dogfood app in isolation; however if you can measure its resource consumption in isolation then you can get a delta of resource usage between dogfood releases which should at least highlight whether a new release has a CPU/memory/network-devouring bug. This is not a foolproof method, but it's certainly better than the standard approach of running load tests; the only problems load test will normally detect are gross changes in performance for whatever standardised actions you're feeding into the load tests.

If you want your system to work well, the people developing it need to feel the immediate pain of their errors - make them responsible for the monitoring of their part of the system for at least three months after its launch, and only allow it to be handed over to the operations team when it's demonstrated a suitably low level of alerting and a suitably high frequency of actual error conditions generating alerts. Make the developers use a pre-release version of the system for their daily work; if it's a web mail system then use that for their mail, if it's a bug database then keep the bug database's bugs in a pre-release copy of itself. You're going to take a slight hit on productivity, but you'll be surprised how quickly and smoothly you can fail over from a bad version to a known good version after the first few such problems - and if you can do it for developers, you can do it for your production version.

Incidentally, the a facet of this law was in evidence in the recent revelations about Edward Snowdon's access to NSA data:

Intelligence officials have claimed that Snowden was able to do all this [automatically web-crawling the NSA pages] largely because the Oahu NSA facilities had not gotten the software purchased to prevent insider threats in the wake of WikiLeaks. "He was either very lucky or very strategic" to get the positions he held in Hawaii, one official told the Times. But it's also entirely possible that his activities would have gone unchecked in any case, simply because of his system administrator status.
I suspect it's even more simple than that. No-one in a position to detect and stop Snowdon was sufficiently interested in securing the network. They'd certainly implement any system they were asked to and investigate any alert that went off - that's what they were paid for, after all. No-one was really engaged in pro-actively hunting down and preventing security threats. This is a hard thing to set up, admittedly - how do you pay someone for a security breach failing to happen because of their actions? - but if you can't figure out how to do it then Snowden-like security breaches will keep happening.

All the stable door slamming currently going on will simply make it harder for people to do their jobs, because the formerly loose data access that they benefited from will be taken away in the name of security; the pendulum has swung in the other direction, and now no-one will benefit from highlighting a security change with a good security/access trade-off because everyone is in arse-covering mode and no-one will countenance loosening security for any reason.

Incentives matter; this is as true in software as it is anywhere else. Now we need to start acting like they matter.

Remedial education on citizenship needed for New York's Mayor

New NYC mayor Bill de Blasio managed an outstanding line as part of his State of the City address:

Guess what, Bill? The reason why nearly all those voices go "unheard" is that they are not allowed to vote in federal, state and city elections because - shock!- they are not US citizens and so they have no voting rights because they are guests of the country. This, incidentally, is not restricted to indocumentados; even the "Green Card" holders, documented permanent residents of the USA, cannot vote. This is a perfectly reasonable position; you don't give a dinner guest in your house approval rights over the choice of curtain patterns.

Bill de Blasio's focus on "undocumented" New Yorkers, however, is disingenuous in the extreme - he clearly prioritises the Democratic-sympathising "undocumented" non-citizens over the privately employed (hence Republican-leaning) "documented" non-citizens, despite the indisputable fact that the latter have real Social Security numbers and pay federal, state and city income taxes, unlike the former.

New York, pay attention: de Blasio is far more interested in adding to the Democrat base rather than any kind of "sufferage justice". He's a grade A self-interested weasel.

2014-02-07

Special pleading for female hackers considered harmful

Just for the title of her article "Girls and Software", hacker Susan Sons would get pilloried by the feministas. And yet she has done more to explain publically the actual problems than anyone I know:

Twelve-year-old girls today don't generally get to have the experiences that I did. Parents are warned to keep kids off the computer lest they get lured away by child molesters or worse—become fat! That goes doubly for girls, who then grow up to be liberal arts majors. Then, in their late teens or early twenties, someone who feels the gender skew in technology communities is a problem drags them to a LUG meeting or an IRC channel. Shockingly, this doesn't turn the young women into hackers.
It's not impossible to feel the tug of software later in life and get involved, but it's pretty apparent that childhood involvement is one of the main feeders of the world supply of hackers. Warning children - especially girls - about the dangers of online communities is clearly a good idea, but if you end up keeping them out of those communities all together then you've just choked the pipeline that produces the next generation of hackers and hence the next generation of software.

You, gentle reader, need to set aside five minutes and read Sons' whole piece - right now. Be you male, female, transgender, hacker or computerphobe, it's one of the best pieces I've seen on this issue. By all that's holy, she nails it:

Open source was my refuge because it was a place were nobody cared what my pedigree was or what I looked like — they cared only about what I did.
Heavens above, don't tell the Guardian columnists. We're never going to get any progress in involving women in software with that kind of attitude, are we? Where are the affirmative-action programs to demand minimum percentile representation?

As flawed-but-interesting arch-hacker ESR notes in the Jargon File:

Racial and ethnic prejudice is notably uncommon and tends to be met with freezing contempt.
When asked, hackers often ascribe their culture's gender- and color-blindness to a positive effect of text-only network channels, and this is doubtless a powerful influence. Also, the ties many hackers have to AI research and SF literature may have helped them to develop an idea of personhood that is inclusive rather than exclusive — after all, if one's imagination readily grants full human rights to future AI programs, robots, dolphins, and extraterrestrial aliens, mere color and gender can't seem very important any more.
I've seen the world of hackers to be one of the most welcoming places to transgendered people. They appear to be significantly over-represented in software engineering. As a canonical example I give you Sophie Wilson who implemented the BASIC interpreter that powered the 1980's home computer phenomenon the BBC Micro. Sophie (as "Roger" back then) crammed a superb BASIC implementation into a 16KB ROM space, leaving five bytes spare to tag it with "Roger" at the top of the address space. I can't remember offhand if there was another byte spare that would have allowed "Sophie" to fit. Anyway, Sophie was a hacker par excellence and when she changed from Roger to Sophie no-one batted an eyelid. I remember one post on the comp.sys.acorn Usenet group where a poster tried to poke fun at her gender change, but was shut down in short order by the rest of the group. We didn't care about her gender, just her code.

One of the major problems that Sons isolates is, ironically, the less-well-thought-out attempts to "promote" women in software engineering:

It used to be that I was comfortable standing side by side with men, and no one cared how I looked. Now I find myself having to waste time talking about my gender rather than my technology...otherwise, there are lectures:
  • The "you didn't have a woman on the panel" lecture. I'm on the panel, but I'm told I don't count because of the way I dress: t-shirt, jeans, boots, no make-up.
  • The "you desexualize yourself to fit in; you're oppressed!" lecture. I'm told that deep in my female heart I must really love make-up and fashion. It's not that I'm a geek who doesn't much care how she looks.
It's hard to over-emphasise how screwy these attitudes are. "We want women involved in computer science!" "Well, I'm a woman in computer science." "But you don't look like a typical woman!" "I'm not a typical woman. I'm involved in computer science. I'm a typical hacker." In my experience women in software can go either way in dress style; T-shirt and jeans probably pips skirts and dresses; indeed, the transgender hackers make up a significant fraction of the latter.

That's not to say there aren't serious problems with how women in software are treated - there is no shortage of chauvinism, trolling and plain bad manners, like anywhere else. A particular problem is that, as Sons notes, many hackers are poor at social skills and don't have any real filter for their words and behaviour; when they encounter a woman they may want to be welcoming, but it seldom comes out right. Still, they'll battle to the death to protect their own whether they be male, female, transgender, robot or dolphin.

Want to encourage more women into software? Tell them how they'll be treated on their merits. On the Internet, no-one knows if you're a dog, a cat, a man or a woman. And in the world of hackers, no-one cares.

[Hat tip: The Advice Goddess]

2014-02-06

The efficiency of central planning

Now we're seeing just how well a centrally planned Winter Olympics can be run in a socialist country, for the bargain price of $50 billion so far shouldn't the Guardian be shipping Polly Toynbee over to Sochi to report on the benefits of central government planning and purchasing in this event? After all, she's so vocally in favour of it in the UK...

I'm watching @SochiProblems on Twitter with something between amusement and growing horror...

Luckily the Russian sense of humour is coming to the fore:

The Brazilians organising Rio 2016 must be down on their knees thanking Vladimir Vladimirovich Putin for this farce - no matter how badly they mess up, they'll still look like a well-oiled machine compared to Sochi 2014.

2014-02-05

Now CASH are going after beer

Dear little green apples, it seems that our beer is too sugary:

Cardiologist Dr Aseem Malhotra said: ‘The levels of sugar in some of these drinks is quite staggering.
'There's nothing wrong with the occasional drink but unfortunately we are consuming much more than is good for us.' Dr Malhotra said alcohol-related ill health is costing the NHS £3.3 billion a year.
Nine teaspoons of sugar in a pint of real ale, apparently. Whoop de doo. That's about 36 grams, or two and a half tablespoons. And on the (unsourced) £3.3 billion, I note that alcohol duty is £10-11bn so it would seem to be more than covering its alleged costs.

Who is cardiologist Dr Aseem Malhotra? Oh look, he's listed as one of Action on Sugar's medical advisors:

Dr Aseem Malholtra, Cardiologist and Science Director of Action on Sugar
And yet, he doesn't publicise any academic connection unlike most of the Action on Sugar advisors. How curious. He's a HuffPost contributor who identifies as "Cardiologist and writer with a special interest in improving the nation's diet". A bit of digging indicates that he's a cardiology registrar at Croydon University Hospital. I'm sure he's a perfectly good registrar but, let's face it, CUH is not known as one of the great medical research establishments. Even his BMJ article is "Observations", not a research paper.

So CASH/Action on Sugar's science director seems to be a rather second-rank medic. Shallow calling unto shallow, I guess.

2014-02-04

Hard core computing from the last century

A spot of tech nostalgia for us, with Google's hirsute chief engineer, Urs Hölzle, discussing his first day in Google's "data center" 15 years ago:

[...] a megabit cost $1200/month and we had to buy two, an amount we didn't actually reach until the summer of 1999. (At the time, 1 Mbps was roughly equivalent to a million queries per day.)
- You'll see a second line for bandwidth, that was a special deal for crawl bandwidth. Larry had convinced the sales person that they should give it to us for "cheap" because it's all incoming traffic, which didn't require any extra bandwidth for them because Exodus traffic was primarily outbound.
What's interesting here is that the primary criteria for billing was space - square footage taken up on the colocation site's floor. Network was an additional cost as noted above, but Exodus didn't bill its residents for power - the 3 x 20A required for all the servers was a scrawled note on the invoice. Nowadays, power is one of the most fundamental requirements of a data center and you don't pour the first bit of concrete before you've got your megawattage lined up. Apple goes as far as sticking its own solar power generation around its North Carolina data center. We've come a long way in fifteen years.

You wouldn't be able to get away with a server rack like Google's 1999 design nowadays - just look at the way they cram the hardware into every available space. I've seen one of these racks on display, and you can barely see any daylight through it from front to back. The fire safety inspector would have kittens.

In the comments, Todd Reed calculates that if you tried to run today's YouTube while paying those data rates, you'd be forking over just under $3bn per month...

This just makes the point that the computing world of 15 years ago really was a different generation from today. Google was anticipating that a few megabits per second would be more than enough to keep crawling the entire web and keep up with the addition of content. Let's look at the most content-dense medium of the modern web - Tweets. In 2013 Twitter averaged 5700 Tweets per second. At 160 characters plus maybe 40 characters of timestamp and attribution that's 200 x 5700 = 1,140,000 characters per second or about 9 Mbits per second (Mbps). It would have cost Google nearly $11,000 per month just to keep up with Twitter's tweets. Nowadays you can get 20Mbps on your home Internet connection for $75 per month (business class) which should cope comfortably with two Twitters - until they started allowing you to attach images...