Since the US government has made a pig's ear, dog's breakfast and sundry other animal preparations of its health care exchange HealthCare.Gov, I thought I'd exercise some 20/20 hindsight and explain how it should (or at least could) have been done in a way that would not cost hundreds of millions of dollars and would not lead to egg all over the face of Very Important People. I don't feel guilty exercising hindsight, since the architects of this appalling mess didn't seem to worry about exercising any foresight.
Now we crunch the numbers. We expect to be offering this to tens of millions of Americans eventually, maybe up to 100M people in this incarnation. We also know that there is very keen interest in this system, and so many other people could be browsing the site or comparison-shopping with their existing insurance plans even if they don't intend to buy. Let's say that we could expect a total of 50M individual people visiting the site in its first full week of operation. The average number of hits per individual: let's say, 20. We assume 12 hours of usage per day given that it spans America (and ignore Hawaii). 1bn hits per week divided by 302400 seconds yields an average hit rate of about 3300 hits per second. You can expect peaks of twice that, and spikes of maybe five times that during e.g. news broadcasts about the system. So you have to handle a peak of 15000 hits per second. That's quite a lot, so let's think about managing it.
The first thing I think here is "I don't want to be worrying about hardware scaling issues that other people have already solved." I'm already thinking about running most of this, at least the user-facing portion, on hosted services like Amazon's EC2 or Google's App Engine. Maybe even Microsoft's Azure, if you particularly enjoy pain. All three of these behemoths have a staggering numbers of computers. You pay for the computers you use; they let you keep requesting capacity and they keep giving it to you. This is ideal for our model of very variable query rates. If we need about one CPU and 1GB of RAM to handle three queries per second of traffic, you'll want to provision about 5000 CPUs (say, 2500 machines) during your first week to handle the spikes, but maybe no more than 500 CPUs during much of the rest of the year.
The next thought I have is "comparison shopping is hard and expensive, let's restrict it to users whom we know are eligible". I'd make account creation very simple; sign up with your name, address and email address plus a simple password. Once you've signed up, your account is put in a "pending" state. We then mail you a letter a) confirming the sign-up but masking out some of your email address and b) providing you with a numeric code. You make your account active and able to see plans by logging in and entering your numeric code. If you forget your password in the interim, we send you a recovery link. This is all well-trodden practice. The upshot is that we know - at least, at a reasonable level of assurance - that every user with an active account is a) within our covered area and b) is not just a casual browser.
There's not much you can do about the business rules evaluation to determine plan costs: it's going to be expensive in computation. I'd personally be instrumenting the heck out of this code to spot any quick wins in reducing computation effort. But we've already filtered out the looky-loos to improve the "quality" (likelihood of actually wanting to buy insurance) of users looking at the plans, which helps. Checking the feeds to insurers is also important; put your best testing, integration and QA people on this, since you're dealing with a bunch of foreign systems that will not work as you expect and you need to be seriously defensive.
Now we think about launch. We realise that our website and backends are going to have bugs, and the most likely place for these bugs is in the rules evaluation and feeds to insurers. As such, we want to detect and nail these bugs before they cause widespread problems. What I'd do is, at least 1 month in advance of our planned country-wide launch, launch this site for one of the smaller states - say, Wyoming or Vermont which have populations around 500K - and announce that we will apply a one-off credit of $100 per individual or $200 per family to users from this state purchasing insurance. Ballpark guess: these credits will cost around $10M which is incredibly cheap for a live test. We provision the crap out of our system and wait for the flood of applications, expect things to break, and measure our actual load and resources consumed. We are careful about user account creation - we warn users to expect their account creation letters within 10 days, and deliberately stagger sending them so we have a gradual trickle of users onto the site. We have a natural limit of users on the site due to our address validation. Obviously, we find bugs - we fix them as best we can, and ensure we have a solid suite of regression testing that will catch the bugs if they re-occur in future. The rule is "demonstrate, make a test that fails, fix, ensure the test passes."
Once we're happy that we've found all the bugs we can, we open it to another, larger, state and repeat, though this time not offering the credit. We onboard more and more states, each time waiting for the initial surge of users to subside before opening to the next one. The current state-by-state invitation list is prominent on the home page of our site. Our rule of thumb is that we never invite more users than we already have (as a proportion of state population), so we can do no more than approximately double our traffic each time.
This is not a "big bang" launch approach. This is because I don't want to create a large crater with the launch.
For the benefit of anyone trying to do something like this, feel free to redistribute and share, even for commercial use.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Update: also very worth reading Luke Chung's take on this application, which comes from a slightly different perspective but comes up with many similar conclusions on the design, and also makes the excellent usability point:
The primary mistake the designers of the system made was assuming that people would visit the web site, step through the process, see their subsidy, review the options, and select "buy" a policy. That is NOT how the buying process works. It's not the way people use Amazon.com, a bank mortgage site, or other insurance pricing sites for life, auto or homeowner policies. People want to know their options and prices before making a purchase decision, often want to discuss it with others, and take days to be comfortable making a decision. Especially when the deadline is months away. What's the rush?