
Could Better Testing Practices Have Prevented the Healthcare.gov Defects? - Baustin
http://blog.smartbear.com/quality-assurance/could-better-testing-practices-have-prevented-the-healthcare-gov-defects/
======
ck2
The site was designed for only 50,000 simultaneous users.

[http://www.usatoday.com/story/news/nation/2013/10/05/health-...](http://www.usatoday.com/story/news/nation/2013/10/05/health-
care-website-repairs/2927597/)

Didn't that site cost millions to make?

Wait, is this blog entry trying to sell their own products?

This is a better round-up of stories about the website:

[http://www.theatlanticwire.com/politics/2013/10/obama-
admini...](http://www.theatlanticwire.com/politics/2013/10/obama-
administration-opens-about-exchange-glitches/70248/)

~~~
jrochkind1
A bunch of other better articles on it have been posted to HN, but not upvoted
enough to make the front page. I too find this surprising, this sort of highly
visible large scale web app rollout failure seems like a topic HN would like.

Not sure why this one finally made it, heh.

~~~
mgkimsal
Yeah, I've figured there'd be more here about it, but it's sort of like
shooting fish in a barrel at this point. Plus... it will inevitably devolve in
to politics. And lastly, it's not actually working yet. We don't have any
postmortem about what's actually going wrong - maybe we'll get that soon (or
never?)

------
jrochkind1
Sure, they could have been prevented. By spending more money. Isn't that
always the answer?

Creating a website that can handle, well, nearly the entire country, out of
the gate, is not trivial. (Facebook etc got to scale up to that level, they
didn't need to do it out of the gate). But it also can be done.

How? Well, you hire real experts in doing this sort of thing. And you give
them enough calendar time, and enough billable hours, to do it right. And you
don't give them your own 'business requirements' which contradict their
expertise on what's neccesary to make it work.

Obviously the government did not do one or more of those things, right?

(I will confess that my opinion is not high of the likely expertise level of
typical IT government contracting firms.).

~~~
JPKab
Your opinion shouldn't be high. There are talented IT people working at
government contractors, but they aren't going to get put where they are
needed, because the incentives for contractors are not performing the work at
a top-notch level.

Gov't contracting is, in economic terms, a rent-seeking business. The best and
brightest in a given contracting firm are dedicated to pursuing new work. I
work for a contractor, and my fun IT projects are all proof of concept items.
Stellar execution simply isn't rewarded. The gov't puts idiots in charge of
things: not idiots in the sense that they aren't smart, but idiots in the
sense that they have no expertise in anything. They are hired for paper
qualifications (ever heard of a Project Management Professional cert? It's
what they hire, and its useless), and the bright/motivated ones move on to
interesting work, perhaps in other parts of the gov't, or the private sector.
I built a great tool, using all open source technology for the gov't once. I
built it in 3 weeks (it is now officially gov't owned software), and it
replaced an $80K piece of software (commercial off the shelf) that was
inadequate for the task. My reward was complaints that the security people
weren't familiar with Postgres, and an immediate request to migrate to Oracle.
There was no functional reason for this, but I complied after the gov't
shelled out huge sums of money for Oracle license. Why? Because paper pushing
idiots in the gov't are in charge of IT security, and their credentials for
the job are based entirely on passing certification tests. Their entire
incentivization at their job is to minimize their own workload, rather than
maximize the happiness of their gov't customers.

Do you think anybody who could even guage whether proper testing was being
performed or not was in charge of this project from the gov't level? Of course
not. If they had, it would have been properly tested.

I'm mad as hell about this because I want the ACA to succeed. I don't think
its ideal, but its vastly better than the existing system of hospitals as
clinics for the uninsured hordes.

The Federal gov't has an awful hiring process, and until it is fixed, you are
lucky to encounter competent and driven individuals within it. You are foolish
to expect it.

~~~
so_says
_I 'm mad as hell about this because I want the ACA to succeed. I don't think
its ideal, but its vastly better than the existing system of hospitals as
clinics for the uninsured hordes._

I couldn't agree more. You only get one chance to make a first impression and
they blew it. (sigh)

------
chintan
[http://blog.netizencorp.com/2013/10/06/how-a-scrappy-
startup...](http://blog.netizencorp.com/2013/10/06/how-a-scrappy-startup-in-a-
dc-garage-revolutionized-healthcare-gov/)

"A scrappy little startup working out of a DC garage that completely
influenced the course of how the web-facing portions of the Affordable Care
Act were to be implemented"

~~~
coldcode
STATIC site content. Not the part being hammered by people signing up. People
keep confusing the two.

~~~
DennisP
The plans should be static site content. You shouldn't have to initiate the
sign-up process just to see what's available.

~~~
shamshiel
Seeing what's available without reliable information on your actual costs
(including subsidies) doesn't accomplish much.

~~~
DennisP
Premiums vary based on region, age, and smoking status. It's not so much
information that you couldn't handle it with static pages.

Subsidy is based on income, and it's simple enough that NPR and others have
already build subsidy calculators. Put that in javascript and you've still got
static pages, running calculations client-side.

------
itbeho
I'm surprised the overall story of this implementation hasn't had more
discussion on HN.

~~~
TallGuyShort
I think it would quickly decay into an unproductive, unpleasant political
debate. I'm quite happy to not be seeing so much of it on here.

~~~
gavinlynch
Weird. That never stops every other topic.

~~~
kbar13
the difference is that there is a government shutdown involved, which is 100%
pure politics, as opposed to topics like the NSA, which aren't necessarily
politics, at least not obviously.

------
bmelton
If anyone has seen my posts in the past, you'll know that I'm not a fan of the
ACA, but disregarding that, let me be the first to say that this _probably_
isn't the fault (or, at least not solely the fault) of those building the web
frontends.

In Maryland, our exchange website is poorly designed and written, just on the
frontend (I obviously can't see the backend), but at the same time, all of the
frontend issues could be fixed with a clever caching scheme.

Where the _real_ bottleneck almost certainly lies is when the system takes
your user submitted data and has to post it into what is surely an old, legacy
federal system so that it can verify your identity. That old legacy system may
have seen upgrades in preparation for this, but probably not -- even so,
there's simply no way to prepare for the onslaught of users accessing
(indirectly) what was certain to have been an isolated, government-only
database in the past.

~~~
mlinksva
> I obviously can't see the backend

You can't see what's running at the moment of course, but why shouldn't you
expect to be able to see the source? Why are new government IT projects closed
source?

~~~
bmelton
Interesting you mention that. I saw this the other day, and was debating
whether or not it was (at least the frontend of) the actual running site:

[https://github.com/CMSgov/healthcare.gov](https://github.com/CMSgov/healthcare.gov)

~~~
integraton
This is only the content site, not the marketplace. See: _" This project does
not include any source code for the Federal Health Insurance Marketplace (the
online systems located under www.healthcare.gov/marketplace."_

------
saluki
I'm surprised they don't have a separate system for their own staff to use
when doing signups over the phone and in person. From all the articles it
sounds like staff are going through the main healthcare.gov to attempt
signups.

The biggest improvement they could make in the short term is setting up a
light weight version so you can just enter the needed information to view your
plans/options prior to the signup/verification process. Then once you've
decided on a plan go into the signup process.

A lot of traffic is probably just people comparing prices, deductibles, etc
putting unnecessary strain on the signup system.

It seems odd that you can't even view the log in page when you click log in on
the home page.

Definitely looking forward to details about the site, backend, db, hosting,
traffic, etc . . . once everything shakes out.

~~~
nknighthb
> _The biggest improvement they could make in the short term is setting up a
> light weight version so you can just enter the needed information to view
> your plans /options prior to the signup/verification process._

Washington's site did just that. Still blew up. For the first couple of days,
just trying to get available plans didn't actually work.

There aren't that many websites out there that experience the kind of traffic
levels the exchanges have. Those that do have mostly grown (if sometimes quite
quickly) to those levels, not launched with them on day 1.

Competent people with prior experience scaling to these levels are
(over-)employed with high compensation. They didn't work on these sites.

------
protomyth
Setting aside the politics, I really don't think it could have been prevented.

The specs were not exactly realistic on number of users at a time. Heck, even
Apple still gets slammed and they know its coming.

Plus, the experience just isn't in the DC contracting community. They can do
websites, but not high availability transactional. If we were talking back-
ends, then yes they have high transaction experience, but not with websites.

I will be expecting second (third?) day stories about bad data problems and
failed transactions. This won't be a simple thing.

~~~
RougeFemme
But isn't the DC contracting community involved in other high availability
transactional systems? For example, isn't one of the ACH systems a government
or pseduo-government system?

~~~
protomyth
Oh yes they are, but not anything with a web-front end. They do really well
with mainframe and back-end, just don't expect anything that has to touch a
web stack.

------
jasonpeacock
Regardless of any technological or architectural decisions, simple load
testing before launch would have prevented this.

1\. Load test, discover issues. 2\. Fix issues. 3\. Repeat steps 1-2 until no
more issues.

 _That_ is the flabbergasting part. Nobody actually tried applying real load
to the system before release? It fell over so easily and quickly the only
assumption is either no loadtesting was performed, the results were ignored,
or massively incorrect load numbers were used (all of which are signs of
incompetent mangement).

~~~
mgkimsal
The official story right now is that they underestimated. Medicaid has had at
max 30k users; they doubled the estimate for hc.gov, but instead of ~50k users
they're seeing 250k. From what I'd read (in a story linked from this hn page)
they did test for load, but nowhere near 200k users.

FWIW, I don't think the issues right now are load so much. I think there's
some insanely bad logical pieces that are messing things up, and possibly have
corrupted accounts that were created early on. I base this on a few things.

1\. The "username" requirements.

The username is case sensitive. Choose a username that is 6-74 characters long
and must contain a lowercase or capital letter, a number, or one of these
symbols _.@/-

Really? I know some systems use case-sensitive usernames, but given that
you're already forcing some odd characters and whatnot in there, why not
normalize to lowercase? This just feels like it's going to cause more support
problems (mobile safari automatically uppercasing a username that should be
lowercased, etc).

Also... the English description of that username is nowhere near intuitive. "A
lowercase or capital letter". As opposed to what other kind of letter?

2\. Multiple accounts with a single email address.

I've been able to 'successfully' register multiple different username accounts
with the same email address. When I do a password reset based on email (the
couple times it worked) which username was I resetting the password for? I say
'successfully' because no login attempt has ever worked. And now trying to
register yet another username with the same email address doesn't work, but
the error message is so vague ("there was a problem" IIRC) that I can't tell
if that was a factor or not.

OK... 2 things. That's about as far as I've been able to see in to the system
so far, so that's all I can judge it on. But it seems that they've probably
allowed some logical inconsistencies in my own signups that may be causing
more problems now, and I don't think they are related to load.

Of course, I could be 100% wrong, but it doesn't feel like load is the culprit
right now.

~~~
shamshiel
I immediately get an error page after logging and and trying to view plans.
That could be load but it certainly feels like some kind of logical error to
me.

~~~
mgkimsal
Exactly. They've got the 'waiting room' technique to slow down apps/logins.
The login process, once done, is quite fast in getting me to a blank page.
([https://www.healthcare.gov/marketplace/auth/userprofile](https://www.healthcare.gov/marketplace/auth/userprofile)
page in question)

------
programminggeek
I've dealt with having to scale a site to millions of pageviews and honestly,
if you've never had to deal with a traffic spike, you probably aren't going to
build for the scale issues you will have.

For example, there is a lot of caching you just have to do. Tons of it. Cache
as much as you can. Memcache and Varnish are your friends, use them as much as
you can. Unfortunately, if Healthcare.gov was in a very write-heavy situation,
they are somewhat limited in how much caching will help.

One thing they could have done that would have saved A TON of load is not
require users to sign up before giving them a list of available plans. That
whole part didn't need to be database driven at all. The data could have been
stored in redis and they could have used javascript to filter it based on a
form. That would have been ridiculously fast. They also could have prerendered
all the plan list possibilities and stored those in varnish. That also would
have been ridiculously fast. My guess is millions of people just wanted to
check prices and eliminating the database load for those users would have
probably kept things running fast and smooth.

Slow DB queries are the enemy and you don't realize how bad they are until you
are at scale. Sure, it only takes a few seconds on your local machine, but
multiply that times thousands of concurrent users and your DB gets swamped. If
you are using an ORM, it is MUCH harder to track down where in your code that
3 way join that scans every record is happening. Ideally you'd be using
straight SQL and maybe use comments to tag a query. Also tools like newrelic
might help if only because many databases don't offer great visibility of
performance data.

Sharding your database is something that is probably possible, and in the case
of healthcare.gov, they probably could have had totally separate
infrastructure on a per-state basis that would have made scaling a lot easier
than putting everything on the same database. Also, put reporting and things
that aren't mission critical on a slave database. The last thing you want is a
reporting job bringing down the live site in the background.

Getting good hardware with fast IO is going to save a lot of developer time
required to scale things. Using fast SSD's is probably the easiest win to
speed up your database. Developer time costs a lot more than hardware and
giving yourself cheap headroom up front gives you breathing room on launch.

Performance testing is also something worth doing, but the tricky part is
until you roll out, it is hard to know exactly where the hotspots are going to
be. In this case, new user signup would be the obvious place to test, so they
probably should have tested up to the limits of their servers and tried to
extrapolate an expected number of users and maybe increased that by 1.5x or
something to have some leeway.

On a rollout where you don't know what you are getting into user wise, being
on a cloud where you can scale out fast as demand requires is something worth
doing. They could have saved a lot of bad press and headache by being on the
cloud initially and migrating to less hardware after the initial peak died
down.

There are a ton of little things like that you have to think about if you are
dealing with massive scale. I don't know if the engineers building
healthcare.gov had ever dealt with something like this before, but I'm sure
they're learning these lessons now.

~~~
jrochkind1
> if you've never had to deal with a traffic spike, you probably aren't going
> to build for the scale issues you will have.

One question would be whether they hired people/firms who had had to deal with
that level of scale before.

I'd assume not.

But apparently they hired someone who didn't even know enough to know that
they were not qualified to do it.

~~~
dragonwriter
One side effect of the complexity of government contracting processes
(designed, in principle, to ensure that the government _doesn 't_ get taken
advantage of by either internal or external actors) is that the people who get
awarded government contracts are often the people with the most skill in
negotiating the government contracting process, not the people with the most
skill in the problem domain.

~~~
snowwrestler
We know who built Healthcare.gov; it was Development Seed. There were a bunch
of stories about it earlier this year. Development Seed also built and runs
MapBox, so they should have some idea how to run a high volume web service.

~~~
ericras
And the front end built by Development Seed has had no problems. It's been the
marketplace built by another contractor.

------
Baustin
This post seems to have gotten buried very quickly (jumped from top 50 to #172
in a matter of minutes). Anyone have a reason why?

