
Tuesday's Heroku outage post-mortem - bscofield
http://blog.heroku.com/archives/2010/10/27/tuesday_postmortem/
======
erikpukinskis
I think it's fascinating that a single engineer, who they had on staff, was
able to write a patch in one night that improved performance by their
messaging system 5x...

... _and hadn't already done it_.

This isn't meant as a slight against Heroku at all. They've got an incredible
team of engineers. But imagine if Ricardo had said "hey, I could write a patch
today that would speed up our messaging system 5x, should I do it?" the rest
of the team would've said "OF COURSE!"

It reminds me of what happens to your brain when you launch a site. Even
before you get feedback, somehow _the knowledge that people can use it_
drastically changes your motivation system. Things that before seemed
important are obviously not. Other things that were invisible before become
the singular focus of your resources.

Maybe we should do fire drills...

* Your requests are suddently taking 100x as long to complete. Go!

* Your "runway" disappears due to an accounting error and you have 7 days to turn a profit. Go!

* 50% of people visiting the site have no idea how to use it. Go!

How could we achieve the focus and clarity that a crisis brings on, without
having the crisis?

~~~
sbov
I understand the gist of your post, but I don't quite understand where you get
1 night from. From what I can tell the 1 night thing was to fix a bug, not
improve performance.

Regarding the 5x increase:

> One of our operations engineers, Ricardo Chimal, Jr., has been working for
> some time on improving the way we target messages between components of our
> platform. We completed internal testing of these changes yesterday and they
> were deployed to our production cloud last night at 19:00 PDT (02:00 UTC).

~~~
erikpukinskis
Ah, you're right. I was reading too quickly. :) The general point stands, but
yeah. Not a good example.

------
danilocampos
I'd like to point out the lesson that other industries can learn from IT
infrastructure companies.

Heroku sells a technical product to a technical audience. They're foundational
to their clients' products. So when something goes down, there's only one
option: explain, in excruciating detail, exactly what happened, why it
happened, and how it's going to be fixed in the future.

Why? Because their clients can smell bullshit better than a purebred
bloodhound. Too much bullshit means it's time to move on.

Beyond being the right thing to do, being accountable is essential to trust.
When you fuck up, it will piss people off. That's just life – everyone makes
mistakes. So you need to be the guy where people can say "Okay, there was a
fuck up, it was bad, but look at how hard these guys worked to fix it. Check
out their plans to prevent it in the future."

Luckily, the incentives are aligned here to make this mostly non-negotiable.
When you get medical malpractice, a financial meltdown or an oil spill going
on, the cover-your-ass impulses are much more compelling.

Even in those cases though, I insist we need to encourage a culture where
accountability and transparency are rewarded. Because, for me, accountable
guys are the kind of people I want to do business with.

I dunno much about scaling a Rails server, but for now, at least, I know the
Heroku guys are the sort of people I'd trust.

~~~
moe
_there's only one option: explain, in excruciating detail, exactly what
happened, why it happened, and how it's going to be fixed in the future. Why?
Because their clients can smell bullshit better than a purebred bloodhound.
Too much bullshit means it's time to move on._

Okay. I feel a bit sorry for bashing heroku here, but I'll bite.

If I was a heroku-customer then I'd feel, ahem, a bit washed by their idea of
"excruciating detail".

So their "internal messaging system" triggered a bug in their "distributed
routing mesh". And they applied a "hot patch".

Great. As far as I am concerned they could as well have written their flux-
compensator overheated because the pixie-dust exhaust got clogged with rogue
bogomips.

I applaud their willingness to talk to their customers at all. But please...
either explain what was going on in a meaningful way - or just leave it at "we
screwed up and promise to do our best to prevent it from happening again".

~~~
epochwolf
> But please... either explain what was going on in a meaningful way

Some of us like a technical breakdown and feel warm fuzzy reassurance. If a
few people got confused after the first paragraph, it's less harmful than
appearing to bullshit technical users.

~~~
endlessvoid94
He's not saying it was too technical, he's saying it wasn't technical enough.

~~~
moe
Yes, sorry if that was unclear.

In less snarky words: Even facebook told us quite clearly _how_ they screwed
up the other day (the config management issue). In contrast this heroku
article was disappointing.

------
absconditus
Will someone at Heroku please describe your QA process?

~~~
JoachimSchipper
I don't care for Heroku, but this is over the top: distributed systems are
_complicated_ , to build and especially to test. Even Google gets it wrong:
[http://gmailblog.blogspot.com/2009/09/more-on-todays-
gmail-i...](http://gmailblog.blogspot.com/2009/09/more-on-todays-gmail-
issue.html) is not entirely dissimilar.

~~~
absconditus
How is it over the top? I am genuinely curious about their QA process. I am
not judging them.

------
smackfu
Are they going to remove "rock-solid" from their front page copy?

~~~
chanks
Heroku had 45 minutes of downtime in August, 28 in September, and 45 in
October so far. (Source:
[http://groups.google.com/group/heroku/browse_thread/thread/f...](http://groups.google.com/group/heroku/browse_thread/thread/fc45c0b5d2a363e))

That's 99.90%, 99.94%, and 99.88% (for the month so far), or simply 99.91% for
the entire period.

So, what would you consider "rock-solid"? Personally, I'll echo what was said
in the other thread - 99.91% is much better than what I could accomplish on my
own, so I'll continue to trust my business to them.

~~~
qeorge
They had over an hour of downtime yesterday alone, one 8m outage and one 1h15m
outage. (Source: <http://status.heroku.com>)

Regardless, 99.9% is really not special or even acceptable from a high dollar
host like Heroku. Cheap shared hosts like HostGator can give you 99.9%.

I'm pulling for them, but they've got some work to do.

~~~
joevandyk
"cheap shared hosts" don't provide you with the same sort of infrastructure
that heroku gives you.

~~~
qeorge
Of course, my point was that 99.9% is expected on even the cheapest setups.
Premium hosting like Heroku should at least be able to deliver comparable
uptime to a cheap shared host like HostGator.

However, Heroku appears to have no SLA, so its a moot point anyway.

------
GICodeWarrior
Does Heroku use anything like 5 Whys to incrementally address organizational-
type causes?

------
random42
_After isolating the bug, we attempted to roll back to a previous version of
the routing mesh code. While the rollback solved the initial problem, there as
an unexpected incompatibility between the routing mesh and our caching
service._

To me, it seems like they just needed to apply the "Hot patch", instead they
panicked(?) and did a lot of unnecessary version control gymnastics, which
delayed the bug fix.

~~~
danilocampos
I've written mostly client code, and watched server action from the sidelines,
but jumping straight to the hotfix only seems obvious in retrospect to me.
Rolling back to a known-good state is the safe approach – it just didn't work
in this case because of a surprise incompatibility with another system.

If you jump straight to the hotfix, you're basically enlisting the entirety of
your userbase to join you in a round of QA, which could be sub-optimal if your
hotfix ends up causing some other unintended consequence.

Right?

~~~
random42
_Rolling back to a known-good state is the safe approach_

Absolutely.

However, they rollback should be atomic, which means all pieces of the
infrastructure/code should be rolled back to a known-good-state.

When I said "gymnastics", I meant for rolling back one piece of code, only to
find incompatibility with other pieces.

I do not intend to judge them, knowing its difficult not to panic in the
difficult emergency situation, but not working effect out on paper (or not
knowing, which versions of software components are inter-compatible), before
actually doing it on code, look pretty novice to me, for a company of heroku's
scale.

I really hope this is not an unfair criticism. (handling emergencies are
difficult)

~~~
erikpukinskis
The trouble is that they have messaging servers, a routing mesh, and caching
servers which are all loosely coupled and deployed on separate boxes. They
could take down all dozens of pieces of their infrastructure and roll them all
back to wherever they were on that previous date, but this is not better for
several reasons:

1) it take much longer than just rebooting the isolated service. Can you
imagine Google shutting down every one of their multi-million boxes, rolling
them back to a previous state and spinning them up again?

2) they'd still be at risk for incompatibilities with their databases, etc.
the problem with unexpected imcomptabilities is that they're unexpected.

~~~
random42
All I am trying to say, it only makes sense to do/know effect analysis of your
changes, before actually doing it.

I fail to see, why they need to revert a piece of code, and then realize,
_OMG... this version of code does not fit well with the rest of architecture,
now change it back._

1.) I do not expect Google (or even a small shop, like my place) to revert any
piece of code which is not affected.

I, HOWEVER, expect to know what changes I am EXACTLY doing, and what to EXPECT
after the changes.

(It should not be black magic, for historical code).

2.) I fail to understand this, why should this the case for older code? I can
understand some tricky/edge/minor cases, but whether the architecture/database
etc. (major compatibility) is compatible or not, should be possible to
calculate BEFORE doing the changes.

I hope I am not over-trivializing the issue, but I still cannot get my head
over the approach.

------
dlevine
I think it's really cool that Heroku is so transparent about their outages. A
lot of companies try to cover them up or blame them on someone else.

It's refreshing to see a company that not only acknowledges their outages, but
even has a list of all past issues and outages. This transparency can only
help them to become better in the future.

