

Heroku's AWS outage post-mortem - mileszs
http://status.heroku.com/incident/151

======
chrishenn
_Our monitoring systems picked up the problems right away. The on-call
engineer quickly determined the magnitude of the problem and woke up the on-
call Incident Commander. The IC contacted AWS, and began waking Heroku
engineers to work on the problem. Once it became clear that this was going to
be a lengthy outage, the Ops team instituted an emergency incident commander
rotation of 8 hours per shift, keeping a fresh mind in charge of the situation
at all time. Our support, data, and other engineering teams also worked around
the clock._

The system they are using (IC, ops, engineer teams, operational periods) is
extremely similar to the Incident Command System. The ICS was developed about
40 years ago for fighting wildfires, but now most government agencies use it
to manage any type of incident.

I've experienced it first hand and can say it works very well, but I have
never seen it used in this context. The great thing about it is it's
expandability---it will work for teams of nearly any size. I'd be interested
in seeing if any other technology companies/backend teams are using it.

<http://en.wikipedia.org/wiki/Incident_command_system>

------
ekidd
Kudos to Heroku for taking full responsibility, and for planning to engineer
around these kinds of Amazon problems in the future.

In particular, I'm delighted to hear that they plan to perform continuous
backups on their shared databases:

 _3) CONTINUOUS DATABASE BACKUPS FOR ALL. One reason why we were able to fix
the dedicated databases quicker has to do with the way that we do backups on
them. In the new Heroku PostgreSQL service, we have a continuous backup
mechanism that allows for automated recovery of databases... We are in the
process of rolling out this updated backup system to all of our shared
database servers; it’s already running on some of them and we are aiming to
have it deployed to the remainder of our fleet in the next two weeks._

Combined with multi-region support, this should make Heroku far more resilient
in the future.

~~~
callmeed
Kudos? For nothing but the words "heroku takes 100% of the responsibility
..."?

Sorry, but that's not cutting it for me right now. I pay Heroku $250 a month
and I was down for 60 hours (not 16). Our app isn't even out of private beta
so I fully expected to be paying Heroku $2-3K/month by the end of the year.
Now, I'm not sure I'll stay.

If you're really taking 100% responsibility, then consider pro-rating the
bills of affected paying customers (based on the downtime).

~~~
toast76
Your biggest concern is you want a $20 refund?

They can keep their $20 in my view, as long as they ensure it never happens
again.

~~~
iamjustlooking
The company where we host most of our servers has an SLA thats starts paying
10% monthly refund per 10 minutes of downtime that is their fault. I can't
believe that you guys will take anything at this magnitude of downtime and
still stick around.

Does Heroku even have an SLA? I can't find it. If they did maybe they would
have been more proactive to prevent this kind of problem.

~~~
mrkurt
"Downtime that is their fault" is kind of a giant caveat, no? Is it their
fault if they lose transit or power, for instance? With that level of refunds
I suspect "their fault" basically only covers one of them accidentally running
over a server with their car. The problem is, that guaranty isn't getting
anyone anything of value.

I suspect Heroku has SLAs for their bigger customers, but don't really know
for sure. I do think you're overestimating what kind of incentive an SLA is
for a provider, though. SLAs are basically an on paper way of showing your
commitment to keeping things running and responding to problems. If you don't
have that commitment already, the paper isn't going to change anything.

Pointy haired bosses and lawyers love SLAs, but smart people who shop for this
stuff don't care all that much about them. An SLA isn't going to convince me
to go with one provider over another, nor is lack of an SLA going to make me
avoid a provider I already like and respect.

~~~
iamjustlooking
I dont know about other peoples SLA's but seeing as you're hinging on my
simplified description my SLA provides 100% uninterrupted transit to the
Internet and 100% uninterrupted electricity so if the power goes out it is
still 'their fault' but if I rm -rf / it is my fault.

I am not a lawyer or a PHB but I run a small business that has customers that
pay for a service so if that service goes down I look bad and they are upset.

~~~
mrkurt
Oh, well 100% uptime for power and bandwidth is pretty standard then, I
figured you were comparing an SLA for similar type services as you'd get from
Heroku and/or EC2.

------
adriand
I'm very impressed by how they take responsibility for this, in their words:
"HEROKU TAKES 100% OF THE RESPONSIBILITY FOR THE DOWNTIME AFFECTING OUR
CUSTOMERS LAST WEEK."

It would be both easy, tempting and heck, even reasonable to assign at least a
portion of the blame to Amazon. Their approach is interesting because their
customers already know that, but are likely to appreciate their forthright
acceptance of responsibility.

It's a good lesson. If I'm being totally honest I'd have to admit that, as a
developer, I sometimes blame external services or events for things that I
have at least partial control over. Perhaps I should adopt Heroku's approach
instead.

~~~
jarin
It's sort of counter-intuitive, but taking responsibility for something (even
if it's not directly your fault) often has the effect of deflecting some of
the anger from your customers/clients/boss/etc.

Personally, I prefer to just get the blame part out of the way by taking
responsibility and concentrate on the important things: fixing the problem and
making sure it doesn't happen again.

I think that deep down, people aren't that concerned with who's fault it was.
They just want to know that someone is going to fix it.

~~~
ghshephard
The reason why you don't want to take responsibility, is that the liability
comes along with that. If Heroku took the position that the AWS outage was a
force de majeure, then their liability for recompense to their customers would
have been minimized.

By suggesting they take responsibility, they also are in a position where they
have to make good for all of the downtime their customers experienced.

Short term - that will be an expensive decision. Long term, I think it's the
right thing to do. It certainly builds up my confidence level in them.

~~~
kowsik
As a PaaS vendor, they are _supposed_ to abstract away from IaaS failures. And
they were _not_ to use a single region to host all their apps. I love Heroku
and will continue to use them as long as I have the option to add affinity to
my dynos and workers to spread across multiple regions of my choice. Coupled
with anycast DNS support, this will be a very compelling offering, if they can
pull it off. During the outage, all of our scale engines (<http://blitz.io>)
and our CouchDB cluster across the other AWS regions held up, but since the
web-tier was down, the whole app went down.

------
watchandwait
The AWS outage is definitely not over. Apparently RDS is built on EBS and they
have not all been restored, I can tell you that first hand.

~~~
watchandwait
UPDATE: we were fully restored after midnight last night. It is a very happy
feeling!

------
waxman
Thank you for taking full responsibility.

Everyone makes mistakes, so what matters is how you deal with them. This was
the right way to respond. Thanks.

------
markbao
I wish Amazon was as good at communication and accountability as Heroku is.

------
chrisbaglieri
"Block storage is not a cloud-friendly technology".

Based on every post-mortem I've read thus far, it's clear how AWS and it's
customers approach EBS will change.

------
bdb
Where is Amazon's?

------
greattypo
It's impressive that they're taking full responsibility, but very surprised
there's no mention of refunds..

~~~
JonWood
Given that Heroku charges based on the time your application is up I wouldn't
be surprised if everyone just gets a bill which doesn't include the time their
sites were offline.

------
dpcan
What the hell? Why is everyone taking responsibilty and giving amazon a free
ride? I'm a firm believer that only victims make excuses, and it's admirable
to take responsibility, and maybe they should have more redundancy in place,
but the way aws has been advertised, most of us felt this kind of thing should
never happen even without a 100% uptime guarantee.

So, take 100% of the responsibility, but I wouldn't think any less of heroku
if they only took 50%.

~~~
dholowiski
I pay Heroku to host my rails apps, not Amazon. I don't give a flying f __k
what kind of back end infrastructure they use, as long as people can get to my
app.

It is surprising they don't talk refunds for the downtime, if they are taking
responsibility. I'd imagine we will see this coming soon?

~~~
dpcan
So you are saying if they are taking 100% of the responsibility they assume
100% of the liability? This is exactly why I'm suggesting they may have put
their foot in their mouth. SOME of the fault reasonably lies with Amazon in my
opinion, and I personally would not have cared if they took 50%. That's all
really.

~~~
RyanKearney
What about this example. You have Acme Corp Datacenters who sell dedicated
servers to their customers. If Acme Corp has a network outage because their
single Comcast connection went down because Comcast was having some routing
errors, the customers who are effected go to Acme Corp. It isn't the customers
fault that Acme Corp wasn't prepared to deal with a downed connection and
setup a redundant network.

In this example, think of Amazon as Comcast and Acme Corp as Heroku. Heroku
wasn't prepared to handle this type of failure, so they're at fault.

~~~
tzs
By that argument, the customers of Heroku who weren't prepared to handle
Heroku's failure were at fault. They should have had alternate rails hosting
lined up.

~~~
RyanKearney
Well it depends on what Heroku's SLA was, if any. If Heroku stated 100% uptime
grantee, then Heroku would be to blame for not living up to their 100% uptime.
If Heroku said hey listen, we can't guarantee any amount of uptime so be
prepared and someone were to host some "mission critical" information on
Heroku then yeah it would be the customers fault.

~~~
railsguy1
Heroku has no SLA

------
chubs
This is why i love hosting on heroku: they'll work their butt off to get it
fixed when its down, and i don't have to lift a finger. However, EBS has been
long known to be a turd, its a pity they relied on it. Plus, if they had a way
to bring it back up in a different region (eg the euro AWS infrastructure) at
the flick of a switch, that'd make me less nervous...

------
AffableSpatula
I don't think this is particularly 'honorable' or anything like that.. it's
the only sensible stance for them to take.

Let's be realistic about this; for most people using heroku the alternative
would have been bare ec2, and could easily have suffered the same fate as on
heroku.

Everyone should feel positive that they got to spend ~60 hours just sitting
around moaning about being let down, instead of having to sweat their nuts off
attempting to rehabilitate crazy, suicidal infrastructure.

Even taking this downtime into account, heroku is still cost effective for me
in a lot of cases.

------
metageek
> _It's a big project, and it will inescapably require pushing more
> configuration options out to users (for example, pointing your DNS at a
> router chosen by geographic homing_

Heroku should save the customers this pain, by setting up anycast:

[https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domai...](https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domain_Name_System)

------
awicklander
They gloss over their biggest failure; they weren't communicating or
interacting with their customers _at all_.

* <http://twitter.com/#!/heroku> * <http://twitter.com/#!/herokustatus>

~~~
daniel02216
They show a history of updates to their status blog and to the herokustatus
Twitter account since April 21, what do you mean by 'they weren't
communicating'?

~~~
runningdogx
From 9:07 to 20:43, the status updates were generic and not very helpful in
answering two questions customers want to know:

1\. What exactly is going on? 2\. When will it be fixed?

In the middle of a crisis, saying "we're aware of the problem, and we're
working hard to fix it," for hours does not really count as communication. It
increases customer aggravation rather than decreasing it. Customers want to
know answers to the above two questions. They don't care that you know about
the problem and that you're working on it, unless you're not doing those two
things in which case they will be (and should be) furious; those two things
are expected.

Barring the ability to tell your customers "we will be back up at X:00", I
think the best approach is to share as much information as you can without
getting into proprietary information. That's why I think GP considered their
communication a failure. That's why I consider their communication a failure,
although I've seen this pattern enough from different companies that I don't
hold it against Heroku as long as they learn from it.

~~~
ceejayoz
Both questions were unanswerable. Even Amazon's estimates were wildly off, and
they can actually look at the infrastructure.

Heroku's only way to answer your questions would have been to lie.

------
oomkiller
I'd really love to know some details on the continuous backup stuff. Sounds
cool.

~~~
bbatsell
Not sure why it was dead-ed (possibly a double-post), but here's the answer
from an author of it in case you don't have showdead on:

fdr 1 hour ago | link | parent [dead] | on: Heroku's AWS outage post-mortem

The mechanism is PostgreSQL continuous archiving.

<http://github.com/heroku/wal-e>

This tool is still quite nascent. It received quite a trial by fire, having
not (before this point) been revealed as a value-added feature to the service
just yet in a wide scale.

~~~
joevandyk
I started using WAL-E a couple days ago for one of my own sites.

WAL-E is a program that postgresql can use to push database changes to S3.

Depending on how you configure postgresql checkpoints, the most data you'd
lose is somewhere between to a couple seconds to a minute. I'd assume Heroku
would make it a couple seconds. The downside to more frequent backups is more
storage space (each checkpoint (WAL archive) stored on S3 is a minimum of 75k
or so, even if there weren't any changes).

~~~
fdr
A couple of seconds is very aggressive. There is a window for data loss when
the segment is incomplete. The calculus of what this means in a real system is
somewhat complicated, and sketched below.

Although people like to measure the data loss temporally, it'd be more precise
to the system-minded to say that it's 16MB of transaction log loss should the
drive die between COMMIT and WAL-E send. Thus, temporally, there is a
plateauing effect: the more data you push up to a point, the less you will
lose temporally because Postgres swaps segments more quickly. If you push too
much, backlogs can occur. If you measure in terms of xact bytes lost, it's
simple: maximum 16MB-(32-epsilon)MB, _assuming a trivial backog size_ , lose-
able between COMMIT; and archiver send.

A word on backlogs: my experience would suggest you need to be doing very
demanding things (bulk load, large in-server denormalizations or statement
executions) to produce backlog given the throughput one sees on EC2. It's easy
to write a monitoring query to do this using pg_ls_dir and regular expressions
or similar. Nominal operation doesn't often see backlog, the pipes to S3 are
reasonably fat. I hope to more carefully document ways to limit these backlogs
via parallel execution and adaptive throttling of the block device I/O for the
WAL writing. Another idea I had was to back WAL writes in-memory in addition
to on-disk (RAID-1) so WAL-E would have a chance to send the last few WAL-
segments, if any, in event of sudden backing block device failure.

A dead WAL drive is interesting because it will prevent COMMIT; from
successful execution, hence, the amount of data loss is reduced (because
availability comes to a halt immediately, even if the WAL segment is
incomplete). Whereas if a Postgres cluster disk fails new transactions might
COMMIT (the WAL continues to write and no fsync that will block has
necessarily been issued) but you have a good chance of grabbing those segments
anyway as database activity halts since WAL-E can continue to execute even in
the presence of a failed postgres cluster-directory serving block device. A
dead WAL drive will nominally allow non-writing SELECT statements to execute,
so availability is generally lost to new writes only, although this may change
on the account of crash-safe hint bits (I'm not terribly familiar with the
latest thinking of that design, but I imagine it may have to generate WAL when
doing read-only SELECT).

Finally, interesting things are possible with synchronous replication and
tools like pg_streamrecv in 9.1, even if pg_streamrecv runs on the same box: I
don't see an obvious reason why it would not be possible to allow for user-
transaction-controlled durability of at least two levels: committed to EBS,
and committed to S3. S3 could effectively act as a synchronous replication
partner.

Fundamentally, putting aside the small archiver asynchronism, EBS with WAL-E
is basically a cache of sorts to speed up recovery. The backing store is
really, in some respect, S3.

~~~
joevandyk
I was thinking setting archive_timeout to a low number would limit the
temporal data loss.

~~~
fdr
You are right, especially if you aren't pushing much data. Your restore times
be rather long though. I hope to implement a prefetching strategy to make this
much, much faster, so one could do that if they absolutely wished.

------
chrisbaglieri
I wish more companies (hell people) were as forthright, pragmatic, and
sensible as the Heroku gang. Their breakdown and response to the outage is
_exactly_ what me as a paying customer wants to hear.

Kudos!

------
mtw
what about also spreading to multiple providers (i.e. also use rackspace
cloud)? they'd be less dependant from amazon issues

------
trezor
And now reddit is down again (posting/submitting is impossible). Probably yet
another Amazon issue, yet again.

In all fairness I've read that the reddit devs have made lots of boneheaded
mistakes in their general infrastructure-design, but it still seems Amazon is
not a very reliable platform to build your stuff on. Platforms built on top of
Amazon's even less so.

~~~
showerst
I think boneheaded is a strong word. They're solving a problem that very few
sites have to solve (huge traffic with low cache-ability) with vastly less
resources than the others who do solve them have. (FB, Twitter, etc).

AMZN in general is a pretty solid 'platform' (especially if you're not using
EBS), but because this whole 'cloud' thing is still partially uncharted
territory, there are still holes, and you can't treat it like a normal web
host.

