The system they are using (IC, ops, engineer teams, operational periods) is extremely similar to the Incident Command System. The ICS was developed about 40 years ago for fighting wildfires, but now most government agencies use it to manage any type of incident.
I've experienced it first hand and can say it works very well, but I have never seen it used in this context. The great thing about it is it's expandability---it will work for teams of nearly any size. I'd be interested in seeing if any other technology companies/backend teams are using it.
In particular, I'm delighted to hear that they plan to perform continuous backups on their shared databases:
3) CONTINUOUS DATABASE BACKUPS FOR ALL. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases... We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.
Combined with multi-region support, this should make Heroku far more resilient in the future.
Sorry, but that's not cutting it for me right now. I pay Heroku $250 a month and I was down for 60 hours (not 16). Our app isn't even out of private beta so I fully expected to be paying Heroku $2-3K/month by the end of the year. Now, I'm not sure I'll stay.
If you're really taking 100% responsibility, then consider pro-rating the bills of affected paying customers (based on the downtime).
You generally won't get all that for $2K to $3K a month. Sure, you can drop $15K on an expensive database server, and co-locate it somewhere. But that only works until somebody takes a backhoe to your fiber, your RAID controller fails catastrophically, somebody pwns your production server, your sysadmin flakes out, or you discover that your backup scripts have been broken for months.
Realistically, if you're only spending $2-3K per month on hosting and administration, you'll eventually experience one or more of the above, and your site may be down for a day or more.
This isn't to say that I'm happy about Heroku's long downtime. One of my clients was offline almost as long as you were. But I'm pleased that Heroku recognizes just how badly they screwed up, and that they're taking the two most important steps they can to prevent a recurrence: multi-region support, and continuous backups for everyone. Multi-region support may not be sufficient to protect against cascading Amazon outages, but it's a good start.
That's implied, at the very least, by the phrase "heroku takes 100% of the responsibility. "
I would be very surprised if they didn't offer more than that.
They can keep their $20 in my view, as long as they ensure it never happens again.
Does Heroku even have an SLA? I can't find it. If they did maybe they would have been more proactive to prevent this kind of problem.
I suspect Heroku has SLAs for their bigger customers, but don't really know for sure. I do think you're overestimating what kind of incentive an SLA is for a provider, though. SLAs are basically an on paper way of showing your commitment to keeping things running and responding to problems. If you don't have that commitment already, the paper isn't going to change anything.
Pointy haired bosses and lawyers love SLAs, but smart people who shop for this stuff don't care all that much about them. An SLA isn't going to convince me to go with one provider over another, nor is lack of an SLA going to make me avoid a provider I already like and respect.
I am not a lawyer or a PHB but I run a small business that has customers that pay for a service so if that service goes down I look bad and they are upset.
You must think pretty highly of your app. Why don't you take it out of 'private beta' and let the rest of us look at it?
But being on multiple Availability Zones was supposed to be bulletproof (according to Amazon). Now that we know that wasn't the case, is being hosted on multiple regions going to provide the necessary level of protection?
Is it an over-reaction to say that relying completely on Amazon could now be seen as irresponsible to your users, given the magnitude of this event?
Honestly, I would prefer this kind of mass outage than the alternative. It's cheaper, easier, and I bet you there's still better uptime overall.
However obviously it's good PR, and we all appreciate the Mea Cupla from Heroku, the fact is, they are proposing to migrate to a situation where they are still completely reliant on AWS for their hosting.
I'm just not sure you can really say "We don’t want to ever put our customers through something like this again and we’re working as hard as we can on making sure that we won’t ever have to.", when at the end of the day, you are again relying on a company that has failed you in the past.
Not trying to attack Amazon or Heroku, I'm honestly intrigued by this issue; not to mention the fact that we are facing the exact same decision at work.
If Heroku evolves to an architecture in which they utilize multiple AWS regions (as they mention in lesson #1 of their post-mortem) and if each region has a distinctly partitioned API "control plane," this should result in a materially improved availability situation for Heroku. EC2 Availability Zones guard against machine, power, and building failures. EC2 Regions should theoretically guard against API infrastructure and AWS software code failures.
Heroku need not necessarily ditch their current single-IaaS-provider architecture in order to achieve significantly better control over their service's uptime.
On the other hand, when downtime does occur, the ability for Heroku to prioritize their incident response manpower to first handle paying customers has its limits based on their downstream dependencies. If all the broken bits are within Amazon's black box, Heroku doesn't have much control over prioritization (Amazon fixes your stuff whenever it gets around to fixing your stuff). If Heroku operated over multiple cloud providers, even with the added complexity of such an approach, at least Heroku would have control over choosing which of their most important customers to migrate first to a working cloud, away from a broken and black box cloud.
In the end, I certainly don't see these considerations as simple. It's easy to cry when things go wrong, but I think the level of scalability and availability that has been achieved up to the present is quite noteworthy.
Wait and see it is...
Just as Heroku took responsibility for the unexpected weaknesses their reliance on a single region created, I believe their customers should take responsibility for the unexpected weaknesses our reliance on a single hosting provider has created.
Heroku still has the value of added resiliency, even if it's not 110% bulletproof. Ultimately, we're responsible for the architecture design of our own sites.
It would be both easy, tempting and heck, even reasonable to assign at least a portion of the blame to Amazon. Their approach is interesting because their customers already know that, but are likely to appreciate their forthright acceptance of responsibility.
It's a good lesson. If I'm being totally honest I'd have to admit that, as a developer, I sometimes blame external services or events for things that I have at least partial control over. Perhaps I should adopt Heroku's approach instead.
Sure you can blame AWS because they said multiple availability zones in the same region would work. But at the same time there is an expectation that a site like Heroku is knowledgeable enough and sophisticated enough to intelligently process what AWS says and determine what's appropriate for them.
Personally, I prefer to just get the blame part out of the way by taking responsibility and concentrate on the important things: fixing the problem and making sure it doesn't happen again.
I think that deep down, people aren't that concerned with who's fault it was. They just want to know that someone is going to fix it.
By suggesting they take responsibility, they also are in a position where they have to make good for all of the downtime their customers experienced.
Short term - that will be an expensive decision. Long term, I think it's the right thing to do. It certainly builds up my confidence level in them.
Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.
Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.
So, take 100% of the responsibility, but I wouldn't think any less of heroku if they only took 50%.
It is surprising they don't talk refunds for the downtime, if they are taking responsibility. I'd imagine we will see this coming soon?
Sure, it would be easy to blame Amazon - really easy. But as I said I was paying for rails hosting from Heroku, not Amazon.
In this example, think of Amazon as Comcast and Acme Corp as Heroku. Heroku wasn't prepared to handle this type of failure, so they're at fault.
If I were providing a paid service and my dedicated servers went offline due to some reason (let's say the fiber line gets cut by road maintenance crews) - my customers wouldn't care what happened - perhaps some will offer sympathy - but in the end it's my fault, and my responsibility to offer any kind of contingency plan - which generally includes load balancing over multiple datacenters etc. The same view should have been taken here and too much reliance was set on one region (even though Amazon promised it would be safe - this is their fault ;)). In the end, they should take full blame and now learn from their mistakes.
For all we know multiple regions could have an undiscovered common failure point.
Don't get me wrong. Heroku isn't entirely blameless--I had a production app that was down for about 12 hours.
Where does it end? Is it 100% your customers' fault for using your service and not using multiple services that match your service to be redundant?
I'm just making conversation here now, but I feel like Heroku did not have to go this far.
It's a sucky situation, and I feel for Heroku - and I blame AWS personally - but when it comes to the final customer, then as a service provider it's only right to take blame.
I think the blame stops at Heroku who either need to provide more redundancy for incidents like this or let their customers know that if this type of incident arises there is little that they can do.
Let's be realistic about this; for most people using heroku the alternative would have been bare ec2, and could easily have suffered the same fate as on heroku.
Everyone should feel positive that they got to spend ~60 hours just sitting around moaning about being let down, instead of having to sweat their nuts off attempting to rehabilitate crazy, suicidal infrastructure.
Even taking this downtime into account, heroku is still cost effective for me in a lot of cases.
Heroku should save the customers this pain, by setting up anycast:
1. What exactly is going on?
2. When will it be fixed?
In the middle of a crisis, saying "we're aware of the problem, and we're working hard to fix it," for hours does not really count as communication. It increases customer aggravation rather than decreasing it. Customers want to know answers to the above two questions. They don't care that you know about the problem and that you're working on it, unless you're not doing those two things in which case they will be (and should be) furious; those two things are expected.
Barring the ability to tell your customers "we will be back up at X:00", I think the best approach is to share as much information as you can without getting into proprietary information. That's why I think GP considered their communication a failure. That's why I consider their communication a failure, although I've seen this pattern enough from different companies that I don't hold it against Heroku as long as they learn from it.
Heroku's only way to answer your questions would have been to lie.
fdr 1 hour ago | link | parent [dead] | on: Heroku's AWS outage post-mortem
The mechanism is PostgreSQL continuous archiving.
This tool is still quite nascent. It received quite a trial by fire, having not (before this point) been revealed as a value-added feature to the service just yet in a wide scale.
WAL-E is a program that postgresql can use to push database changes to S3.
Depending on how you configure postgresql checkpoints, the most data you'd lose is somewhere between to a couple seconds to a minute. I'd assume Heroku would make it a couple seconds. The downside to more frequent backups is more storage space (each checkpoint (WAL archive) stored on S3 is a minimum of 75k or so, even if there weren't any changes).
Although people like to measure the data loss temporally, it'd be more precise to the system-minded to say that it's 16MB of transaction log loss should the drive die between COMMIT and WAL-E send. Thus, temporally, there is a plateauing effect: the more data you push up to a point, the less you will lose temporally because Postgres swaps segments more quickly. If you push too much, backlogs can occur. If you measure in terms of xact bytes lost, it's simple: maximum 16MB-(32-epsilon)MB, assuming a trivial backog size, lose-able between COMMIT; and archiver send.
A word on backlogs: my experience would suggest you need to be doing very demanding things (bulk load, large in-server denormalizations or statement executions) to produce backlog given the throughput one sees on EC2. It's easy to write a monitoring query to do this using pg_ls_dir and regular expressions or similar. Nominal operation doesn't often see backlog, the pipes to S3 are reasonably fat. I hope to more carefully document ways to limit these backlogs via parallel execution and adaptive throttling of the block device I/O for the WAL writing. Another idea I had was to back WAL writes in-memory in addition to on-disk (RAID-1) so WAL-E would have a chance to send the last few WAL-segments, if any, in event of sudden backing block device failure.
A dead WAL drive is interesting because it will prevent COMMIT; from successful execution, hence, the amount of data loss is reduced (because availability comes to a halt immediately, even if the WAL segment is incomplete). Whereas if a Postgres cluster disk fails new transactions might COMMIT (the WAL continues to write and no fsync that will block has necessarily been issued) but you have a good chance of grabbing those segments anyway as database activity halts since WAL-E can continue to execute even in the presence of a failed postgres cluster-directory serving block device. A dead WAL drive will nominally allow non-writing SELECT statements to execute, so availability is generally lost to new writes only, although this may change on the account of crash-safe hint bits (I'm not terribly familiar with the latest thinking of that design, but I imagine it may have to generate WAL when doing read-only SELECT).
Finally, interesting things are possible with synchronous replication and tools like pg_streamrecv in 9.1, even if pg_streamrecv runs on the same box: I don't see an obvious reason why it would not be possible to allow for user-transaction-controlled durability of at least two levels: committed to EBS, and committed to S3. S3 could effectively act as a synchronous replication partner.
Fundamentally, putting aside the small archiver asynchronism, EBS with WAL-E is basically a cache of sorts to speed up recovery. The backing store is really, in some respect, S3.
I'm not deeply familiar with the Postgres versions of that (to my regret), but for the MySQL version you can read something like this:
Better yet, find yourself a copy of High Performance MySQL.
MySQL has long relied on statement based replication, which can lead to server drift in the case of any nondeterministic query. This is a total killer for extensibility of the database, as well as correctness in general. It also has a row-based-replication variant that showed up around 2008 that represents a significant improvement, but the search results for "mysql rbr" might give you pause...
My guess is this is why Amazon made the sensible choice to back their RDS (MySQL) product via DRBD and synchronous, block-device level replication, because there is no good application-level option for MySQL that is to be trusted. This technique can also be used with PostgreSQL. However, use of DRBD tends to have punishing performance impact, is complicated, and is not very suitable for a hot standby unless you write very complex shared-storage database software like Oracle RAC, hence why so much effort went into WAL-streaming hot standby in the PostgreSQL community. The DRBD option is venerable, dating back to the LiveJournal early days as their MySQL HA option, and probably before that (Credit to LiveJournal for well-documenting their HA setup, including their use of DRBD).
In all fairness I've read that the reddit devs have made lots of boneheaded mistakes in their general infrastructure-design, but it still seems Amazon is not a very reliable platform to build your stuff on. Platforms built on top of Amazon's even less so.
AMZN in general is a pretty solid 'platform' (especially if you're not using EBS), but because this whole 'cloud' thing is still partially uncharted territory, there are still holes, and you can't treat it like a normal web host.