Hacker News new | past | comments | ask | show | jobs | submit login
Heroku's AWS outage post-mortem (heroku.com)
202 points by mileszs on Apr 27, 2011 | hide | past | web | favorite | 80 comments

Our monitoring systems picked up the problems right away. The on-call engineer quickly determined the magnitude of the problem and woke up the on-call Incident Commander. The IC contacted AWS, and began waking Heroku engineers to work on the problem. Once it became clear that this was going to be a lengthy outage, the Ops team instituted an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all time. Our support, data, and other engineering teams also worked around the clock.

The system they are using (IC, ops, engineer teams, operational periods) is extremely similar to the Incident Command System. The ICS was developed about 40 years ago for fighting wildfires, but now most government agencies use it to manage any type of incident.

I've experienced it first hand and can say it works very well, but I have never seen it used in this context. The great thing about it is it's expandability---it will work for teams of nearly any size. I'd be interested in seeing if any other technology companies/backend teams are using it.


Kudos to Heroku for taking full responsibility, and for planning to engineer around these kinds of Amazon problems in the future.

In particular, I'm delighted to hear that they plan to perform continuous backups on their shared databases:

3) CONTINUOUS DATABASE BACKUPS FOR ALL. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases... We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.

Combined with multi-region support, this should make Heroku far more resilient in the future.

Kudos? For nothing but the words "heroku takes 100% of the responsibility ..."?

Sorry, but that's not cutting it for me right now. I pay Heroku $250 a month and I was down for 60 hours (not 16). Our app isn't even out of private beta so I fully expected to be paying Heroku $2-3K/month by the end of the year. Now, I'm not sure I'll stay.

If you're really taking 100% responsibility, then consider pro-rating the bills of affected paying customers (based on the downtime).

I've run both cloud and non-cloud applications. In my experience, you won't get 99.95% annual uptime over 5 years without a full-time sysadmin, the ability to provision a complete offsite infrastructure and fail over to it within a few hours, and a backup/restore process that you rigorously test every month or so.

You generally won't get all that for $2K to $3K a month. Sure, you can drop $15K on an expensive database server, and co-locate it somewhere. But that only works until somebody takes a backhoe to your fiber, your RAID controller fails catastrophically, somebody pwns your production server, your sysadmin flakes out, or you discover that your backup scripts have been broken for months.

Realistically, if you're only spending $2-3K per month on hosting and administration, you'll eventually experience one or more of the above, and your site may be down for a day or more.

This isn't to say that I'm happy about Heroku's long downtime. One of my clients was offline almost as long as you were. But I'm pleased that Heroku recognizes just how badly they screwed up, and that they're taking the two most important steps they can to prevent a recurrence: multi-region support, and continuous backups for everyone. Multi-region support may not be sufficient to protect against cascading Amazon outages, but it's a good start.

"If you're really taking 100% responsibility, then consider pro-rating the bills of affected paying customers (based on the downtime)."

That's implied, at the very least, by the phrase "heroku takes 100% of the responsibility. "

I would be very surprised if they didn't offer more than that.

Your biggest concern is you want a $20 refund?

They can keep their $20 in my view, as long as they ensure it never happens again.

The company where we host most of our servers has an SLA thats starts paying 10% monthly refund per 10 minutes of downtime that is their fault. I can't believe that you guys will take anything at this magnitude of downtime and still stick around.

Does Heroku even have an SLA? I can't find it. If they did maybe they would have been more proactive to prevent this kind of problem.

"Downtime that is their fault" is kind of a giant caveat, no? Is it their fault if they lose transit or power, for instance? With that level of refunds I suspect "their fault" basically only covers one of them accidentally running over a server with their car. The problem is, that guaranty isn't getting anyone anything of value.

I suspect Heroku has SLAs for their bigger customers, but don't really know for sure. I do think you're overestimating what kind of incentive an SLA is for a provider, though. SLAs are basically an on paper way of showing your commitment to keeping things running and responding to problems. If you don't have that commitment already, the paper isn't going to change anything.

Pointy haired bosses and lawyers love SLAs, but smart people who shop for this stuff don't care all that much about them. An SLA isn't going to convince me to go with one provider over another, nor is lack of an SLA going to make me avoid a provider I already like and respect.

I dont know about other peoples SLA's but seeing as you're hinging on my simplified description my SLA provides 100% uninterrupted transit to the Internet and 100% uninterrupted electricity so if the power goes out it is still 'their fault' but if I rm -rf / it is my fault.

I am not a lawyer or a PHB but I run a small business that has customers that pay for a service so if that service goes down I look bad and they are upset.

Oh, well 100% uptime for power and bandwidth is pretty standard then, I figured you were comparing an SLA for similar type services as you'd get from Heroku and/or EC2.

Do they have a SLA? I presume they can pass on some of the refunds from Amazon.

You're going to complain about a refund of what, ~ $22, from a service with no SLA for a site in 'private beta'?

You must think pretty highly of your app. Why don't you take it out of 'private beta' and let the rest of us look at it?

Is this reasonable? I'm sure a lot of amazon hosted companies are thinking similar thoughts.

But being on multiple Availability Zones was supposed to be bulletproof (according to Amazon). Now that we know that wasn't the case, is being hosted on multiple regions going to provide the necessary level of protection?

Is it an over-reaction to say that relying completely on Amazon could now be seen as irresponsible to your users, given the magnitude of this event?

The one problem I have with such concerns is this: what other viable options are there? Google AppSpot? Windows Azure? Perhaps. But AWS is flexible, very few stack limitations. The only other alternative I think is to go back to the pre-cloud era, when hosting was much more expensive, and outages were still possible, especially when you couldn't keep up with big traffic spikes.

Honestly, I would prefer this kind of mass outage than the alternative. It's cheaper, easier, and I bet you there's still better uptime overall.

Agreed. On both counts.

However obviously it's good PR, and we all appreciate the Mea Cupla from Heroku, the fact is, they are proposing to migrate to a situation where they are still completely reliant on AWS for their hosting.

I'm just not sure you can really say "We don’t want to ever put our customers through something like this again and we’re working as hard as we can on making sure that we won’t ever have to.", when at the end of the day, you are again relying on a company that has failed you in the past.

Not trying to attack Amazon or Heroku, I'm honestly intrigued by this issue; not to mention the fact that we are facing the exact same decision at work.

Regarding Heroku's plan to continue relying on a company, Amazon, that failed them before:

If Heroku evolves to an architecture in which they utilize multiple AWS regions (as they mention in lesson #1 of their post-mortem) and if each region has a distinctly partitioned API "control plane," this should result in a materially improved availability situation for Heroku. EC2 Availability Zones guard against machine, power, and building failures. EC2 Regions should theoretically guard against API infrastructure and AWS software code failures.

Heroku need not necessarily ditch their current single-IaaS-provider architecture in order to achieve significantly better control over their service's uptime.

On the other hand, when downtime does occur, the ability for Heroku to prioritize their incident response manpower to first handle paying customers has its limits based on their downstream dependencies. If all the broken bits are within Amazon's black box, Heroku doesn't have much control over prioritization (Amazon fixes your stuff whenever it gets around to fixing your stuff). If Heroku operated over multiple cloud providers, even with the added complexity of such an approach, at least Heroku would have control over choosing which of their most important customers to migrate first to a working cloud, away from a broken and black box cloud.

In the end, I certainly don't see these considerations as simple. It's easy to cry when things go wrong, but I think the level of scalability and availability that has been achieved up to the present is quite noteworthy.

Interesting that this is essentially all stemming from yet again a communication failure from AWS. Once they have a post-mortem and can explain the multi-AZ issue, we may have a better idea of whether multi-region spread is sufficient redundancy. Or they could completely fail to communicate enough information, and adequately wary customers will be left with no choice but to assume that regions are not sufficiently independent.

Wait and see it is...

I'm currently trying pretty hard to get one support answer out of Google to help prevent 6 hours of downtime next week. Go with the smaller companies who will at least respond when you need it.

Human contact is not exactly one of Google's strengths in the first place though. :)

Go HRD, much better.

Agreed. I'm trying to, but need them to link my old app to the new one for SSL API purposes, but that means a mystical form-filling-waiting-might-happen-might-not. I'll focus on the things that I can change while I wait!

The other option would of course to be old school and go "non-cloud" and buy their own hardware that they manage, but this is an old man talking.

You are comparing IaaS to PaaS. They are not the same.

Fair enough, point well taken. But I don't think acknowledging that difference changes the crux of the Q&A here. What IaaS does it better than AWS? Seriously. It's not a surprise that Heroku, Engineyard, and various other PaaS options are built on AWS.

We could just as easily say that relying completely on Heroku is irresponsible to our own users.

Just as Heroku took responsibility for the unexpected weaknesses their reliance on a single region created, I believe their customers should take responsibility for the unexpected weaknesses our reliance on a single hosting provider has created.

Heroku still has the value of added resiliency, even if it's not 110% bulletproof. Ultimately, we're responsible for the architecture design of our own sites.

I'm very impressed by how they take responsibility for this, in their words: "HEROKU TAKES 100% OF THE RESPONSIBILITY FOR THE DOWNTIME AFFECTING OUR CUSTOMERS LAST WEEK."

It would be both easy, tempting and heck, even reasonable to assign at least a portion of the blame to Amazon. Their approach is interesting because their customers already know that, but are likely to appreciate their forthright acceptance of responsibility.

It's a good lesson. If I'm being totally honest I'd have to admit that, as a developer, I sometimes blame external services or events for things that I have at least partial control over. Perhaps I should adopt Heroku's approach instead.

Do they have another option? If no site had survived the outage, then they may have been on to something. But with some sites surviving the outage, they just have no excuse.

Sure you can blame AWS because they said multiple availability zones in the same region would work. But at the same time there is an expectation that a site like Heroku is knowledgeable enough and sophisticated enough to intelligently process what AWS says and determine what's appropriate for them.

It's sort of counter-intuitive, but taking responsibility for something (even if it's not directly your fault) often has the effect of deflecting some of the anger from your customers/clients/boss/etc.

Personally, I prefer to just get the blame part out of the way by taking responsibility and concentrate on the important things: fixing the problem and making sure it doesn't happen again.

I think that deep down, people aren't that concerned with who's fault it was. They just want to know that someone is going to fix it.

The reason why you don't want to take responsibility, is that the liability comes along with that. If Heroku took the position that the AWS outage was a force de majeure, then their liability for recompense to their customers would have been minimized.

By suggesting they take responsibility, they also are in a position where they have to make good for all of the downtime their customers experienced.

Short term - that will be an expensive decision. Long term, I think it's the right thing to do. It certainly builds up my confidence level in them.

As a PaaS vendor, they are supposed to abstract away from IaaS failures. And they were not to use a single region to host all their apps. I love Heroku and will continue to use them as long as I have the option to add affinity to my dynos and workers to spread across multiple regions of my choice. Coupled with anycast DNS support, this will be a very compelling offering, if they can pull it off. During the outage, all of our scale engines (http://blitz.io) and our CouchDB cluster across the other AWS regions held up, but since the web-tier was down, the whole app went down.

The AWS outage is definitely not over. Apparently RDS is built on EBS and they have not all been restored, I can tell you that first hand.

UPDATE: we were fully restored after midnight last night. It is a very happy feeling!

Thank you for taking full responsibility.

Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.

I wish Amazon was as good at communication and accountability as Heroku is.

"Block storage is not a cloud-friendly technology".

Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.

Where is Amazon's?

It's impressive that they're taking full responsibility, but very surprised there's no mention of refunds..

Given that Heroku charges based on the time your application is up I wouldn't be surprised if everyone just gets a bill which doesn't include the time their sites were offline.

What the hell? Why is everyone taking responsibilty and giving amazon a free ride? I'm a firm believer that only victims make excuses, and it's admirable to take responsibility, and maybe they should have more redundancy in place, but the way aws has been advertised, most of us felt this kind of thing should never happen even without a 100% uptime guarantee.

So, take 100% of the responsibility, but I wouldn't think any less of heroku if they only took 50%.

I pay Heroku to host my rails apps, not Amazon. I don't give a flying fk what kind of back end infrastructure they use, as long as people can get to my app.

It is surprising they don't talk refunds for the downtime, if they are taking responsibility. I'd imagine we will see this coming soon?

So you are saying if they are taking 100% of the responsibility they assume 100% of the liability? This is exactly why I'm suggesting they may have put their foot in their mouth. SOME of the fault reasonably lies with Amazon in my opinion, and I personally would not have cared if they took 50%. That's all really.

Legally? Yes. Morally? Yes. Ethically, Yes. Did they have to do this? No. Will I stay a customer because they took responsibility, instead of blaming someone else? Yes. Do I wish every company (including Amazon themselves) had the balls that Heroku does? Yes.

Sure, it would be easy to blame Amazon - really easy. But as I said I was paying for rails hosting from Heroku, not Amazon.

What about this example. You have Acme Corp Datacenters who sell dedicated servers to their customers. If Acme Corp has a network outage because their single Comcast connection went down because Comcast was having some routing errors, the customers who are effected go to Acme Corp. It isn't the customers fault that Acme Corp wasn't prepared to deal with a downed connection and setup a redundant network.

In this example, think of Amazon as Comcast and Acme Corp as Heroku. Heroku wasn't prepared to handle this type of failure, so they're at fault.

By that argument, the customers of Heroku who weren't prepared to handle Heroku's failure were at fault. They should have had alternate rails hosting lined up.

Amazon is at fault to Heroku. However, Heroku isn't talking to Amazon here, they are talking to their customers. Heroku is at fault to those customers. In turn, those customers are at fault to their users. Heroku can't pass the buck to Amazon, and those customers can't pass the buck to Heroku for their downtime (They can, but it's still their responsibility).

Well it depends on what Heroku's SLA was, if any. If Heroku stated 100% uptime grantee, then Heroku would be to blame for not living up to their 100% uptime. If Heroku said hey listen, we can't guarantee any amount of uptime so be prepared and someone were to host some "mission critical" information on Heroku then yeah it would be the customers fault.

Heroku has no SLA

Because Heroku isn't positioned as a way to manage your EC2 instances, it's a whole hosting solution. How much more confusing would it be for customers if Heroku said that they take responsibility for downtime that's not Amazon's fault? What counts as Amazon's fault? What about customers that know only Rails and don't care about how Heroku works on the backend?

I get what they are doing, and I would probably handle it the same way, but I think it is perfectly ok for them to place some blame here as their downtime WAS because aws was down technically. I think they took it too far by taking 100% of the blame. I'm on heroku's side, I don't think it was fair to themselves to take 100% pf the blame.

As a developer who uses AWS, it's complicated, but I agree that Heroku is at 100% fault.

If I were providing a paid service and my dedicated servers went offline due to some reason (let's say the fiber line gets cut by road maintenance crews) - my customers wouldn't care what happened - perhaps some will offer sympathy - but in the end it's my fault, and my responsibility to offer any kind of contingency plan - which generally includes load balancing over multiple datacenters etc. The same view should have been taken here and too much reliance was set on one region (even though Amazon promised it would be safe - this is their fault ;)). In the end, they should take full blame and now learn from their mistakes.

If you replace Fiber line with Power line, I was in this exact position in real life once. The power went out in our Data Centre, and our backups all failed - UPS's, Diesel Generators etc. The customers who had their servers in our Data Centre didn't blame the guy who cut the power cable, or the power company. They blamed us, the company who ran the data center. The power outage wasn't our fault, the catastrophic failure and lack of disaster recovery planning was. Banks and Hospitals are not forgiving when you break their SLA's.

Exactly - and I was actually in a position where the Virginia Dept of Transportation were doing roadwork and cut the fiber lines at the data center I was hosting at. ServInt did everything they could to remedy the situation and I remained a loyal customer for a long time after that, but the clients I was hosting really didn't care about that and moved away from me - and I respect that and understand that completely.

The thing is, multiple availability zones are in multiple data centers. We now know that they have a common failure point, but how could anyone have known that before it happened.

For all we know multiple regions could have an undiscovered common failure point.

Don't get me wrong. Heroku isn't entirely blameless--I had a production app that was down for about 12 hours.

I'm just saying it isn't fair to Heroku to take 100% responsibility. Do you now have to go to the customers of YOUR app hosted at Heroku and take 100% of the blame? Or do you tell them it was Heroku? Were YOU using more hosts than just Heroku?

Where does it end? Is it 100% your customers' fault for using your service and not using multiple services that match your service to be redundant?

I'm just making conversation here now, but I feel like Heroku did not have to go this far.

Yes. If I have an app that I host on Heroku and it goes down, it's MY fault for not putting in place the appropriate backups. I would apologize to my customers and offer them a refund for the downtime. I would not blame Heroku, or Amazon. Do you think my (hypothetical) paying customers care if my hosting provider goes down? No Fking way.

As someone who has been in the position of having to explain to my customers that our host went down, yes, they did understand, and because I had shown them through the years how hard I worked fir them not one expected a refund or anything. They knew that part of what they were paying me for was to fix these problems when the inevitably would come up. I'm not saying this will work for everyone, but I believe a lot of people are starting to understand the risks of doing business in the cloud, and the truth of the matter is, some patients is required.

If I have a contractor come and redo my kitchen, and I am quoted a certain time frame then I will blame him when things do not happen on time - even if his supplier couldn't deliver the cabinets in time, or the hardware store was out of joint compound.

It's a sucky situation, and I feel for Heroku - and I blame AWS personally - but when it comes to the final customer, then as a service provider it's only right to take blame.

Heroku's value proposition though is that they are taking away the back end worries for their customers so they can just deploy apps and not have to worry about it all. If I have to still worry about redundancy outside of their system then their service suddenly becomes a lot less useful.

I think the blame stops at Heroku who either need to provide more redundancy for incidents like this or let their customers know that if this type of incident arises there is little that they can do.

Exactly except that a problem came up, which it still could in the future even with these changes they are making, and when it does, you can probably rely on them to fix those problems because that is what you are paying for - someone else to run the back end, and that appears to be exactly what they are doing.

This is why i love hosting on heroku: they'll work their butt off to get it fixed when its down, and i don't have to lift a finger. However, EBS has been long known to be a turd, its a pity they relied on it. Plus, if they had a way to bring it back up in a different region (eg the euro AWS infrastructure) at the flick of a switch, that'd make me less nervous...

I don't think this is particularly 'honorable' or anything like that.. it's the only sensible stance for them to take.

Let's be realistic about this; for most people using heroku the alternative would have been bare ec2, and could easily have suffered the same fate as on heroku.

Everyone should feel positive that they got to spend ~60 hours just sitting around moaning about being let down, instead of having to sweat their nuts off attempting to rehabilitate crazy, suicidal infrastructure.

Even taking this downtime into account, heroku is still cost effective for me in a lot of cases.

>It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing

Heroku should save the customers this pain, by setting up anycast:


They gloss over their biggest failure; they weren't communicating or interacting with their customers at all.

* http://twitter.com/#!/heroku * http://twitter.com/#!/herokustatus

They show a history of updates to their status blog and to the herokustatus Twitter account since April 21, what do you mean by 'they weren't communicating'?

From 9:07 to 20:43, the status updates were generic and not very helpful in answering two questions customers want to know:

1. What exactly is going on? 2. When will it be fixed?

In the middle of a crisis, saying "we're aware of the problem, and we're working hard to fix it," for hours does not really count as communication. It increases customer aggravation rather than decreasing it. Customers want to know answers to the above two questions. They don't care that you know about the problem and that you're working on it, unless you're not doing those two things in which case they will be (and should be) furious; those two things are expected.

Barring the ability to tell your customers "we will be back up at X:00", I think the best approach is to share as much information as you can without getting into proprietary information. That's why I think GP considered their communication a failure. That's why I consider their communication a failure, although I've seen this pattern enough from different companies that I don't hold it against Heroku as long as they learn from it.

Both questions were unanswerable. Even Amazon's estimates were wildly off, and they can actually look at the infrastructure.

Heroku's only way to answer your questions would have been to lie.

I'd really love to know some details on the continuous backup stuff. Sounds cool.

Not sure why it was dead-ed (possibly a double-post), but here's the answer from an author of it in case you don't have showdead on:

fdr 1 hour ago | link | parent [dead] | on: Heroku's AWS outage post-mortem

The mechanism is PostgreSQL continuous archiving.


This tool is still quite nascent. It received quite a trial by fire, having not (before this point) been revealed as a value-added feature to the service just yet in a wide scale.

I started using WAL-E a couple days ago for one of my own sites.

WAL-E is a program that postgresql can use to push database changes to S3.

Depending on how you configure postgresql checkpoints, the most data you'd lose is somewhere between to a couple seconds to a minute. I'd assume Heroku would make it a couple seconds. The downside to more frequent backups is more storage space (each checkpoint (WAL archive) stored on S3 is a minimum of 75k or so, even if there weren't any changes).

A couple of seconds is very aggressive. There is a window for data loss when the segment is incomplete. The calculus of what this means in a real system is somewhat complicated, and sketched below.

Although people like to measure the data loss temporally, it'd be more precise to the system-minded to say that it's 16MB of transaction log loss should the drive die between COMMIT and WAL-E send. Thus, temporally, there is a plateauing effect: the more data you push up to a point, the less you will lose temporally because Postgres swaps segments more quickly. If you push too much, backlogs can occur. If you measure in terms of xact bytes lost, it's simple: maximum 16MB-(32-epsilon)MB, assuming a trivial backog size, lose-able between COMMIT; and archiver send.

A word on backlogs: my experience would suggest you need to be doing very demanding things (bulk load, large in-server denormalizations or statement executions) to produce backlog given the throughput one sees on EC2. It's easy to write a monitoring query to do this using pg_ls_dir and regular expressions or similar. Nominal operation doesn't often see backlog, the pipes to S3 are reasonably fat. I hope to more carefully document ways to limit these backlogs via parallel execution and adaptive throttling of the block device I/O for the WAL writing. Another idea I had was to back WAL writes in-memory in addition to on-disk (RAID-1) so WAL-E would have a chance to send the last few WAL-segments, if any, in event of sudden backing block device failure.

A dead WAL drive is interesting because it will prevent COMMIT; from successful execution, hence, the amount of data loss is reduced (because availability comes to a halt immediately, even if the WAL segment is incomplete). Whereas if a Postgres cluster disk fails new transactions might COMMIT (the WAL continues to write and no fsync that will block has necessarily been issued) but you have a good chance of grabbing those segments anyway as database activity halts since WAL-E can continue to execute even in the presence of a failed postgres cluster-directory serving block device. A dead WAL drive will nominally allow non-writing SELECT statements to execute, so availability is generally lost to new writes only, although this may change on the account of crash-safe hint bits (I'm not terribly familiar with the latest thinking of that design, but I imagine it may have to generate WAL when doing read-only SELECT).

Finally, interesting things are possible with synchronous replication and tools like pg_streamrecv in 9.1, even if pg_streamrecv runs on the same box: I don't see an obvious reason why it would not be possible to allow for user-transaction-controlled durability of at least two levels: committed to EBS, and committed to S3. S3 could effectively act as a synchronous replication partner.

Fundamentally, putting aside the small archiver asynchronism, EBS with WAL-E is basically a cache of sorts to speed up recovery. The backing store is really, in some respect, S3.

I was thinking setting archive_timeout to a low number would limit the temporal data loss.

You are right, especially if you aren't pushing much data. Your restore times be rather long though. I hope to implement a prefetching strategy to make this much, much faster, so one could do that if they absolutely wished.

Yes. a double post.

If I'm translating it correctly, this phrase is referring to database replication.

I'm not deeply familiar with the Postgres versions of that (to my regret), but for the MySQL version you can read something like this:


Better yet, find yourself a copy of High Performance MySQL.

WAL archive replay is also used for the PostgreSQL hot standby feature, aka replication. Combined with streaming you can get sub-second latency, but there's no reason you could just not use streaming and use WAL-E to syndicate WAL to thousands of replicants with hot standby enabled (albeit with lag). Use your imagination if you want to write a event-driven, high performance WAL streaming server. I haven't found the use case yet.

MySQL has long relied on statement based replication, which can lead to server drift in the case of any nondeterministic query. This is a total killer for extensibility of the database, as well as correctness in general. It also has a row-based-replication variant that showed up around 2008 that represents a significant improvement, but the search results for "mysql rbr" might give you pause...

My guess is this is why Amazon made the sensible choice to back their RDS (MySQL) product via DRBD and synchronous, block-device level replication, because there is no good application-level option for MySQL that is to be trusted. This technique can also be used with PostgreSQL. However, use of DRBD tends to have punishing performance impact, is complicated, and is not very suitable for a hot standby unless you write very complex shared-storage database software like Oracle RAC, hence why so much effort went into WAL-streaming hot standby in the PostgreSQL community. The DRBD option is venerable, dating back to the LiveJournal early days as their MySQL HA option, and probably before that (Credit to LiveJournal for well-documenting their HA setup, including their use of DRBD).

I wish more companies (hell people) were as forthright, pragmatic, and sensible as the Heroku gang. Their breakdown and response to the outage is exactly what me as a paying customer wants to hear.


what about also spreading to multiple providers (i.e. also use rackspace cloud)? they'd be less dependant from amazon issues

And now reddit is down again (posting/submitting is impossible). Probably yet another Amazon issue, yet again.

In all fairness I've read that the reddit devs have made lots of boneheaded mistakes in their general infrastructure-design, but it still seems Amazon is not a very reliable platform to build your stuff on. Platforms built on top of Amazon's even less so.

I think boneheaded is a strong word. They're solving a problem that very few sites have to solve (huge traffic with low cache-ability) with vastly less resources than the others who do solve them have. (FB, Twitter, etc).

AMZN in general is a pretty solid 'platform' (especially if you're not using EBS), but because this whole 'cloud' thing is still partially uncharted territory, there are still holes, and you can't treat it like a normal web host.

Are you seriously not aware that reddit falls over pretty much all the time on its own?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact