Hacker News new | comments | show | ask | jobs | submit login
Why Reddit was down for 6 hours (reddit.com)
234 points by meghan on Mar 18, 2011 | hide | past | web | favorite | 95 comments

I know it's not exactly in vogue these days to tout the merits of bare hardware, but.. after all the VPS hubbub over the last couple of years, the best progression for your website still seems to be:

1. No traction? Just put it anywhere, 'cause frankly, it doesn't matter. Cheapest reputable VPS possible. Let's say, Linode.

2. Scaling out, high concurrency and rapid growth? DEDICATED hardware from a QUALITY service provider--use rackspace, softlayer et al. Have them rack the servers for you and you'll still get ~3 hour turnarounds on new server orders. That's plenty fast for most kinds of growth. No inventory to deal with, and with deployment automation you're really not doing much "sysadmin-y" work or requiring full timers that know what Cisco switch to buy.

3. Technology megacorp, top-100 site? Staff up on hardcore net admin and sysadmin types, colocate first, and eventually, take control of/design the entire datacenter.

I simply don't understand why so many of these high-traffic services continue to rely on VPSes for phase 2 instead of managed or unmanaged dedicated hosting. The price/concurrent user is competitive or cheaper for bare metal. Most critically, it's insanely hard to predictably scale out database systems with high write loads when you have unpredictable virtualized (or even networked) I/O performance on your nodes.

reddit actually is a top 100 site, but we don't have nearly the need to host our own datacenter or co-locate. If we do make a move, it will be to #2. I don't want to hire people to be hands on -- I'd rather outsource that and let someone else pay to have spare capacity laying around.

what kind of scale are you at? I mean, about how many 32GiB ram/8 core servers would you need if you were using real hardware?

We have ~130 servers at Amazon right now. We could probably do it with 50-75 or less, depending on how big the boxes are.

conde would very easily do this for you, they built an entire datacenter in deleware just for their child-companies to use like this.

Yes, they could. It is an option.

Fair enough; 100 was sort of arbitrary, but there is some metric when you're sort of at google scale and cutting down on power costs or achieving absolute minimum latencies between datacenters has a meaningful impact on the bottom line. Draw a line somewhere, and yeah, Reddit is probably on the other side.

Reddit has always amazed me with what they can do with extremely limited resources.

But it looks like that attitude has finally caught up with them, especially since they are down to just 3 technical staff (from 5 last week), and two of them are brand new.

Have some faith. We'll pull through. :)

Generally with you, except for the Rackspace recommendation, been burned and pissed off by Rackspace too many times to ever use or recommend them again.

I tend to try to find at least 1 good local/regional datacenter for a good portion of a server stack. There is huge benefits (IMO) in being able to drive to your servers and have a face to face meeting if there are issues and/or take your toys and go someplace else if there is a massive outage. If you're in Green Bay, options might be limited, but in any semi-major metropolitan area there are usually enough datacenter options that you have multiple choices.

Another thought. If availability is your goal, with the trend toward 'operations as code', I think a small development team can build a system on top of AWS that can automatically respond to arbitrary node/resource/data-center failure. Netflix seems to do this to an extent, with their Chaos Monkey.

That being said, there are situations where you may truly need single-node performance that isn't available on AWS.

Chaos Monkey causes chaos, it does not fix it. :)

But yes, you are right. The goal is to have a system that scales itself. Not an easy task for sure.


A former employee is not quite as nice to Amazon.

Also note the two other former employees replied in agreement: KeyserSosa[1] and raldi[2].

[1] http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

[2] http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

It's so surreal reading their comments as "former employees."

This wasn't the case a few months ago. Good god I'm getting old, now I know how the poor people at plastic.com feel.

> This wasn't the case a few months ago.

For two of them, it wasn't the case a few days ago.

Assuming for a minute that Amazon deserves as much blame as ketralnis is heaping on here, why would the Reddit guys be so reluctant to point this out? Professionalism? Kindheartedness? Even professionalism and mutual respect have limits.

The community loves both the site and the admins, but there are limits to the patience of users, and those limits are being tested by these outages. I would think the Reddit guys would be happy to have a scapegoat to direct the community's rage towards.

Blaming a third party lacks class. The Reddit guys made the decision to rely heavily on EBS, and it came back to bite them. They show a lot of character by taking responsibility for an outage they had very little control over.

Also, its not a great idea to badmouth someone who is currently providing you with a service and actively working on improving it for you (according to the blog post.)

Then again, according to the comments Amazon is mostly actively promising to work on improving it, and has been for a year with little to show for it.

The good news is that it seems they have some leverage on Amazon now.

"Hi Mr. CIO of Amazon. You realize we had a 6 hour outage on one of the largest sites in the US because of you. We don't want to badmouth your service but we can. Care to pay more attention now?"

Reddit's no longer a cashstrapped startup with limited resources and options. It's ultimately their choice to stick with Amazon. If Amazon has been giving them the run around for more than a whole year, maybe it would've been a smart decision to move to something else.

Happily, the Reddit team is better than the Reddit hivemind with regards to the appropriateness of channeling nerdrage.

I think in general a small group of professionals is going to be a little bit more focused and less knee-jerk-prone than a gigantic mass of similarly-minded people.

Because the whole site is hosted on Amazon, and it would be non-trivial to move from it. They also may or may not be receiving special pricing due to their size?

You shouldn't bite the hand that... hosts you.

And it is pretty unprofessional. Ketralnis thinks he's defending his buddies but he isn't really helping the situation.

I'm not sure that he really cares if he's helping the situation or not, seeing as he's no longer employed there.

I know first-hand that when a team is working on things, and it's very publicly going wrong, and it's due to things that are out of your control, it's beyond frustrating to have everyone think that it IS your fault due to the public stance.

I'm guessing that the Reddit team that is dealing with this is more than a bit pissed that they can't be more public with what the real reasons are.

Personally, I could care less about the "professionalism" or political BS, I'd rather know the real reasons for the problems so that I could be better informed and not run into the same issues.

I'd rather see more candor and less PR.

On the other hand: "Blaming a third party lacks class. The Reddit guys made the decision to rely heavily on EBS, and it came back to bite them. They show a lot of character by taking responsibility for an outage they had very little control over." What do you think is the best solution to that one?

Well, at this point, perhaps it needed to be said.

Reddit isn't vital to anyone's well being, but it is a service that hundreds of thousands (?) use pretty regularly, so it certainly isn't trivial either.

How is he hurting the situation by calling out a service for what it really is?

Actually, as a former employee, he is in a pretty good position to criticise while keeping that criticism at arms-length from Reddit. Not a bad tactic.

A poor craftsman blames his tools.

Indeed. You should always have a second and third hammer any time you swing, with an automatic failover to complete the job in case the first hammer explodes on contact or spontaneously ceases to exist.

Unless of course you happen to be something akin to a cash-strapped startup, when you don't have those kind of luxuries.

I'm pretty sure he was being sarcastic. Even if he wasn't, that's how I took it, just because it was funnier that way.

I was :) I just hear the "a poor craftsman blames his tools" line often, but I'd bet you'd hear Michelangelo singing a different tune if his chisel hammer broke off at the handle and landed on his toe.

More like a poor craftsman selects the wrong tools for the job.

Amazon claims: "Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component".

They make it sound like they are already providing RAID or something similar; however, the fact that things like this happen to Reddit, who have built their own RAID on top of Amazon's already replicated volumes, show that reliability is not a good reason to go with AWS.

EBS isn't really RAIDed, it's virtualized block storage with replicas. The issue Reddit experienced wasn't drive failure, though, it was network degradation. The solution is to deploy redundant replicas in different availability zones (and/or regions, if you can). Reddit unfortunately wasn't built for that.

This isn't really any different from an on-premise application. An availability zone by definition implies "shared network hardware". Using multiple is what you do when you want redundancy.

How do you know the issue was network degradation? Is this written up somewhere?

The original Reddit blog post indicates there were latency problems initially. It's not clear what caused followon problems, but the latency may have triggered a bad condition for their replication.

Disk access was the single reason we couldn't go with AWS offerings. The speed/reliability of EBS just isn't where we need it to be for our database servers. I don't blame amazon for this-it's just a drawback of the their choice to go fully shared tenant and virtual.

EBS storage aside, they are down to 3 guys? yikes

They have been granted a lot more help from Conde Nast. They're in the process of hiring four more developers.


devops and all, the ratio's still staggering for the number of hits that reddit gets.

Just one programmer, spladug, on the job for 4 months.

Idle speculation is a bit useless, but it really makes you wonder why so many of their team have left recently, doesn't it? I wonder if the stress of keeping up with the increased load (and subsequent downtime) became too much for some of them.

On that note, I have been meaning to ask HN (even if nothing more than an exercise)...

If you had to run a site like Reddit, what would you do?

Most importantly, I wouldn't let the staffing levels get this low.

At this point in its life, reddit should have 6-12 programmers/system administrators + a few support staff, compared with the 3 they have at the moment.

That way they won't be agonizing over the choice between devoting their resources to keeping reddit running in the short term, or to moving reddit away from EC2 for long term stability.

"Most importantly, I wouldn't let the staffing levels get this low."

How would you pay for that?

Reddit has never been low on money. According to one of their old developers, the staffing issues have always been political, not financial.


Ah yes, the telltale sign of a large bureaucracy: people spending $150 of one kind of money to avoid spending $100 of a different kind of money (in this case, payroll budget versus tech-infrastructure budget).

Well, I'm a SysAdmin/hardware guy myself, so obviously, I'd buy my own hardware and run it. Once a month I'd spin up my off-site backups (that would be on ec2) to make sure I had somewhere to go if the shit hit the fan at my co-lo.

Of course, I've spent my life dealing with infrastructure, so running a rack or two (or ten) servers is going to be a lot cheaper/easier for me than it would be for someone without that experience. The economics for people unwilling to gain that experience will depend entirely on scale; e.g. do they have enough servers that the lower running cost of owning hardware would pay for someone to manage that sort of thing. (and yes, hiring other people has overhead in and of itself.)

Generally, as much as possible, I avoid building complexity. I find that having a single point of failure with a backup that can be manually brought in to place (such as an asynchronously replicated database) is quite often more reliable than fancy home-made SAN solutions. In general, you need to be /very careful/ of complex redundant systems. In fact, I approach it somewhat like crypto. As much as possible, I don't build it myself. use well-used open-source tools with well-known failure modes.

The other thing to think of is failure domains. Sure, think about single points of failure, but more importantly, think about what goes down if that single point fails. For instance, in my current setup, each xen host is a single point of failure... but one going down won't take down anything else. I've seen other people design similar systems with improvised SAN setups, thinking "oh, if one node dies, I'll boot the guests on another!" the problem is that if that san goes down, everyone is toast, while if one of my hosts goes down, we're talking maybe 1/40th of my customers who are out of action.

I've seen drbd setups where a guest mirrored itself locally and to a second server... It sounded like a great idea, but the system turned out to be less reliable than my dumb local storage setup, as weirdness in drbd and how drbd dealt with disk and network issues would cause lockups that were much more frequent than the hardware failures that would take down whole nodes in my local disk setup.

In fact, I have a very strong suspicion, born of hard experience and smoking pagers, of SANs that cost less than mid-sized bay-area condos. And I'm pretty cheap, so that means local storage for me. There has been a lot of activity in that field lately, so I'm very carefully exploring it again, but I certainly wouldn't count on some homemade san as being more reliable than local disk.

(to be fair, my experience lies with "expensive SANs, used expensive SANs, and homemade SANs." and my only good experience was with the first of those. I don't have a lot of experience with low-cost commercial SANs.)

Morever, I'm pretty suspicious of disk over the network schemes, even when the expensive "real" SANs are involved. NFS is the only scheme I really trust; it's older than I am. It has problems, but we know about all of them. Overall, I think NFS handles network blips much better than any of the block device over the network schemes I've used. And you do see network blips. The network simply isn't as reliable as your sata cable, and the block subsystem isn't designed to deal with devices that are temporarily unavailable.

I've seen a lot of 'clever' redundant setups built by people who are much smarter than I am... quite often, their setup ends up becoming less reliable than my "dumb" systems.

  > I find that having a single point of failure with a backup
  > that can be manually brought in to place (such as an
  > asynchronously replicated database) is quite often more
  > reliable than fancy home-made SAN solutions.
As someone who has invested heavily in implementing HA systems and then seen them be the root of increased downtime, I have come to the same conclusion. A simple fail-over system that will result in short downtime & require limited manual intervention is in many cases the best route.

Then, something really bad happened. Something which made the earlier outage a comparative walk in the park.

Murphy's law on St. Patrick's Day. Doesn't get any better than that.

I didn't even get a chance to have a Guinness today. :(

At this rate, a nice, tall Guinness is probably just what the doctor ordered.

I always love seeing a good technical post-mortem of what went wrong and how it could be fixed in the future...

I'm currently working on building a backend service that has to scale massively as well, and it has been a fun challenge trying to understand exactly where things can go wrong and how wrong they can go...

Wow...they sound like they are really beating themselves up over it.

I know the community can be demanding, but that just seems stressful.

Great writeup. I'd love to hear other people's experience with regards to workarounds when / if EBS goes down (switching over to RDS for a short time, etc.).

The comment about moving to local storage was interesting. Isn't the local storage on EC2 instances extremely limited (like 10-20GB?)

Assuming "local storage" is synonymous with "instance storage" it's 160GB to 1690GB. http://aws.amazon.com/ec2/instance-types/

So, I assume instances reddit uses have instance storage root volumes instead of EBS root vols. I've always assumed the persistence of ebs AMIs was a plus without a downside. Why would you opt for instance-storage AMIs instead of ebs root volume EC2 AMIs?

Being that EBS booting only became an option in December 2009 I would not be surprised if Reddit had not migrated their instances to that boot/storage method. They acknowledged in the last two years they hadn't even had time to move one of their databases from a single EBS volume to striped EBS volumes.

We're currently in the process of replacing every one of our hosts with new OS versions. As we do this we are in fact going to the EBS based instances.

Those instances actually show the same problems, but they aren't too bad, because once you boot them, you don't need the root vol that much (that's what the instance storage is for).

Some Qs:

Q1. I still don't get the use case for db storage on ephemeral storage.

Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols? The problem with this is still the time in between snapshots even though it will be shortened.

Some Comments: It will only be a matter of time before S3 disks and hardware start dying like EBS...en masse

I talked with Ketralnis several year ago and know how many VMs you were running back then. Pretty sure your not too far off from that count even today (even if 2x).

You can still virtualize on a good set of dedicated hardware to emulate your current 'network environment' to get you up and running in the near term _asap_. Obviously you'd build out of that vm environment (with your load) as the days go by. Seriously look into a parallel switch over though.

If EBS is in fact a huge issue as has been shown, you really may need to start migrating off unless you want dedicated employees monitoring system health on AWS. Eventually if problems continue that is what will happen, with no time left to even develop automation... And why automate on a pile of instability?

Don't forget that the more VMs you add with this high failure rate increases soft management costs and will eventually eat into your development time...

I don't work for Rackspace (I think they're quite expensive), but you guys might benefit from this level of care to focus on the real issues.

> Q1. I still don't get the use case for db storage on ephemeral storage.

We're still not sure either, so we're investigating to see if it makes sense. One possible option will be to have the master on ephemeral disk with a hot backup on EBS so there is no data loss.

Another option is use ephemeral for the master and all but one slave, so we got hot backups without a slowdown.

Still need to look into it more.

The one that we are doing ephemeral right now is Cassandra with continuous snapshots to EBS. Everything in there can be recalculated, and with an RF of 3, if we lose one node we can run a repair.

> Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols? The problem with this is still the time in between snapshots even though it will be shortened.

They are just easier to use. The root volume is rarely accessed after it is booted, so the EBS slowdowns aren't really a problem in that case.

> Some Comments: It will only be a matter of time before S3 disks and hardware start dying like EBS...en masse

I don't think so. It is a totally different product built by a totally different team with a different philosophy. S3 was build for durability above all else.

In response to the rest of your comments, you are absolutely right, there are other options. We will certainly be investigating them.

I meant to say several months ago, not years.

Thanks for the follow up jedberg, I was just guessing based on what has been publicly stated in the various blog posts over the last year or two. I used the same process for my own s3 -> ebs boot volume migration. That took a few weeks and I didn't have that many instances to migrate in the first place. Given the large number of instances reddit uses and the surprisingly small staff and one would reasonably expect that the migration would not be done.

Thanks for the extra info! I'm doing a lot of work now with python on EC2 and the reddit write ups + presentations have been a huge help. Thanks again.

> python on EC2

#1 tip: Don't use threading. Python threading + EC2 will not work well. Instead rely on the OS doing the task switching and run multiple copies.

If you want more info, I did a talk at Pycon about this and other things: redd.it/b5jyy

Duly noted. I started with this talk and have been using it as a guide to scaling edge cases with python as well as AWS. I thought raid10 was overkill before I started digging into the postgres/EBS mess, but now it seem almost routine enough that Amazon should have it as a configuration option.

Did you get stuck in the Fedora 8 trap as well? It was the 'starter' in 2008/2009 and it took us two years to get off it.

Ubuntu 8.10 for us.

That makes more sense, thanks.

I had two machines running in east-1 last night and one of them went down around the same time reddit did. The other one made it through the night O.K.

EBS problems do seem to be the biggest reliability problem in EC2 right now. The most common symptom is that a machine goes to 100% CPU use and 'locks up'. Stopping the instance and restarting usually solves the problem.

The events also appear to be clustered in time. I've had instances go for a month with no problems, then it happens 6 times in the next 24 hours.

My sites are small, but one of them runs VERY big batch jobs periodically that take up a lot of RAM and CPU. Being able to rent a very powerful machine for a short time to get the batch job done without messing up the site is a big plus.

This is why you don't outsource your bread and butter, people!

If you want to outsource who makes your lunch, fine, but if your whole business is requests in, data out, you do not put the responsibility of storing your data in someone else's hands.

I get it, Amazon EBS is cheap. But at the end of the day you've got to make sure it's your fingers on the pulse of those servers, not someone else who's priorities and vigilance may not always line up with yours.

(also the cloud is dumb)

It's all a continuum. You could also build your own servers, or design specialized boards with dedicated processors optimized to your application. Everyone is going to choose a point on the continuum and each will have tradeoffs.

You're still outsourcing if you go with a managed dedicated hosting service, or even if you buy hardware and colocate it. Even if you owned the datacenter and the entire backbone, you're still banking on everyone else not fucking up their end of the connection.

Yeah, but at least you can take direct action when your people fuck up.

> We could make some speculation about the disks possibly losing writes when Postgres flushed commits to disk, but we have no proof to determine what happened.

If you read between the lines, this says that EBS lies about the result of fsync(), which is horrifying.

Most consumer /hard drives/ lie about the result of fsync, as a 'performance optimization'.

It's generally possible to fsync, then cut the power before the data is physically on the disk.

Yeah. but even I don't use consumer hard drives in production. (Honestly, I don't know 100% that the 'enterprise' drives are that much better, but I'd guess they'd lie less... I switched because consumer drives tend to hang RAIDs when they fail, while 'enterprise' stuff fails clean)

There's a section in the Postgres docs about fsync and lying hard drives: http://developer.postgresql.org/pgdocs/postgres/wal-reliabil...

There's also a Postgres contrib module called pg_test_fsync that tests various fsync modes on your hard drive, and if you get too high results, you can strongly suspect that the dist is lying.

Enterprise hardware lies about fsync, because even if you loose power, there is a little battery on the RAID controller that's enough to flush the cache to disk or to keep it for hours until the machine is powered up again. When the battery goes bad, the write cache is disabled automatically.

On bigger hardware, like SAN storage arrays, the (redundant) batteries keep the whole thing running for a while after the loss of power.

You could consider that the whole battery backup etc. means that it isn't lying about fsync. It says that it's been permanently written, and then it makes sure that is has. It might not have been burnt to spinning metal, but the system as a whole will ensure that it's permanent.

"enterprise" disk is just that; it's just disk. I've not seen a disk with an onboard battery. You can put a raid controller in front of your disks, with or without battery backed cache. (most raid cards have an optional battery module.) But, even many gold plated 'corporate' systems I've seen omit the battery, as it's usually not cheap.

My personal opinion is that raid controllers are of little value without battery backed cache.

I just use MD in front of 'enterprise sata' drives, as without doubling my total cost for storage, I can't get a raid card, as far as I can tell, which is better than MD, and really, if I was to double my storage cost, simply buying twice as many spindles would get me better bang for my buck, I think, than a hardware raid card that cost that much.

Enterprise hard drives might lie about fsync with caching turned on, but they generally let you disable caching and behave correctly.

Of course that completely shafts your performance, but hopefully you were expecting that.

A comment from an ex reddit employee linked elsewhere in this HN discussion (http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...) seems to confirm that:

More recently we also discovered that these disks [EBS volumes experiencing performance degradation] will also frequently report that a disk transaction has been committed to hardware but are flat-out lying.

...I was wondering if I was going to be the only one that caught that.

There's not enough information to know if that's even possible or not, or how it might be possible, but as far as I know there aren't any known postgres issues that would cause that.

I would be extremely reluctant to store data on a network that sometimes didn't actually store the data without raising an error anywhere.

You can actually configure postgres to not do an fsync() after each transaction; instead it gets written to disk, which on Linux means written to the disk caches in RAM, then Linux flushes to disk when it does the rest of the disk writes (anywhere from 5 to 30 seconds are set as default on different versions of Linux).

Why would you want to do that? This isn't MongoDB, the real world cares about durability.

I am not sure why I was downvoted - I wasn't suggesting as a course of action... it gives a large speed increase and I was suggesting that possibly this was done as a performance measure.

If you read between the lines, this says that EBS lies about the result of fsync(), which is horrifying.

Jump to conclusions much?

Do you know if reddit had fsync enabled in first place?

If you read between the lines, "if you read between the lines" almost always means "if you make an educated jump to conclusions."

So where is the education in jumping from "we have no proof to determine what happened" to an outrageous claim about EBS?

It's very common to turn off fsync on a database for performance reasons. It's far less common to have a network block device driver (which tend to be designed with intermittent outages in mind) lie about fsync.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact