
Why Reddit was down for 6 hours - meghan
http://blog.reddit.com/2011/03/why-reddit-was-down-for-6-of-last-24.html
======
jamwt
I know it's not exactly in vogue these days to tout the merits of bare
hardware, but.. after all the VPS hubbub over the last couple of years, the
best progression for your website still seems to be:

1\. No traction? Just put it anywhere, 'cause frankly, it doesn't matter.
Cheapest reputable VPS possible. Let's say, Linode.

2\. Scaling out, high concurrency and rapid growth? DEDICATED hardware from a
QUALITY service provider--use rackspace, softlayer et al. Have them rack the
servers for you and you'll still get ~3 hour turnarounds on new server orders.
That's _plenty_ fast for most kinds of growth. No inventory to deal with, and
with deployment automation you're really not doing much "sysadmin-y" work or
requiring full timers that know what Cisco switch to buy.

3\. Technology megacorp, top-100 site? Staff up on hardcore net admin and
sysadmin types, colocate first, and eventually, take control of/design the
entire datacenter.

I simply don't understand why so many of these high-traffic services continue
to rely on VPSes for phase 2 instead of managed or unmanaged dedicated
hosting. The price/concurrent user is competitive or cheaper for bare metal.
Most critically, it's insanely hard to predictably scale out database systems
with high write loads when you have unpredictable virtualized (or even
networked) I/O performance on your nodes.

~~~
jedberg
reddit actually is a top 100 site, but we don't have nearly the need to host
our own datacenter or co-locate. If we do make a move, it will be to #2. I
don't want to hire people to be hands on -- I'd rather outsource that and let
someone else pay to have spare capacity laying around.

~~~
lsc
what kind of scale are you at? I mean, about how many 32GiB ram/8 core servers
would you need if you were using real hardware?

~~~
jedberg
We have ~130 servers at Amazon right now. We could probably do it with 50-75
or less, depending on how big the boxes are.

------
naner
[http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...](http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l6ykx)

A former employee is not quite as nice to Amazon.

~~~
brianwillis
Assuming for a minute that Amazon deserves as much blame as ketralnis is
heaping on here, why would the Reddit guys be so reluctant to point this out?
Professionalism? Kindheartedness? Even professionalism and mutual respect have
limits.

The community loves both the site and the admins, but there are limits to the
patience of users, and those limits are being tested by these outages. I would
think the Reddit guys would be happy to have a scapegoat to direct the
community's rage towards.

~~~
naner
You shouldn't bite the hand that... hosts you.

And it is pretty unprofessional. Ketralnis thinks he's defending his buddies
but he isn't really helping the situation.

~~~
nettdata
I'm not sure that he really cares if he's helping the situation or not, seeing
as he's no longer employed there.

I know first-hand that when a team is working on things, and it's very
publicly going wrong, and it's due to things that are out of your control,
it's beyond frustrating to have everyone think that it IS your fault due to
the public stance.

I'm guessing that the Reddit team that is dealing with this is more than a bit
pissed that they can't be more public with what the real reasons are.

Personally, I could care less about the "professionalism" or political BS, I'd
rather know the real reasons for the problems so that I could be better
informed and not run into the same issues.

I'd rather see more candor and less PR.

~~~
yuhong
On the other hand: "Blaming a third party lacks class. The Reddit guys made
the decision to rely heavily on EBS, and it came back to bite them. They show
a lot of character by taking responsibility for an outage they had very little
control over." What do you think is the best solution to that one?

------
A1kmm
Amazon claims: "Each storage volume is automatically replicated within the
same Availability Zone. This prevents data loss due to failure of any single
hardware component".

They make it sound like they are already providing RAID or something similar;
however, the fact that things like this happen to Reddit, who have built their
own RAID on top of Amazon's already replicated volumes, show that reliability
is not a good reason to go with AWS.

~~~
parasubvert
EBS isn't really RAIDed, it's virtualized block storage with replicas. The
issue Reddit experienced wasn't drive failure, though, it was network
degradation. The solution is to deploy redundant replicas in different
availability zones (and/or regions, if you can). Reddit unfortunately wasn't
built for that.

This isn't really any different from an on-premise application. An
availability zone by definition implies "shared network hardware". Using
multiple is what you do when you want redundancy.

~~~
snewman
How do you know the issue was network degradation? Is this written up
somewhere?

~~~
parasubvert
The original Reddit blog post indicates there were latency problems initially.
It's not clear what caused followon problems, but the latency may have
triggered a bad condition for their replication.

------
kowsik
EBS storage aside, they are down to 3 guys? _yikes_

~~~
mttwrnr
They have been granted a lot more help from Conde Nast. They're in the process
of hiring four more developers.

[http://blog.reddit.com/2011/03/so-long-and-thanks-for-all-
po...](http://blog.reddit.com/2011/03/so-long-and-thanks-for-all-
postcards.html)

~~~
kowsik
devops and all, the ratio's still staggering for the number of hits that
reddit gets.

------
bryanh
On that note, I have been meaning to ask HN (even if nothing more than an
exercise)...

If you had to run a site like Reddit, what would you do?

~~~
phire
Most importantly, I wouldn't let the staffing levels get this low.

At this point in its life, reddit should have 6-12 programmers/system
administrators + a few support staff, compared with the 3 they have at the
moment.

That way they won't be agonizing over the choice between devoting their
resources to keeping reddit running in the short term, or to moving reddit
away from EC2 for long term stability.

~~~
stingraycharles
"Most importantly, I wouldn't let the staffing levels get this low."

How would you pay for that?

~~~
phire
Reddit has never been low on money. According to one of their old developers,
the staffing issues have always been political, not financial.

[http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...](http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l7ajp?context=1)

~~~
_delirium
Ah yes, the telltale sign of a large bureaucracy: people spending $150 of one
kind of money to avoid spending $100 of a different kind of money (in this
case, payroll budget versus tech-infrastructure budget).

------
duck
_Then, something really bad happened. Something which made the earlier outage
a comparative walk in the park._

Murphy's law on St. Patrick's Day. Doesn't get any better than that.

~~~
jedberg
I didn't even get a chance to have a Guinness today. :(

~~~
bryanh
At this rate, a nice, tall Guinness is probably just what the doctor ordered.

------
X-Istence
I always love seeing a good technical post-mortem of what went wrong and how
it could be fixed in the future...

I'm currently working on building a backend service that has to scale
massively as well, and it has been a fun challenge trying to understand
exactly where things can go wrong and how wrong they can go...

------
marcamillion
Wow...they sound like they are really beating themselves up over it.

I know the community can be demanding, but that just seems stressful.

------
rgrieselhuber
Great writeup. I'd love to hear other people's experience with regards to
workarounds when / if EBS goes down (switching over to RDS for a short time,
etc.).

The comment about moving to local storage was interesting. Isn't the local
storage on EC2 instances extremely limited (like 10-20GB?)

~~~
ktsmith
Assuming "local storage" is synonymous with "instance storage" it's 160GB to
1690GB. <http://aws.amazon.com/ec2/instance-types/>

~~~
gregburek
So, I assume instances reddit uses have instance storage root volumes instead
of EBS root vols. I've always assumed the persistence of ebs AMIs was a plus
without a downside. Why would you opt for instance-storage AMIs instead of ebs
root volume EC2 AMIs?

~~~
ktsmith
Being that EBS booting only became an option in December 2009 I would not be
surprised if Reddit had not migrated their instances to that boot/storage
method. They acknowledged in the last two years they hadn't even had time to
move one of their databases from a single EBS volume to striped EBS volumes.

~~~
jedberg
We're currently in the process of replacing every one of our hosts with new OS
versions. As we do this we are in fact going to the EBS based instances.

Those instances actually show the same problems, but they aren't too bad,
because once you boot them, you don't need the root vol that much (that's what
the instance storage is for).

~~~
jjm
Some Qs:

Q1. I still don't get the use case for db storage on ephemeral storage.

Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols?
The problem with this is still the time in between snapshots even though it
will be shortened.

Some Comments: It will only be a matter of time before S3 disks and hardware
start dying like EBS...en masse

I talked with Ketralnis several year ago and know how many VMs you were
running back then. Pretty sure your not too far off from that count even today
(even if 2x).

You can still virtualize on a good set of dedicated hardware to emulate your
current 'network environment' to get you up and running in the near term
_asap_. Obviously you'd build out of that vm environment (with your load) as
the days go by. Seriously look into a parallel switch over though.

If EBS is in fact a huge issue as has been shown, you really may need to start
migrating off unless you want dedicated employees monitoring system health on
AWS. Eventually if problems continue that is what will happen, with no time
left to even develop automation... And why automate on a pile of instability?

Don't forget that the more VMs you add with this high failure rate increases
soft management costs and will eventually eat into your development time...

I don't work for Rackspace (I think they're quite expensive), but you guys
might benefit from this level of care to focus on the real issues.

~~~
jedberg
> Q1. I still don't get the use case for db storage on ephemeral storage.

We're still not sure either, so we're investigating to see if it makes sense.
One possible option will be to have the master on ephemeral disk with a hot
backup on EBS so there is no data loss.

Another option is use ephemeral for the master and all but one slave, so we
got hot backups without a slowdown.

Still need to look into it more.

The one that we are doing ephemeral right now is Cassandra with continuous
snapshots to EBS. Everything in there can be recalculated, and with an RF of
3, if we lose one node we can run a repair.

> Q2. If EBS is the problem why are you migrating to S3 backed EBS boot vols?
> The problem with this is still the time in between snapshots even though it
> will be shortened.

They are just easier to use. The root volume is rarely accessed after it is
booted, so the EBS slowdowns aren't really a problem in that case.

> Some Comments: It will only be a matter of time before S3 disks and hardware
> start dying like EBS...en masse

I don't think so. It is a totally different product built by a totally
different team with a different philosophy. S3 was build for durability above
all else.

In response to the rest of your comments, you are absolutely right, there are
other options. We will certainly be investigating them.

------
PaulHoule
I had two machines running in east-1 last night and one of them went down
around the same time reddit did. The other one made it through the night O.K.

EBS problems do seem to be the biggest reliability problem in EC2 right now.
The most common symptom is that a machine goes to 100% CPU use and 'locks up'.
Stopping the instance and restarting usually solves the problem.

The events also appear to be clustered in time. I've had instances go for a
month with no problems, then it happens 6 times in the next 24 hours.

My sites are small, but one of them runs VERY big batch jobs periodically that
take up a lot of RAM and CPU. Being able to rent a very powerful machine for a
short time to get the batch job done without messing up the site is a big
plus.

------
jwcacces
This is why you don't outsource your bread and butter, people!

If you want to outsource who makes your lunch, fine, but if your whole
business is requests in, data out, you do not put the responsibility of
storing your data in someone else's hands.

I get it, Amazon EBS is cheap. But at the end of the day you've got to make
sure it's your fingers on the pulse of those servers, not someone else who's
priorities and vigilance may not always line up with yours.

(also the cloud is dumb)

~~~
nkohari
You're still outsourcing if you go with a managed dedicated hosting service,
or even if you buy hardware and colocate it. Even if you owned the datacenter
and the entire backbone, you're still banking on everyone else not fucking up
their end of the connection.

~~~
jwcacces
Yeah, but at least you can take direct action when your people fuck up.

------
tedjdziuba
> We could make some speculation about the disks possibly losing writes when
> Postgres flushed commits to disk, but we have no proof to determine what
> happened.

If you read between the lines, this says that EBS lies about the result of
fsync(), which is horrifying.

~~~
jrmg
Most consumer /hard drives/ lie about the result of fsync, as a 'performance
optimization'.

It's generally possible to fsync, then cut the power before the data is
physically on the disk.

~~~
lsc
Yeah. but even _I_ don't use consumer hard drives in production. (Honestly, I
don't know 100% that the 'enterprise' drives are that much better, but I'd
guess they'd lie less... I switched because consumer drives tend to hang RAIDs
when they fail, while 'enterprise' stuff fails clean)

~~~
CrLf
Enterprise hardware lies about fsync, because even if you loose power, there
is a little battery on the RAID controller that's enough to flush the cache to
disk or to keep it for hours until the machine is powered up again. When the
battery goes bad, the write cache is disabled automatically.

On bigger hardware, like SAN storage arrays, the (redundant) batteries keep
the whole thing running for a while after the loss of power.

~~~
omh
You could consider that the whole battery backup etc. means that it _isn't_
lying about fsync. It says that it's been permanently written, and then it
makes sure that is has. It might not have been burnt to spinning metal, but
the system as a whole will ensure that it's permanent.

