
Jedberg on Reddit's most recent outage - zacharycohn
http://www.reddit.com/r/announcements/comments/gva4t/on_reddits_outage/
======
cookiecaper
Reddit's software is not very impressive. They should really focus on
consolidating to a more manageable setup. I'm aware that big sites often have
complex environment to facilitate scaling, but frankly, reddit is a big
monstrous mess. It'd be a lot simpler to move hosts if they would get some of
those extraneous dependencies removed and put some simplicity and _sanity_
into their data storage, instead of the half Cassandra/half PgSQL thing they
do now.

I wrote a long post about it here:
[http://www.deserettechnology.com/journal/reddit-the-open-
sou...](http://www.deserettechnology.com/journal/reddit-the-open-source-
software) .

------
cygwin98
Actually, there was a heated discussion on such an issue a month ago at Reddit
[1], when AWS started acting up back then. Someone was asking why cann't
Reddit buy beefy servers and co-lo them. The arguments were that considering
the traffic Reddit is having, it's very difficult to DIY technically if not
possible. Looking back, it looks like co-lo is still the way to go, especially
for big sites like Reddit. Anyway, if StackOverflow can make it, why cannot
Reddit, both have comparable traffic?

[1]
[http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...](http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l76ok)

~~~
kbatten
Whether you co-lo or run in the cloud it seems to me that you would still want
the same types of redundancy/failover processes in place. The advantage of EC2
is you can easily set up servers in multiple places across the globe (or at
least across the country if you just want to host in the US.)

~~~
jshen
and there are many downsides to EC2, dependence on a third party to a greater
degree, shitty disk IO, instances go down more frequently than real servers,
inflexibility with hardware (no SSDs), etc, etc.

There are co-lo provides that have data centers in different places.

------
Andys
I've discovered through a bunch of similar lessons that buying really good
hardware and hosting yourself sometimes requires the least amount of your time
of any option. It provides a stable base on which to build additional
redundancy such as shared network storage or database replication if you wish
(and this is generally how expensive "enterprise" solutions work).

In practice and on cheap hardware, networked storage is flakey and has umpteen
failure modes. Database replication is even worse. Both require babysitting by
developers or sysadmins and hours to repair when it goes wrong. What is the
point outsourcing hardware and scaling to EC2 if you end up with even more
work to monitor and keep fixing the infrastructure you build on top?

------
benologist
Be interesting to see what hosting offers they get .... lot of benefits to
being the hosting company that keeps reddit online, if they manage to.

~~~
cagenut
Its actually very easy to beat the price/performance of ec2 using real
hardware. The draw of ec2/aws is in the "everything is a monthly charge and an
api call away" operational instant gratification. If you need to auto-scale by
the hour there's few other places you can do it. If you can generally
capacity-manage your setup a month or more into the future, then a 1-week
latency on RunInstances isn't actually a problem.

------
c2
Does anyone else think that Reddit's usage of EBS might be the culprit?

Looking at their past outage response:

[http://blog.reddit.com/2010/01/why-did-we-take-reddit-
down-f...](http://blog.reddit.com/2010/01/why-did-we-take-reddit-down-
for-71.html)

Money quote: "In response, we started upgrading some of our databases to use a
software RAID of EBS disks, which gives drastically increased performance (at
a higher cost of course)."

RAIDing EBS disks seems like a really really BAD idea. There is a non-trivial
failure rate of any single EBS disk, and if you RAID them together, your
failure rate of the RAID will consequently increase. Am I understanding that
correctly?

If they fix that, could that be a 'silver bullet' to fix these outages?

~~~
magicofpi
They certainly think there's a problem with EBS: "Since that last failure, we
have been doing everything we can to move ourselves off of the EBS product.
We're about half way there. All of our Cassandra nodes are now using only
local disk, and we hope to have all of postgres on local disk soon."

------
Natsu
I just love how someone sent them a bacon pizza during the outage:
<http://imgur.com/dunO2>

------
iamjustlooking
Jedberg last year stating that a startup buying physical machines today is
foolish:
[http://www.reddit.com/r/IAmA/comments/a2zte/i_run_reddits_se...](http://www.reddit.com/r/IAmA/comments/a2zte/i_run_reddits_servers_and_do_a_bunch_of_other/c0fm2we)

~~~
tesseract
Reddit is not a startup anymore.

------
alex1
> All of our Cassandra nodes are now using only local disk, and we hope to
> have all of postgres on local disk soon.

Since local disk is temporary and can be lost any second, don't they still
have to use EBS for persistence?

~~~
dialtone
They probably would use replication to a different region/zone and then they
implement their own snapshot/backup mechanism on those replica machines so
that they can use them to re-create instances when they lose them. Also
PostgreSQL ships logs that you can store in S3 and bring up an instance from
another area starting from a known backup point.

EBS is a pretty bad product regarding reliability and consistency of
performance unfortunately, it's better to design your system without it.

~~~
jjm
Re-building by reading logs out from S3 can be very slow. It should be seen as
a last ditch effort. Not really a hot swap, or fail over solution (depending
on data size obviously).

If you can't use any persistent storage, then these machines become pure
processing nodes. Which in this case, I feel like it would be better to not
design your system with EC2 at all. :-(

To me having EBS made EC2 a very powerful solution compared with their
competitors. Not having durable and consistent EBS otherwise makes no point in
differentiation and serves purely as non-functional fluff we end up paying
extra for.

~~~
smhinsey
If I were them, I'd look at an approach that was based on either SimpleDB or
S3. I've been working on a couple of prototypes of systems similar to theirs
for my own stuff at work, and I've been toying with a system that uses S3 to
store what I'm calling "absolute fallback" sources of data for conversations
(basically, JSON documents) and SimpleDB as a front-line store.

My general take on this issue is that if you're running your app on EC2 and
your persistence medium is something that's also on EC2, you really have no
ideal high availability scenario. Of course, even in my case, if SimpleDB and
S3 go down, I'm still in trouble, but at least I have the option of throwing
Akamai in front of it.

------
turbojerry
Has anyone else noticed the Amazon status page says there are still problems
today but on the status history show no problems yesterday only on the 21st?

<http://status.aws.amazon.com/?a>

~~~
turbojerry
They have corrected it now.

------
jjm
At any time an instance may become un-responsive and so will local storage,
no?

I don't see how this helps?

~~~
joevandyk
postgresql replication to other instances?

i believe the point is that instance storage is more reliable than ebs.

~~~
jjm
Why would you want to take a chance on that? Every instance should be thought
of as expendable. They can go down at anytime.

EBS is really just like any other NAS/iScsi vol and it wouldn't be such a
problem if they did what they're supposed to. That is, be consistent in read,
write, and durability.

~~~
joevandyk
What do you mean by "take a chance on that"? If the instance goes down, you
can failover quickly to another one that you've been replicating to.

~~~
jjm
If you accept that any of your instances can go down at any time, you accept
the fact all of them can be down at one (the same) time.

You might think your safe with reserved instances too, thinking you've
reserved the dedicated time with EC2. Well what happens when the entire
network stack goes down or block storage the reserved instance rack depends on
goes down? So does all 10 or 20 reserved instances you had up, at once.

Also, it was this very replication/snapshot mirroring feature of EBS that
cascaded into (network,etc) congestion.

Check [http://joyeur.com/2011/04/22/on-cascading-failures-and-
amazo...](http://joyeur.com/2011/04/22/on-cascading-failures-and-amazons-
elastic-block-store/)

