
Netflix: Lessons We’ve Learned Using AWS - jeffmiller
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
======
c2
Sounds like the AWS architecture caused Netflix to write better code ( read:
more durable, more fault tolerant ). Less assumptions baked in the code, and
it will be easier to port it to a new data center/cloud architecture if AWS
doesn't meet their needs.

As Netflix continues to scale, these changes will make managing that growth
much easier.

A lot of you seem to take this post as being negative against AWS
architecture. I take it more as a good collection of common things that you
need to watch out for in distributed environments, specifically the dangers of
assumptions within your current infrastructure which may change dramatically
as you scale.

------
jemfinch
Their "Chaos Monkey" approach reminds me of an excellent paper on "Crash Only
Software": <http://goo.gl/dqDII>

The best way to test the uncommon case is to make it more common.

~~~
aw3c2
Please do not use URL shorteners when they are not needed. They are
obfuscating.

Actual URL is
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.9585&rep=rep1&type=pdf)
and that is a direct link to a PDF.

~~~
jemfinch
When pg implements proper hyperlinking that doesn't clutter up my text with an
ugly URL, I'll stop using a shortener. I understand the positives and
negatives of using a shortener, but the positive impact on readability seems
to outweigh the negatives when I'm confronted with an ugly link longer than
the text I use to introduce it. Sorry if that offends you.

I should have noted that my link was a PDF, though. My apologies for that
oversight.

~~~
dpritchett
The HN community has long used footnotes to help with readability [1].

[1]
[http://www.google.com/search?q=%22[1]%20http%22+site:ycombin...](http://www.google.com/search?q=%22\[1\]%20http%22+site:ycombinator.com&safe=active)

------
briandoll
This reads like the 'fallacies of distributed computing' paper
([http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Comput...](http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing)).

While the likelihood of failure (or added latency, impacting upstream changes,
etc.) is greater in large-scale distributed environments for which you do not
control vs. your home-grown datacenter, those scenarios are just facts of life
in distributed environments.

An awesome side effect of hosting an app in a cloud environment is that you
must face up to those fallacies immediately or they'll eat you alive.

~~~
wmf
Paying large engineering costs upfront is "awesome"?

~~~
briandoll
Ignoring the fallacies of distributed computing in web applications is akin to
technical debt. You're going to have to pay it eventually, and as the metaphor
suggests, it's cheaper to pay it off sooner rather than later.

~~~
wmf
That's probably true for Netflix, but it may not be for people with a short
runway who want to "fail fast".

~~~
briandoll
There are lots of advantages and disadvantages to hosting in a "cloud"
environment, so I'm not trying to paint it as rainbows and ponies.

If a company has a short runway (ie. not a lot of cash in the bank) however,
hosting in the cloud means essentially renting capacity for which you do not
need to maintain. Sounds good to me.

It's a fun pendulum to watch swing though, I'll admit. Some companies, with
small runways, host in the cloud because it's rented capacity without too much
administration. They may grow wildly, when they notice they are paying the
"cloud tax" and could considerably save money by hosting themselves. They grow
some more, see how much money and attention they need to put into
infrastructure and how slow it is for their business to increase capacity. In
order to keep up with growth and maintain focus on their core business, they
move to the cloud.

------
aristus
I'm pretty sure "session-based memory management" should be "memory-based
session management", ie they kept user session state in memory.

~~~
joevandyk
Maybe they rewrote malloc to make a http call to a session provider which
keeps track of the allocated memory on the machine? :)

------
wccrawford
I want a Chaos Monkey, too!

Actually, that was my first reaction, but after thinking for a moment, that
isn't really a reliable way to test. If you make changes to something, you
don't know for sure if the chaos monkey hit while you were testing a certain
thing or not. Proper unit tests would seem to be a lot more useful.

~~~
mickeyben
How would you write unit tests to test against services not responding and
routing issues between your different instances ?

~~~
jdludlow
That's what mock objects are for. You mock the object that would normally call
the remote system, have it artificially fail, and use that to test clients of
that object.

~~~
samuel
So you only test those cases you have thought can fail(which is compulsory to
test anyway). The purpose of the Monkey Chaos is to trigger situations you
haven't dreamed about.

OK, it ain't perfect. Things usually doesn't fail in a uniformaly distributed
way and you can't be sure you'll see a different kind of fail on production,
but it sounds useful(yeah, and pretty cool) nonetheless.

------
Jd
Basically the gist is: You need to be prepared for anything to stop working at
any time.

The tone of this post indicates to me that the criticism and problems
experienced by Netflix with AWS are understated, which I can understand given
their position as a flagship AWS customer, etc.

~~~
SpikeGronim
You need to be prepared for halting failures and variable performance. Moving
from higher end, dedicated hardware to lower end, shared hardware means an
increase the failure rate and in the variance of your latencies. In exchange
you get elastic scaling, you convert capital costs to operational costs (no
more building datacenters), and you hand the "undifferentiated heavy lifting"
to AWS. The posts aren't, IMO, understating problems but demonstrating the
tradeoffs that come with AWS' core value proposition.

------
wglb
Interesting: _The Chaos Monkey’s job is to randomly kill instances_

Another way to say "If it ain't tested, it's broken".

------
kondro
Hardware is always going to fail eventually. Moving to AWS caused NetFlix to
write better code to deal with these failures.

Failures were always going to happen, even in their own datacentre. What they
have now is a more fault-tolerant system which should have less downtime
overall.

------
byteclub
If you do decide to adopt your very own pet Chaos Monkey in your next project,
make sure you ARE able to gracefully degrade your service in case of failures.
Otherwise your customers will see the monkey in action, manifested by "we'll
be back shortly" messages. It's easier said than done, since a lot of the time
all of us forget to write (or feel lazy, or have no idea how to properly
handle) the "else" statements in case of errors/unavailable
services/unreachable databases.

Otherwise, good idea. It forces you to think about the perils of distributed
environment from the very beginning, as opposed to leaving it to be an
afterthought.

------
dochtman
Lesson they have not yet learned: including a HTML title tag in their Blogger
templates.

------
jasonkester
I love the idea of setting up a fully working system on AWS, then repeating
all traffic from your live site over to it to see how it stands up under load.

No need to simulate traffic for testing purposes. Here's our _actual_ traffic.
All of it.

Nice.

------
ergo98
Reading both this entry and the one that explained why they went with AWS, I'm
left confused about why they ever went to AWS in the first place.

~~~
recampbell
Having built a non-trivial production system, I have a love-hate relationship
with AWS. I love the possibilities and potential, but I'm deeply disappointed
when it fails to live up to the hype.

It's just a matter of alternatives. There's frankly no other IaaS out there
that can match the features of EC2 (eg, EBS, elastic IP). These guys are
unstoppable, they come out with more features every week.

~~~
wmf
Yeah, EC2 is the best cloud but it's a shame that it's still missing features
that ESX 2.0 had (and Amazon has shown no interest in fixing it).

~~~
jeffbarr
Wes, what features are you looking for?

~~~
bmurphy
Disk I/O performance that doesn't suck.

Let us add dev pay instances to an ELB.

More ram.

Elastic private ip addresses.

Change security groups of running instances.

I have a lot more, but those are my big ones.

~~~
jollojou
While it is true that you cannot add new ones or remove existing ones, you can
modify an existing security group of a running instance:
[http://aws.amazon.com/articles/1145?_encoding=UTF8&jiveR...](http://aws.amazon.com/articles/1145?_encoding=UTF8&jiveRedirect=1#13).

------
sabat
I'll bet other companies (e.g. Heroku, Dropbox) that use AWS/EC2 would have
similar things to say.

I did have this one question, being a guy with an IT background: they expected
stability? Really? I always expect host/app/system failure, and am pleasantly
surprised when it doesn't happen.

~~~
Travis
My take on their experience was that they had only sort of abstractly thought
about stability. Then they severely underestimated the variance in
performance. Before moving to Amazon, stability was something you worked at in
the lower levels, e.g., hardware, power, etc. Once you move to amazon,
stability cannot be controlled at a lower level, so you have to make your
application code more robust.

With your own IT infrastructure, you'd logically put the focus into improving
the reliability on your hardware stack, rather than just focusing on handling
those issues in software.

