
AWS outage brings Netflix down for some devices on Christmas Eve - molecule
http://www.engadget.com/2012/12/24/netflix-outage/
======
rdl
I don't get why Netflix doesn't put enough logic in the client to deal with
this kind of thing in some kind of graceful way.

Netflix, to me, is a big collection of links to stuff in CDNs (which, got a
first approximation, never go down; Akamai is essentially unfadeable, highly
resistant to all forms of outage because it's a trivial replication problem.)

I rarely if ever care about the recommendations engine, new content in the
submission queue, etc.

Yes, there's an authentication problem, but this is also trivial to replicate,
and it's fine if "is a valid subscriber" goes even a month out of date.

Essentially, even if AWS goes down completely, the Netflix client should be
able to get a static list of movies and show them. Maybe that's my queue,
maybe that's the top 10 per genre, whatever.

In times of degraded operation, show me _something_.

~~~
stusmall
I know they have some degradation of service in their recommendations but I
can't speak for the rest of the service. A coworker was telling me that each
recommendation on your main netflix screen is provided by a different service.
So lets say its suppose to be feeding me scifi recommendations. If the scifi
service is down, then it will just remove it from the main screen and I will
get a different recommendation category until its back up.

------
ladon86
Our ELB endpoints have been down for over 9 hours. Luckily we were able to
failover to an alternative solution in under 5 minutes, but that is pretty
mean downtime.

------
patrickgzill
At some point, companies that rely solely on AWS will be seen as having a
"technical debt" - to not be able to engineer around such obstacles will be
seen as a management failure.

~~~
morgo
Yeah I can't wait till people wise up and don't just see "the cloud" as one AZ
in us-east-1.

(It wouldn't have helped in this case, but its a general annoyance I have.)

------
druiid
I'm wondering if this outage will have Netflix looking at alternatives either
for only running on AWS, or moving from them. While AWS is a great piece of
technology, stable it is not, and Amazon has as surely proven this as I can
imagine.

Basically after nearly 15 hours of downtime I'd consider that beyond
unacceptable. With physical hardware even on Christmas Day you could have
replacements well before that and spun up... just saying.

------
durpleDrank
You'd think Amazon would have done something since the last big outage. They
just don't care do they ?

~~~
seanp2k2
They do care, but they're also developing software pretty quickly and they
break stuff just like everyone else. Only, when they break, people tend to
notice, and those people tend to spread word around pretty quickly.

When I worked at a hosting company, we'd have at least 5 shared hosting
servers (with ~75-200 websites on each) go down each day, and yet we had a
reputation for good uptime, because comparatively, we DID have good uptime.

I think the problem is that most people think about uptime wrong. Uptime is a
compromise; a trade-off. You can pay for more uptime with diminishing returns
after about 98%. Running one dedicated server colo'd with decent fault
recovery systems in a decent datacenter will probably get you 100% uptime most
of the time, until it doesn't, thus ~98%.

If you're going to attempt to get beyond power failures, you'll need a second
server (or "instance") somewhere. If you need the capacity anyway, this might
not double your costs, but you may still have unused headroom. You could buy
flexible computing power by using shared hosting ("the cloud") or whatever,
but it's the same problem.

Once you get into a state where you have a global business, customer demands,
supplier issues, vendor lock-in, etc, it becomes a numbers game. You can hire
(more|better|famous) devs and possibly get more uptime. You can test more (and
slow down feature releases) to get better uptime. You can break stuff (and pay
for the recovery + lost face + downtime) to decrease downtime. Everything is a
trade-off, and right now it makes sense to chase about "four nines" of uptime
-- 99.99%.

Four nines is 4.32 minutes per month -- four minutes and nineteen seconds AT
MOST once a month. This is very possible for many large services and while it
does have cost overhead, it's manageable.

TL;DR don't go chasing waterfalls (100% uptime.) It's not practical. It may be
possible, but it depends on how long. 100% uptime for 10 years would be a
pretty damn lofty goal. In my opinion, it's much more important to recover
quickly and gracefully with awesome communication with your customers than it
is to be up 100% of the time. 100% uptime goes unnoticed, but consistently
great customer service does not.

------
emersonrsantos
Not too late to remind: if you use AWS for big things, you should have
worldwide region redundancy.

~~~
druiid
This is only reasonable to a point. You're essentially doubling your costs
because things don't exactly work the 'same' between zones. It's not like
you're putting a bunch of servers in the same platform exactly and spreading
them out around the planet.... so you can't have four web-servers instead of
four and send traffic two all four (with two sets at another AZ) without
considering and also having a separate DB setup in that AZ, etc.... so costs
quickly mount (not to mention it's a much more complex platform suddenly).

------
ck2
Still down 5am EST, so that's over 12 hours.

I guess redbox is still working.

I wonder what percentage of the population will have never owned a DVD player
in the next generation.

If netflix owned their own hardware and could reach out and touch it, would
this have happened?

~~~
Cthulhu_
> I wonder what percentage of the population will have never owned a DVD
> player in the next generation.

Probably the same amount of people that have never owned a record player or,
for the younger people, a CD player.

> If netflix owned their own hardware and could reach out and touch it, would
> this have happened?

Yes, outages happen whether you have the hardware in your own hands or have it
hosted by a cloud provider.

------
jessedhillon
Can someone explain why Netflix's service is designed this way? By "this way"
I mean, why are some devices served by some set of servers/services
exclusively, and apparently web browsers and other devices served in some
other way?

I'm not questioning if there's a good reason that it's done this way, but that
reason just not obvious to me. I would have designed such a system where there
is one endpoint which all clients hit, regardless of platform. But I have
never designed the world's largest video streaming infrastructure.

~~~
moe
Not all devices support all streaming protocols and formats. I.e. their
desktop player uses the proprietary SMPTE 421M codec (microsoft) which is not
available (or efficiently implemented) on platforms where silverlight is not
available (read: most of them).

Since many of these other platforms _also_ use proprietary formats they end up
having to maintain a range of different proprietary streaming servers, many of
which presumably need to interface with their respective mothership platform
services (XBox live, PSN, Roku).

Oh, and recently they have begun building their own CDN[1] which may add
further to the diversity.

[1] <https://signup.netflix.com/openconnect>

~~~
seanp2k2
See also: <http://gigaom.com/video/netflix-encoding/> and DRM schemes, esp.
those detailed in content contracts (protip: big studios specify codecs,
encryption specifics, etc.)

Different for each device based on how much power the device has. TVs have
nothing for CPU power in most cases.

------
rdl
Is there anything at all "special" about ELB? i.e. is there any reason you
couldn't implement a software load balancer yourself within AWS, using non-
EBS-dependent EC2 instances?

~~~
jwilliams
Not especially. What you do get is:

\- SSL for "free"

\- If you use Route 53, it's a single DNS lookup (no CNAME hop)

\- If you're using a VPC you can use an ELB to face the public Internet.

\- (Slightly simplier) auto-scaling.

\- Redundancy across AZ's

Many/All of these things you could achieve yourself, but if you were using EC2
resources you'd probably find it reasonably expensive.

~~~
1SaltwaterC
The redundancy across AZ's is limited to the removal of the zone from the ELB
config if all the instances from that AZ are marked as failed. If the API is
down, tough cookies. All you get is 503 errors for each request hitting the
failed AZ. Asked the support to implement proper failover. They said they're
working on it, but no ETA. Didn't see an announcement about it in the mean
time. Their resolution for closing the ticket was: "script it yourself" with
the core assumption that the API works.

I say these because I don't want people to get a sense of false security when
deploying to multi-AZ behind ELB.

------
seldo
Amazon is currently reporting simultaneous issues in ELB, Cloudsearch, and
Elastic Beanstalk. I suspect they use some common underlying service which is
the root of the failure in all three, but I don't know what service that might
be. (The usual suspect, EBS, is not reporting issues right now)

------
xwowsersx
Was down for me on xbox last night, which was super annoying because I loaded
up some great movies from rottentomatoes netflix list and then fired up my
xbox only to discover netflix was down :(

------
cbsmith
Drat. Just got hit by this. PS3 player is totally not working.

~~~
sjs382
Roku too

