

Why Twilio Wasn’t Affected by Today’s AWS Issues - johndbritton
http://www.twilio.com/engineering/2011/04/22/why-twilio-wasnt-affected-by-todays-aws-issues/

======
toast76
504 Gateway Time-out

nginx/0.9.2

You really want to make sure your shit works before you go boasting about how
well it works. :)

EDIT: seems to be working now :P Interesting article once I got over the irony
of it not working.

~~~
emcooke
Ironically, this highlights one of the main issues we discuss in the post!

The Twilio Engineering blog is hosted off an external Wordpress site with a
single IP that's forwarded from ngnix load balancer pool. Since the load
balancers assume that the external service can fail, they won't tied resources
blocking access to other parts of the site.

Hope you enjoy the post :)

-Evan Twilio.com

~~~
aaronblohowiak
Why not stick an angel-mode Varnish in between? Serving stale blog is usually
better than no blog!

~~~
emcooke
Yup, good idea. We set up an ngnix proxy to cache the page while the blog
hosting provider fixes their server.

------
keltex
Sounds like a humblebrag to me:

[http://online.wsj.com/article/SB1000142405274870457070457627...](http://online.wsj.com/article/SB10001424052748704570704576275320082913808.html)

------
emcooke
We just enabled caching on the ngnix proxy to the external site hosting our
Wordpress install for the engineering blog. Hopefully that should help
performance.

-Evan Twilio.com

------
necrodome
This post would be better if they gave more concrete examples of their
infrastructure. I read the whole post and still don't know how they survived
except some knowledge about distributed system design.

~~~
notauser
They had some good general points though, like fast retries. Which brings me
to one of the worst examples of a Human Factors mistakes I can think of right
now...

The new rent-a-bike scheme in London has POS terminals connected to the
central system via bits of string and/or cellular modems. Every now and again
these links fall over or the central system becomes unresponsive.

If you are attempting to get a bike (with an active card subscription) you
drop your card into the terminal and it prints you a release code that lets
you take a bike.

Unless the system is down... in which case it still reads your card, and then
sits there and shows you a spinner for 5 minutes.

You can't walk away during this time, because if you do and the link comes
back up it'll print a release code which anyone can use to take a £300+ bike
on your account.

If you do stick around and try again? That'll be another 5 minutes which you
could have spent walking to the next bike dispensary.

I think that timeouts are one of those things that you can only tune really
well when you use the system in a live environment and see how well things
work. In this case a higher transaction failure rate would be _vastly_ better
than a 5 minute time out - on other systems not so much.

------
egon_
And of course, anyone running services not in the affected region weren't
affected.

------
ahlatimer
Here's the article from Google's cache, in case it's unreachable for others:
[http://webcache.googleusercontent.com/search?sourceid=chrome...](http://webcache.googleusercontent.com/search?sourceid=chrome&ie=UTF-8&q=cache%3Awww.twilio.com%2Fengineering%2F2011%2F04%2F22%2Fwhy-
twilio-wasnt-affected-by-todays-aws-issues%2F)

------
emcooke
Several people have asked for additional details. We just posted a quick
follow-on:

[UPDATE] A central theme of the recent AWS issues has been the Amazon Elastic
Block Storage (EBS) service. We use EBS at Twilio but only for non-critical
and non-latency sensitive tasks. We've been a slow adopter of EBS for core
parts of our persistence infrastructure because it doesn't satisfy the "unit-
of-failure is a single host principle." If EBS were to experience a problem,
all dependent service could also experience failures. Instead, we've focuses
on utilizing the ephemeral disks present on each EC2 host for persistence. If
an ephemeral disk fails, that failure is scoped to that host. We are planning
a follow-on post describing how we doing RAID0 stripping across ephemeral
disks to improve I/O performance.

------
chopsueyar
Are you able to observe/log the failed instances?

What percentage of the various pools were affected by the outage?

I'm more curious about the hourly rate.

If you have a pool of 30 instances and only 3 are accessible, are you still
being charged for all 30 plus the additional 27 you need to bring up?

------
johndyer
LOL...."504 Gateway Time-out".....nginx must not be one of Twilio's "small
stateless services" (F)(A)(I)(L) ;)

~~~
trotsky
A cursory inspection indicates that their ngix box at least is running on EC2
NoVA. It takes a particular kind of person to want to tempt fate to such a
degree by posting something like that while running on top of what can be best
described as "a fluid situation"

~~~
jdupree
Like saying "My spelling is perfect, my grammer to!"

------
pkteison
I want to see an article about making use of not-perfectly-up-to-date backups
databases in a different region. Why can't reddit dump a copy of their new
articles and comments to the west coast every night, then if the east coast
dies, fire that up? Sure it's missing a chunk of the latest day's data, but
that has to beat either completely being down or jumping through the technical
hoops required to keep separate regions in sync across the internet. Then
collect new articles and comments on the backup for a while, and when east
coast is fixed merge the new data back over to east coast and go back about
business?

Ditto for any web 2.0 we-are-a-fancy-shared-commenting-blog service, or
anything that is fundamentally time based aggregation of information. Do
database replication systems just not handle the concept of working with
temporary gaps in the data?

~~~
dangrossman
A lot of AWS stuff can't be transferred between regions. There's no way to
move an EBS snapshot from east to west coast except to copy the thing across
the public internet. Once it's over there on the west coast, to "fire that up"
they have to launch app servers, database servers, cache servers, etc. whose
configurations they had to keep mirrored from their normal region. They need
to get all those backups onto EBS disks without using the same snapshot
features they probably automated in their main region, attach them to the
right instances... For a team with a single sysadmin, it's not as simple as
you make it sound.

~~~
chopsueyar
Amazon needs to buy some railroad right-of-ways.

------
suking
Website not loading... Not sure if article is serious...

~~~
frankdenbow
loads just fine for me...

~~~
suking
Try #7 worked (serious).

------
tayl0r
Awesome post! "504 Gateway Time-out" was the best article I've read in a long
time.

------
pbreit
I'm sure Twilio is hoping this sounds impressive but it sounds pathetic to me.
Services are supposed to stay up after all.

