Why Twilio Wasn’t Affected by Today’s AWS Issues

toast76 · on April 22, 2011

504 Gateway Time-out

nginx/0.9.2

You really want to make sure your shit works before you go boasting about how well it works. :)

EDIT: seems to be working now :P Interesting article once I got over the irony of it not working.

emcooke · on April 22, 2011

Ironically, this highlights one of the main issues we discuss in the post!

The Twilio Engineering blog is hosted off an external Wordpress site with a single IP that's forwarded from ngnix load balancer pool. Since the load balancers assume that the external service can fail, they won't tied resources blocking access to other parts of the site.

Hope you enjoy the post :)

-Evan Twilio.com

aaronblohowiak · on April 22, 2011

Why not stick an angel-mode Varnish in between? Serving stale blog is usually better than no blog!

emcooke · on April 22, 2011

Yup, good idea. We set up an ngnix proxy to cache the page while the blog hosting provider fixes their server.

markerdmann · on April 22, 2011

Evan, I just noticed that your service seems to be running on Slicehost, not the AWS colo in Virginia. Is that correct? I got the opposite impression from your post, which seems to imply that Twilio is hosted on AWS, yet managed to weather the storm because of your design decisions.

emcooke · on April 22, 2011

Our main infrastructure is deployed on AWS but we have capacity at several cloud providers for load-balancing, redundancy, etc.

markerdmann · on April 22, 2011

Ah, I see it now. I just got a POST from one of your servers in the AWS US-West region. Is Twilio also hosted in US-East (the region affected by today's outage), and, if so, would Twilio have stayed up if it hadn't been spread across multiple regions?

chopsueyar · on April 22, 2011

Eggs and baskets.

oinksoft · on April 22, 2011

Speaking of highlighting, something about Disqus' markup/styles causes your blog text to be un-highlightable with mouse (Firefox 3.6.16 Debian 5.0.8).

dmor · on April 22, 2011

Thanks for the heads up, I've disabled Disqus comments for now... was also causing some issues for iPhone/iPad readers. Regular commenting is enabled

nopal · on April 22, 2011

You were born to work at Twillio.

dpcan · on April 22, 2011

They mention twilio is working, their blog may be hosted another way.

dmor · on April 22, 2011

yep, our blog is hosted on a third party service completely unrelated to our website and APIs

johndyer · on April 22, 2011

dmor, really? It looks like the blog points to AWS, as does your API?

jdyer@aleph:~ [git:master] <ruby-1.9.2> » host api.twilio.com api.twilio.com is an alias for public-vip374d1ca4e.prod.twilio.com. public-vip374d1ca4e.prod.twilio.com is an alias for ec2-174-129-254-101.compute-1.amazonaws.com. ec2-174-129-254-101.compute-1.amazonaws.com has address 174.129.254.101 ----

jdyer@aleph:~ [git:master] <ruby-1.9.2> » host www.twilio.com www.twilio.com is an alias for public-vip29c4ab3d.prod.twilio.com. public-vip29c4ab3d.prod.twilio.com is an alias for ec2-174-129-253-75.compute-1.amazonaws.com. ec2-174-129-253-75.compute-1.amazonaws.com has address 174.129.253.75

chadrs · on April 22, 2011

DNS lookups don't tell you anything here. The way a reverse proxy works is that HTTP requests to certain URLs get turned into an HTTP client request by the web server to the 3rd party provider (for caching, URL changing, compressing, terminating SSL, and get around firewalls). You can learn about them here: http://en.wikipedia.org/wiki/Reverse_proxy

johndyer · on April 22, 2011

True, and I had actually misread dmors' post entirely here; I read the post as stating Twilio's blog was not reliant on AWS in anyway, which would have been a misrepresentation in my mind. However in hindsight this was not the case, and I will certainly admit when I am wrong.

keltex · on April 22, 2011

Sounds like a humblebrag to me:

http://online.wsj.com/article/SB1000142405274870457070457627...

emcooke · on April 22, 2011

We just enabled caching on the ngnix proxy to the external site hosting our Wordpress install for the engineering blog. Hopefully that should help performance.

-Evan Twilio.com

necrodome · on April 22, 2011

This post would be better if they gave more concrete examples of their infrastructure. I read the whole post and still don't know how they survived except some knowledge about distributed system design.

notauser · on April 22, 2011

They had some good general points though, like fast retries. Which brings me to one of the worst examples of a Human Factors mistakes I can think of right now...

The new rent-a-bike scheme in London has POS terminals connected to the central system via bits of string and/or cellular modems. Every now and again these links fall over or the central system becomes unresponsive.

If you are attempting to get a bike (with an active card subscription) you drop your card into the terminal and it prints you a release code that lets you take a bike.

Unless the system is down... in which case it still reads your card, and then sits there and shows you a spinner for 5 minutes.

You can't walk away during this time, because if you do and the link comes back up it'll print a release code which anyone can use to take a £300+ bike on your account.

If you do stick around and try again? That'll be another 5 minutes which you could have spent walking to the next bike dispensary.

I think that timeouts are one of those things that you can only tune really well when you use the system in a live environment and see how well things work. In this case a higher transaction failure rate would be vastly better than a 5 minute time out - on other systems not so much.

egon_ · on April 22, 2011

And of course, anyone running services not in the affected region weren't affected.

ahlatimer · on April 22, 2011

Here's the article from Google's cache, in case it's unreachable for others: http://webcache.googleusercontent.com/search?sourceid=chrome...

emcooke · on April 22, 2011

Several people have asked for additional details. We just posted a quick follow-on:

[UPDATE] A central theme of the recent AWS issues has been the Amazon Elastic Block Storage (EBS) service. We use EBS at Twilio but only for non-critical and non-latency sensitive tasks. We've been a slow adopter of EBS for core parts of our persistence infrastructure because it doesn't satisfy the "unit-of-failure is a single host principle." If EBS were to experience a problem, all dependent service could also experience failures. Instead, we've focuses on utilizing the ephemeral disks present on each EC2 host for persistence. If an ephemeral disk fails, that failure is scoped to that host. We are planning a follow-on post describing how we doing RAID0 stripping across ephemeral disks to improve I/O performance.

johndyer · on April 22, 2011

LOL...."504 Gateway Time-out".....nginx must not be one of Twilio's "small stateless services" (F)(A)(I)(L) ;)

trotsky · on April 22, 2011

A cursory inspection indicates that their ngix box at least is running on EC2 NoVA. It takes a particular kind of person to want to tempt fate to such a degree by posting something like that while running on top of what can be best described as "a fluid situation"

jdupree · on April 22, 2011

Like saying "My spelling is perfect, my grammer to!"

chopsueyar · on April 22, 2011

Are you able to observe/log the failed instances?

What percentage of the various pools were affected by the outage?

I'm more curious about the hourly rate.

If you have a pool of 30 instances and only 3 are accessible, are you still being charged for all 30 plus the additional 27 you need to bring up?

pkteison · on April 22, 2011

I want to see an article about making use of not-perfectly-up-to-date backups databases in a different region. Why can't reddit dump a copy of their new articles and comments to the west coast every night, then if the east coast dies, fire that up? Sure it's missing a chunk of the latest day's data, but that has to beat either completely being down or jumping through the technical hoops required to keep separate regions in sync across the internet. Then collect new articles and comments on the backup for a while, and when east coast is fixed merge the new data back over to east coast and go back about business?

Ditto for any web 2.0 we-are-a-fancy-shared-commenting-blog service, or anything that is fundamentally time based aggregation of information. Do database replication systems just not handle the concept of working with temporary gaps in the data?

dangrossman · on April 22, 2011

A lot of AWS stuff can't be transferred between regions. There's no way to move an EBS snapshot from east to west coast except to copy the thing across the public internet. Once it's over there on the west coast, to "fire that up" they have to launch app servers, database servers, cache servers, etc. whose configurations they had to keep mirrored from their normal region. They need to get all those backups onto EBS disks without using the same snapshot features they probably automated in their main region, attach them to the right instances... For a team with a single sysadmin, it's not as simple as you make it sound.

chopsueyar · on April 22, 2011

Amazon needs to buy some railroad right-of-ways.

suking · on April 22, 2011

Website not loading... Not sure if article is serious...

frankdenbow · on April 22, 2011

loads just fine for me...

suking · on April 22, 2011

Try #7 worked (serious).

Qz · on April 22, 2011

Down for me too.

tayl0r · on April 22, 2011

Awesome post! "504 Gateway Time-out" was the best article I've read in a long time.

pbreit · on April 22, 2011

I'm sure Twilio is hoping this sounds impressive but it sounds pathetic to me. Services are supposed to stay up after all.