Ironically, this highlights one of the main issues we discuss in the post!
The Twilio Engineering blog is hosted off an external Wordpress site with a single IP that's forwarded from ngnix load balancer pool. Since the load balancers assume that the external service can fail, they won't tied resources blocking access to other parts of the site.
Evan, I just noticed that your service seems to be running on Slicehost, not the AWS colo in Virginia. Is that correct? I got the opposite impression from your post, which seems to imply that Twilio is hosted on AWS, yet managed to weather the storm because of your design decisions.
Ah, I see it now. I just got a POST from one of your servers in the AWS US-West region. Is Twilio also hosted in US-East (the region affected by today's outage), and, if so, would Twilio have stayed up if it hadn't been spread across multiple regions?
dmor, really? It looks like the blog points to AWS, as does your API?
jdyer@aleph:~ [git:master] <ruby-1.9.2>
» host api.twilio.com
api.twilio.com is an alias for public-vip374d1ca4e.prod.twilio.com.
public-vip374d1ca4e.prod.twilio.com is an alias for ec2-174-129-254-101.compute-1.amazonaws.com.
ec2-174-129-254-101.compute-1.amazonaws.com has address 174.129.254.101
----
jdyer@aleph:~ [git:master] <ruby-1.9.2>
» host www.twilio.com
www.twilio.com is an alias for public-vip29c4ab3d.prod.twilio.com.
public-vip29c4ab3d.prod.twilio.com is an alias for ec2-174-129-253-75.compute-1.amazonaws.com.
ec2-174-129-253-75.compute-1.amazonaws.com has address 174.129.253.75
DNS lookups don't tell you anything here. The way a reverse proxy works is that HTTP requests to certain URLs get turned into an HTTP client request by the web server to the 3rd party provider (for caching, URL changing, compressing, terminating SSL, and get around firewalls). You can learn about them here: http://en.wikipedia.org/wiki/Reverse_proxy
True, and I had actually misread dmors' post entirely here; I read the post as stating Twilio's blog was not reliant on AWS in anyway, which would have been a misrepresentation in my mind. However in hindsight this was not the case, and I will certainly admit when I am wrong.
We just enabled caching on the ngnix proxy to the external site hosting our Wordpress install for the engineering blog. Hopefully that should help performance.
This post would be better if they gave more concrete examples of their infrastructure. I read the whole post and still don't know how they survived except some knowledge about distributed system design.
They had some good general points though, like fast retries. Which brings me to one of the worst examples of a Human Factors mistakes I can think of right now...
The new rent-a-bike scheme in London has POS terminals connected to the central system via bits of string and/or cellular modems. Every now and again these links fall over or the central system becomes unresponsive.
If you are attempting to get a bike (with an active card subscription) you drop your card into the terminal and it prints you a release code that lets you take a bike.
Unless the system is down... in which case it still reads your card, and then sits there and shows you a spinner for 5 minutes.
You can't walk away during this time, because if you do and the link comes back up it'll print a release code which anyone can use to take a £300+ bike on your account.
If you do stick around and try again? That'll be another 5 minutes which you could have spent walking to the next bike dispensary.
I think that timeouts are one of those things that you can only tune really well when you use the system in a live environment and see how well things work. In this case a higher transaction failure rate would be vastly better than a 5 minute time out - on other systems not so much.
Several people have asked for additional details. We just posted a quick follow-on:
[UPDATE] A central theme of the recent AWS issues has been the Amazon Elastic Block Storage (EBS) service. We use EBS at Twilio but only for non-critical and non-latency sensitive tasks. We've been a slow adopter of EBS for core parts of our persistence infrastructure because it doesn't satisfy the "unit-of-failure is a single host principle." If EBS were to experience a problem, all dependent service could also experience failures. Instead, we've focuses on utilizing the ephemeral disks present on each EC2 host for persistence. If an ephemeral disk fails, that failure is scoped to that host. We are planning a follow-on post describing how we doing RAID0 stripping across ephemeral disks to improve I/O performance.
A cursory inspection indicates that their ngix box at least is running on EC2 NoVA. It takes a particular kind of person to want to tempt fate to such a degree by posting something like that while running on top of what can be best described as "a fluid situation"
I want to see an article about making use of not-perfectly-up-to-date backups databases in a different region. Why can't reddit dump a copy of their new articles and comments to the west coast every night, then if the east coast dies, fire that up? Sure it's missing a chunk of the latest day's data, but that has to beat either completely being down or jumping through the technical hoops required to keep separate regions in sync across the internet. Then collect new articles and comments on the backup for a while, and when east coast is fixed merge the new data back over to east coast and go back about business?
Ditto for any web 2.0 we-are-a-fancy-shared-commenting-blog service, or anything that is fundamentally time based aggregation of information. Do database replication systems just not handle the concept of working with temporary gaps in the data?
A lot of AWS stuff can't be transferred between regions. There's no way to move an EBS snapshot from east to west coast except to copy the thing across the public internet. Once it's over there on the west coast, to "fire that up" they have to launch app servers, database servers, cache servers, etc. whose configurations they had to keep mirrored from their normal region. They need to get all those backups onto EBS disks without using the same snapshot features they probably automated in their main region, attach them to the right instances... For a team with a single sysadmin, it's not as simple as you make it sound.
nginx/0.9.2
You really want to make sure your shit works before you go boasting about how well it works. :)
EDIT: seems to be working now :P Interesting article once I got over the irony of it not working.