We managed to get the majority of the static images and JS running off CloudFront back then, but were always stuck serving html from our EC2 boxes (a combination of what was available from AWS at the time, and the previous one-day minimum timeout of CloudFront). We put a lot of work into optimizing them so that they'd be super lightweight and could be served quickly. It was pretty fast considering that we weren't able to put all of the static-like files into S3/CloudFront.
Now that you can host the whole static bits of your domain through S3/CloudFront, I'd love to give this another shot with a future project. With strongly-named resources (ie: resource name == MD5) being served with one-year expiration times from Amazon's infrastructure, you could build a blazing fast dynamic site without having to spin up a single EC2 box.
(My original comment was on the dupe here... http://news.ycombinator.com/item?id=4976475)
Specifically, it is both allowed to--and often does--return 50x errors to requests. The documentation for the S3 API states you should immediately retry; that's fine for an API where I can code that logic into my client library, but is simply an unacceptable solution on the web. Maybe there are one or two exceptions to this, but I have simply never seen a web browser retry these requests: the result is you just get broken images, broken stylesheets, or even entire broken pages. Back when Twitter used to serve user avatar pictures directly from S3 the issue was downright endemic (as you'd often load a page with 30 small images to request from S3, so every few pages you'd come across a dud).
Sure, it only happens to some small percentage of requests, but for a popular website that can be a lot of people (and even for an unpopular one, every user counts), and it is orders of magnitude higher of a random error rate than I've experience with my own hosting on EC2; it is also irritating because it is random: when my own hosting fails, it fails in its entirely: I don't have some tiny fraction of requests for users all over the world failing.
Regardless, I actually have an administrative need to stop hosting a specific x.com to www.x.com redirect on some non-AWS hosting I have (the DNS is hosted by Route53, etc., but I was left with a dinky HTTP server in Kentucky somewhere handling the 301), and I figured "well, if it doesn't have to actually request through to an underlying storage system, maybe I won't run into problems; I mean, how hard is it to take a URL and just immediately return a 301?", but after just a few minutes of playing with it I managed to get a test request that was supposed to return a 301 returning a 500 error instead. :(
HTTP/1.1 500 Internal Server Error
Content-Type: text/html; charset=utf-8
Date: Fri, 28 Dec 2012 07:19:24 GMT
(edit: Great, and now someone downvoted me: would you like more evidence that this is a problem?)
S3 has not been delivering. Here's a few reasons:
* S3 only provides read-after-write consistency for non-standard regions: http://aws.amazon.com/s3/faqs/#What_data_consistency_model_d... Since moving to US-West-1, we've had noticeably more latency. Working without read-after-write just isn't an option, users get old data for the first few seconds after data is pushed.
* CORS support is basically broken. S3 doesn't return the proper headers for browsers to understand how objects should be cached: https://forums.aws.amazon.com/thread.jspa?threadID=112772
* Oh, and the editor for CORS data introduces newlines into your config around the AllowedHost that BREAK the configuration. So you need to manually delete them when you make a change. Don't forget!
* 304 responses strip out cache headers: https://forums.aws.amazon.com/thread.jspa?threadID=104930... Not breaking spec right now, but quite non-standard.
* I swear, I get 403s and other errors at a higher rate than I have from any custom store in the past. But this is purely subjective.
Based on all this- I really need to agree with saurik that the folks at S3 aren't taking their role as an HTTP API seriously enough. They built an API on HTTP, but not an API that browsers can successfully work with. Things are broken in very tricky ways, and I'd caution anybody working with S3 on the front-end of their application to consider the alternatives.
I'm moving some things to Google Cloud Storage right now, and it is blazing fast, supports CORS properly, and has read-after-write consistency for the whole service. Rackspace is going to get back to me, but I expect they could do the same (and they have real support).
> CORS support is basically broken. S3 doesn't return the proper headers for browsers to understand how objects should be cached:
The S3 team is working to address this. We're investigating the other issues and always appreciate your feedback.
I cannot replicate the really-high 500 rate anymore on 184.108.40.206 (the node that was particularly bad). However, I'm still concerned about what caused that: Is it likely to happen again? Why did it only happen to that one node? (In essence: help me trust this system ;P.)
However, what I'm most interested in is whether the "static website hosting" endpoint of S3 (the *.s3-website-us-east-1.amazonaws.com URLs) has different semantics than S3 normally does, so that under "normal interaction" scenarios I can rely on "this will do its best to not return a 500 error, retrying if required to get the underlying S3 blob".
Have you tried setting a redirection rule on your bucket so that when the 500 error occurs, S3 will automatically retry the request? You can set a redirection rule in the S3 console, and I think the following rule might work:
This will redirect all 500s back to the same location, effectively re-trying the request. This should cover the random 500 case but I'm not sure that it will work 100% of the time though.
(That said, I have very little personal experience with CloudFront, as in my experience it is more expensive for fewer features with less POPs than using a "real CDN" like CDNetworks, or even Akamai.)
For this specific circumstance, I'm not certain at all what CloudFront's behavior will be; it seems like the "redirect" concept is a property of the "static website hosting" feature of S3, not part of the underlying bucket, and CloudFront "normally" (in quotes, as I just mean the default origin options it provides) directly accesses the bucket.
I thereby imagine that if I simply set a custom origin to the ___.s3-website-us-east-1.amazonaws.com URL provided by the S3 static hosting feature I will get the right behavior (where CloudFront forwards and caches the 301 responses), but then I have no clue if it will correctly retry the 500 error responses.
That said, I will point out that I am not even certain if CloudFront retries the 500 requests anyway: it occurred to me that with a small error rate combined with a cache, if you (as I somewhat did at least) expect the potential fix to be S3-specific, you might simply never really "catch" an actual failing request in a test scenario.
It could then be that CloudFront retries all 50x failures (in which case if I set it up with a custom origin to the S3 static hosting URL you'd still get the retry behavior), but I somehow doubt that it does that (and just earlier I saw two requests in a row to S3 fail for these 301 redirects, so it might not even help).
Other "ALIAS" providers can't do real CloudFront apex support either. Their intermediate resolvers end up caching the CloudFront records without varying per client subnet.
(As for non-Amazon DNS with server-side aliasing support, it wouldn't be that bad, for this kind of use case: you are already taking a latency hit by returning the 301, and direct links will never target these URLs as they will have the canonical www. hostname, so if you just end up with an edge node near a geo-ip DNS server near the original user, it will be approximately good enough.)
No redirect, 1 s3 bucket, and they both serve up the same assets from cloudfront. And HTTPS works too on cloudfront.
Definitely impressed with how quickly it went. I muddled through setting up AWS DNS, S3 and CF through a bunch of blog articles. But it was well worth the time investment.
Likely I'll just wrote a post on my experiences as well once everything is said and done. I haven't had the time over the holiday to figure out the AWS bucket policy. But my overall plan is to have node-webkit shim that has a markdown editor for editing posts.
It should be relatively easy and would be a complete win for me blog post wise. Especially since my "drafts" would be in S3 themselves.
I host 1.5 million images on S3 and I've never had issues with 500 errors. I've also done extensive testing of S3 under load and it's simply amazing.
Frankly S3 is one of the few bits of infrastructure I DO trust.
I would definitely agree: it doesn't fail under load; for a while I was seriously using S3 as a NoSQL database with a custom row-mapper and query system (not well generalized at all) that I built.
However, this particular aspect is a known part of S3 that has been around since the beginning: that it is allowed to fail your request with a 500 error, and that you need to retry.
This is something that if you read through the S3 forums you can find people often commenting on, you will find code in every major client library to handle it, and it is explicitly documented by Amazon.
"Best Practices for Using Amazon S3", emphasis mine:
> 500-series errors indicate that a request didn't succeed, but may be retried. Though infrequent, these errors are to be expected as part of normal interaction with the service and should be explicitly handled with an exponential backoff algorithm (ideally one that utilizes jitter). One such algorithm can be found at...
Regardless, the 2% failure rate on that one S3 IP endpoint is definitely a little high, so I filed a support ticket (I pay for the AWS Business support level) with a list of request ids and correlation codes that returned a 500 error during my "static hosting + redirect" test today. I'll respond back here if I hear anything useful from them.
Seems like the pros outweigh the cons but I'm probably missing something.
Very valid point saurik, thanks for pointing out the extent of the problem. It is a dilemma. Seems kind of silly to have an instance just for the first hit to go through reliably for visitors, goddamit Amazon.
Edit: Wait a minute, maybe this could be solved with custom error pages which I think they support. :P
It sounds like Jeff has tracked down the specific issue in your case, so things are looking up :)
My buckets are in Ireland and I haven't seen 5xx problems. I'm not a heavy user, though.
I have one bucket that has 3,148,859,832 objects in it <- I got that number from the AWS Account Activity for S3, StandardStorage / StorageObjectCount metric. I apparently make 1-2 million GET requests off of it per hour. Yesterday, Amazon returned a 500 error to me 35 times, or 1-2 per hour.
That's about a 1 in a million chance of failure, but if you are serving 4 billion images out of S3 (assuming you mean # requests and not # objects), then that means that 4,000 of your requests failed with a 500 error. That's 4,000 people out there who didn't get to see their image today.
So, seriously: are you certain that didn't happen? That out of the billions of people you are serving images to off of Twitpic, that you don't have some small percentage of unhappy people getting 500 errors? Again: it is a small chance of failure, but when it happens the browser won't retry.
As I said: "it only happens to some small percentage of requests, but for a popular website that can be a lot of people (and even for an unpopular one, every user counts)" <- websites like ours serve tens to hundreds of millions of users billions of requests... one-in-a-million actually happens.
(edit: Also, I will note that you seem to be using CloudFront to serve the images from S3, which might be a very different ballgame than serving directly out of S3; for all we know, CloudFront's special knowledge of S3 might cause it to automatically retry 500 errors; for that matter, the "website" feature of S3 could be doing this as well, but I have yet to get word from Amazon on whether that's the case... just pulling directly from the bucket using the normal REST API endpoint does, however, return 500 errors in the way they document.)
13:40:15 < REDACTED> heh yeah
We keep access logs to look for errors. The error rate is marginal.
B) The only reason I opted for "# requests" instead of "# objects" is because it let me put a hard figure on "number of people dissatisfied if you have a one in a million error rate". Let's say you are doing 4 billion image requests per hour (the time scale is actually irrelevant): then at a 0.0001% error rate (which is what I get from S3) then 4,000 users per hour are getting an error.
The amount of features that AWS release is astonishing given their size. Great work.
I'm not surprised that they aren't bothering with all that yet
You know, just like Amazon gift card balances applied to your consumer Amazon account.
* Transaction fees that people have to pay on their credit card
* Budgeting purposes
* Expense handling if you're buying the company services on a personal card (this happens a lot more than you'd expect)
* Company shared accounts where the finance/purchasing department wants to do the old invoice / payment dance - finance department gets invoice for $5000, pays it in the background, developers get on with their work.
* Applying a cap on your spending
* Lack of credit cards in many parts of the world
I'm sure there's dozens of reasons why you'd want to prepay, and it's clearly something Amazon want to support, judging by their staff's comments on the page.
"In this step, you will configure both buckets for website hosting. First, you will configure example.com as a website and then you'll configure www.example.com to redirect all requests to the example.com bucket."
FTA: "In the Amazon Route 53 Management Console, create two records for your domain. Create an A (alias) record in the domain's DNS hosted zone, mark it as an Alias, then choose the value that corresponds to your root domain name.
Create a CNAME record and set the value to the S3 website endpoint for the first bucket."
I've been using Dnsimple for quite a while to host root domains on heroku.
That's how the Obama fundraising website was hosted
S3 doesn't have an EBS dependency, and has been pretty rock-solid for half a decade now.
well, let's hope adding domain hosting to amazon doesn't add to further downtime.
S3 and CloudFront have supported custom subdomains for years.