I serve over 4 billion images out of S3 and don't have any issues with 50x error...

saurik · on Dec 29, 2012

As I've stated elsewhere in this thread, this is documented behavior from S3. I also have billions of objects in S3, and I definitely get back 500 errors. I'm sorry, but even the CTO of Twitpic is not in a position to say "we push infinitely more data than you, so we know better", at least not for S3 ;P.

Honestly, I have to ask: would you know if some tiny percentage of your requests failed with a 500 error? I bet the answer is "no", as the idea that you wrote some JavaScript to look for a condition you probably didn't realize could happen is almost 0. I'd love to be surprised, however ;P.

(That said, as you are hosting "images", at least you could detect it with JavaScript and fix it, so one could thereby imagine a realistic reason why this would not be a serious problem for you; however, I'd argue that you are then treating S3 as an API, not as a static web hosting system.)

I have one bucket that has 3,148,859,832 objects in it <- I got that number from the AWS Account Activity for S3, StandardStorage / StorageObjectCount metric. I apparently make 1-2 million GET requests off of it per hour. Yesterday, Amazon returned a 500 error to me 35 times, or 1-2 per hour.

That's about a 1 in a million chance of failure, but if you are serving 4 billion images out of S3 (assuming you mean # requests and not # objects), then that means that 4,000 of your requests failed with a 500 error. That's 4,000 people out there who didn't get to see their image today.

So, seriously: are you certain that didn't happen? That out of the billions of people you are serving images to off of Twitpic, that you don't have some small percentage of unhappy people getting 500 errors? Again: it is a small chance of failure, but when it happens the browser won't retry.

As I said: "it only happens to some small percentage of requests, but for a popular website that can be a lot of people (and even for an unpopular one, every user counts)" <- websites like ours serve tens to hundreds of millions of users billions of requests... one-in-a-million actually happens.

(edit: Also, I will note that you seem to be using CloudFront to serve the images from S3, which might be a very different ballgame than serving directly out of S3; for all we know, CloudFront's special knowledge of S3 might cause it to automatically retry 500 errors; for that matter, the "website" feature of S3 could be doing this as well, but I have yet to get word from Amazon on whether that's the case... just pulling directly from the bucket using the normal REST API endpoint does, however, return 500 errors in the way they document.)

saurik · on Dec 29, 2012

12:54:47 * saurik ('s [third] sentence still managed to feel a little more confrontational than he wanted, even with the ;P at the end; he was going for more of a funny feel)

13:40:15 < REDACTED> heh yeah

stevencorona · on Dec 30, 2012

4 billion objects, not requests. Way more requests.

We keep access logs to look for errors. The error rate is marginal.

saurik · on Dec 31, 2012

A) Do you define "marginal" as "one in a million"? ;P

B) The only reason I opted for "# requests" instead of "# objects" is because it let me put a hard figure on "number of people dissatisfied if you have a one in a million error rate". Let's say you are doing 4 billion image requests per hour (the time scale is actually irrelevant): then at a 0.0001% error rate (which is what I get from S3) then 4,000 users per hour are getting an error.

C) ... you aren't doing S3 static web hosting if you are keeping access logs, as the only people who know about the request are the user's web browser and the server. You can attempt to detect the error in JavaScript on the client, but you can't keep an access log. If you are logging requests made by your server, then the error rate is irrelevant as you can just retry the operation.