I trust S3 a lot (in fact, there was a time I had >1% of all objects in S3; ...

recuter · on Dec 28, 2012

2% failure rates are excessive, agreed - but why is the requirement to retry on 500 so off-putting? Virtually all API's have this occur on some level and you do the exponential backoff song and dance.

What am I missing that makes this such a show stopper with browsers? You can still do the backing off clientside with a line or two of javascript.

Seems like the pros outweigh the cons but I'm probably missing something.

saurik · on Dec 28, 2012

You are still thinking about this as an API, with for example JavaScript and some AJAX. The use case here is zone apex static website hosting: if you go to http://mycompany.com/ and get a 500 error, the user is just going to be staring at an error screen... there will be no JavaScript, and the browser will not retry. As I actually explicitly said multiple times: for an API that is a perfectly reasonable thing to have, but for static website hosting it just doesn't fly.

recuter · on Dec 28, 2012

Oh I see what you mean, you're concerned about the first bytes to the browser being faulty. Well, that 2% error rate is spread out across the total of requests, the likelihood of a user getting a 500 on his first hit should be significantly less then 2%. (but it does seem like it will still be way too high)

Very valid point saurik, thanks for pointing out the extent of the problem. It is a dilemma. Seems kind of silly to have an instance just for the first hit to go through reliably for visitors, goddamit Amazon.

Edit: Wait a minute, maybe this could be solved with custom error pages which I think they support. :P

Dylan16807 · on Dec 28, 2012

I..what? That's not how percents work.

recuter · on Dec 28, 2012

No, it isn't, but that's not what I meant either. I'm assuming that not all requests are equivalent because of the nature of S3.

Dylan16807 · on Dec 28, 2012

You're going to need to explain how the requests would differ. If anything I'd expect image files to be more cache-friendly and have fewer visible failures than the critical html files. An image might have a 2% failure rate once or twice plus fifty error-free cache loads, while an html page might have 2% failure every single click.

ryeguy · on Dec 28, 2012

The likelihood of a user getting a 500 on his first hit is the same as the user getting a 500 on his 100th hit - 2%.

aidos · on Dec 28, 2012

Interesting. I guess in the case of static web hosting you could use onerror to deal with failed frontend requests to smooth out the broken images from the user perspective. Though as I say, not been a problem for me.

saurik · on Dec 28, 2012

Yeah, for images you can probably deal with that; but what if your JavaScript doesn't load because the script itself was a 500 error, or the entire website doesn't load because of a 500 error... well, you're screwed. The use case here is for zone-apex whole-site static website hosting (either of just canonicalizing redirects or of the final webpage: same issue).

aidos · on Dec 28, 2012

I've just discovered that you can use the same technique for script and style tags too (though I'd rather not have to).

It sounds like Jeff has tracked down the specific issue in your case, so things are looking up :)