Hacker News new | past | comments | ask | show | jobs | submit login

> Failure at every level has to be simulated pretty often to understand how to handle it

Keep in mind, S3 "fails" all the time. We regularly make millions of S3 requests at my work. Usually we get 1:240K failure rate (mostly GETs), returning 500 errors. However, if you're really hammering an S3 node in the hash ring (e.g. Spark job), we see failures in the 1/10K range, including SocketExceptions, where the routed IP is dead.

You need to always expect such services to die in your code, setting the proper timeouts, backoffs, retries, queues, and dead letter queues.

If you retry the get does it succeed usually?

Yes, retry has always worked (*us-west-2).

Sometimes it's a 404 for an object written 1 sec prior, other times it's an S3 node that died mid request. Retry gets you to a different node.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact