So it is 6 hours complete outage in around 22 months since its opening beta. The lifetime outage is somewhat around 6/(30 * 22 * 24) = 0.00037 = 0.037%! I think this is pretty impressive achievement to build a system with uptime as 99.963%. Especially for some poor engineers woke up at 2am in Seattle and started to figure out what went wrong and get it back on line. I think it is pretty cool.
In the case when our PCs/Macs crashed. Even I could rush to a Circuit city/JR store to get a replacement hard drive. I probably will spend the same amount of time just to revive my system, given I have good habit back up the system. If that is not the case, I will need to reinstall operating system and applications. I guess the down time may be 24 to 48 hours.
So the downtime for a person without good habit in backup. The uptime will be 99.849%! if it takes 24 hours to get back the system in 22 months.
Yes. as a lousy programmer and lousy administrator myself, I am pretty impressed at 3 9's for a new system that support HTTP DELETE, PUT under heavy load can reach such record. I am too ignorant to know any WEBDAV based system can be that reliable under the same load. And I am also amazed that the downtime for my PCs/Macs is far inferior than an immature s3! And I am also wondering the day those systems can be as reliable as telecom's 5 9's (5 minutes/year).
We use them commercially too, and I am very happy. For a system that is under development (and presumably iterating internally), support has been nothing short of fantastic.
"S3 is still more reliable than a couple of dedicated servers, though "
Maybe. I've had colo or dedicated servers since 2000, and the last time I had one fail in any way was in 2001. I move servers every 2-4 years to newer, faster hardware, but even so, my current uptime is longer than S3 has existed.
Phew, back up. Although that the fact that it was possible to have the entire network go down is quite worrying.
S3 actually has an SLA; http://aws.amazon.com/s3-sla If I'm reading that right, if S3 is completely down for more than about 40 mins in Feb (which it was - about 90 mins by my count) then we should get a 10% discount for this month. Is that right?
Kathrin of the The Amazon Web Services Team has posted some more specific details on the failure here. In summary it seems their Authentication service was overloaded.
Our site seems to be running fine (EC2/S3). We actually have all files currently on EC2 and backed up to S3 (We haven't checked to see if the backup is still working yet)
From our S3 Logs from our EC2 instance ()which saw no interruption): The first failure we had was at 4:25 this morning, no success until 7:08, then mixed results, now full success since 8:55
It also didn't go down all the times the 37Signals web apps I pay for or the hosted FogBugz installation my company uses have gone down.
Desktop apps rock! :)