[Edit: That being said, there's also the statistic that every 100ms of latency costs Amazon 1%. Imagine what 20+ minutes of "latency" would do. https://news.ycombinator.com/item?id=273900]
Brick and mortar stores found that out ages ago. They were closed for large parts of the day/night and the customers just came back the next day.
If people didn't leave Tumblr and Twitter, with their constant massive outages (at some point in their life), when why would the leave Amazon, a huge established player, for a few hours outage?
Contrast that with going to Amazon and finding their site is down or performing poorly; I have never gone to the store to buy something instead. If I was already going to be ordering it online, I was already resigned to waiting a day or two for it to arrive.
So I seriously doubt Amazon has "orders of magnitude more users" that Twitter. For Tumblr, maybe, but I doubt that too (maybe just one order of magnitude for it).
Besides, all that's orthogonal to my point. Except if you mean that the reason that Twitter and Tumblr have less users than Amazon is that they left those services due to outages.
When you get in the hundreds of millions of users, you've run out of early adopters and end up with users who are a little more demanding about uptime.
Dividing income for time doesn't necessarily give you loss, especially this seems to have no weighting for time of day and season. I doubt an outage right now has anywhere near the same effect it would have during lunch break two weeks before Christmas.
Conversely, when AWS had issues, Amazon.com was not impacted.
Amazon.com != AWS. I'm curious to know when AWS or Amazon.com innovations impact each other, or which one leads. I'd rather it be Amazon.com.
Anyone with an ounce of server knowledge would know it's impossible to keep a website up for 100% of the time, so downtime at Amazon is understandable, but maybe the average Joe Manager is deciding between Rackspace and AWS and happens to visit amazon.com during this downtime. "If Amazon can't even keep their bread-and-butter running, how can I trust them with something like AWS?" he might say.
As far as I know Google has 100% uptime, so it's not impossible. May not be 100% for every geographical location but that's partly because of things Google cannot control nor make redundant.
If they can't keep their own server up, how can you trust them with yours?
An unfair argument, perhaps, but one that impacts them all the same.
It went down once in the last couple weeks as well, if I remember right.
Most of those are surely done later, perhaps they lose some impulsive buys though.
Not all of amazon runs on AWS though, since they use a service oriented architecture, but many of the services also run on AWS.
A few years ago I built an automated test system in perl, complete with message bus and message listener container for running tasks on various servers. One of the automated tests I wrote had a component that would periodically (at random intervals) kill processes, unmount shared filesystems, offline interfaces, etc. to cause failovers, to verify that all processes and resources were failed over, and all tasks were reassigned to other nodes and no jobs were dropped or stalled.
It is really the only way to ensure you've covered your bases - beating the shit out of your system repetitively. It uncovered a bunch of big holes and some very obscure ones too, and once we got those fixed it ran pretty much flawlessly.
I wonder if they lose money for a brief outage, or if people just delay their purchases. I seem to remember them graphing this somewhere.
I didn't submit this as story because I didn't think anyone would care, given the recent call not to post downtimes. Given the #1 spot the story has now, it seems I should have. So do people care or not?
I get Http/1.1 Service Unavailable on first two requests.
I got 500 with the message "We're very sorry, but we're having trouble doing what you just asked us to do. Please give us another chance--click the Back button on your browser and try your request again. Or start from the beginning on our homepage.
It's perfectly possible for AWS to keep running just fine, while Amazon the website bursts into flames.
I just got my shopping cart to load, but it took quite a long time. Maybe they're getting DOSed.
Nevertheless, they still have application architecture which sits above the aws substrate. It's perfectly feasible for them to have seriously fucked up a deployment that runs on top of AWS, which may be functioning just fine (and at least all of my services running out of us-east seem to be up and running).
Having seen a lot of the code that Amazon runs on, and having seen first-hand the scale that it runs on, I'll say this: it's not perfect, but it's remarkably well-engineered, and a hell of a lot better than most snarky HNers could do.
I know that almost all of my downtime comes from when I overengineer things. And I don't need to "patch my kernel" because my OS doesn't have kernel holes once a week. Linux isn't the only Unix OS out there.
Today, a lot of sysadmins believe that "LAMP" is a synonym for webserver, and consequently there are a bunch of webservers serving static content on a machine with way too many moving parts. Complexity is bad.
"Things should be made as simple as possible, but not any simpler." -- Albert Einstein
OP responded to "Amazon.com is down" with "this is a lesson in over-engineering" - which it isn't, because Amazon.com is most certainly not overengineered for its purpose (I've seen the code with my own two eyes).
Your response is "not everyone needs extensively engineered systems", which is true, but is a non-sequitor from the previous posts.
...or your internet connection
You never patch or reboot your magic box either?
When very little changes and very little happens, uptime's a lot easier to accrue.