Hacker News new | past | comments | ask | show | jobs | submit login
Amazon was Down (amazon.com)
101 points by crc321 on April 22, 2013 | hide | past | favorite | 96 comments

A while ago, someone claiming to have worked at Amazon said that downtime doesn't really affect things as much as you'd think. He said most people simply just come back later.


[Edit: That being said, there's also the statistic that every 100ms of latency costs Amazon 1%. Imagine what 20+ minutes of "latency" would do. https://news.ycombinator.com/item?id=273900]

>A while ago, someone claiming to have worked at Amazon said that downtime doesn't really affect things as much as you'd think. He said most people simply just come back later.

Brick and mortar stores found that out ages ago. They were closed for large parts of the day/night and the customers just came back the next day.

If people didn't leave Tumblr and Twitter, with their constant massive outages (at some point in their life), when why would the leave Amazon, a huge established player, for a few hours outage?

Funny enough, I've gone to a brick and mortar store to find it was closed, then went back home and bought the item on Amazon. I wanted it right then, but since I would have to wait until the next day anyway I just ordered it online.

Contrast that with going to Amazon and finding their site is down or performing poorly; I have never gone to the store to buy something instead. If I was already going to be ordering it online, I was already resigned to waiting a day or two for it to arrive.

I doubt many would leave Amazon permanently because of a service outage, but it probably does cost them something in impulse purchases, if the impulsive desire for the item passes during the downtime.

No, you would think that, but actually if you read the comment referenced above, what they actually found is that outages seem to have little effect on revenues. That's why it's so surprising, really. The implication seems to be that their customers don't actually spend a whole lot on impulse buys.

Tumblr and Twitter have orders of magnitude fewer users than Amazon.

Seeing that Twitter has 500 million users, an order of magnitude more would be in the range of 5 billion, and a second order of magnitude would be around 50 billion.

So I seriously doubt Amazon has "orders of magnitude more users" that Twitter. For Tumblr, maybe, but I doubt that too (maybe just one order of magnitude for it).

Besides, all that's orthogonal to my point. Except if you mean that the reason that Twitter and Tumblr have less users than Amazon is that they left those services due to outages.

Twitter has much better uptime than it used to, too.

When you get in the hundreds of millions of users, you've run out of early adopters and end up with users who are a little more demanding about uptime.

I imagine the difference between latency and downtime is that latency tends to occur every time you visit, while by definition, downtime is more rare. In other words, latency provides a bad experience, while downtime provides no experience.

The Oatmeal actually has a great comparison of how people react to latency vs how they react to downtime: http://theoatmeal.com/comics/no_internet

the latency vs cost curve will not be linear. After some latency, increase in latency wont affect cost much.

Also, latency (a constant lack of resources) and down time (an extraordinary lack of resources) are two very different things. I wouldn't be surprised if some down time had little impact on sales, whereas latency has a lot.

true! latency is also location dependent. so in latency network and health of other external 'resources' also matter. Whereas downtime is only a server issue. So they are two very different things.

This is true by my experience. I went online last night to buy a replacement garden hose. Was surprised Amazon was down. After confirming it wasn't just me (thanks isitdownforeveryoneorjustme.com), I gave up and ordered it this morning.

This is just a small complaint coming many hours after this link was posted. You linked to perhaps the most top-level URL amazon has available for a temporary outage. This means a couple things. 1) Hours later, the outage is over and I'm just hitting the home page. No specific information about what you were reporting. 2) This specific link is, as far as I know, now no linger available for other stories. That may not matter I the long run but it bares mentioning.

At an estimated loss of $31,000 per minute http://news.cnet.com/8301-10784_3-9962010-7.html?tag=nefd.to... I'm blown away that I see Amazon goes down so often. That certainly, in my mind, doesn't bode well for the brand of AWS.

I can't imagine that's a real loss, likely only deferred sales. What are you going to do if you can't buy your thing at Amazon? Drive somewhere? I think you'll try again later.

Dividing income for time doesn't necessarily give you loss, especially this seems to have no weighting for time of day and season. I doubt an outage right now has anywhere near the same effect it would have during lunch break two weeks before Christmas.

Heroku is up: http://cl.ly/image/0B0U1K3Z342R.

Conversely, when AWS had issues, Amazon.com was not impacted.

Amazon.com != AWS. I'm curious to know when AWS or Amazon.com innovations impact each other, or which one leads. I'd rather it be Amazon.com.

I think the gp was saying that the brand— as in, the perception of AWS— will suffer, not the actual services.

Anyone with an ounce of server knowledge would know it's impossible to keep a website up for 100% of the time, so downtime at Amazon is understandable, but maybe the average Joe Manager is deciding between Rackspace and AWS and happens to visit amazon.com during this downtime. "If Amazon can't even keep their bread-and-butter running, how can I trust them with something like AWS?" he might say.

> Anyone with an ounce of server knowledge would know it's impossible to keep a website up for 100% of the time

As far as I know Google has 100% uptime, so it's not impossible. May not be 100% for every geographical location but that's partly because of things Google cannot control nor make redundant.

Even if Amazon.com != AWS, it is still bad for the Amazon Brand which encompasses AWS.

If they can't keep their own server up, how can you trust them with yours?

An unfair argument, perhaps, but one that impacts them all the same.

Apparently, Amazon.com switched to AWS in 2011 http://www.quora.com/Amazon/Does-Amazon-com-use-Amazon-AWS.

I was at the AWS Summit in NYC last week and Werner claims that they only completed the transition of retail to AWS fairly recently: http://www.youtube.com/watch?v=oo1W92Teqx4

Does Amazon really go down that often? Is there any data how often/ for how long Amazon does go down? I wonder how it compares to other sites that get the same amount of traffic.

I think it's the second time this year - http://money.cnn.com/2013/01/31/technology/amazon-down/index...

It probably doesn't, but you notice every time it does.

It went down once in the last couple weeks as well, if I remember right.

FWIW, I did not notice that.

It does not go down that often, and when it does its measured in minutes. This latest outage was maybe ~10 minutes total...

Is it a real cost (and how can you know that?) or just a naive interpolation sales_per_hour / hours_outage?

Most of those are surely done later, perhaps they lose some impulsive buys though.

$31K in '08. What is it today?

A very quick calculation (using AMZN's $61b net sales in 2012) yields about $116k per minute.

One thing that would contribute to extra cost is a large amount of advertising that they are paying per click that ends up leading to a page being down.

Amazon.com retail website does not run on AWS.

Amazon.com website actually runs on AWS for last 2-3 years.

It sure didn't when I was there three years ago, but maybe they moved quickly.

Not true...

It runs 100% on AWS.

Not all of amazon runs on AWS though, since they use a service oriented architecture, but many of the services also run on AWS.

Judging from alpb's post, they either currently work there, or did in the past. "retail website" being the key phrase used frequently internally to Amazon.

AWS is working at unprecedented scale and is definitely pushing the envelope. This sort of stuff is inevitable and I don't think it's inherently a bad thing.

Somewhat off-topic: my (limited) experience with Amazon Prime video suggests it's significantly less reliable than Netflix or iTunes (neither of which are stupendously reliable, but I'd say Netflix is by far the most reliable of the three). Hulu might actually be worse than Amazon Prime.

I don't know if you saw this posted on HN, but Netflix test their system really well. They use so-called chaos monkey [1] that shuts down random servers on a whim. This allows them to detect and get rid of dependencies, i.e. tolerate failures in other parts of the system.

[1] http://techblog.netflix.com/2011/07/netflix-simian-army.html

Wish I had thought of chaos monkey, much cooler than whatever I called my version of it.

A few years ago I built an automated test system in perl, complete with message bus and message listener container for running tasks on various servers. One of the automated tests I wrote had a component that would periodically (at random intervals) kill processes, unmount shared filesystems, offline interfaces, etc. to cause failovers, to verify that all processes and resources were failed over, and all tasks were reassigned to other nodes and no jobs were dropped or stalled.

It is really the only way to ensure you've covered your bases - beating the shit out of your system repetitively. It uncovered a bunch of big holes and some very obscure ones too, and once we got those fixed it ran pretty much flawlessly.

The interesting problem is that the underlying AWS system seems to come up with more and more interesting failure modes due to system complexity, that the testing could never catch. Like Netflix had a major outage recently on Christmas Eve 2012.


Isn't Netflix hosted on AWS?

Netflix is indeed running on AWS. Some details about their back-end here:


They also constant kill machines (and replace them with fresh instances of the image) that participate in key load-balanced activities.

My experience is the opposite and I use both services regularly. At least one fairly recent event corroborates this [1].

1: http://gigaom.com/2012/12/25/christmas-eve-aws-outage-stings...

Wow. Something this big means I'll bet it's a networking issue.

I wonder if they lose money for a brief outage, or if people just delay their purchases. I seem to remember them graphing this somewhere.

Do we really need a front-page post every time a well-known site has a hiccup? It's bad enough getting it every time github does. What are you hoping for here? A thread full of me-toos?

Had downtime yesterdayevening (12 hours ago) as well in the Netherlands. People from Germany were able to load the website (the .com version; .de worked at all times), and after two hours I was able to as well. Upon trying to add something to my cart it returned the same error 500 though, so that was still down the last time I checked (about 10 hours ago). I'm not sure if or when this was resolved.

I didn't submit this as story because I didn't think anyone would care, given the recent call not to post downtimes. Given the #1 spot the story has now, it seems I should have. So do people care or not?

http://www.amazon.ca/ is OK.

I get Http/1.1 Service Unavailable on first two requests.

I got 500 with the message "We're very sorry, but we're having trouble doing what you just asked us to do. Please give us another chance--click the Back button on your browser and try your request again. Or start from the beginning on our homepage. "


aws != amazon.

It's perfectly possible for AWS to keep running just fine, while Amazon the website bursts into flames.

Does Amazon not run their website on AWS? I assumed (incorrectly, apparently) that AWS was originally built to allow Amazon to scale their own services. Is it really a separate product that they don't use themselves?

Just because the servers themselves are up doesn't mean that the software running on the servers is up (or that it's capable of handling the load).

I just got my shopping cart to load, but it took quite a long time. Maybe they're getting DOSed.

Just because Amazon is down, doesn't mean the infrastructure is the reason. They did build AWS out of the technology they used to build Amazon, but its unclear if they are using it directly or use an isolated set of services.

Amazon uses AWS to host Amazon.com (retail), although you can imagine that they dodge the billing structure and have quite a number of resources dedicated away from the main AWS fleets.

Originally, that's true, Amazon didn't run on AWS afaik. But i believe they do now.

Nevertheless, they still have application architecture which sits above the aws substrate. It's perfectly feasible for them to have seriously fucked up a deployment that runs on top of AWS, which may be functioning just fine (and at least all of my services running out of us-east seem to be up and running).

Just because a site is on AWS does not mean it cannot go down for its own reasons. There are more failures possible than infrastructure.

I thought the same. Werner Vogels, their CTO, said at the NYC cloud event that they moved amazon.com to it in 2010, and amazon.com international in 2011.

AFAIK the "Amazon runs on AWS" meme was originally a line of pure marketing tripe.

Incorrect, Amazon.com runs on AWS: https://news.ycombinator.com/item?id=5588012

AWS tech maybe, but you can be pretty assured that the data centers (or at least networks) are are almost completely segregated.

Nope, its all mixed outside of software segregation.

Umm. If AWS is working fine, why should the status page show anything of note?


I doubt that works when the site is up either. Many servers these days reject pings.

Doing fine here, both amazon.com and the AWS console.


I got a 500 error on amazon.com. Now just a timeout.

Can't get into AWS management console either.

I can't get to the management console or status.aws.amazon.com

Sounds like someone that is oncall is going to have a bad night. Amazon.de and .co.uk are up.

There are two types of websites, those that have suffered downtime, and those that will.

Seems like Only US market got goosebumps. UK looks fine and up.

And it's back up for me.

aws console seems fine to me. the amazon.com is down for sure.

Yeah same here. Amazon is throwing a "Http/1.1 Service Unavailable" but AWS console is fine.

Just successfully launched an instance, but I can't get my Prime video!

Amazon.com is back up

Use Windows Azure!

Back online

Browsing the site is nearly impossible at the moment.

It's a lesson in overengineering. At this point my $5 Pentium 3 server has a greater uptime than Amazon.

Your $5 Pentium 3 server isn't the largest retail website on the internet making $61 billion a year.

Having seen a lot of the code that Amazon runs on, and having seen first-hand the scale that it runs on, I'll say this: it's not perfect, but it's remarkably well-engineered, and a hell of a lot better than most snarky HNers could do.

But that's the point. Most people don't need anything that well-engineered. Compared to more traditional hosting solutions from quality providers, AWS has terrible uptime and at a much higher cost for the same amount of resources. Two VPS'es from two different providers in a simple failover configuration with an anycast DNS solution would be simpler, cheaper, and much more reliable.

Wow, apparently that last comment really hit a nerve, as several people decided to downvote it, but not a single person actually refuted any of what I said. I was under the impression that downvotes were more to be used against trolling or flamebaiting, and not just opinions that people disagreed with. Considering everything I said is quite easy to verify as being true, this downvoting just strikes me as kind of intellectually dishonest. I expected better from HN.

Yeah, I think there's a misunderstanding somewhere. Some people think I believe a $5 computer could handle amazon.com's traffic, which is clearly preposterous.

I know that almost all of my downtime comes from when I overengineer things. And I don't need to "patch my kernel" because my OS doesn't have kernel holes once a week. Linux isn't the only Unix OS out there.

Today, a lot of sysadmins believe that "LAMP" is a synonym for webserver, and consequently there are a bunch of webservers serving static content on a machine with way too many moving parts. Complexity is bad.

"Things should be made as simple as possible, but not any simpler." -- Albert Einstein

I think the downvotes are because your post is somewhat off topic.

OP responded to "Amazon.com is down" with "this is a lesson in over-engineering" - which it isn't, because Amazon.com is most certainly not overengineered for its purpose (I've seen the code with my own two eyes).

Your response is "not everyone needs extensively engineered systems", which is true, but is a non-sequitor from the previous posts.

Except your Pentium 3 server doesn't have to handle over 100 million unique visitors per quarter. :P

Unless your server needs to handle comparable amounts of traffic, it's not the same thing.

It doesn't. My server is appropriately engineered to its task.

Is that the sound of the power supply going on your Pentium 3...

...or your internet connection

You never patch or reboot your magic box either?

How is your hooptie server at all comparable to the largest online retailer?

When very little changes and very little happens, uptime's a lot easier to accrue.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact