I got that bit thank you :-) but I did not understand why there might be a web page on amazon named after bozos, some sort of test page? Or a joke that has completely passed me by?
Funny. I followed the "down=yes" link and got the "Service unavailable" page, as expected.
Then I clicked on the "down=no" link for fun, and the page partially loaded for me. I refreshed it and got the whole front page loaded. And then one more time and got the "Service unavailable" again...
During my time as an engineer working on Amazon.com, we occasionally experienced outages of various lengths. One of the surprising details about these outages is that they really didn't result in any revenue loss. That is, it appeared that customers would simply wait until the website was available again to make their purchase. I would be surprised if that effect doesn't still happen today especially with the availability of Amazon on a variety of platforms (i.e. customers are comfortable ordering from their phones when they couldn't get to the website from their desktop computers).
That's a really interesting observation and substantiates a suspicion I've had: people generally have a good idea in mind of what they're going to buy and about when they're going to buy it. If at a particular moment the opportunity doesn't present itself, they'll simply delay the purchase until it's possible.
This would apply more to purchases from a specific and exceptional point than those which can be made from multiple providers. Say, my usual lunch spot is closed or out of an item, and I can walk down the street elsewhere (or a drugstore, etc.). However if you're selling hard-to-find exclusive items, or we've got an established relationship and the item isn't something I need right now, I'll simply get it later.
On the macro scale, it makes me suspect that shorter interruptions to service don't have a significant regional financial impact.
Definitely, we can make sales numbers riding on the back of their reputation in the marketplace on amazon that we can get nowhere near with website sales in the US market.
From a customer standpoint their a-z guarantee probably helps a lot.
Yeah, Amazon's got trust, an established relationship, excellent customer service/satisfaction, and a good payment management history, all of which are crucial in online commerce (and for all of that I still prefer not to use them, for what it's worth).
Musing over my own post, I can think of instances where the same would not be true. A financial trading platform in particular -- trades are already occurring at volume, and lost time would be lost trades.
That's exactly right. I'd wager that millions of people have bought from Amazon but no other online store. And I doubt any other online store selling physical goods in the US can beat them by that metric. It's not that people won't shop elsewhere but that they literally are unprepared to let themselves do so. One-click shopping is a great idea to keep grandma loyal to the store, because her son probably set it up, and he's not home now.
It was actually someone from Amazon themselves, reporting on a/b testing they did, that gave us the numbers of 100ms of latency cost them 1% in sales. People might wait for Amazon to come back, but if it's slow to navigate back and forth to compare brands, or slow to return search results, or slow to just render the page, you can easily imagine losing some percentage of people or just some percentage of what you could have sold to them.
That's coming from Google, who makes money off page views. Amazon has a different model, so I could believe the latency sensitivity of its business is different.
There is an important distinction here between latency and availability. It's possible Amazon could be very sensitive to small changes in latency, while availability isn't that big a deal. After all, a lot of retailers do fine with nightly outages lasting 12 hours or more.
I thought for sure I'd have missed it and this would be one of those reports where the service was back up before the story gained traction, but as of 12:07 PM Pacific/US time I cannot navigate to Amazon's home page..
The amazing thing about this for me is that it reminds me that it was only a few years ago that even the biggest sites would have fairly frequent multi-hour outages, but these days it is pretty rare for this sort of thing to happen, particularly on a retail or otherwise direct-money generating site.
I used to work for Amazon. Unless this was willful I doubt anyone is going to get fired for it. After all, you'd be firing the guy who's least likely to make such a mistake ever again.
It's so silly to fire people for mistakes that no organization that exerts any mental capacity toward human resources would do such a thing. Even Apple didn't fire the guy who left a prototype at the bar.
In a factory in 1965, maybe, but no good employer is going to fire someone for making a mistake, no matter how costly.
Depends on the factory, and I doubt it's much more or less likely in the 21st century than it was in 1965. In all times and places, you have enlightened and unenlightened people. In all times and places you have good and bad leaders.
"In every time, in every place, the deeds of men remain the same."
The story of the plane servicer who never mis-serviced a plane again after almost killing the pilot he worked for is cute, but screwing up big-time doesn't turn a (presumably) sloppy engineer into one that never messes up again.
I'm of the mind right now that the word "cute" really shouldn't be used in any other context than physical description. It's just condescending and rude.
The problem with your attitude is that it's based upon a premise that is almost never true: that screwups are caused by incompetence, and that they have singular (or overwhelmingly singular) sources.
Neither of these assumptions bear out in reality, and certainly not in our industry.
The vast majority of downtime events trace back to systemic failures, not a freak event, and are more often catalyzed by momentary lapses than long-standing incompetence. Do we penalize the tech who clicked the wrong link on a dashboard, or the guy who wrote the dashboard such that a critical action contains no safeties or confirmations? Or do we penalize the manager for not having any established documentation on protocols surrounding triggering critical actions?
The only reasonable stance here is to collectively take responsibility for the failure. It may feel good to hang someone out to dry, but in all likelihood their failure was only the final link in a long chain of failures that extended well beyond themselves.
You root cause what led to the event (going deeper than "a tech clicked on the wrong thing"), and you fix the root cause, and you move on.
To some extent I agree with you, but it is a slippery slope. If we always deny that one person really is a problem, we may retain a truly bad employee while building excessive safeguards that hinder productivity for others. In my experience this possibility is all too real.
A team of good people should learn from their mistakes and reduce hazards along the way. But bumper bowling is no fun for experienced players. It's a balance, and it does tend to shift as a company grows.
On the plus side, if they're willing to share, I bet this will be a very interesting postmortem. Presumably Amazon.com is one of the more bulletproof web properties in the world. Whatever could have occurred to take it down for nearly an hour (at this point) can only be interesting!
I can't compare to other web properties, but when I worked at Amazon, the store going down was a regular event. Something broke almost daily, though it was rare for the whole store to go down. (EG: You might not be able to search, or checkout might be down, etc.)
The store went thru periods of relative stability, and relative lack of stability, and in the periods where it was not doing so well, it (or a major piece of functionality) would go down in some key area at least once a week, sometimes multiple times a week during the holidays.
While it's been several years and I'm sure they've improved reliability, the sheer mass of the store made it very slow to evolve. And as an ex-amazonian sometimes I go and check for bugs that were issues back in the day- several of them have come back over the years, which is not surprising given that the entire group that was working on the parts I was working on disbanded because so many people were driven off by bad management. (A one-two punch in that case, a bad manager backed by another bad manager, neither of which had any technical knowledge.)
At the time I worked there, large swaths of code in the store had no team who was responsible because the team had been disbanded in one of the regular shuffles of employees. Amazon had a tendency to get a team together to do a feature, launch it, get the PR and the stock bump, then disband the team and put them on other projects. Of course some of these things stuck around if they were successful, but there was a lot of cruft from past efforts like: Local restaurant menus, the movie times system, various "social shopping features" (a perennial favorite to try again and again.) Hell, they used to have catalogs for mail order merchants- scanned paper catalogs!
At the time, they were claiming that "AWS is what we built the amazon store on!" (which was totally false, S3 was engineered completely separately from the store, and to its credit, as obidos and gurupa were crap. The only thing the store shared with AWS for at least the first several years was being hosted in some of the same datacenters.)
At least at the time I worked there, I'd call it a mess held together by the code equivalents of duct tape and bailing wire.
One of the things Amazon excels at is customer service, so when these problems would impact the customer, their bacon was often saved by customer support fixing the problem manually (eg: messed up orders, etc.)
Granted, operating at Amazon's scale is not trivial matter. But Amazon is a retailer and stock marketing company (Eg: one of their primary products is Amazon stock), more than an engineering company.
I'm kinda amazed that people perceive them as a "tech giant" along with Google, Facebook and Amazon. Shows the power of a good (actually, GREAT) side business like AWS. They get the credit for building something good and scalable with AWS, but of course it was a separate team lead by a senior executive with enough political clout to shelter that team.
Amazon is a weird company, and it has lots of parts. Even at, say, Microsoft there can be a huge amount of variation from division to division and team to team on how things are run, the corporate culture micro-climate, etc. At Amazon this is even more true, each team is substantially on their own, and while there is a certain amount of global overarching corporate culture every group is different and some groups buck against the trend successfully.
They have one of the biggest logistics systems run by a large amount of software in the US, one of the biggest robotics deployments in the warehouse, AND they developed AWS on the IT side. Amazon's software is largely behind the curtains but they are definitely a tech giant.
Except for the part where Gurupa enables scores of developers to build web apps that make hundreds of service calls yet emit results faster than the website we're using right now.
Not returning memory is different from a memory leak. Not returning memory means the memory footprint equals peak memory footprint. A memory leak is a bug in the program which causes space complexity in memory to grow unbounded. mzscheme certainly doesn't leak memory. HN leaks memory.
As for why, it's easy. Taxes. Much like Walmart rents its stores form an LLC it owns to write off the taxes and bring down the liability of the largest revenue sector, Amazon can write off their server costs since they can "rent" them from AWS LLC. While AWS makes a good chunk of change, it has nothing on amazon.com so by making AWS its own entity (and event better for them that its publicly available) they get a gigantic tax write off and AWS makes capex expenditures saving them taxes. All In all, the shell game must save amazon millions just like it does for Walmart
It's actually surprising it isn't down more often—internally, everyone has write access to prod and the rule is that if you deploy something to prod you need to be able to roll it back.* Apparently, though, someone has failed on the second item.
* Or so I was told in a job interview with the big A a few years back.
I find this extremely hard to believe. (Not calling you a liar, but I think something must have gotten lost in translation).
The possibility for theft and fraud would be so massive if every dev at Amazon had write access to production that I find it nearly impossible to believe this is true.
Developers probably have access to most production systems. Credit card processing and source of truth on orders that get shipped are most likely segregated. (actually PCI dictates that physical and data access controls be in place so only essential employees can access card data)
Who cares if I can access the credit card processing system if I can insert random code elsewhere in the system that redirects you to my phishing page whenever you enter credit card information?
Given that you would be an Amazon employee with a solid audit trail leading back to you in that scenario, I'd say it's pretty likely you'd be caught and prosecuted rather quickly.
Yes, I and my coworkers could've sold the realtime trades of a petroleum multinational to the highest bidder, including ones that hadn't happened yet. That would've been easy, and would've been worth 100's of millions to someone. Not getting caught and having your life ruined -- that was scary and would've been hard. Now, if I was working for a sovereign power, like China, and my life was there anyhow, then pulling stuff like that in the US wouldn't be so hard.
The bits that need high security such as production databases have extra layers of access and tracking. But most devs can push changes to the retail website.
One of the reasons I left Amazon was that I was given the job to deploy code regularly (about weekly) at 1am or so, and one evening, there was a problem due to work of another team, so it escalated and we spent 6 hours dealing with it. We rolled the change back right away, but for contractual reasons their code had to be fixed and deployed and there was an interdependency. Fortunately, it wasn't my team's mistake, but I had to be there to help test it, etc.) So, it's finally working at 7am, and I stuck around for 30 minutes to make sure it kept working before going to sleep around 7:45AM.
I emailed my boss about it, and of course he was getting emails the whole while as the tickets status was changing.
Still, the fact that I showed up at 10:15 for the 10AM meeting that morning was "unacceptable" and I got chewed out. (~2 hours sleep!)
I made the mistake of thinking that my HR rep might be someone to talk to about this, because I wasn't sure how to make it clear to him that it was kinda unreasonable (Especially since I told him I'd be late for the meeting)... and that's when I found out that everything I told her was written up in an email & sent to him.... resulting in getting chewed out yet again for going to HR!
The lesson: as a programmer, never work for a boss who can't program, or at least, be very wary of it!
I have to say, it sounds to me like the lesson isn't about bosses who can't program, so much as "don't have a terrible boos". There are plenty of fields I know nothing about, but if I was managing people in that field, I would expect that on 2 hours sleep they wouldn't be effective, and I also wouldn't expect them to work both night and day shifts. It's common sense.
45 minutes of downtime so far, we're seeing mostly 503 responses with an occasional 200 getting through. We've seen a few other smaller outages for amazon.com in the past but this is definitely the longest in at least the last 3-4 years. Details at http://reports.panopta.com/amazon/server/96291
Interesting for all those people chasing "five nines": If 45 minutes today is their only downtime for the year, their annual uptime for 2013 will be just
"5 nines" works out to about 5 minutes of downtime a year: very challenging to achieve.
For reference, 4 nines is about an hour, and 6 nines is only ~30 seconds of downtime a year!
It's true that their profit margins can be vanishingly thin, but that doesn't mean they don't make money.
For some classes of items, they can sell at cost and still make money, because their operations are allegedly so good that they can turn over the inventory before their own payment to the supplier is due.
For example, say Amazon buys a book today and payment is due to the publisher in 30 days. They sell the book tomorrow at cost. Now they get to sit on the full price of the book for the rest of the month. In fact, take that money and buy another book, and sell it right away too. Keep that up, and you have a very big pool of money always sitting in your bank account. Money that can be profitably invested in other activities.
Why would a publisher give them 30 days to pay? Because they're Amazon. It's good to be big.
They are optimizing for market share and innovation rather than profits. It's a world domination thing.
I think it's awesome. Imagine if Google had run a bunch of low-rent punch-the-monkey display ads early on. It would have killed them. Facebook vs MySpace is another good example of what happens when you focus on long-term value creation versus short-term profit taking.
With its low profit margin amazon leaves virtually no room for a small size competitor to dislodge them from their share of the market.
Say you want a bite of the tablet market dominated by apple, it's easy, make a somewhat decent tablet for cheap and there you have it.
If you want a bite of an amazon dominated market, well good luck with that, and while at it hope that amazon is not planning to get into the market you're in.
It seems their strategy relies on tiny margins, maybe with a different set or circumstances amazon would change their stance, but I don't think it's currently part of their plans to ramp up prices.
From Amazon's perspective, it's the only way to exist in the long term. If they keep costs and margins low, they don't give their competitors much breathing room to challenge them on price.
That's assuming that they're revenue is spread out evenly over every day and every minute. Which is not the case. Think Christmas, deals, weekends, time of the day, etc...
It also assumes that everybody who goes to buy something from Amazon while its down end up buying somewhere else as opposed to simply waiting until later that day
That's less than I expected! Back at the height of their popularity, I remember hearing that AOL could lose a lot of money every second their servers weren't showing ads.
Is there a name for the fallacy of ignoring marginal effects at the tail end of a probability distribution? I see it here incredibly frequently.
There will almost certainly be some number of people who would have stopped by Amazon right now and made some impulse purchases. At the scale Amazon operates, the increase in inconvenience to push off the marginal purchase as a function of inconvenience is almost certainly miniscule (See frequent reports on how milliseconds of page load time affect the likelihood of purchase)
I can attest to this being the case. I mentioned it elsewhere in the thread, but we're in e-commerce and when walmart.com went down around black friday last year we saw a 20megabit jump in traffic until their site came back up... and we're only one e-com provider out of many.
It's January, so not much is being lost here. We're looking at the slowest shopping season of the year. My unscientific estimate based on previous experience in retail would suggest sales are probably 1/50th peak Thanksgiving/Xmas volume.
Still, downtime is money, even if it isn't a world-changing amount of it.
Makes sense, you are one reporter with little more information than everyone else before you know what the story is. Of course the thousands of people on HN are going to beat you to a story like this.
It appears just to be the homepage, but all deep links are unauthenticated. That is: if you were logged in before the site started misbehaving, and use a deep link, you're not logged in on the page that loads.
I remember being in a talk by Dr. Vogels last year and he mentioning that *most of the Amazon.com North America services moved over to AWS in September 2011, many other services outside of NA were yet to move.
Running on AWS doesn't protect you from problems in the applications you're running on AWS. You can `rm -rf /` an EC2 instance and have plenty of problems.
I've been having odd behavior with DynamoDB all day. I wonder if it's related. The AWS Health Dashboard says things are fine, but I'm not so sure: http://status.aws.amazon.com/
Been a while since I've seen the amazon homepage down. Wow.
I know from the e-commerce side, when walmart.com went down last year we saw a traffic increase (enough to actually link to to the outage for walmart). I wonder if it'll happen here.
Not all abandon sales are a loss for Amazon. Many customers will simply just com back and buy another day because they don't know about the competition.
Free month of Thinkful to the team that's supposed to be keeping it up. Could have order something and had it delivered in the time it took to get back up...
My server with an Intel Atom, Windows 7, and 100KB/s upload connection didn't go down from reaching the Hacker News homepage. It's laughable that any other website does. For Amazon, traffic from sites like Hackernews must be completely negligible.
not sure what kind of statistics would be good and not violate the NDA =) but our internal rubyhackers mailing list is one of the busiest lists that I'm on. I myself am not a Ruby fan, so don't know any details.