My bet's on some unforeseen bottleneck that affects search and static pages. Almost everything within Amazon is crazy scaleable, but there are some bits where you scale them up and their behaviour changes radically. For instance, a service's cache misses might skyrocket as customers get distributed over a wider set of servers, causing service response times to increase just a little bit on average, tipping a dependent service over into more frequent timeouts, causing its downstream service to blow a timeout-percentage 'software fuse' and stop using that service... etc etc.
Given that each of those services (and many more possibly-related ones) will have an on-call engineer paged into a conference call when the manure hit the rotating ventilation apparatus, there are going to be a lot of unhappy people cancelling their weekend plans right now. I definitely don't miss that aspect of the job!
> there are going to be a lot of unhappy people cancelling their weekend plans right now.
The week just started... are you saying that you can already anticipate the war room for an event like this to extend through this coming weekend?
OR there wasn't backpressure on a cascading failover so as services failed they increasingly failed to more and more overloaded systems
OR there WAS backpressure and it was the luck of the draw whether you were queued into an error page or got good data
OR the autoscaling couldn't keep up with the onsale window. This used to happen in ticketing a lot. Ticketmaster has a talk somewhere where they talk about warming the scaling load and server cache in anticipation of big ticketing onsales. The time it took to autoscale was just too long.
> Amazon used about 80X the capacity of their entire AWS public cloud
which is probably closer to "the capacity that is available via AWS is a tiny, tiny fraction of their overall computing power.. therefore adding it back in when things are falling over doesn't actually solve any problems."
1. he misspoke and meant "80%" of the AWS capacity, which I agree seems implausible.
2. Amazon does not run on AWS because Amazon is 80x more than all of AWS infrastructure. This also seems implausible because of Netflix. In fact, there's an article out there that said AWS exceeded Amazon's capacity within 1 quarter!
I still don't understand what that has to do with autoscaling exactly
Does Amazon expect a fix to be deployed ASAP after the immediate crisis is averted?
But, yes, unforeseen rate limits and size limits can cause many hilarious things to happen. I've seen a few good ones in my time. In particular, when somebody sets an upper size on an in-memory table and commits it thinking "Well, I added a few orders of magnitude safety margin - that should be enough for anybody", that's probably going to become an incident at some point in the distant future ;) With luck, the throttling or failure behaviour will only affect a few people, and it'll be spotted by looking at traffic graphs and noticing a very slightly elevated rate of service errors. If you're unlucky, though, when you hit the limit the whole service slows down, locks up, or just plain crashes, and something like this happens...
"We are at the fair, I have 18 customers lined up but I can't take orders. Can you fix this now or should I start taking them on paper?"
You know the joke that atheists don't believe in god until the plane nosedives? That was the moment for me :)
"Oh, I... I need to see the error message before I can say anything. We don't have any errors logged. Hmm... Would you allow me to remote into your device?"
"Can't do that. We lost the internet connection here, sorry."
I can personally point to two friends who I consider top notch engineers and designers that have left Amazon because of its toxic culture. I'm sure I'm not the only with these anecdotal examples, we've all heard the stories. At the end of the day years of unbalanced work/life balance, overly aggressive management and frugal approach to everything makes for a weak argument for A players to stick around.
Could this be an example of crumbling engineering standard at Amazon?
>Something no one has mentioned yet, could it be that the engineering force at Amazon is no longer what it used to be?
In many regards, yes. The bar had to be lowered to meet the demands of growth. We've also taken in a lot of hires from companies that have brought their culture and friends with them. The culture at Amazon is not what is was even 2 years ago. It is in many places day 2.
No one also seems to notice that Amazon retail often suffers widespread issues like this. We can count on SEV1's happening during peak as things blow up badly. This has happened several years in a row, and sadly the themes are pretty much the same across all: forgot to scale (yes...really) or some stupid system bottleneck. It doesn't help that Amazon retail has a good amount of its workforce based in India and seemingly disconnected from the Seattle based leadership.
For those unfamiliar with internal Amazon lingo, a big deal is made of it always being "day 1"
Not only that, but I've noticed that Amazon has started using way more contractors (not actual firms, but more Mechanical Turk/UpWork like contractors, if not just straight up misclassification) and agencies within the past year or so. I can't say that I'm surprised that Amazon is having technical issues now that they've grew in numbers but not in technical chops.
On a side note, it is a great time to become a recruiter in Seattle with all these agencies popping up. /s
Do you sense any attrition like that?
Could you expand on this a bit, please?
Mental health is enough of an epidemic that $company_with_exponential_scale is most definitely going to pick up a fair chunk of people with various issues, cognitive included.
Chances are this is even part of why the bar is a bit lower.
(Said as someone with mild high-functioning autism)
Do they really need 1 million servers? Many of my friends who work at other tech companies need such few servers in comparison even with significantly high traffic that just screams massive inefficiencies...which seems wrong.
But I've never worked at Amazon so I wouldn't know.
But we didn’t handle free-form text search like Amazon. I can imagine that would necessitate a huge scaling up of compute and data.
and when has amazon ever been an engineering force? i have always felt the website and service experience is a relic of the 2000s. more often than not, i get the answer “our system can’t do that” from customer service.
I think Amazon has taken on an outsized image to many people that just isn't true. We have good engineers in many organizations, but we don't pay enough, have the right strategy, or take care of individuals well enough to lure the kind of great folks you find at other big tech companies. In many ways, Amazon is a retailer that does technology because it found a way to make money from it. The DNA is still MBAs/finance and retail.
Its true Amazon has some great engineers but is not a very engineering centric. I remember a senior engineer in Retail once comparing it to a plumbing system kept together with bandages.
Bottom line is Amazon is a product culture not an engineering culture and that makes it really easy to leave for Google or unicorns that really appreciate tech debt tradeoffs.
In simple terms, bar raisers are current Amazon employees that come in during the interview process to analyse candidates. They do this alongside their own full-time job, assessing as many as 10 candidates a week and spending 2-3 hours on each one.
In other words 20-30 hours per week on top of the full-time job? That doesn't sound quite right.
10 candidates in a week may happen at some kind of event, but then it isn’t 2-3 hours per candidate, and in that case you’d effectively be taking a couple of days off from your normal job.
I love the Amazon customer service. They’ve managed to crack a difficult problem and execute enough that other Giants haven’t come close to it yet.
GCP and Azure tail AWS by quite a bit. Amazon online retail is a Google search engine level monopoly now.
So Amazon can do a lot of things wrong, but I’d have to say they get the important parts right.
Not even particularly close. Amazon doesn't even have a majority of online sales, although it's getting close. They seem like a much bigger force than they are because of growth.
They got some parts right, some parts awfully wrong, and some are just irrelevant now. They make money, they are cheap + convenient, and that usually what people focus on. They are not sophisticated, they are not great designers, etc.
A PM who has never opened an IDE in his or her life and who is only familiar with “coding” concepts through Wikipedia. They read books by Malcolm Gladwell, Daniel Kahneman, and Nassim Taleb and majored in one of the humanities. When they’re shown the webpage that the geeks created which loads in 2 seconds, they tell the lead developer that they want the loading time cut down to 1 second and the font to be changed.
I understand that it’s not always easily fixable. But honestly this looks pretty bad. Could be anything from bad code to a flakey cable.
I got out of high volume websites - too much marketing, unpaid overtime, and horrible work-life balance.
It is not a binary state of absolute destitute and top notch brilliance, it's a trend that can move one way or another and will show itself in more frequent outages, poorly rolled out products, lazy design and etc.
The plumping can keep on working for a few years even after it begins to erode. Historically there are many examples of this.
Amazon has always been toxic and frugal, so obviously it didn't interfere with whatever mythical software quality it had in the past.
And today's failures might be because past engineers built unstable unmaintainable systems and then ran away.
And of course amazon is 10, 100, 1000 times larger tech system than it used to be.
Which in turn tells me they didn't test the failure case. Now, Amazon is a huge and complicated beast so I don't want to imply this was a "dumb" mistake, but (assuming I'm correct) it is a failure that risked making MORE failure, so it's not demand alone to blame.
As someone that's been going through quite a bit of depression because Amazon was the best offer I got (as opposed to Google or Facebook because I'm still bad at coding interviews) I'm afraid this is the straw that's going to break the camel's back for me. This is exactly what I was afraid of and exactly what I have waiting for me when I join next week.
There's no point anymore.
Why do I see literally everyone else have their dreams come true while mine don't?
Let me tell you: it will be OK, and the other posts about making the best of your situation are true. Amazon is an incredible learning opportunity if you stay open to it. I'm still working there a year later and I'm building far higher impact projects than my friends at Goog and FB because like another poster said, Amazon lets SDE1s work on just about everything. The growth potential is tremendous if you show you're competent and willing to learn.
Chin up, you're in a good position :)
Nowadays they have to be impressed by the project too...and mine's front-end development on an internal service nobody uses.
they have a lot of really amazing and smart people. but like anything in life, its what you make of it. I'd say put your best into the position and try to learn as much as you can from others. no good reason to do less
There are a lot of amazing and smart people, like you said, but there is also a lot of stress, heartache, and trouble if you don't keep your ear to the ground and build a strong network of people to give you an early warning. Don't keep your head down and concentrate on tech and building cool stuff: Amazon can be way too political for pure techies to thrive without strong protection from management.
You need to realize that your doing great, folks would kill to get a job at Amazon, the scale and the challenges are mind bending compared to what most other companies deal with and the interview is grueling and you made it.
Technology is a word that describes something that doesn't work yet and Amazon thinks that you are a person that can help tip the balance.
Jokes aside, I admire the work of the team(s) responsible for Amazon's web site. I use it so often and encounter glitches so rarely that it really stands out when something does go wrong.
Now the question part: would Amazon ever secretly run Amazon.com in a multi-cloud setup, balancing between AWS, GCE, Azure, etc?
Nowadays, I hear it's quite different, and much of AWS is more rapidly dogfooded.
The requirements for prime day/black friday/cyber monday were mind-boggling.
There are plenty of good reasons why you wouldn’t run Amazon’s commerce on EC2, but I don’t think cloud availability is one of them.
One of Amazon Retail's big internal goals this year is "finally get everyone off of Oracle".
Bezos opposed the creation of AWS. Almost everything that was early AWS (EC2, S3) was done over the objection of Seattle leadership. Look where the teams were based.
After they shipped AWS, it took eighty zillion years for Retail to use anything AWS offered beyond S3, and, as mentioned above, it's not like their 100% on DDB or RDS now: they have dependencies on freakin' _Oracle_ all over the place.
I mean, this is not a criticism of Bezos or Amazon at all, but at the end of the day, even if Bezos is a supergenius (and I see no reason to doubt that he is. Although I often wonder how he keeps himself motivated to continue to work so hard on building a Walmart competitor with the precious few years he has left on Earth given that he could do literally anything with his time), Amazon is still a company made of tens of thousands of people. It's not a 4 dimensional chess-ballet. It's got probably 95% of the same chaos and disorder that every big organization made of humans has. It just turns out that cutting it 5% in the right markets has incredible returns.
I can't see Amazon ever using external cloud hosting for anything except the most trivial of tasks. They're absolutely, utterly paranoid about any sort of confidential information, and I think even with encryption the perceived risk would be too high.
In my business I can put a hardware site
2 x intel gold 5115 10 core + 64 GB RAM
1 nvme @512G + soft raid1 @4TB magnetic
1 10G, 2 1G ether
storage or NAS with 60TB @RAID5 + 2x quad core low end xeon + 32 GB RAM = 16k
1G edge/core mngd switches + 10G SAN/LAN mngd switches
Endian firewalls + threat appliances
Colo with 2 year lease and 25 amps @208v
1g port speed and committed throughput > 100/mbps
= 16K yearly
68K one time cost for depreciating assets we maintain, provision and secure + 16K yearly recurring cost.
Or I can go AWS and modify my processing model, security expectations and service infra and spend 25K a year + 15K
1x migration cost.
And that's assuming you don't have any needs to quickly scale up or down and you are limited to 1 colo instead of the ability to expand to multiple regions like with AWS.
And that's not even taking into account the cost of the brain power to make sure your hardware stays up and running.
Doesn't sound like rolling your own stuff in a colo is a very good idea in this case. But that's job security if you are the sys admin I guess.
Although, as I said upthread, I agree that AWS is very likely ideal for this particular deployment size, let me try to dispell this oft-repeated myth.
Modern server hardware takes almost no "brain power" (or effort of any kind) to keep up and running.
We aren't living in the days of the early dot-com boom where Linux-on-Intel in the datacenter could mean flimsy cases, barely rack-mountable, with nary a redundant part to be seen.
Applying some up front "brain power", one can even choose and configure hardware in such a way as to provide things like server-level redundancy, if that's important and/or preferable to intra-server redundancy (think Hadoop), or the ability to abandon mechanical disks in place instead of ever having to replace one.
I am generally a strong proponent of using ones own hardware in a colo or on-premises, instead of or in addition to the cloud (primarily for "base" workload).
However, if the entirety of your needs can fit into a single rack, even I will advocate for AWS, since "convenience" is, perhaps, not strong enough a word.
I do think your server and storage prices are around $25k too high, but that's easy to do buying brand name and/or not negotiating with multiple vendors on price (which is particularly tough at low volume unless you're a startup with a credible growth story). That's assuming such an expensive CPU (in comparison to so little RAM) isn't foolishly profligate, along with the other hardware choices. Of course, this underscores the point (on which we agree) that, as a rule, it's just not worth that much time and effort for so little.
I'll take your word on the AWS pricing, as it's fairly predictable, if very tedious to perform the prediction. The main "gotchas" I've found people run into are forgetting to add in EBS costs for EC2 instance types without (or without comparable) local storage and underestimating data transfer costs.
No large vendors used in this example - thinkmate or aberdeen supermicro re-brands for due diligence and warranty.
No, I wouldn't suggest more chasses, as that's almost always more expensive (it's tough to break even on that $1k minimum buy-in on a server).
I believe your workload needs the resources you say. It just happens to be a remarkably rare ratio, hence my remark.
> No large vendors used in this example - thinkmate or aberdeen supermicro re-brands for due diligence and warranty.
The vendor doesn't have to be large to jack up the price.. Any re-brand is super suspicious. To me, a large part of the point of a commodity server product is the reliability is predictable (and therefore easy enough to engineer for/around). Paying extra for "diligence", warranty, or hardware support is just flushing money down the toilet.
A fee for custom assembly and/or a basic smoke test is fine, but it had better be a flat rate per server and on the order of $100. Technician labor isn't that expensive.
Larger or "enterprise" vendors are merely the extreme version of this, with upwards of a 10x premium on something like storage arrays, especially if one includes
I agree with your cautions around supermicro resale but
the warranty support and build diligence are absolutely necessary for a small business. Having a good business relationship with a trusted provider of hardware that
always performs the first time is priceless.
I admit that, having an affinity for startups, rather than more traditional small businesses, I have a greater affinity for risk. Ironically, perhaps, I'm usually the voice of risk-aversion with respect to IT infrastucture, so I don't believe it affects my overall understanding.
I recently pointed out to an interviewer who was trying to convince me that it was worth spending half a megabuck on a petabyte from Netapp because it was "business critical" instead of 1/10th that amount for DIY, that, just like the DIY solution, Netapp does not indemnify the business against loss. One isn't buying insurance, only a bunch of technology.
Sure, "works the first time" is worth something. Is it worth the cost of a whole, complete, extra server on a order of qty 6? If the infant-mortality rate on servers is anywhere approaching 1-in-6 and they're being shipped somewhere that the replacement time and/or cost would be prohibitive, I'd still probably rather just order 7 servers instead.
That's my main problem with paying a vendor for "reliability": it's a very fuzzy, hand-wavy assurance. Paying for reliability with more hardware has data and statistics behind it, which is an engineering solution.
I'd hope for a more substantive reply, if anything.
Even if they're not sifting through your server data, they can possibly try to get a competitive advantage by analyzing things like usage data as someone above pointed out.
(I might have a detail wrong, this and a hundred other great telecom anecdotes are in The Master Switch.)
Always nice to make sure no parties you depend on have conflicting interests.
The asymmetry in trust is partly because those cloud providers are huge, and partly because those things are part of their core competency.
I'm not sure how secretive it would be. AWS 'bids' on Amazon.com's business and Amazon.com is under no obligation to use AWS as it's cloud service provider.
Hell, even us regular joes can do that with services like Packet.Net or Vultr.
That's similar to all of the locks on all Walmart stores inexplicably getting stuck in the locked state for 2 hours at 6AM on Black Friday.
This happens every time AWS has an outage as well. Reddit is down, better sell AMZN.
I'm sure there is some impact, but it's nowhere near the inconvenience of being locked out of a physical store.
Tons of people that shop sales will just spend their money somewhere else if they miss a sale.
Presumably the price sensitive customer doesn’t just go elsewhere, right?
Maybe it's just me and my confirmation bias at work, but it seems that the core value proposition that Amazon provided -- high value, low margins on products -- has been eroding before our eyes.
Seems so much like the transition Microsoft made... too much focus on "synergies" and leveraging... not enough on keeping the bilges dry and the engine running.
It's funny... Fred Brooks wrote about this in 1975... and we're still making the same mistakes forty years later. There are real limitations to how quickly any organization can grow. Even awesome companies who are excellent at building organizations -- places like like Amazon and Microsoft -- can't organize this law of software development away.
The companies that just keep the bilges dry and the engine running are the ones that we love, but they’re gone because they got made irrelevant. Or they got absorbed into something larger. Microsoft has a bunch of failed initiatives (Windows Phone, Zune) plus a bunch of successful ones (Azure, Xbox, Office 365).
If you’re up for classic books like Brooks check out The Innovator's Dilemma. You have to try to expand in a hundred different directions because you don’t know which one of those hundred directions will be relevant next decade, and you have to be unafraid of cannibalizing your core business because if you don’t eat yourself then someone else will eat you instead.
I think the hard part is to walk the line between stagnation & over-expansion. This is a dilemma we all face in organizations large and small as well as individually.
Building systems and processes (a.k.a Habits) that allow us to assume and integrate some new set of "stuff" without having to think about them (so we can move on to the next new "stuff") are what sets these companies apart; Amazon has been brilliant at this; however, from my perspective, looking at all the high-rise office buildings going up, it seems like maybe Icarus has flown too high...
But again, I could be building a narrative to fit my preconceived ideas... I'm definitely no authority here. Maybe this is just another blip... I think more troubling to me is the overall degradation in quality of the things I have formerly taken for granted in Amazon -- the quality of the products and ratings.
Summarizing the book here would be a bit of a disservice—but one of the points of the book is that there are economic reasons why companies focus on their most profitable core products, and there are economic reasons why that kind of focus can result in the company collapsing when the market moves forward. This isn’t some kind of imperative—the book isn’t saying, ”therefore, you should create a new company.” It’s more descriptive, “this is how big, successful tech companies can suddenly fail.”
The book has 2 chapters on that single point. And repeats it everywhere else.
On synergies, the OP post was about it, not about disruption.
Do they just do brute-force massive scale out?
Amazon's US market is big, but my understanding is that number of online users in China (> 400mil) exceeds the population of the US (~325mil), which makes me wonder if the folks there think about data architecture a little differently than we do.
Also, I just read that as of 2017, there are 700mil internet users in China, 90% on mobile. The scale there is just staggering.
I remember people were praising them as such when this outage happened: https://aws.amazon.com/message/41926/
Generally, the rule at Amazon is that any particular f*-up is forgivable... once. (Especially if you can show that you had preventative measures, documented procedures and redundancy in place.)
That said, there will be finger pointing and blame because you're dealing with human beings.
These engineers work at a world class company and are paid vast sums of money to not fuck things up. They live way better off than the majority of the country and their mere presence makes life more expensive and stressful for communities around them.
To suggest they cannot go a mere 48 hours or less without sleep on one of their company’s most hyped days is out of touch.
People need sleep to think straight, and no amount of money or responsibility is going to change that.
You're not saving lives, you're selling books and cat litter on the internet.
Disclaimer: My knowledge is based solely off of public reporting and first hand experiences of SWEs and TAMs no longer at Amazon/AWS.
For example, the pen test authorization request form can only be filled out from the root account.
If you're affected by this, please accept my unofficial thanks for your patience and understanding. (If you're a coworker in retail, good luck getting things up and running!) :-)
This whole thing about businesses "giving jobs" is ludicrous. American brainwashing. Businesses NEED workers. Not the other way around. Humans have always existed and survived for hundreds of thousands of years without "jobs". Businesses cannot survive without workers. Businesses would simply cease to exist without workers. People can still grow their own veggies and meat. Businesses can't generate profits without workers.
Not as an alternative means of even survival-level support if they don't have access to sufficient suitable real estate to do so.
Meanwhile Jeff Bezos' fortune is approximately $1,000 per American household.
Amazon needs to share its wealth much more, at least among its workers and also independent authors using it, and so on, and realize that it needs to support the ecosystem that allowed it, including things like more freedom and liberty for people, not less. This vast accumulation of wealth is bad news, even for capitalism itself, when you get right down to it. Bezos has a huge chance to make a stellar example to the world here, but so far...ummm...i mean why not?
It can't be that companies have crippled unions so that they can treat wage and working conditions as a one-sided negotiation. Surely not...