Hacker News new | past | comments | ask | show | jobs | submit login
Internal documents show how Amazon scrambled to fix Prime Day glitches (cnbc.com)
53 points by FactolSarin on July 23, 2018 | hide | past | favorite | 15 comments



Has "Sable" been spoken about externally before? I found a few details about it on a LinkedIn profile: https://www.linkedin.com/in/kokamd/.

"Sable has been hugely successful in the company: it provides storage and computation environment to over 850 teams, running 2450 unique applications, hosts 10,000 data-sets comprising of 2.05PB of (replicated) data across all retail regions. It handles about 2.2 trillion transactions per day with average client-side latencies of 3ms. In the first half of 2016, there were two outages of the cluster in North America due to operator errors, and this brought down multiple lines of business, not just Amazon.com. Leading the project to reduce SABLE blast radius. This involves segregating the data-sets and providing seamless fail-over to alternate storage solutions (for example DynamoDB) for data that is very critical such as Amazon catalog."

"SABLE is Amazon's e-commerce storage and computation platform. It hosts some of the most important applications running on Amazon.com. Altogether there are about 400 teams (including Shopping Cart, Items, Prices) that use SABLE within Amazon.com for high performance storage, caching, and new business object derivation services.

Part of the team that built SABLE from ground up. SABLE uses BDB (Berkley Database) for persistence and Libevent for non-blocking I/O. My specific feature contributions include: repartitioning support (add/reduce fleet size without any impact to live production traffic), SABLE reactors (computation platform for propagating prices, availability to the website). I optimized SABLE reactors by eliminating worklogs (which is the bookkeeping mechanism used to keep track of the amount of work that needs to be done in the system) which resulted in 30% reduction in our fleet size.

As a manager, I had five directs, and our team launched: Quality of Service for SABLE reactors (this helped in faster prices propagation to the website), Shopping Cart on SABLE platform (increased the availability of Shopping Cart application, more than what Oracle could provide)."


There are listings for Sable development on amazon.jobs: https://www.amazon.jobs/en/search?base_query=sable

> Our NoSQL storage platform processes more than 1 trillion transactions per day to serve Amazon country-specific and private-label websites and internal Amazon systems.


> "I'm confident we'll deliver an even better experience next year," he wrote in the email.

"Even better."

Ugh, I detest this kind of PR-written language. Acknowledge failure.

"We failed, and we're sorry we let down our customers. We'll try our best to provide a shopping experience that's up to our standards, as well as our customers' expectations, next year."

This is not that difficult.


How would you know?


Because its surprisingly easy to empathize with. The fundemental pattern of choosing between doing the right thing and doing the easy thing is all it really comes down to. The right thing isn't nearly as difficult to do as it sounds.


(Note: my knowledge here is out of date as of 2015.)

Sable is relatively ancient and badly in need of deprecation. (If I remember correctly, as of 2011 it was sort of in a "please consider alternate storage solutions" mode.) It is the closest to a single point of failure I can think of at Amazon.

Hopefully this is the kick in the ass they need to move to something more robust.


It's amazing to me that CNBC was able to pull such a long piece out of this. The image they used of Bezos is funny too, as if he was ashamed or something. I'm guessing he has a particularly good prime day since Amazon stock hit a new high.


So many jokes to be made about Amazon having to contact their AWS representative to get their limits raised, etc.


I would say the exact opposite. Shouldn't it be the case that Amazon (retail) should have to use the same escalation mechanism for AWS as any other user of AWS? If the escalation/support mechanism is good enough for Amazon Retail, then that's a strong selling proposition for anyone considering AWS for their mission critical operations. And, on the flipside, if it isn't good enough for Amazon retail, then that creates tremendous feedback pressure to fix it.


I can confirm.

My team at Amazon was a customer of AWS just like every other team at Amazon is. Every department/subsidiary at Amazon charges one another for services provided. We all operated as cost/profit centers.

In fact, as far as Amazon policies go, this may be the only one that actually satisfies every Leadership Principle [0].

[0] https://www.amazon.jobs/principles


I've seen this in other industries.

When I was in television, Westinghouse would charge the television stations it owned for satellite time. The result was that big news from small places often only got to the national or regional desks by Greyhound Package Express. And by then, the news was old, so it rarely made air on the national news.

It even happened on a local scale at some stations. The station's own promotions department had a budget and had to purchase commercial time on its own station. (Different company, not Westinghouse.)


Interesting - I wonder if there is a PM at Amazon making deals behind the scenes:

"Sure - I'll approve Steve's transfer to your department, as long as you give me 10K free compute hours for your API. Too pricey? I guess you don't want him then."


Not sure what you mean, running Amazon retail on AWS is a fact that demonstrates AWS' maturity and Amazon's commitment to have the same infrastructure for internal and external users, which is a bold and long-term-effcient strategy where many cloud providers are struggling to achieve.


Sable is run on non-AWS infra to prevent a single point of failure for the entire AWS side of the internet. They still likely have an auto-scaling system specifically for Sable, however without the numerous amount of AWS servers that could have been used to auto-scale.


Except sable is not part of AWS and thus why they couldn't just scale up more servers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: