I’m also surprised at the general architecture of Kinesis. What appears to be their own hand rolled gossip protocol (that is clearly terrible compared to raft or paxos, a thread per cluster member? Everyone talking to everyone? An hour to reach consensus?) and the front end servers being stateful period breaks a lot of good design choices.
The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there.
I see where you're coming from with this, but you really have to wonder. It sounds more like the original architects made implicit assumptions regarding scale that, likely due to original architects and engineers moving on, were not re-evaluated by the current engineers on Kinesis as Kinesis grew. While it may take an hour now for the front-end cache to sync, I find it highly unlikely that it needed that much time when Kinesis first launched.
The process failure here is organizational, one where Amazon failed to audit its current systems in a complete and current manner such that sufficient attention and resources could be paid to a re-architecture of a critical service before it caused the service to fail. Even now, vertically scaling the front-end cache fleet is just a band-aid - eventually, that won't be possible anymore. Sadly, the postmortem doesn't seem to identify the organizational failure that was the true root cause of the outage.
AWS has more than enough learnings to avoid these "events". The problem is the whole culture is focused on delivering new stuff instead of preventing problems and improving existing systems.
Some folks made these decisions with best intentions. I'm pretty sure they all got promoted and then moved on. Now, people who inherited these systems have no incentive to fix these, because at most you'll receive a pat in the back. I'm also pretty sure that those teams talked about the shortcomings of the current architecture. TODOs must be present somewhere in the backlog.
I don't think this is an AWS specific problem, but we have to start treating people who prevent problems like the heroes they are. Everyone congratulates when you put out a fire. No one gives a damn if you prevent the fire in the first place.
This is sort of comforting to hear that Google’s same problems have reached Amazon, in that no tech behemoth is immune to prioritizing promotion and glitz over the maintenance grind.
> No one gives a damn if you prevent the fire in the first place.
I would personally put emphasis on the "sort of" clause.
If bigcos with tons of resources can't align incentives to build solid software instead of deliver new features, who among us has a chance?
This gives the rest of us a chance to one day have our day in the sun. We can deliver solid software, precisely because we are not big. And then, perhaps one day we may have the choice of growing into a bigco ourselves, or staying small.
The same disease exists at other FAANGs and large tech companies. Nobody ever gets a promo for maintenance work and being on a sustained engineering team is seen as a career dead-end.
It's not all that hard. AWS heavily focuses on Service Oriented Architecture approaches, with specific knowledge/responsibility domains for each. It's a proven scalable pattern. The APIs will often be fairly straight-forward behind the front end. With clearly lines of responsibility between components, you'll almost never have to worry about what other services are doing. Just fix what you've got right in front of you. This is an area where TLA+ can come in handy too. Build a formal proof and rebuild your service based on it.
I joined Glacier 9 months after launch, and it was in the band-aid stage. In cloud services your first years will roughly look like:
1) Band-aids, and emergency refactoring. Customers never do what you expect or can predict them to do, no matter how you price your service to encourage particular behaviour. First year is very much keep the lights on focused. Fixing bugs and applying band-aids where needed. In AWS, it's likely they'll target a price decrease for re:invent instead of new features.
2) Scalability, first new feature work. Traffic will hopefully be picking up by now for your service, you'll start to see where things may need a bit of help to scale. You'll start working on the first bits of innovation for the platform. This is a key stage because it'll start to show you where you've potentially painted yourself in to a corner. (AWS will be looking for some bold feature to tout at Re:Invent)
3) Refactoring, feature work starts in earnest. You'll have learned where your issues are. Product managers, market research, leadership etc. will have ideas about what new features to be working on, and have much more of a roadmap for your service. New features will be tied in to the first refactoring efforts needed to scale as per customer workload, and save you from that corner you're painted in to.
Year 3 is where some of the fun kicks in. The more senior engineers will be driving the refactoring work, they know what and why things were done how they were done, and can likely see how things need to be. A design proposal gets created and refined over a few weeks of presentations to the team and direct leadership. It's a broad spectrum review, based around constructive criticism. Engineers will challenge every assumption and design decision. There's no expectation of perfection. Good enough is good enough. You just need to be able to justify why you made decisions.
New components will be built from the design, and plans for roll out will be worked on. In Glacier's case one mechanism we'd use was to signal to a service that $x % of requests should use a new code path. You'd test behind a whitelist, and then very very slowly ramp up public traffic in one smaller region towards the new code path while tracking metrics until you hit 100%, repeat the process on a large region slightly faster, before turning it on everywhere. For other bits we'd figure out ways to run things in shadow mode. Same requests hitting old and new code, with the new code neutered. Then compare the results.
side note: One of the key values engineers get evaluated on before reaching "Principal Engineer" level is "respect what has gone before". No one sets out to build something crap. You likely weren't involved in the original decisions, you don't necessarily know what the thinking was behind various things. Respect that those before you built something as best as suited the known constraints at the time. The same applies forwards. Respect what is before you now, and be aware that in 3-5 years someone will be refactoring what you're about to create. The document you present to your team now will help the next engineers when they come to refactor later on down the line. Things like TLA+ models will be invaluable here too.
The not-invented-here (or by-me) syndrome is probably also at play here.
Ah, Chesterton’s Fence strikes again!
Dat soundz like a bank and not a cloud provider.
The first stage in making something reliable, sustainable, and as easy to run as possible is to understand the problem, and understand what you're trying to achieve. You shouldn't be writing any code until you've got that figured out, other than possibly to make sure you understand something you're going to propose.
It's good software engineering, following practices learned, overhauled, and refined over decades, that have a solid track record of success. It's especially vital where you're working on something like AWS, Azure, etc. cloud services.
If you leap feet first in to solving a problem, you'll just end up with something that is unnecessarily painful down the road, even in the near term. It's often quicker to do the proposal, get it reviewed, and then produce the product than it is to dive in and discover all the gotchas as you go along. The process doesn't take too long, either.
Every service in AWS will follow similar practices, and engineers do it often enough that whipping up a proposal becomes second nature and takes very little time. Just writing the proposal in and of itself is valuable because it forces you to think through your plan carefully, and it's rare for engineers not to discover something that needs clarified when they write their plan down. (side-note: All of this paperwork is also invaluable evidence for any promotion that they may be wanting, arguably as much as actually releasing the thing to production). It shouldn't take a day to write a proposal, and you'd only need a couple of meetings a few days apart to do the initial review and final review. Depending on the scope of what came up in the initial review, the final review may be a quick rubber stamp exercise or not even necessary at all.
Where I am now, we've got an additional cross-company group of experienced engineers that can also be called on to review these proposals. They're almost always interesting sessions because it brings in engineers who will have a fresh perspective, rather than ones with preconceived notion based on how things currently are.
An anecdote I've shared in greater detail here before: Years ago we had a service component that needed created from scratch and had to be done right. There was no margin for error. If we'd made a mistake, it would have been disastrous to the service. Given what it was, two engineers learned TLA+, wrote a formal proof, found bugs, and iterated until they got it fixed. Producing the java code from that TLA+ model proved to be fairly trivial because it almost became a fill-in-the-blanks. Once it got to production, it just worked. It cut down what was expected to be a 6 months creation and careful rollout process down to just 4 months, even including time to run things in shadow mode worldwide for a while with very careful monitoring. That component never went wrong, and the operational work for it was just occasional tuning of parameters that had already been identified as needing to be tune-able during design review.
In an ideal world, we'd be able to do something like how Lockheed Martin Corps did for the space shuttles: https://www.fastcompany.com/28121/they-write-right-stuff, but good enough is good enough, and there's ultimately diminishing returns on effort vs value gained.
The thread per frontend member definitely sounds like a problematic early design choice. It wouldn't be the first time I heard of an AWS issue due to "too many threads". Unlike gRPC, the internal RPC framework defaults to a thread per request rather than an async model. The async way was pretty painful and error prone.
Although, for Frontend servers which just do auth, routing, etc - why is P2P gossip necessary for building shard map? Possibly because retrieval of configuration information directly from the vending service may be a bottleneck - but then why not gossip with a subset of peers than every peer and the vending service which is a source of truth.
 Seems like a relic of years gone by https://patents.justia.com/patent/9838240
Why is it a "relic of years gone by"? Consul uses a similar, though more advanced technique. Consul may not be as widely used as etcd, but I don't think most would consider it a relic.
Kinesis uses Chain Replication, a dead simple fault tolerante storage algorithm: machines formed a chain, data flow from head to tail in one direction, writes always start at head, and read at tail, new nodes always join at tail, but nodes can be kicked out at any position.
The membership management of chain node is done through a paxos-based consensus service like chubby or zookeeper. Allan  (the best engineer I personally worked with so far, way better than anyone I encountered) wrote that system. The Java code quality shows itself after the first glance. Not mentioning the humbleness and openness in sharing his knowledge during early design meetings.
I am not sure what protocol is actually used now. But I would be surprised it's different, given the protocol's simplicity and performance.
It was chosen for future expansion. Kinesis was envisioned to be a much larger-scale Kafka + Storm (storm was the streaming programming framework popular in 2012, it was since falls out of favor).
edit: They also had the most disorganized and de-centralized interview approach from all the FAANG companies I talked with. Which isn't growing pains this far in, it's just bad management and process.
I interviewed as a new grad SWE and the process was totally straightforward, and way lower friction (albeit much less human interaction, which made it feel even more impersonal) than almost everywhere else I applied: initial online screen, online programming task, and then a video call with an engineer where you explained your answer to the programming task.
edit: You could blame the recruiter but every other company had a well oiled machine for their recruiters. So even if they provided only generic information there was still a standard process for what they provided.
My personal observation having known quite a few Amazon SWEs and interviewed them.
The bad rep is only for the junior roles. SWEs who work at AWS and are high L5+ are pretty solid.
Disclaimer: Am Amazon employee
Raft and Paxos are not gossip protocols - they are consensus protocols.
I came to the realization about a year ago, that there are definitely talent tiers and unless you are working super hard at recruiting and paying top dollar, the cliff edge approaches fast and is very very steep.
Assuming they have say, 5000 front end instances, thats 5000 file descriptors being used just for this, before you are even talking about whatever threads the application needs.
It’s not surprising that they bumped into ulimits, though as part of OS provisioning, you typically have those tuned for workload.
More concerning is the 5000 x 5000 amount of open tcp sessions across their network to support this architecture. This has to be a lot of fun on any stateful firewall it might cross.
Just because you haven’t encountered it doesn’t mean it’s not there, it’s probably just properly tuned and balanced for the load.
Now, they don’t know how it behaves, so they’re afraid to take corrective actions in production.
They built that before ensuring that they logged the result of each failed system call. The prioritization seems odd, but most places look at logging as a cost center, and the work of improving it as drudgery, even though it’s far more important than shiny things like automatic response to failures, and also takes a person with more experience to do properly.
I don't trust anything outside core services on AWS. Regardless of whether the rumor I heard is true, it's clear they appreciate quantity over quality.
If we're talking about the same thing then I think casting stones just because it is based on MySQL is severely misguided. MySQL has decades of optimizations and this particular system at Amazon has solved scaling problems and brought reliability to countless services without ever being the direct cause of an outage (to the best of my knowledge).
Indeed, MySQL is not without its flaws but many of these are related to its quirks in transactions and replication which this system completely solves. The cherry on top is that you have a rock solid database with a familiar querying language and a massive knowledge base to get help from when needed. Oh, and did I mention this system supports multiple storage engines besides just MySQL/InnoDB?
I for one wish we would open source this system though there are a ton of hurdles both technical and not. I think it would do wonders for the greater tech community by providing a much better option as your needs grow beyond a single node system. It has certainly served Amazon well in that role and I've heard Facebook and YouTube have similar systems based on MySQL.
To further address your comment about Amazon/AWS lacking quality: this system is the epitome of our values of pragmatism and focusing our efforts on innovating where we can make the biggest impact. Hand rolling your own storage engines is fun and all but countless others have already spent decades doing so for marginal gains.
Why all the mysterious cloak-and-dagger?
The more important takeaway is that building on top of MySQL/InnoDB is perfectly fine and that is what I tried to emphasize.
FB built a similar system to maintain their graph: https://blog.yugabyte.com/facebooks-user-db-is-it-sql-or-nos...
It’s a ton of tiny DBs that look like one massive eventually consistent DB
The relevant resiliency pattern in this case would be what they refer to as cell-based architecture, where within an AZ services are broken down into smaller independent cells to minimize the blast radius.
They specifically mention in the write-up that this was a gap they plan to address, the "backend" portion of Kinesis was already cellularized but that step had not yet been completed on the "frontend".
Celluarization in combination with workload partitioning would have helped, e.g. don't run Cloudwatch, Cognito and Customer workloads on the same set of cells.
It is also important to note that celluarization only helps in this case if they limit code deployment to a limited number of cells at a time.
This YouTube video of a re:invent presentation does a great job of explaining it. The cell-based stuff, starts around minute 20.
I definitely recommend checking out the video. Even if you have seen it before, rewatching it in the context of this post-mortem really makes it hit home.
Googlers would be quick to point out that Borg does this natively across all their services: https://news.ycombinator.com/item?id=19393926
Nearly all AWS services are regional in scope, and for many (if not most) services, they are scaled at a cellular level within a region. Accounts are assigned to specific cells within that region.
There are very, very few services that are global in scope, and it is strongly discouraged to create cross-regional dependencies -- not just as applied to our customers, but to ourselves as well. IAM and Route 53 are notable exceptions, but they offer read replicas in every region and are eventually consistent: if the primary region has a failure, you might not be able to make changes to your configuration, but the other regions will operate on read-only replicas.
This incident was regional in scope: us-east-1 was the only impacted region. As far as I know, no other region was impacted by this event. So customers operating in other regions were largely unaffected. (If you know otherwise, please correct me.)
As a Solutions Architect, I regularly warn customers that running in multiple Availability Zones is not enough. Availability Zones protect you from many kinds of physical infrastructure failures, but not necessarily from regional service failures. So it is super important to run in multiple regions as well: not necessarily active-active, but at least in a standby mode (i.e. "pilot light") so that customers can shed traffic from the failing region and continue to run their workloads.
However, Cognito is very region specific and there is currently no way to run in active-active or even in standby mode. The problem is user accounts; you can't sync them to another region and you can't back-up/restore them (with passwords). Until AWS comes up with some way to run Cognito in a cross-region fashion, we are pretty much stuck in a single region and vulnerable to this type of outage in the future.
Speaking about multi-region services. What do you think about Google now offering all three major building pieces as multi-regional?
They have muti-regional buckets, LB with single anycast IP, document db (firebase). Pubsub can route automatically to nearest region. Nothing like this is available in amazon, well only DIY building blocks.
When I talk about cross regional dependency, I talk about an architectural decision that can lead to a cascading failure in region B, which is healthy by all accounts, when there is a failure in region A.
AWS has services that allow for regional replication and failover. DynamoDB, RDS, and S3 all offer cross region replication. And Global Accelerator provides an anycast IP that can front regional services and fail over in the event of an incident.
Alternatively, global load balancing with Route 53 remains a viable, mature option as well. Health checks and failover are fully supported.
I, as many people have, discovered this when something broke in one of the golden regions. In my case cloudfront and ACM.
Realistically you can’t trust one provider at all if you have high availability requirements.
The justification is apparently that the cloud is taking all this responsibility away from people but from personal experience running two cages of kit at two datacenters the TCO was lower and the reliability and availability higher. Possibly the largest cost is navigating Harry-Potter-esque pricing and automation laws. The only gain is scaling past those two cages.
Edit: I should point out however that an advantage of the cloud is actually being able to click a couple of buttons and get rid of two cages worth of DC equipment instantly if your product or idea doesn't work out!
The hard part with multi-cloud is, you're just increasing your risk of being impacted by someone's failure. Sure if you're all-in on AWS and AWS goes down, you're all-out. But if you're on [AWS, GCP] and GCP goes down, you're down anyway. Even though AWS is up, you're down because Google went down. And if you're on [AWS, GCP, Azure] and Azure goes down, it doesn't matter than AWS and GCP are up... you're down because Azure is down. The only way around that is architecting your business to run with only one of those vendors, which means you're paying 3x more than you need to 99.99999% of the time.
The probability that one of [AWS, Azure, GCP] is down is way higher than the probability that just one of them is down. And the probability that your two cages in your datacenter is down is way higher than the probability that any one of the hyperscalers is down.
This would be a poor decision. If you assume AWS, GCP, and Azure would fail independently, you can pay 1.5x. Each of the 3 services would be scaled to take 50% of your traffic. If any one fails, you would then still be able to handle 100%. This is a common way to structure applications. Assuming independence means that more replicas result in less overprovisioning. 1 replica means needing to provision 2x. Having 5 independent replicas means, you need to provision 1.25x to be resilient against one failure as each replica will be scaled at 25%.
In general, N replicas need N/(N-1) over provisioning to be resilient against one replica failing.
Or do you disagree that planning for a total failure of one and running redundant workloads on other vendors increases your costs 99.99999% of the time? Because that's a fairly standard SLA from each of the major vendors. Let's even reduce it to EC2's SLA, 99.99%. So 99.99% of the time you're paying 3x as much as you need to be paying just to maintain your services an extra four hours per year. Again, you can disagree with that but that doesn't make it incorrect.
Some businesses might need that extra four hours, the cost of the extra services might be cheaper than the cost of four hours of downtime per year. But you're not going to find many businesses like that. Either you're running completely redundant workloads, paying 3x as much for an extra 4 hours per year, or you're going to be taken offline when any one of the three go down independently of each other.
Single providers go down, yes. And three providers go down three times as often as one. Either you're massively overspending or you're tripling your risk of downtime. If multi-cloud worked, you'd be hearing people talking about it and their success stories would fill the front page of Hacker News. They don't, because it doesn't.
I’ve never heard of an untested failover mechanism that worked. Most places are afraid to invoke such a thing, even during a major outage.
Being afraid of failures is a massive sign of problems. I’ve worked in those sorts of places before.
Is that a reference to the difficulty of calculating the cost of visiting all the rides at universal? That's my best guess...
"Well this pricing rule only works on a Tuesday lunch time if you're standing on one leg with a sausage under each arm and a traffic cone on your head"
And there are a million of those to navigate.
Then, to be fair:
> We have a back-up means of updating the Service Health Dashboard that has minimal service dependencies. While this worked as expected, we encountered several delays during the earlier part of the event in posting to the Service Health Dashboard with this tool, as it is a more manual and less familiar tool for our support operators. To ensure customers were getting timely updates, the support team used the Personal Health Dashboard to notify impacted customers if they were impacted by the service issues.
I'm curious if anyone here actually got one of these.
Through reading Reddit and HN during this event I learned that most people apparently aren’t even aware of the existence of the PHD and rely solely on the global status page, despite the fact that there is a giant “View my PHD” button at the very top of the global status page, and additionally there is a notification icon on the header of every AWS console page that lights up and links you directly to the PHD whenever there is an issue.
The PHD is always where you should look first. It is, by design, updated long before the global status page is.
If you don’t know what the PHD is, a big button pointing to it won’t do anything. People ignore big boxes of irrelevant stuff all the time.
AWS user of ~8 years and I’ve never heard of the PHD nor this sequencing of updating it first.
Is it really? I get the value of eating your own dogfood, it improves things a lot.
But your status page? Such a high importance, low difficulty thing to build that dogfeeding it gives you small amount of benefits (dogfeed something bigger/more complex instead) in the good case, and high amount of drawback when things go wrong (like when your infrastructure goes down, so does your status page). So what's the point?
At 9:39 AM PST, we were able to confirm a root cause [...] the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.
...[adding] new capacity [to the front-end fleet] had caused all of the servers in the [front-end] fleet to exceed the maximum number of threads allowed by an operating system configuration [number of threads spawned is directly proportional to number of servers in the fleet]. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.
...moving to larger CPU and memory servers [and thus fewer front-end servers]. Having fewer servers means that each server maintains fewer threads.
...making a number of changes to radically improve the cold-start time for the front-end fleet.
...moving the front-end server [shard-map] cache [that takes a long time to build, up to an hour sometimes?] to a dedicated fleet.
...move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet.
...accelerate the cellularization  of the front-end fleet to match what we’ve done with the back-end.
 https://www.youtube.com/watch?v=swQbA4zub20 and https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c...
> Amazon Cognito uses Kinesis Data Streams [...] this information streaming is designed to be best effort. Data is buffered locally, allowing the service to cope with latency or short periods of unavailability of the Kinesis Data Stream service. Unfortunately, the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers.
> And second, Lambda saw impact. Lambda function invocations currently require publishing metric data to CloudWatch as part of invocation. Lambda metric agents are designed to buffer metric data locally for a period of time if CloudWatch is unavailable. Starting at 6:15 AM PST, this buffering of metric data grew to the point that it caused memory contention on the underlying service hosts used for Lambda function invocations, resulting in increased error rates.
That should be tested at least quarterly (but preferably automatically with every build).
If Amazon did that, this outage would have been reduced to 10 mins, rather than the 12+ hours that some super slow rolling restarts took...
But if you’re running a DB or a storage system, 10 mins is a blink of an eye. Storage systems in particular can run a few hundred TB per node and moving that data to another node can take over an hour.
In this case, the frontends have a shard map which is definitely not stateless. This is typically okay if you have a fast load operation which blocks other traffic until shard map is fully loaded
It basically boils down to "We must be able to restore the minimum necessary parts of a full backup in under 10 minutes".
Take wikipedia as an example. I'd expect them to be able to restore a backup of the latest version of all pages in 10 minutes. It's 20GB of data, and I assume it's sharded at least 10 ways. That means each instance will have to grab 2GB from the backups. Very do-able.
As a service gets bigger, you typically scale horizontally, so the problem doesn't get harder.
Restoring all the old page versions and re enabling editing might take longer, but that's less critical functionality.
Translation: The eng team knew that they had accumulated tech debt by cutting a corner here in order to meet one of Amazon's typical and insane "just get the feature out the door" timelines. Eng warned management about it, and management decided to take the risk and lean on on-call to pull heroics to just fix any issues as they come up. Most of the time yanking a team out of bed in the middle of the night works, so that's the modus operandi at Amazon. This time, the actual problem was more fundamental and wasn't effectively addressable with middle-of-the-night heroics.
Management rolled the "just page everyone and hope they can fix it" dice yet again, as they usually do, and this time they got snake eyes.
I guarantee you that the "cellularization" of the front-end fleet wasn't actually under way, but the teams were instead completely consumed with whatever the next typical and insane "just get the feature out the door" thing was at AWS. The eng team was never going to get around to cellularizing the front-end fleet because they were given no time or incentive to do so by management. During/after this incident, I wouldn't be surprised if management didn't yell at the eng team, "Wait, you KNEW this was a problem, and you're not done yet?!?" Without recognizing that THEY are the ones actually culpable for failing to prioritize payments on tech debt vs. "new shiny" feature work, which is typical of Amazon product development culture.
I've worked with enough former AWS engineers to know what goes on there, and there's a really good reason why anybody who CAN move on from AWS will happily walk away from their 3rd- and 4th-year stock vest schedules (when the majority of your promised amount of your sign-on RSUs actually starts to vest) to flee to a company that fosters a healthy product development and engineering culture.
(Not to mention that, this time, a whole bunch of peoples' Thanksgiving plans were preempted with the demand to get a full investation and post-mortem written up, including the public post, ASAP. Was that really necessary? Couldn't it have waited until next Wednesday or something?)
Yes, this is exactly how product development works at many (if not most) places within Amazon for engineers. It can be this toxic.
Disclaimer: Amazon engineer
This goes a bit more in-depth:
I’m wondering how many people Amazon fired over this incident - that seems to be their goto answer to everything.
Is it because operating system configuration is managed by a different team within the organization?
An auto scaling irony for AWS! We seem to be back to the late 1990s :)
First of all, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
Then move on to explain...