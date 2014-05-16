It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.
I may have to upgrade that to take the mighty power of Cloud (TM) into account, though. Billions and trillions of fuck ups per second are now well within reach!
This is one of the things that happens with windows, getting up a server is so easy, that people believe that they don't have to understand what's under the hood, and then, we get a lot of miss-configuration and operational issues.
For instance: https://thenextweb.com/shareables/2014/05/16/emory-universit...
Note: Automation is great, you just can't be sloppy with it. EVER.
If the computer knows exactly what actions would be a mistake, why can't it just do the correct actions (those that aren't a mistake) automatically?
That's good UI design in tools with powerful destructive capabilities. You make the UI to do lots of things v.s. the few things you do routinely different enough that there's no mistaking them.
no, because developers still produce code with bugs.
You could feedback a clarification, but if that happens too often nobody will double check it after they have seen it over and over.
Anyway, to me this firstly sounds like a "tool" or command that was too powerful with not enough safeguards. Who knows, the command might even be ambiguous.
typos happen, if the system didn't stop them that's a design issue or accepted risk.
" At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."
"Fire you? I just spent $10 million training you!"
> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.
These sorts of things make me understand why the Netflix "Chaos Gorilla" style of operating is so important. As they say in this post:
> We build our systems with the assumption that things will occasionally fail
Failure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.
[1] https://github.com/Netflix/chaosmonkey
http://techblog.netflix.com/2011/07/netflix-simian-army.html
> Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.
Ensuring that your status dashboard doesn't depend on the thing it's monitoring is probably the first thing you should think about when designing your status system. This doesn't fill me with confidence about how the rest of the system is designed, frankly...
Since restarting the entire fleet would incur downtime of all relevant S3 operations, it's unlikely that it was something ever intentionally done in production (and they may or may not have run that scenario in other environments).
Source: I used to run several large scale services at Amazon.
It sounds like the weakness in the process is that the tool they were using permitted destructive operations like that. The passage that stuck out to me: "in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."
At the organizational level, I guess it wasn't rated as all that likely that someone would try to remove capacity that would take a subsystem below its minimum. Building in a safeguard now makes sense as this new data point probably indicates that the likelihood of accidental deletion is higher than they had estimated.
I've always wondered why ops hasn't adopted some of the best practices that have been around for years to avoid fat finger errrors. Like why don't we have systems where to do something dangerous requires two separate people run the command, or there's an approval step, or whatever.
From a software development perspective, it makes sense to reuse S3 and rely on it internally if you need object storage, but from an ops perspective, it means that S3 is now a single point of failure and that SES's reliability will always be capped by S3's reliability. From a customer perspective, the hard dependency between SES and S3 is not obvious and is disappointing.
The whole internet was talking about S3 when the AWS status dashboard did not show any outage, but very few people mentioned other services such as SES. Next time we encounter errors with SES, should we check for hints of S3 outage before everything else? Should we also check for EC2 outage?
It would be easy for an arrogant organisation to fire or negatively impact the person that made the mistake, I hope Amazon don't fall into that trap and focus instead on learning from what happened, closing the book and move on.
The only thing I read in there and go "hmmm" is that it took quite that long for the S3 service to recover, and that the status page wasn't hosted on someone that doesn't have an S3 dependency. That's just a plain "doh" moment :)
That was CEO Robert Allen's response when the AT&T network collapsed [1] on January 15, 1990
He was asked who made the mistake.
I can't imagine any CEO now a days making a similar statement.
[1] http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collaps...
This is analogous to "we needed to fsck, and nobody realized how long that would take".
> Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
If I would have guessed anyone could prevent mistakes like this from propagating it would be AWS. It points to just how easy it is to make these errors. I am sure that the SRE who made this mistake is amazing and competent and just had one bad moment.
While I hope that AWS would be as understanding as Gitlab, I doubt the outcome is the same.
the most important one is that Spanner runs on Google’s private network. Unlike most wide-area networks, and especially the public internet, Google controls the entire network and thus can ensure redundancy of hardware and paths, and can also control upgrades and operations in general. Fibers will still be cut, and equipment will fail, but the overall system remains quite robust.
It also took years of operational improvements to get to this point. For much of the last decade, Google has improved its redundancy, its fault containment and, above all, its processes for evolution. We found that the network contributed less than 10% of Spanner’s already rare outages.
But when it fails it's going to be epic!
[1] https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Sp...
[1] http://articles.latimes.com/1999/oct/01/news/mn-17288
I'd be interested to understand why a cold restart was needed in the first place. That seems like kind of a big deal. I can understand many reasons why it might be necessary, but that seems like one of the issues that's important to address.
CEO's all over the world just realized that they can't only depend on S3, and they might have to double up on their infrastructure and have a parallel env. on Azure or Google as well.
There are process-fixes for this, such as requiring a two-person rule when at a production shell and modifying tooling to detect potentially unintentional commands (e.g. a SQL UPDATE without a WHERE) - but given what I know about Amazon's internal practices (i.e. the brutality) it wouldn't surprise me if they did terminate the unfortunate operator - not because they want to, but because AWS simply has too many large-scale customers who would demand immediate action like that.
SHUT DOWN S3? ARE YOU SURE? (y/N) :
SHUT DOWN S3? ARE YOU SURE? ("I'm absolutely positive."/n) :
Every mistake was used as a learning opportunity to ensure that the same and similar mistakes can't be repeated.
On a more serious note, if you've never done something like this, you haven't had enough interesting projects.
I've had a decent career and I still managed to:
* re-deploy the current application version in all our data centers, instead of the new version, in a period when our deployment wasn't a 0-downtime one
* rename all the Jenkins jobs on the server to the same name, thus deleting hundreds of Jenkins jobs in one fell swoop
"Let him who is without sin cast the first stone" and all that :)
1. No organization anywhere is a paragon of excellence, and everyone can benefit from improvement.
2. Every organization is made up of humans just like you. With all that entails.
Some things which seem blatantly obvious after the fact are easily overlooked when the pressure to deliver is high and other issues are taking precedence.
I completely agree with your statement. In fact, when I do interviews, one of my favorite and most insightful questions to ask is, basically, "tell me about a time you screwed the pooch." If they don't have a story and they worked in ops, then it can suggest they didn't really do much. The really sharp ones I've interviewed have a good story or two (and can tell it in excruciating detail. =)
* At a prior company I once tried appending to the list of NFS exports, but dropped the "no-root-squash" option, and instantly denied write permissions to our entire VMware farm. You can imagine what then happened to all of the VMs for this mission critical customer. =P
He took off in his piston engine plane, only to lose power during the climb and was forced to make a crash landing. It turned out the airplane was fueled with jet fuel instead of regular gasoline (the ground crewman mistakenly thought the plane was a turbo prop).
Instead of yelling at or firing the ground crewman, Hoover had this to say[2]:
"There isn't a man alive who hasn't made a mistake.
But I'm positive you'll never make this mistake again.
That's why I want to make sure that you’re the only one
to refuel my plane tomorrow. I won't let anyone else
on the field touch it."
[2] http://www.squawkpoint.com/2014/01/criticism/
We have geo-distributed systems. Load balancing and automatic failover. We agonize over edge cases that might cause issues. We build robust systems.
At the end of he day reliability -- a lot like security -- is most affected by the human factor.
This is the bit that'd worry me most; you'd think they'd be testing this.
>We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.
"From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions."
* "authorized S3 team member" -- how did this team member acquire these elevated privs?
* Running playbooks is done by one member without a second set of eyes or approval?
* "we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years"
The good news:
* "The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately."
The truly embarrassing that everyone has known about for years is the status page:
* "we were unable to update the individual services’ status on the AWS Service Health Dashboard "
When there is a wildly-popular Chrome plugin to fix your page ("Real AWS Status") you would think a company as responsive as AWS would have fixed this years ago.
) If it's a playbook for something with minimal intended impact sure. The issue is that the tooling had larger capabilities than should be.
) Yes that seems like a major, major problem.
This is not in the post.
