When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling up you can't allow that pressure to cloud your judgment.
Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.
1) Patio11 touches on a very good lesson, in passing, in an article about Japanese business:
While raw programming ability might not be highly valued at many Japanese companies, and engineers are often not in positions of authority, there is nonetheless a commitment to excellence in the practice of engineering. I am an enormously better engineer for having had three years to learn under the more senior engineers at my former employer. We had binders upon binders full of checklists for doing things like e.g. server maintenance, and despite how chafing the process-for-the-sake-of-process sometimes felt, I stole much of it for running my own company. (For example, one simple rule is “One is not allowed to execute commands on production which one has not written into a procedural document, executed on the staging environment, and recorded the expected output of each command into the procedural document, with a defined fallback plan to terminate the procedure if the results of the command do not match expectations.” This feels crazy to a lot of engineers who think “I’ll just SSH in and fix that in a jiffy” and yet that level of care radically reduces the number of self-inflicted outages you’ll have.)
2) I once heard organisational 'red tape' described as 'the scar tissue of process failures' and it is absolutely true and I deeply regret not recording the source of it. Whenever you wonder why there's some tiresome, overly onerous process in place that is slowing you down, consider why it may have been put in place - chances are, there was a process failure that resulted in Bad Things. When you wonder why big orgs are glacially slow compared to more nimble startup competitors, understand that those startups have yet to experience the Bad Things that the big org has probably already endured. Like scar tissue, the processes they develop reduce their agility and performance but also serve to protect the wounds they experienced.
Why am I saying that? Because in some of the Japanese companies I've worked with were the exact opposite of that. To be sure, lip service was duly paid to the aforementioned "commitment to excellence", and every release procedure had its own operational manual, sometimes 300 steps long. Repeated manually for every server. Out of 100-200.
Configuration updates? Sure, let's log in to every server and vi the config file. How do we keep excellence? Just diff with prev and verify (with your eyes that is) that the result is the same as in your manual. After every "cd" you had to do a pwd and make sure that you moved to the directory you meant to. After every cp you diffed to the original file to make.
Releases obviously took all day or often all night, and engineers were stressed and fatigued by Sisyphean manual with its 300 steps of red tape. They invariably made silly mistake, because this is what you get when you use human beings as glorified tty+diff. We had release issues and service outages all the time.
We've fortunately managed to move away to modern DevOps practices with a lot of top down effort. But please don't tell me every Japanese company magically delivers top quality. Some of them do, some of them don't, even in the same industry. Insane levels of bureaucracy could be found all across the board, but whether that bureaucracy actually encourages or deters quality is an entirely different story.
My opinion: script it. Always. It doesn't matter if it's ansible, bash, puppet, python, whatever, just make sure it's not an ad-hoc command. Test the script on a server which can be sacrificed. Test as long as there is a single glitch. Run it in production.
It's to eliminate typos and to have a "log" to see what actually had been done.
For things that you can't script, you write abstracted processes that force the executor to write down the things that could cause Bad Things to happen, and use that writing down stage to verify that it's not going to cause a Bad Thing. That forces people to pause and consider what they're doing, which is 80% of the effort towards preventing these issues.
eg: Forcing YP to write down which database they were scorching would've triggered an 'oh fuck' moment. Having a process that dodged naming databases as 'db1' and 'db2' would've prevented it. etc. etc. etc.
But there was a tremendous organic resistance to that from the very same "culture of excellence in engineering". "How can we be sure it works if it's automated?"
"It's safer to manually review the log"
"How can you automate something like email tests are or web tests?"
"It's no worth automating this procedure, we only release this app once a year, and it only takes 5 hours".
Expect to hear these kind of claims when engineers have got the equality "menial work == diligence == excellence" pummeled into them for generations.
This way, when your script fails, you can recover quickly.
edit: I split this with my parent reply to try to make the two separate points clearer
What you can say about Japan, is that since technology-wise it tends to be behind the US (of course this too is a gross generalization), you can expect most non-startups to use bureaucracy over automation (and modern DevOps practices in general) to regulate production operation quality. The unfortunate side here is that bureaucracy is much more fragile and when it fails, it tends to fail spectacularly.
The googlable nugget is actually "organizational scar tissue" (it caught my attention too). It's from Jason Fried. On twitter:
and also apparently in "Rework", quoted with more context here:
Another explaination for big orgs vs teeny start ups: What level of failure is acceptable? For a teeny start up, a few hours being down is not so important. For (say) a bank, being down for a few hours might be mentioned in the national newspapers.
A good employment environment is one where you may ask why a process exists and receive valid justifications therein, but where the idea of not following it, no matter how bad, never crosses your mind. I acknowledge I'm really lucky to work in an industry that doesn't fall too far from that target.
Pushing the metaphor a bit too far.
"..the accident is considered a prime example of successful crew resource management due to the large number of survivors and the manner in which the flight crew handled the emergency and landed the airplane without conventional control."
I highly recommend it
Even better to avoid all that and make it idiot proof. I'd rather not be in the situation where my only protection is rigmarole. But sure, as a last resort (just like frameworks) - much better than nothing. Typically in programming, frameworks are premature. A simple api tends to suffice, with checked (or typechecked) inputs and outputs. But sure, if for some reason you can't make that, and you need complex interactions with dynamically generated code, variable number of order-dependant parameters, black-box "magic" base-types, stringly-typed unchecked mini-languages, multiple sequentially dependant calls into the same thing, or any other tricky api you can't (or won't) easily detect misuse for... then a framework is the least-bad amongst bad options.
I speak their language (it's German, but people from Switzerland speak a pretty strong dialect) and they discuss highly technical and serious stuff, but their language is just so adorable when they mix the English and the German. I always thought this communication is English only, nowadays?
not just pilots ... doctors, nurses, etc.
Now, when you can't have tested and debugged software, yeah, formal procedures are the second best thing. Just don't get complacent there.
"Building Secure Cultures" by Leigh Honeywell
(checklist part starting from about 16:00)
I found this one which does:
Remember they're driving 1970's technology, redundant everything, and they all grew up flying "steam gauges", where the culture includes tapping on the glass to make sure the needle didn't stick. They want to compare every sensor output for sanity so they can disregard one if needed.
This vid is also the source of cockpit audio for the FSim shuttle simulator game if you like this stuff.
PS - here's what happens when they got a bad instrument and didn't catch it: http://www.avweb.com/news/safety/183035-1.html
Pilot: Your radar's good. My radar's good.
Commander: I agree.
Pilot: You're going just a little bit high.
Commander: I agree.
There's also the distinction between specific orders - "you are to". Knowing the commander's intent, and his commander's intent (known as 1 Up and 2 Up) enables Mission Command, the concept of giving latitude to subordinates to achieve the mission in the best way possible within the confines and direction given to them.
Finally there's the ritual of it, NATO forces expect to operate within multi-national structures where English won't be a first language. There is a "NATO sequence of orders" which should be roughly followed. It means everyone knows the structure of what is coming up and when in the brief - so you don't get people asking questions about equipment when the limitations are being explained, they know that comes later. Opening, especially from a junior officer, with "my intention is" is essentially like having a schema definition at the start of a document - it defines the structure of what is coming for those who are going to be parsing it.
"You should have KNOWN that erudite command was going to fail."
"You should have known that our one-off program had issues you did not account for."
"You should have known that the backups were not properly tested elsewhere for known good state."
"You should have ....."
In reality, some disasters were caused by idiotic things like "rm -rf /opt/somedir ." You just hosed the system, or a large part of it quickly. And we could say that your malfeasance of including the "." started wiping / immediately. But we can also say that rm should be aliased to prevent accidents like that, or that rm should do some minimalistic sanity checking on critical directories before executing them.
People can, and will mess up. These computers are nice, in that they can have logic that can self-correct, or at least loudly alert errors.
I'm not saying that's the case here, because it does seem that GitLab has systemic deficiencies. But "never" and "always" are such strong statements.
Prosecutor: Mr. Accused, here is evidence that you murdered that other person. Accused: It's not my fault, but the systems'. Judge: Oh, ok. You are a free man.
Let's assume a bad actor in a company. It still doesn't help improve the situation to allow blame to rest with the bad actor. Definitely, there should be penalties applied (likely the termination of their position), but it doesn't help your company at all to stop there.
Did they delete data? Why is there no secure backup system in place to recover that data? Why was there such lax security in place to allow them to delete the data in the first place? Why are we hiring people who will go rogue and delete data? Did they "turn" after working here for a while because of toxic culture, processes, etc?
Hell, if the law worked this way, we might actually have less crime because we'd look further into the causes of crime and work to address them instead of simply punishing the offenders.
Then, always remember to delete /trash a few days later.
sudo apt install trash-cli
alias rm="echo This is not the command you want to use"
Proposed tweaks: symbolic link into /var/trashlist directory, where the name of the symbolic link is "<timestamp>-<random stub>-<original basename>". Timestamp first so we can stop once we hit the first too-recent timestamp, random stub to unique the original base name if two different files In different directories are deleted at the same timestamp, original file name for inspection.
You're at 90%, and you have an app error spewing a few MB of logs per minute. Your on-call engineer /bin/rm's the logs, and instead of going to 30%, you're still at 91%, only the files are gone. Your engineer (rightfully) thinks "maybe the files are still on disk, and there's a filehandle holding the space", so instead of checking /proc to confirm, he bounces the service. Disk stays full, but you've incurred a few seconds of app downtime for no reason, and your engineer is still confused as shit. Your cron job won't kick in for hours? Days? In the mean time, you're still running out of disk, and sooner or later you'll hit 100% and have a real outage.
Cron job is a stupid hack. It doesn't solve any problems that aren't better solved a dozen other ways.
No documentation, no checklists? That's the source of your problem, not a trash command which moves files rather than deleting them.
Docs and checklists are fine, but at 2am when the jr is on call, you're asking for problems by making rm == mv
The best thing to do is to never operate with 2 terminals simultaneously, when one of them is a production env, better login/logout or at least minimise it.
It lead to a "Very Bad Day". I found out about it after reading the post mortem.
And/or a system like etckeeper to help keep a log on top of an fs that doesn't keep one for you:
The military does a very similar thing. An Army company commander usually has a RTO (radiotelephone operator) to handle taking on the radio. This frees the commander to make real-time decisions and response quickly to the situation on the ground, and frees him/her from having to spend time explaining things. A really good RTO will function as the voice of the commander, anticipating what he/she needs to get from the person on the other end of the radio. This is a great characteristic of a good operations engineer, too. While the person doing triage is addressing the problem, the communicator is roping in other resources that might be needed and communicating the current situation out to the rest of the team/company
Another thing to note, is that the RTO will occasionally state someone and "ACTUAL". That means that whomever they are speaking for is actually speaking and not the RTO on behalf of the CO.
"I meant the other server."
My worst nightmares are nothing but a blimp on a developers mind. I've have lost it all. I have lost it all on multiple servers. On multiple volumes.
No one should ever experience that. Ugh, developing, sometimes, is just so frustrating.
The Checklist Manifesto https://smile.amazon.com/Checklist-Manifesto-How-Things-Righ...
I work in EMS when not in IT, and even bringing a trauma or cardiac arrest patient into the Emergency Room, there is still time to review and consider.
If there's time in Emergency Medicine, there's time in IT.
In this script I added a checklist of things to "check" before running it. It has worked in my favour every time I run it.
Here's how it looks like https://github.com/jontelang/Various-scripts/blob/master/Git...
Take your time and work when you're at your best.
I got bitten by this in the past, luckily nothing that could not be reversed, just 2-3 hours lost. I can imagine how YP's stomach must have felt when he realised what happened.
Still, I had no idea about checklists and so many people here seem to be pretty familiar with the concept :-)
Even more so if you tag-team things, so much better.
There isn't a silver bullet anyway, it's layer on layer of operational best practices what makes you resilient against such issues.