Hacker News new | comments | show | ask | jobs | submit login

This is painful to read. It's easy to say that they they should have tested their backups better, and so on, but there is another lesson here, one that's far more important and easily missed.

When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling up you can't allow that pressure to cloud your judgment.

Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.

Two points:

1) Patio11 touches on a very good lesson, in passing, in an article about Japanese business[1]:

While raw programming ability might not be highly valued at many Japanese companies, and engineers are often not in positions of authority, there is nonetheless a commitment to excellence in the practice of engineering. I am an enormously better engineer for having had three years to learn under the more senior engineers at my former employer. We had binders upon binders full of checklists for doing things like e.g. server maintenance, and despite how chafing the process-for-the-sake-of-process sometimes felt, I stole much of it for running my own company. (For example, one simple rule is “One is not allowed to execute commands on production which one has not written into a procedural document, executed on the staging environment, and recorded the expected output of each command into the procedural document, with a defined fallback plan to terminate the procedure if the results of the command do not match expectations.” This feels crazy to a lot of engineers who think “I’ll just SSH in and fix that in a jiffy” and yet that level of care radically reduces the number of self-inflicted outages you’ll have.)

2) I once heard organisational 'red tape' described as 'the scar tissue of process failures' and it is absolutely true and I deeply regret not recording the source of it. Whenever you wonder why there's some tiresome, overly onerous process in place that is slowing you down, consider why it may have been put in place - chances are, there was a process failure that resulted in Bad Things. When you wonder why big orgs are glacially slow compared to more nimble startup competitors, understand that those startups have yet to experience the Bad Things that the big org has probably already endured. Like scar tissue, the processes they develop reduce their agility and performance but also serve to protect the wounds they experienced.

[1] http://www.kalzumeus.com/2014/11/07/doing-business-in-japan/

One of thing that I learned the hard way about "Japanese companies" - despite western conceptions, every company in Japan has its own unique culture (and takes pride in having its own culture!). What's more, often departments and division inside the same companies work in very different way.

Why am I saying that? Because in some of the Japanese companies I've worked with were the exact opposite of that. To be sure, lip service was duly paid to the aforementioned "commitment to excellence", and every release procedure had its own operational manual, sometimes 300 steps long. Repeated manually for every server. Out of 100-200.

Configuration updates? Sure, let's log in to every server and vi the config file. How do we keep excellence? Just diff with prev and verify (with your eyes that is) that the result is the same as in your manual. After every "cd" you had to do a pwd and make sure that you moved to the directory you meant to. After every cp you diffed to the original file to make.

Releases obviously took all day or often all night, and engineers were stressed and fatigued by Sisyphean manual with its 300 steps of red tape. They invariably made silly mistake, because this is what you get when you use human beings as glorified tty+diff. We had release issues and service outages all the time.

We've fortunately managed to move away to modern DevOps practices with a lot of top down effort. But please don't tell me every Japanese company magically delivers top quality. Some of them do, some of them don't, even in the same industry. Insane levels of bureaucracy could be found all across the board, but whether that bureaucracy actually encourages or deters quality is an entirely different story.

Unfortunately I had similar experiences as well; incredibly manual processes, frighteningly long manual procedure descriptions instead of scripted solutions.

My opinion: script it. Always. It doesn't matter if it's ansible, bash, puppet, python, whatever, just make sure it's not an ad-hoc command. Test the script on a server which can be sacrificed. Test as long as there is a single glitch. Run it in production.

It's to eliminate typos and to have a "log" to see what actually had been done.

Oh, absolutely. Where something can be scripted, script it. Why? Because scripting is a process development. You write something, validate it and then remove the human error element.

For things that you can't script, you write abstracted processes that force the executor to write down the things that could cause Bad Things to happen, and use that writing down stage to verify that it's not going to cause a Bad Thing. That forces people to pause and consider what they're doing, which is 80% of the effort towards preventing these issues.

eg: Forcing YP to write down which database they were scorching would've triggered an 'oh fuck' moment. Having a process that dodged naming databases as 'db1' and 'db2' would've prevented it. etc. etc. etc.

Which is what we did, obviously and nobody is allowed to run anything manually while SSHing to production.

But there was a tremendous organic resistance to that from the very same "culture of excellence in engineering". "How can we be sure it works if it's automated?" "It's safer to manually review the log" "How can you automate something like email tests are or web tests?" "It's no worth automating this procedure, we only release this app once a year, and it only takes 5 hours". Expect to hear these kind of claims when engineers have got the equality "menial work == diligence == excellence" pummeled into them for generations.

Also, script disaster recovery too. Script it when creating your backup procedure (not at the time of disaster), use the script to test your procedure, and do it often.

This way, when your script fails, you can recover quickly.

(Only mentioning since I wrote the above quote: I agree with the general thrust of this comment.)

For the public record, by quoting you I wasn't implying that you agreed with my #2 either. I just felt I gained a lot out of both points, that they both resonated with my experiences, and that they both articulated the lessons I'd learnt in my career.

edit: I split this with my parent reply to try to make the two separate points clearer

Oh no worries at all; I just like continuing to beat my "Japan is a big country with a diversity of practices and attitudes in it" drum. (It is under-beaten both inside and outside of Japan.)

I'd second that any day, even though I'm probably guilty as any at generalizing.

Agreed. I wasn't trying to make the point about "Japanese companies" (and I've edited my other post replying to patio to split the two comments I was making so that they're clearer) but rather about the process aspect. I am sure that Japanese companies, just like Western companies, come in a wide range of competencies. Clearly patio worked for a great one, however, and those lessons apply to companies all over the world. That's why I quoted it.

Definitely. Though I admit I prefer automation or red tape where possible.

What you can say about Japan, is that since technology-wise it tends to be behind the US (of course this too is a gross generalization), you can expect most non-startups to use bureaucracy over automation (and modern DevOps practices in general) to regulate production operation quality. The unfortunate side here is that bureaucracy is much more fragile and when it fails, it tends to fail spectacularly.

> regret not recording the source of it

The googlable nugget is actually "organizational scar tissue" (it caught my attention too). It's from Jason Fried. On twitter:


and also apparently in "Rework", quoted with more context here:


> When you wonder why big orgs are glacially slow compared to more nimble startup competitors, understand that those startups have yet to experience the Bad Things that the big org has probably already endured.

Another explaination for big orgs vs teeny start ups: What level of failure is acceptable? For a teeny start up, a few hours being down is not so important. For (say) a bank, being down for a few hours might be mentioned in the national newspapers.

That is certainly true of some of the red tape, but in no way it's true for the majority. A lot of "process" is created because the people in charge of the process need to validate their existence.

Undoubtably that's a problem, I agree. Competent management helps minimise that, though. And I would not agree that the majority is existance justification. Then again, I'll disclaim that by saying I work in a Mech Eng. role, where process for safety's sake has been established and engrainedi nto the culture for literal centuries.

A good employment environment is one where you may ask why a process exists and receive valid justifications therein, but where the idea of not following it, no matter how bad, never crosses your mind. I acknowledge I'm really lucky to work in an industry that doesn't fall too far from that target.

I just wish big companies were more willing to remove this scar tissue.

Or even not scarify themselves before potential Big Bad happens.

Pushing the metaphor a bit too far.

If you get the chance to observe pilots operating in the cockpit, I'd recommend it. Every important procedure (even though the pilot has it memorized) is done with a checklist. Important actions are verbally announced and confirmed: "You have the controls" "I have the controls". Much of flight training deals with situational awareness and eliminating distractions in the cockpit. Crew Resource Management[1].

1: https://en.wikipedia.org/wiki/Crew_resource_management

There is a neat video[1] where a Swiss flight has to make an emergency landing and just happens to have a film crew in the cockpit.

[1] https://www.youtube.com/watch?v=rEf35NtlBLg

Here's a great documentary [0] by Errol Morris about the United Flight 232 crash in 1989 [1].

"..the accident is considered a prime example of successful crew resource management due to the large number of survivors and the manner in which the flight crew handled the emergency and landed the airplane without conventional control."

I highly recommend it

[0] https://www.youtube.com/watch?v=2M9TQs-fQR0

[1] https://en.wikipedia.org/wiki/United_Airlines_Flight_232

Another recent example is Qantas QF32 had an engine explode (fire then catastrophic/uncontained turbine failure) and the A380 landed with one good engine, and two degraded engines. The entire cockpit crew of 5 pilots did a brilliant job in landing the jet.

It's amazing how decisive both pilots are despite the large amount of process going on.

It's like frameworks in programming: it frees brain cycles to focus on what's important

I kind of agree with the conclusion, but I'd look at that from the opposite direction: i.e. just like frameworks, checklists provide a way to avoid some mistakes in repetitive, boring practices. But you'll never avoid them all, and you're going to need a lot of red tape.

Even better to avoid all that and make it idiot proof. I'd rather not be in the situation where my only protection is rigmarole. But sure, as a last resort (just like frameworks) - much better than nothing. Typically in programming, frameworks are premature. A simple api tends to suffice, with checked (or typechecked) inputs and outputs. But sure, if for some reason you can't make that, and you need complex interactions with dynamically generated code, variable number of order-dependant parameters, black-box "magic" base-types, stringly-typed unchecked mini-languages, multiple sequentially dependant calls into the same thing, or any other tricky api you can't (or won't) easily detect misuse for... then a framework is the least-bad amongst bad options.

Thanks for that, this is amazing.

I speak their language (it's German, but people from Switzerland speak a pretty strong dialect) and they discuss highly technical and serious stuff, but their language is just so adorable when they mix the English and the German. I always thought this communication is English only, nowadays?

As documented in this great book:https://www.amazon.com/Checklist-Manifesto-How-Things-Right/...

not just pilots ... doctors, nurses, etc.

This reminds me of Japanese train crews and factory workers who use hand signals and audible call-outs to reinforce checklists with muscle memory.

Every time you create a checklist for developers, it's a mild kind of failure. Human procedures fail too, and we should rely on those as little as possible. Instead of checklists, we should have tested and debugged software.

Now, when you can't have tested and debugged software, yeah, formal procedures are the second best thing. Just don't get complacent there.

Ive been lucky to be in a cockpit before and during takeoff. It was an amazing experience to listen to the pilots going throug checklist, and agreeing on procedures in case of engine failure etc. Before anything was actually done, training me to take on a pilots mask etc. And follow procedures, If something should happen. Nothing did happen of course.

Great operations teams have incident response procedures based on checklists and clear communication channels like that as well. If this interests you I recommend David Mytton's talk at dotScale 2015: https://www.youtube.com/watch?v=4qGcTOQRvEU

Another talk about using checklists from StrangeLoop 2016:

"Building Secure Cultures" by Leigh Honeywell https://www.youtube.com/watch?v=2BvVZU4IPKc

(checklist part starting from about 16:00)

This[1] chaps story is very relevant. Taking those procedures and applying them to the medical industry. I swear I heard this on a podcast but I'm pretty sure the only thing I'd have heard it on is this american life and I can find no mention of it.

1: http://chfg.org/chair/martin-bromiley/

I've been told that submarines operate in the same fashion.

Yes, good tip from "Turn the Ship Around" by David Marquet is to use the "I intend to" model. For every action you are going to undertake that is critical, first announce your intentions and give enough time for reactions from others before following through.

I saw that Space Shuttle landing video that was kicking around recently. In that they also had explicit "I agree" responses to any observation like "You're a bit below flight path". Quick, positive acknowledgment of anomalous events or deviations. Seemed really ... sane.

do you happen to have the link for that video (and related discussion if posted on HN)?

I didn't hear any "I agree" in that one.

I found this one which does:


Does the phrase "I'll show you" also have a predefined special meaning? He seems to repeat it quite often.

I heard that as "I show you xx". One guy is flying with his head up, the other one talking most is monitoring with his head in the instruments, and helping the pilot flying getting confident data.

Remember they're driving 1970's technology, redundant everything, and they all grew up flying "steam gauges", where the culture includes tapping on the glass to make sure the needle didn't stick. They want to compare every sensor output for sanity so they can disregard one if needed.

This vid is also the source of cockpit audio for the FSim shuttle simulator game if you like this stuff.

So "I show you X" stands for "I see X on the instruments" (as opposed to just stating X as a fact)?

Yeah. Oh and Houston did the same thing, calling out the 180 and the 90 on the HAC - heading alignment circle. Just helping them out with their radar indication.

PS - here's what happens when they got a bad instrument and didn't catch it: http://www.avweb.com/news/safety/183035-1.html

What an awesome video; saving for future reference. We have much to learn from this in IT.

Missed this reply, sorry! It was in this one I think towards the end: https://youtu.be/Jb4prVsXkZU

There's a mixture of positive confirmation and criticism being shared. eg around the sixteen minute mark:

    Pilot: Your radar's good. My radar's good.
    Commander: I agree.
Then 16:37:

    Pilot: You're going just a little bit high.
    Commander: I agree.

"I intend to" is actually a phrased deliberately used a lot during military mission briefings. I always wondered why, I guess it's deliberate.

The military uses "intention" rather than "I will" because they understand that no plan survives contact with the enemy. It is also higher level so that when exhausted, stressed subordinates find themselves in life threatening circumstances the most important thing they need to remember is the intent. If they forget steps 1-5 of the plan but recall the intent they can't go that badly wrong in using their initiative.

There's also the distinction between specific orders - "you are to". Knowing the commander's intent, and his commander's intent (known as 1 Up and 2 Up) enables Mission Command, the concept of giving latitude to subordinates to achieve the mission in the best way possible within the confines and direction given to them.

Finally there's the ritual of it, NATO forces expect to operate within multi-national structures where English won't be a first language. There is a "NATO sequence of orders" which should be roughly followed. It means everyone knows the structure of what is coming up and when in the brief - so you don't get people asking questions about equipment when the limitations are being explained, they know that comes later. Opening, especially from a junior officer, with "my intention is" is essentially like having a schema definition at the start of a document - it defines the structure of what is coming for those who are going to be parsing it.

Much better answer! I recognise a lot of that.

Yup, it's never the fault of a person, always of the system. Once we get this resolved we'll definitely look at ways to prevent anything like it in the future.

The reasoning for that was to prevent reducto ad absurdum and "It's YOUR fault".

"You should have KNOWN that erudite command was going to fail."

"You should have known that our one-off program had issues you did not account for."

"You should have known that the backups were not properly tested elsewhere for known good state."

"You should have ....."

In reality, some disasters were caused by idiotic things like "rm -rf /opt/somedir ." You just hosed the system, or a large part of it quickly. And we could say that your malfeasance of including the "." started wiping / immediately. But we can also say that rm should be aliased to prevent accidents like that, or that rm should do some minimalistic sanity checking on critical directories before executing them.

People can, and will mess up. These computers are nice, in that they can have logic that can self-correct, or at least loudly alert errors.

Wishing you the best of luck partly out of sympathy, and partly as an active user wanting my account operative again :)

At some point you have to blame the person. If the person did it wilfully, deliberately, ignoring the checks and safeguards. Or because there's already so much system that having any more of it would place severe restrictions on everyday tasks.

I'm not saying that's the case here, because it does seem that GitLab has systemic deficiencies. But "never" and "always" are such strong statements.

Prosecutor: Mr. Accused, here is evidence that you murdered that other person. Accused: It's not my fault, but the systems'. Judge: Oh, ok. You are a free man.

The law and a "no fault" postmortem are entirely different things and you conflating the two doesn't help the discussion.

Let's assume a bad actor in a company. It still doesn't help improve the situation to allow blame to rest with the bad actor. Definitely, there should be penalties applied (likely the termination of their position), but it doesn't help your company at all to stop there.

Did they delete data? Why is there no secure backup system in place to recover that data? Why was there such lax security in place to allow them to delete the data in the first place? Why are we hiring people who will go rogue and delete data? Did they "turn" after working here for a while because of toxic culture, processes, etc?

Hell, if the law worked this way, we might actually have less crime because we'd look further into the causes of crime and work to address them instead of simply punishing the offenders.

On a production server this delicate, (eg: production db), I'd replace /bin/rm with a script that moves stuff to /trash.

Then, always remember to delete /trash a few days later.

This goes on my machines:

    sudo apt install trash-cli
    alias rm="echo This is not the command you want to use"
This way I managed to get unlearn my rm trigger-happyness and use trash instead. I had too many incidents already... ':|

I used to use the same strategy, but it's important to know that on some configurations, there are issues with the gnome trash specifications/tools:

this bug has been open for, uh, 7 years.

Didn't know trash-cli, thanks for sharing.

Even better, move stuff to /trash/$(date -I), and have your cleanup script regularly clean out any directory from before a week ago.

This sounds good until you consider that many systems are utilizing multiple drives. When someone is expecting to delete a large file and it ends up on a different drive, problems could arise.

Renaming a file should not move its data, right? So rename the file into .file.$(date -Iseconds).trash (but make sure that no legitimate files are ever named in this pattern). Then put that file path into a global /var/trashlist. To cleanup, you just check that file for expired trash and make the final deletion.

Beware race conditions when writing to /var/trashlist (assuming you mean "a file with one path per line.")

Proposed tweaks: symbolic link into /var/trashlist directory, where the name of the symbolic link is "<timestamp>-<random stub>-<original basename>". Timestamp first so we can stop once we hit the first too-recent timestamp, random stub to unique the original base name if two different files In different directories are deleted at the same timestamp, original file name for inspection.

It will when moving across disks.

Nice, but in practice I've found that when working with replication on big production DBs I usually don't have the space to hold multiple copies of a data dir that I want to delete, and copying elsewhere takes too long. Not a blocker to using it by any means as at least this adds a natural sanity-checking stop.

And then your disk fills because your new hire doesn't know about trash and you have an outage and remove it

Which, as mentioned is a systemic problem that has to be solved by training. And/or you can set up cron jobs to do the cleaning. Or have some conditional script triggers.

Cron job with a find command fixes that.

Cron job doesn't fix anything.

You're at 90%, and you have an app error spewing a few MB of logs per minute. Your on-call engineer /bin/rm's the logs, and instead of going to 30%, you're still at 91%, only the files are gone. Your engineer (rightfully) thinks "maybe the files are still on disk, and there's a filehandle holding the space", so instead of checking /proc to confirm, he bounces the service. Disk stays full, but you've incurred a few seconds of app downtime for no reason, and your engineer is still confused as shit. Your cron job won't kick in for hours? Days? In the mean time, you're still running out of disk, and sooner or later you'll hit 100% and have a real outage.

Cron job is a stupid hack. It doesn't solve any problems that aren't better solved a dozen other ways.

That engineer should have read the documentation. Failing that a `du -a / | sort -n -r` would immediately show the conclusion you jumped to was wrong. Randomly bouncing services on a production machine is a cowboy move.

No documentation, no checklists? That's the source of your problem, not a trash command which moves files rather than deleting them.

Changing the meaning of a decades old command is the problem - du / can take hours to run on some systems, and randomly bouncing shit in prod is the type of cowboy shit that happens in a panic (see gitlab live doc).

Docs and checklists are fine, but at 2am when the jr is on call, you're asking for problems by making rm == mv

Maybe add a bash function to check the path and ask the magic question: "Do you really want to delete the dir XYZ in root@domain.com?" .... but then again when you're in panic mode you might either misread the host or hit 'y' without really reading what's in front of you.

The best thing to do is to never operate with 2 terminals simultaneously, when one of them is a production env, better login/logout or at least minimise it.

At my last place, the terminal on prod servers had a red background, HA servers was amber and staging servers green - worked a treat.

Perhaps, Do you really want to delete the dir XYZ with 300000000000 bytes in it?

The problem in this case was that YP confused the host. YP thought he was operating in db2 (the host that went out of sync) and not db1 (the host that held the data), a message that doesn't display the current host wouldn't help in this case.

Indeed. That's happened on our systems as well. Someone issued a reboot on what they thought was a non-critical host, and instead they did it on some very essential host. The host came up in a bad state, and critical services did not start (kerberos....).

It lead to a "Very Bad Day". I found out about it after reading the post mortem.

One day, I almost accidentally shut off our master DB so I could update it. It's funny.. I leave these types of tasks for later in the day because I'm going to be tired and they're easy & quick tasks. But that almost backfired on me that day; I read the hostname a few times before it fully hit me that I wasn't where I was supposed to be.

I always set my tmux window name to "LIVE" or something similar when I ssh into a live host.

Sounds like what you want is a nilfs2 or similar file-system...


And/or a system like etckeeper to help keep a log on top of an fs that doesn't keep one for you:


On some machines I've actually masked `rm` with an echo to remind me not to use it, and I would delete with `rrmm`. That would give me pause to ensure that I really mean to remove what I'm removing, and more importantly, that I didn't type `rm` when I meant `mv` (which I actually have done by accident).

We have 2 people on rotation and anything like this you pair with someone else and talk through it. As we like to say "It's only a mistake if we both make it."

Absolutely. Same goes for any ops response. You need two people: one to triage the issue and another one to communicate with external stakeholders and to help the one doing the triage.

The military does a very similar thing. An Army company commander usually has a RTO (radiotelephone operator) to handle taking on the radio. This frees the commander to make real-time decisions and response quickly to the situation on the ground, and frees him/her from having to spend time explaining things. A really good RTO will function as the voice of the commander, anticipating what he/she needs to get from the person on the other end of the radio. This is a great characteristic of a good operations engineer, too. While the person doing triage is addressing the problem, the communicator is roping in other resources that might be needed and communicating the current situation out to the rest of the team/company

Indeed, and a very good way to proceed.

Another thing to note, is that the RTO will occasionally state someone and "ACTUAL". That means that whomever they are speaking for is actually speaking and not the RTO on behalf of the CO.

That is interesting: You're saying that at any given weekend you have 2 ppl on rotation which work on parallel when/if there's a crisis?

"What? You told me to do that!"

"I meant the other server."

The best Ops people I have worked with (looking at you Dennis and Alan) repeat everything back that I say. More than once I have caught mistakes in my approach simply by hearing someone else repeat back exactly what I just said.

The best ops people have made all the mistakes they're helping you avoid.

Amen brother.

My worst nightmares are nothing but a blimp on a developers mind. I've have lost it all. I have lost it all on multiple servers. On multiple volumes.

No one should ever experience that. Ugh, developing, sometimes, is just so frustrating.

I think this applies to pretty much any field : http://thecodelesscode.com/case/100

BBC's Horizon has a really good episode about checklists and how they're used to prevent mistakes in hospitals, and how they're being adopted in other environments in light of that success. It's called How To Avoid Mistakes In Surgery for the interested.

Here's the book on this topic-

The Checklist Manifesto https://smile.amazon.com/Checklist-Manifesto-How-Things-Righ...

There's the article that this book is based on The Checklist http://www.newyorker.com/magazine/2007/12/10/the-checklist

A must read if you don't want to make mistakes, it's given to all new Square employees.

So glad someone brought up this book. It a good read and full of sound advice.

There are very, very few situations in life where there's not enough time to take a time-out, sitrep, or checklist.

I work in EMS when not in IT, and even bringing a trauma or cardiac arrest patient into the Emergency Room, there is still time to review and consider.

If there's time in Emergency Medicine, there's time in IT.

I worked in software to help manage this for a while. There are still checklists but they are produced ahead of schedule. Every instrument a nurse takes off the tray is counted and then checked at the end for instance.

Delivery room experience: before stitching my wife, the OB counted the pieces of gauze out loud with the nurse watching. They verbally confirmed the total with each other. A matching count and verbal confirmation were performed after the stitching. It inspired confidence seeing them perform this protocol.

With gauze in particular I think every nurse has a story of "that time we removed the septic gauze" with colorful descriptions of the accompanying smell.

Something related, I made a script that helps me to clean up my git repository of already-merged branches (I tend to not delete them until after a release cycle).

In this script I added a checklist of things to "check" before running it. It has worked in my favour every time I run it.

Here's how it looks like https://github.com/jontelang/Various-scripts/blob/master/Git...

Another lesson is this one which I also learned the hard way: don't work long hours or late at night unless it is absolutely necessary. It doesn't sound like it was totally necessary for YP to be pushing this hard, and pushing that hard is what leads to these kinds of errors.

Take your time and work when you're at your best.

> Don't get distracted and start switching between terminal windows.

I got bitten by this in the past, luckily nothing that could not be reversed, just 2-3 hours lost. I can imagine how YP's stomach must have felt when he realised what happened.

Still, I had no idea about checklists and so many people here seem to be pretty familiar with the concept :-)

The amount of times describing something to a co-worker or manager out loud has allowed me to catch a problem or caveat is innumerable.

Even more so if you tag-team things, so much better.

Checklists are great if you have them and often you should create one even during an outage but other times you waste time. It's really hard to balance fixing things quickly and not breaking more.

There isn't a silver bullet anyway, it's layer on layer of operational best practices what makes you resilient against such issues.

You missed the most important part: writ down the backout plan.

And a logbook

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact