1. Only that should be automated what can be done manually.
Before automating something, a sysadmin must be able to manually carry out the same operation. If they do not understand how it works, without using someone else's scripts, they are ineligible to administrate it.
2. What has been automated should always be doable manually.
After something has been automated, a new sysadmin should be able to read the automation code and re-do the steps manually. Maybe not at the full scale of repeating actions manually on every server, but enough that if the automation tools are irrecoverably lost they can be rebuilt from knowledge.
This also serves as a test for adequate documentation in the automation code. A new sysadmin should not have trouble figuring out why a certain thing is done a certain way in the scripts.
3. What happens, even when automated, should not be hidden from the sysadmin
Even in full auto, what's going on is always visible, so a sysadmin can jump in (i.e., go full-manual) at any step. This was one of our earliest rules, because we (me & my teammates) wanted to be able to debug without friction if anything went wrong. Also, it looks cool(er).
As a guiding principle behind how we automate operations, we assert that no amount of automation is replacement for a sysadmin's skills; it replaces only manual, error-prone effort.
Until you have to 10 years later for whatever reason, and suddenly, you forget. And suddenly you have to look things up.
Besides, hiding the how does contribute a tiny little bit towards the forgetting. Not much, but still. So we keep asciinema recordings of the tool performing some of the important tasks, so they can serve as a quick look-up, while the code serves as a full-detailed look-up.
All operations during development are carried out manually, unless already supported by automation (and until, of course). When automating the new operations, a sysadmin must also conduct the steps manually, and also take care of considerations a developer might not have been aware of.
> What I didn't see mentioned is that when the automation fails for whatever reason, typically it is a five-alarm situation where minutes count.
Generally, this is easily-handled by redundancy, HA, and snapshot rollbacks. Of course anything that's a potential five-alarm situation has redundancy, right? :P
No, that's not right. When automation fails, it's typically not in five-alarm situations. During five-alarm situations, knowing how to do things manually won't help, because if you could do them manually, the automation would already work.
The tools we use for this are just: a programming language (Python, looking to add some Hy), a rich SSH library in the programming language, and some wrappers we wrote around simple tools and techs (e.g., rsync, tls, git, sqitch etc.).
The wrappers themselves are (currently) 1261 lines of code (according to sloccount).
We arrived at this after building our automation in Ansible, and then Salt, and learning from all the troubles and frictions we went through with either. They're both fine tools themselves, great for many tasks we'll never need to do, but some of their design decisions are at odds with what we needed.
...and this is why I like Ansible. easy to follow scripts that do all the leg-work, and is reproducible at the command line...
So I'd love to hear of some of the hurdles you have encountered. So I can avoid them in the future.
BTW, I do agree with you. I vastly prefer a 'CLI fall-back' level of control, and that's the only way to keep core skills relevant. Another example of this:
Do it once manually, then use automation to replicate it 1000 times.
1. There are small bits of repetition in some of our steps that cannot be trivially abstracted in Python without sacrificing the simple aesthetic of imperative Python code, but they are perfect for macros.
2. One of the components involves quite a bit of interaction with the user, based on some data. The data itself changes infrequently, and the interaction code is generated from the data, but the generator isn't as pretty (due to the nature of the data) as it could be in a Lisp.
These are not existentially important to the tool, but we've had bugs because of both. Something about the ugliness makes people not look at it too closely. While the bugs were small and easily spotted by an attentive eye, it'd certainly be nicer if the likelihood of future creation of similar bugs could be reduced.
That seems at odds with rule number 3 because sysadmins must have fundamental understanding of the systems/network/OS, in-depth knowledge of your custom homegrown config manager and the language it's implemented in in order to debug.
Happy to hear it works for you though. Personally I'm willing to accept the shortcomings of an existing config management system like Ansible because it's a system that is already understood by a large percentage of sysadmins and the core program does not require any maintenance time/effort from my team.
Ultimately, it turned out:
1. We had already grown a custom configuration manager within the tool.
Both tools support many ways of doing the same thing, so we had to pick one way and constrain ourselves to it (e.g, master-minion vs. salt-ssh vs. masterless). This happened quite a bit. And, as usual, with enough use emerges a pattern. Plus, some ways simply did not exist and had to be built.
2. We had already learned large portions of the tool.
Ansible and Salt are simple tools when used for simple tasks. When using either for not-so-simple tasks, one invariably meets portions of their code/behaviour one doesn't expect to meet.
3. Any sysadmin we'd hire would need to know the configuration tool in-depth, anyway.
And, to our surprise, the vast majority knew only the basics, if that. Since we are, on principle, opposed to using something in an important capacity without understanding it well enough, and we needed to be certain any sysadmin we'd entrust with the responsibility did know their tool in-depth, we learned the tools in-depth, ourselves.
4. The tools are NOT simple.
When we'd learned Ansible and Salt, ourselves, we'd found they were actually quite a bit complex. Made sense, they had to take care of so many conditions and variations and different situations.
5. Any sysadmin we'd hire would need to know programming and a programming language.
We already had extensions in these tools, custom modules written in a real programming language. And, in this day and age, anyone with a responsibility as important as managing our production servers has to be able to program anyway.
Our current ops tool is only 1.2k (library) + 1.3k (config mgmt, sans static) LOC, the config mgmt is in plain Python, and is vastly simpler compared to knowing how Ansible or Salt work or how to write Ansible or Salt modules. The in-depth knowledge required is much smaller too (since we don't have to take care of all the many ways Ansible and Salt could be used or the platform differences Ansible and Salt need to worry about); just Python, SSH, SFTP, Rsync, Git, Sqitch, TLS, and the OS we have chosen, almost all of which one needs to know anyway.
Seems kinda weak.
I'm sorry I wasn't clearer in my original comment. We do not have manual fallbacks. Everything is automated, with the capability of going full manual if ever necessary, at any point. The rules are simply to ensure that this ability can always be relied upon, even if never needed.
At the least, automating the unknown (the first pitfall mentioned in TFA) is a dangerous "unknown unknown" risk, and the first rule tries to protect against that. Then there's the risk of being burned by "Unknown automation" (the second pitfall mentioned in TFA), which is why we have the second rule. And finally, simply pressing a button and being told everything has been done for you, without being shown how (or worse, even that) it was done, is just boring (not to mention how this supports the third pitfall mentioned in TFA).
"Have a pilot in the cockpit who can fly the plane even if you trust your autopilot very much"
I find it's better to save my brain power for learning things like what I need to take into account when configuring a database for example, or what variables affect network performance and what options I have for tuning, etc.
When I reach for tools to solve problems, the most I usually remember is "I can use X to solve this class of problems" and the how is available from the manual or with a Google search.
Although if you really do need to learn regex, I highly recommend https://regexone.com
At least for me, the way it was taught Wes super-helpful in retaining the syntax knowledge.
People have huge reservations about parsing non-regular languages (like xml) with regular expressions because xml could have an attribute with a string with another xml in it or something, but the whole point is moot because regex is mainly used for partial matches and the possibility of having a string inside the html that could match whatever you are looking for is minuscule and will probably show up in tests anyway.
Of course this creates the pitfalls of automation, but I think it's unavoidable if you desire reliability and efficiency. Bypassing automation and typing in commands from memory or from man files, even if it's a one-off experiment with a fresh server, is going in the wrong direction to address the pitfalls of automation. If you're worried about not understanding the configuration commands your automation uses, a better way would be to call a meeting with some junior engineers (or a rubber duck) and talk through the configuration process line-by-line.
And in the process of doing this, people stop remembering the original languages and you are now stuck in config language land where you still have to look up syntax on how to do things anyway.
So from your perspective, the problem is not "not remembering how" or "not knowing how" it is "not organising processes" am I getting it correct here?
From my perspective, he doesn't have a problem at all. I would even dispute that he doesn't know the "how." If his servers are configured via chef, then chef is the how. Going deeper and looking at the OS-specific mechanisms chef uses is necessary for debugging when the behavior doesn't seem to match the documentation, just like if you're doing it by hand, iptables is the how, and the system calls or library calls it makes are what you'd look at if you couldn't understand its behavior from the man page and other docs.
In other words, you shouldn't worry if you're forgetting something because you never need to use it. When you need to use it, then you will learn it again and count it as an advantage that you did know it at some point in the past.
Well this is a little weird because the default sshd_config file tells you how to do it in the comments. So basically you need to be able to figure out where the sshd config file is.
At a previous job we had a wiki with all that stuff. I was quite nice.
A sysadmin should know how to troubleshoot, how to use tools, and (more importantly) how to learn how to use the tools.
I don't want to discount the value of someone who still remembers how to handle seemingly arcane tasks like iptables. Experience has its weight in gold, and memory definitely contributes to its value, but there's more to it than that. It is the capacity and willingness to learn in any situation, however, where a talented sysadmin really shines.
If this guy had to work on a server for a week without any config manager, I’m sure he’d be right back at home.
Opposed to a guy who has never touched iptables. That guy is in it for the long haul.
Maybe a better title is “I forgot the commands(and some directory locations) I use to manager a server.”
(I'm just a home-server admin ;) but that also means significant stuff only happens every 3-5 years at a new LTS or with some major problem, so notes to self are important.)
Such notes to self not only help when things go wrong, they also make for easy repetition of stuff years later. When I reinstall, I first back up the entire old root with all configs and thus all my notes (I could limit myself to /etc probably).
It's basically a collection of bash scripts that pull all the config files from your servers. You can view the results as text, or they have a server that helps with long term trending and comparison.
For example, I used this to figure out which servers were using specific ssl certs when I renewed them.
Ops today is largely responsible for maintaining build / deploy pipelines, orchestration systems, and ensuring SLAs are met via SRE activities.
Most functioning DevOps teams I’ve worked with recently have added a more generalist role for a person who is a mile wide and a foot deep. It’s more of a hybrid sysadmin / development skill set ranging across base OS and package management, logging, scripting / automation, networking, access control, security, a dozen programming languages and whatever ITIL / EA platforms you have to interface with. These folks are a godsend in issue resolution as they know where the skeletons are buried. They also can pinpoint your top 5 tech debt issues of the top of your head.
The best version of this person also has some BA skills — they work really well as a demand management / intake person because they usually understand the end-to-end architecture — especially the code behind the integration interfaces — better than anyone on the team. They allow developers to focus on code, ops people to focus on production ops, and architects to focus at the right level of abstraction.
Someone's Puppet and Ansible scripts for some tasks are no comparison to C the standard, and do not come with the breadth, stability, or support of C.
The point was that it's common to forget about details on a different level of abstraction to the one you normally operate. Doesn't matter if it's an ANSI standard or a hand-rolled DSL for a very niche use case.
I don’t think it’s a big problem nowadays. As long as you know how to quickly get the information you need, it’s probably not that much more inefficient.
90% of stuff I don't know off the top of my head I do like this:
1) connect to a server/repository which already uses that in some form
2) copy/paste relevant parts.
3) add new parts by using existing parts as a reference
4) fill in holes with man page/docs/internet search engine
If I had to start from scratch I'd probably fail the basic syntax with pretty much anything.
The important thing is to know that you forgot implementation details, but know where to look them up.
Part of me feels that installing software or running basic configs should still be "collective knowledge" for a sysadmin, and I'm even forgetting how to do that.
If that's true he'll pick it up soon enough. The muscle memory will start flowing again.
If he wants something to panic over try moving to k8s. If effectively includes a puppet like thingy done a completely different way, and worse it builds a very different computing model. Use that for 10 years, then try to re-adjust your thinking back to a single server model that you maintain by editing config files and you will be in for a rough ride.
I focus on how to quickly learn (or re-learn) how to do it.
As soon as the calculator was invented (okay, and made cheap and portable), this skill was useless. Then you had to set up problems so that the calculator could solve them. (Similarly with the computer).
This is a natural and good progression. All the best practices should be learned and mastered by people building the tool, and you should just be proficient at the tool.
Partly this is to fix things when the higher abstraction breaks.
But TFA's point is to manipulate the system without bringing in the cost of the higher abstraction.
I was reminded of that by the realization that I'd forgotten how to manually convert between binary and hexadecimal strings in C.
I'm still scared.
But, above all, I want to stop doing things in C.
The solution is really really simple. Don't hire people without an RHCE, or that haven't had one in the recent past. Feel free to substitute the RHCE for a comparable certification, which are few and far between.
This is one that I genuinely don't know how to solve; I do not know how you evaluate somebody whose skillset you don't fully understand. But certifications aren't the answer.
I don't understand how this is possible. Did you validate their RHCE at redhat.com? It's not a multiple choice test, it's real systems(not connected to the internet, no google for you), and you don't pass if you can't configure/repair them and the services required. The requirements aren't toy examples, they tend to be complicated and involving things you want to avoid due to difficulty. Most people(>50%) fail the exam, according to redhat's statistics.
Have them "drive" you through the resolution of one of the issues that you expect could happen. See how they behave and what they try - they may not necessarily have the complete correct answer, but you will see if they know where to look, the basic concepts, and so on.
I've interviewed people capable of writing ansible scripts, but with no real ability to dive into issues and fix them and/or investigate what's broken. Many would not even tell me "we could just provision a new VM to replace the one that went down, right?"...
Because I'm hiring somebody to do something I know well--this is my bailiwick! In that post, I was trying to approach it from the perspective of someone who...isn't...a former devops consultant and miles deep in this stuff. ;)
Are these actually things sysadmins know without googling or peeking into manpages?
I believe you should see a problem there.
Thankfully you are quite wrong there; lots of us keep running the services that were designed to be decentralized.
This is how I do it too, except replace Google with DuckDuckGo.