I’ve written Java programs that write a bash script that runs under cloud init. It’s a lot less code than Puppet, it “just works” and you never hear that “you can’t get from there” or spend three days to find out how to configure something you could configure in three minutes if Puppet wasn’t involved.
This will be an unpopular answer, but none. Karpenter for AWS, Rackspace spot, or microk8s up… Stock Debian + Kubernetes means I never have to configure a host system again. Even raid these days is one line of config: “instanceStorePolicy: RAID0”.
My Raspberry Pis at home are maybe the only exception, in which case I don’t need to harden anything as they’re isolated on my network and run trivial things like a webcam.
I’ve been a Linux sysadmin my entire professional life, and I’ll never go back to VIMing conf files ever again. It’s wonderful.
That said, if you do have to do this, Packer and Ansible still exist.
Typically kustomize and plain YAML - I like “infrastructure as data”. For IaC tho I’ve been enjoying Pulumi which has been a hit for my Typescript loving developer friends.
I recently wrote something like this, as I was replacing my home media center/server box with new hardware, and wanted to start fresh instead of imaging the drive and copying it over.
At first I considered that it would be a good opportunity to learn something like Ansible, but after looking at the getting started docs, I realized I didn't feel like taking the time. So, a bash script!
I wrote a script that assumed I'd just installed a Debian base image. The script installs a couple needed things with apt-get, adds some extra config to /etc/apt, and then does a full upgrade, as well as installing other needed packages.
Next it creates some daemon users & groups, copies in a file system overlay (mostly config files), and then has a list of things to download and install to /opt (stuff that isn't available from a Debian repository). There are a few things I run on it that come in Docker containers, so those are set up (with systemd service files) as well.
Finally it installs duplicity and restores app data that gets backed up nightly to S3 on my original media server box, and sets things up so those backups will happen on the new box as well.
Ultimately I'm not thrilled with it: some of it is fragile (like 'sed' that changes existing config files), and I of course made mistakes as I was writing it, so I had to also take care that the script was idempotent and could handle being run multiple times on the same box without erroring or re-doing what had already been done. I imagine/assume something like Ansible (or Chef or Puppet or whatever) would handle this for me.
But it worked, and was fairly low-effort compared to learning a new provisioning system, so it's fine. Maybe I'll learn Ansible another day.
I have a NixOS config in GitHub that will configure all of my servers to be identical when you symlink it to /etc/nixos/ and run nixos-rebuild --switch
For one startup, the MVP included deploying some Linux-based overseas factory stations, under initially impossible conditions.
For reasons that were really good in context, the software aspects of our station builds were based on a set of step-by-step written manual instructions, to configure the Linux distro to "bootstrapping" state for running the huge configuration shell script.
The huge shell script was written to be idempotent. So it could do both the initial heavy configuration and then the same or later versions of it re-run to make any engineering-change adjustments. Including re-run safely remotely, after the machine was deployed on the other side of the globe, in an active production line. If you're comfortable with Bash or Python, a script like this can be easier than shoehorning the problem into the structure of some declarative tool.
(The really good reasons would sound crazy out of context, and you wouldn't think it worked, but we did ultimately have perfect software uptime. And, yes, there were plans (before the Covid chill on our customers and VC) to do a lot of things differently in our next-gen station, including putting most of it in a container.)
Besides bespoke scripts (Bash, Python, Make), I've also used declarative off-the-shelf tools that do a lot of heavy lifting, including doing things with Terraform. (For example, Terraform to allocate a set of AWS EC2, storage, and networking, to simulate a real metal server environment, for developing some infrastructure software.) But, even with those tools, there's always some kind of documentation of it, and often also bespoke scripts.
Regarding remembering all the steps and getting them right... One of the tricks to documentation for things like "pets" servers is to have a canonical internal wiki into which pretty much all useful info that's not otherwise in Git goes. For example, if you go to the AWS Console to add or change something, have the right wiki page open, and update it, including copy&pastes, as you go. (Also, cross-link that with any issue-tracking for the task.) If you keep wiki access low-friction and high-value, it should save you at least as much time as it costs, and be even more valuable to others. (Put loosely, I've seen situations in which a project or a person's job hinges on whether or not one key sentence was captured in a wiki.)
Sometimes, this documentation can later be turned into a script.
Following these lightweight doc conventions, even my personal laptop has a wiki page on how to reconstruct its configuration, and it's kept up-to-date for years.
nixos-anywhere and some flakes is all i need! sometimes for new setups I'll run nixos-anywhere and write the config directly on the machine, but that's usually just for sketching out ideas. I'll reprovision with a module after I've worked out the kinks.
I just transitioned my home and homelab PCs to nixos+anywhere+flakes and I can't imagine going back to ansible or puppet or even docker compose.
you know that it won't drift. you can generate images from the same config you apply to a host. you can stamp out the same config across N hosts. you can completely reconfigure a host without reprovisioning it from scratch. you can run slim VMs by mounting the nix store as read only.
it would be great to run an entire company off nixos this way
I recently started using devbox over homebrew for package installs on my Mac and this has gotten me more interested in nix - however, neither of the VPS providers we use offer NixOS as an OS option, is there a way to achieve that declarative setup on Ubuntu?
I had high hopes, but it's obviously designed for less paranoid consumers since executing the verb "plan" is all "ya, fam, but I'm gonna need to be root to tell you what I'm gonna do"
`lix-installer` needs to run as `root`, attempting to escalate now via `sudo`...
TRACE execute:execute: lix_installer::cli: Execvp'ing `"sudo"` with args `["sudo", "env", "SHELL=/usr/local/bin/bash", "./lix-installer", "-vvvv", "--logger=full", "plan"]`
I guess I'll have to patch the source to have it tell me why it wants unrestricted access just to tell me what it is going to do
I’ve written a huge number of shell and Python scripts to deal with these sorts of things. But I recently started learning Go and had a lightbulb moment: why not write a cross-platform tool that does this? I really like Neovim’s approach of using Lua for configs, both for the core app as well as for plugins, so I too am taking that approach.
(Yes, I’m aware that Ansible and Nix exist. I like them. This is really just a hobby project for learning Go).
So far it’s going great. The Neovim-inspired approach of essentially viewing applications and other incidental configurations (like ssh keys) as “plugins to your core system” feels like the right way to solve this problem. And given that Lua can exist as a simple YAML/TOML/JSON-esque collection of fields, but can also drop into proper functions or even quasi-OOP behavior, writing your own “presets” for a given machine config is also trivial. And Go is proving to be a fantastic language for running it all. Really excited to see where this project leads.
These days I use Ansible to take care of all of that. I can’t share the scripts due to company restrictions, but it has probably built ~10,000 servers in the last year.
I first run a bunch of checks to try and make sure the build will be successful. Then create the instance where cloud-init does a few basics to allow the rest to work. Once the instance is up, the connection in Ansible flips from localhost to the server, mounts the drives, installs everything that needs to be installed and does whatever configuration is needed, and adds records in whatever systems of record need to be updated. The whole process takes about 10 minutes or so (for a single sever), depending on some external dependencies. The time increases as the count goes up. That’s probably something we could solve for, but it hasn’t been a big deal so far.
It's also handy for keeping systems in a desired state but I use it a lot more for "clone this when you boot" times hundreds of devices. Error recovery is left as an exercise to the reader but mostly a matter of configuring log egress early and having sane health check policies
I’ll have to look into this a little to see if it might be useful for some other things we have, if we’re even allowed to used it. We’re running everything through AAP which is useful for audits. We do have a home-grown tool for installing and maintaining state of a lot of the things we need to setup. We install that during setup with Ansible and wait for it to finish its initial setup to make sure there are no errors before we turn it over to the customer.
The other issue we have is our change management process. We need to provide a single update with the status of everything when it’s done. So our process can only go as fast as the slowest server. We could change this, but it would be a political nightmare I don’t want to deal with.
These are all meant for RasPi embedded controls, so they don't handle a lot of security related things that aren't relevant for just a Pi on a private network without ports open.
I set the password with the flasher utility, then have my app server just use Linux authentication so I have fewer things to mess with and more that can be done with standard tools.
Unfortunately MQTT can't do that and the PKI model is hard to set up fully automatically, but almost all routers have guest networks and such, so relying on WPA3 is fine for non-critical stuff.
If I need remote access, I use Zrok.io and avoid having to manage certs myself.
Love zrok.io, I work on its parent, OpenZiti. It makes me wonder; OpenZiti makes PKI much simpler while providing the secure overlay, we even used our SDKs to demonstrate zero trust overlay networking built into MQTT - https://github.com/ekoby/mqziti... could that be useful for your use case??
On mobile at the moment, can't share - but I have a great library of Ansible roles.
Strongly recommended for those looking for inspiration. Give "ansible-doc --list" a spin! This shows modules, it can filter for existing roles as well.
Happy to share! I'll save the sales pitch, but the module/role library is truly a good source of inspiration despite the language.
The code itself isn't that important - declarative YAML, but best practices/patterns can be an art. I don't actually have much published that wouldn't tie my identities together :P
Avoid using Ansible to wrap 'shell' commands with modules of the same name... outside of information gathering, stick with the modules specific to the work. Also avoid using when - may often prefer 'handlers' :)
I have built server deployments with Powershell DSC. Honestly, it was not stable enough for my liking at the time, and you were on your own in terms of support. I never went back, and work decided to go with a mix of VM extensions and Chef.
My JellyFin server, I have a bunch of powershell scripts mostly to reproduce the config of JellyFin components (NGINX, cert for public access, TV listings, automated content scraping, auto-update, library management, etc). I am building a Powershell module which wraps the JellyFin REST API to help with all this. Lastly, the tasks which run on a schedule self-register with Windows Task Scheduler.
Not ready yet, but a TCL powered agent that manages *BSD jails.
I currently have a working design of my host, which has a jail, that hosts a jail within that jail which then hosts another jail which then hosts two more jails of dns and www.
or, otherwise:
Host -> Infrastructure -> NetServices -> Netbox which hosts DNS and WWW
This agent sits on all jails and reports back stats such as networking with auto throttling. Because jails are just a zfs zvol and config away it's has the ability to create too.
All sitting cosy within a TK GUI. On the roadmap is to plug it in to NaviServer and with essence of vanilla JavaScript turn it in to a web frontend.
I use Ansible to setup my servers, and also all my workstations.
The first step is to run a bootstrap script, then run a tailor made playbook for each situation.
Those specific playbooks haven't been made public, but I wrote something[0] last year about how to setup an Android development environment using Ansible, and as part of that shared my bootstrap script.
> The first step is to run a bootstrap script, then run a tailor made playbook for each situation.
Up to you, but just in you weren't aware: ansible has a `raw:` task which enables one to run anything that the `connection:` transport tolerates, which can go a long way toward getting rid of any extraneous shell scripting
Further toward that goal, I found that PyPy is "untar and go" for every distribution that I've cared about, so even getting python on the machine can similarly be bootstrap friendly
Like others have said, cloud-init is the way. For the few instances that may not have the right tooling baked into their base image or where it’s not fully supported by the orchestrator (like Proxmox containers), I use https://github.com/rcarmo/ground-init (which I first wrote to set up my laptop and local machines and lets me re-use cloud-init files).
As for general principles, if one’s packer, containerfile, cloud-init, bash, ansible, or puppet is much more than “install packages, set keys and a small amount of per-environment config, and start services,” it is likely that local packages are not being used effectively.
Thanks, will take a look at that curl thing. We are still using this and been working for us for ~15 years (python 2, ported to python 3) and this is just an example of how to take https://fabfile.org to the extreme but still is not the best way to do it. We only ~50 servers so it is not a massive fleet. The convenience of typing `fab <env> <role> <task>` to do things under control is still better than nothing :)
No, I have a text file with a series of steps. I'm not one of those hackers who spends 48 hours writing a script that will save me 5 minutes of work per year.
This is probably an unpopular opinion, but I have a bunch of install scripts that install some programs from source. Even some basic things like vim. The reason being that there's some customization I want to do. A lot is that I often can't trust the package manger. For simple things like idk if I'll get python3 (or even python) support in vim to the fact that Ubuntu 20 had fd-find and batcat while Ubuntu 22 has fd and bat. The other side of this is that I don't always have full control over the machine so I'll just install things into `"${HOME%/}"/.local/{bin,builds,include,lib,share}`. There's rarely "one-size-fits-all" solutions, but if we know what decisions we will want to make, we can leverage that.
This does also end up having a multiplying effect where if I want to use ansible I can call these scripts directly (which I find often easier than using ansible itself...). Then usually I can have the system set itself up, or at least 90% of the way and while it doesn't save wall time, it saves my time. I can also make specific options for the distro at hand or when I have certain constraints (it is easy to probe for quickly and that can be put in a common file that can be sourced by other scripts). The scripting method is helpful because you're just doing the same design pattern as when programming: containerizing functions, creating flexibility, modularity, and readability (I also highly suggest putting notes in these scripts. Not for others, for you. The more you automate the quicker you'll forget the awesome tricks you found, but you'll be more likely to remember where to revisit). Because like you said "it's very repetitive." When you see that, then you know there's a great opportunity to leverage your programming skills.
I purposefully try to make these scripts require minimal tooling. My main ones are `curl`, `grep`, and `sed`. So I can generally rely on having those on a fresh system. This is really all you need (though you should use these to grab things like `make`).
Pro tip: make a template maker. I have one for a github source. While I can normally just source my common file, there is some benefit for having everything self contained. You won't be able to write the whole script this way but you definitely can get all the boiler plate out of the way which is at least 80% of it, and hey, maybe you could get an LLM to do another 10% for you. Though it I haven't found one that's really that good at bash. (I suspect that this is primarily due to the metric ton of shit bash scripts, since the average one is beyond terrible). Idk why bash scripting is a "lost art" but it is not that hard (for the love of god, use functions).
I also suggest writing some systemd and cron skeletons. These can save a lot of time and really help as you find mistakes or if you want to add extra system hardening. I do find that common implementations do not have the containerization I want (since you're mentioning hardening). You can't always trust the ArchWiki, but it is usually mostly right. An example might be with Fail2Ban, where I don't like to put logs in /var/log/fail2ban/ instead of /var/log/fail2ban.log{,.{1..N}}
I'd share but I don't want to dox myself here but I'm happy to share some bash tips or other quick hacks.
(to be clear, not everything is or should be installed from source. You don't have infinite time and I don't used Gentoo).
Sounds like what you have works for you, seems like it would be quite brittle to me though.
> Idk why bash scripting is a "lost art"
I suspect that has to do with the fact that writing or reading it is often an effort in futility, acting as an efficient rabbet hole time trap.
Bash scripting is supposed to act as glue, and it does so extremely poorly. It relies on program writers coding their interfaces correctly to maintain determinism (which many have done very poorly). It makes no guarantees, or warnings about such errors, it makes it quite easy to make these mistakes, and the respective maintainers of the various utilities it may depend on have said these bugs are working as intended, nofix.
Take a look at ldd sometime, I'm sure you'll notice the output has three different types of determinism problems which prevent passing the output to any automation (and having it work thereafter correctly), without first patching the software.
That particular bug was reported in 2016, it was partially fixed in 2018 by PaX, but the maintainers wouldn't pull the fix (from what I read), so PaX forked it. The bugs still exist there today afaik.
For glue to work, you have to be able to make certain guarantees and be able to correctly transform the output in various ways easily. Visibility in all processes is extremely important as well and changes made on one version should continue to work on later versions.
Unfortunately, by externalizing most of the basic functionality to various core utilities, you potentially get different behavior every time you update, and as mentioned it fails resiliency tests; instead becoming brittle. There is also very little visibility without having deep knowledge of the ecosystem.
Ironically, what was described in the Monad Manifesto got this better than anything else I've seen since.
Technically you shouldn't, but some people don't consider this a loss of credibility as a whole.
The moment Ubuntu started deceptively poisoning their repo with pre-packaged fixup scripts that would violate security policy by fixing/re-enable snap if it was disabled, and installed the related snap packages (without prompts) instead, was the moment I stopped using Ubuntu for anything professional or production grade.
The package manager is apt, visibility is through apt. If I wanted a snap package, I'd use snap to install it. If you use apt-get upgrade, you shouldn't have to worry about third-party idiocy violating security policy without disclosure or notice and automatically fixing disabled services, and then installing the snap packages.
We don't use snap for a reason, it is unnecessary and broad attack surface that is unneeded.
[1] https://github.com/canonical/cloud-init
[2] https://www.puppet.com/