
Erase your darlings: immutable infrastructure for mutable systems - grhmc
https://grahamc.com/blog/erase-your-darlings
======
kreetx
Since my move to nixos I've considered to reinstall my system every morning -
since the entire system config is a few .nix files (+ home-manager). Haven't
quite found the determinism yet. Perhaps I should do it tomorrow morning. :)

edit: If I get three more upvotes I'll do tomorrow.

edit: two more.

edit: one more.

edit: all filled up, it's going to happen!

edit: Some more context. Although I have a laptop (a 13" 2015 mbp) I don't
bring it to work anymore, but have two desktops, one at home and one for work.
All three run the same nix configuration, shared through a git repo. Whenever
I discover there is some program I need I add it to the config and after
switching machines run `sudo nixos-rebuild switch` - so the thing I added
becomes available on the current machine there, too. All this just works (tm)
and I'm fairly confident the re-install will be painless.

~~~
delusional
I wonder if this could be used instead of image based computer management for
enterprise settings. Basically the team managing the systems have some
settings, employees have their own overlay (to account for personal preference
in tools or directory structure), and then you just install the entire machine
at the start of every day.

It seems really complex though, so I'm not entirely convinced it's a good
idea.

~~~
GordonS
How would this work from a bandwidth and caching perspective? I'm thinking it
would be problematic if hundreds of workstations need to download gigs of
software at 9am every morning.

~~~
kreetx
At least for nix you can set up local caches, so this would save most of the
bandwidth. But you could also just keep the /nix/store folder - as it's an
immutable store for all the packages.

------
jedberg
I would love to get to a point where my laptop can be managed as immutable
infrastructure.

All the big chunks of data are already isolated into redundant partitions, but
it’s the system config that’s tough.

I have a time machine backup, but that’s still not the same as being able to
say “I’m gonna wipe my hard drive and start over today”.

So does anyone have have good suggestions on maintaining a MacOS laptop in an
immutable way?

~~~
joshspankit
In the world of commercial software that’s essentially impossible.

Any immutable infrastructure lights up license keys and demo restrictions like
a spotlight.

Take, for example, Apple’s desktop OS: You used to be able to drag-drop an
application to install it. To uninstall you would delete it. Simple. Easy.
Stateless. They talked about it a lot and so did Mac evangelizers. But it also
meant you could walk in to an Apple store, connect a USB drive in your sleeve
to a demo computer, drag, drop, and walk away with full versions of very
expensive software. So, the OS fell prey to the same stateful pitfalls as
Windows: places to hide keys, system hooks, etc, etc, etc.

Your best bet these days is likely to manage all software config through a
system management tool, keep your data backed up in Time Machine, and
“reprovision” your laptop every X days or months.

~~~
jedberg
> Your best bet these days is likely to manage all software config through a
> system management tool,

Got any good examples? This is usually where I get stuck and can't find a good
solution that works well with MacOS.

~~~
joshspankit
I can’t remember the particulars right now, but there’s a small team that
remotely manages the macbooks of Google employees. I suspect that their tools
would be a great fit here.

~~~
lstamour
They started using Puppet against Mac dev machines, then switched to internal
tools I believe. But it was one of the first times it occurred to me that we
should manage and use DevOps and SRE practices against user machines and user
workloads where possible. The trouble is the tooling isn’t that mature yet,
and we can’t assume real-time data or always connected machines, and we don’t
have a herd because users only have one machine with them at any time and
remote state deletion only works when you know you’re not deleting critical
state, which in turn requires better tooling and a greater understanding of
user application state persistence behaviour than most are willing to invest
time in doing.

This is what makes Chromebooks so easy to maintain though: web apps and
sandboxed Android apps can all easily sync to the cloud and compartmentalize
their data.

~~~
sneak
I would pay real money for a linux distro that works as well as ChromeOS or
macOS that doesn’t have all the phone-home endemic to both (yes, Macs phone
home like mad even with iCloud and all of the analytics off).

------
CaveTech
This is basically the setup we've been using with vagrant (and in production)
for years.

Vagrant launches a bare-bones VM. Local files mount on /vagrant Chef-zero uses
local mount to provision the systems and configurations necessary.

On every vagrant reload, this process repeats. Chefs idempotent nature means
that any manual drift is automatically repaired.

In this setup there's no difference between `vagrant reload` and `vagrant
destroy && vagrant up`.

It's possible to package this so that it's simultaneously "Infrastructure as
Code" while also satisfying "Immutable Infrastructure". Our stab at this is
now 6 years old and we're surely not the first to do it.

------
solatic
As someone who loves NixOS and runs it on my daily-driver laptop -

I can't see running NixOS in production.

We're running 100% Kubernetes, including for databases and other stateful
workloads. Kubernetes implements the author's pattern just fine - any OS state
is defined within the container image, and any application state is defined
within a Persistent Volume. Unfortunately, NixOS doesn't have a good story yet
for service management (Disnix isn't nearly as featureful as the Kubernetes
scheduler and doesn't see nearly the same activity / community buy-in as Nix /
NixOS) let alone ensuring that networked storage is re-attached to the
particular node that runs the service in the same reliable manner that
Kubernetes offers.

IMO the way forward for Nix / NixOS in production is to:

a) develop a container runtime that would allow a Kubernetes node to run pods
that specify Nix expressions directly in the image field, instead of the
current workaround of creating Docker containers from Nix expressions and
dealing with the overhead of external registries

b) improve the experience of running Kubernetes on NixOS such that ease of
installation more closely approaches that offered by managed Kubernetes
providers.

~~~
ris
> develop a container runtime that would allow a Kubernetes node to run pods
> that specify Nix expressions directly in the image field

Do you really want to give kubernetes the added responsibility of building
your images?

~~~
solatic
Kubernetes isn't building the image, really, it's just passing the Nix
expression directly to the container runtime that Nix would provide. This is
more or less how Nix works already, as the Nix tooling takes Nix expressions
and builds derivations which are stored in the Nix store.

------
sargun
I feel pretty strongly against the idea of immutable infrastructure when
you're "infrastructure" (shared systems, running other people's software), but
this article isn't about that.

The beef I have with this article is the idea of:

> New computer smell

> Getting a new computer is this moment of cleanliness. The keycaps don’t have
> oils on them, the screen is perfect, and the hard drive is fresh and
> unspoiled — for about an hour or so.

In my observation (and in datasets that I have access to), computers systems
tend to follow the "infant-mortality" curve. This means that if they run for a
little bit, they're likely to run for a long time (and in addition, if you
have many of them, they tend to die around the same time). My conjecture is
that many computer systems have initialization routines which are not as
thoroughly tested as the normal operating state of the system. Due to this, we
tend to run into more issues in "immutable" systems than you otherwise would
in "mutable" systems.

------
ScottBurson
Horror story: [https://thedailywtf.com/articles/Designed-For-
Reliability](https://thedailywtf.com/articles/Designed-For-Reliability)

------
outworlder
So this is mostly about Nix, but I've used (in production!) CoreOS, which
implemented some of the same concepts.

You couldn't just update anything easily. Well, anything is possible, but
CoreOS made it very hard to do it the wrong way, with the readonly system
partitions.

But it made upgrades really easy. And you have a second, backup system
partition to boot from, if the update messed up things.

We had to move back to a 'standard' Linux distribution and now all those old
habits are creeping up. It takes a lot of discipline (and enforcement) to
avoid the applications of 'fixes' which get eventually forgotten.

~~~
yjftsjthsd-h
> We had to move back to a 'standard' Linux distribution and now all those old
> habits are creeping up

That sounds very relevant: Why did you have to go back?

~~~
ecnahc515
Most likely because Container Linux (previously CoreOS) is nearing end-of-
life. Your options are to move to Fedora CoreOS or Flatcar Linux, so it's
likely they decided to go with a more vanilla distro instead of migrating to
one of the more similar options.

~~~
yjftsjthsd-h
That's actually a fairly encouraging failure mode, given that it means there's
nothing inherently wrong with the approach, just that particular
implementation.

------
ghuntley
Here's my dotfiles for some of my nixos servers and home computers.

[https://github.com/ghuntley/dotfiles-
nixos](https://github.com/ghuntley/dotfiles-nixos)

Steal away and enjoy.

------
downerending
For years I've been doing this in a lightweight way with a shell script (and
cache of auxiliary data files). I suspect many others do this as well.

The basic idea is to image with some lean/vanilla image, then run the script
to put the system into the desired state. Kind of like Puppet, except far
easier to understand and change.

Properly done, one can reimage pretty much at will, which is nice if there are
lot of people making local 'root' changes on boxes.

edit: typo

~~~
eeZah7Ux
> The basic idea is to image with some lean/vanilla image, then run the script
> to put the system into the desired state.

Spot on. Much better than most CM tools.

The Nix setup is pretty hackish and does not track files changed by running
applications.

What we really need is automatic version history on whole filesystems.

~~~
pas
ZFS (and btrfs) snapshot diff does that easily. Also any overlay filesystem.
But after a point combing through the diff becomes an insurmountable task. Nix
forces the user to declare what goes where up front. (And yes, it's not
exactly a user friendly syntax/style. systemd has support for
WorkingDirectory, RootDirectory, RootImage, RuntimeDirectory, StateDirectory,
CacheDirectory, LogsDirectory and ConfigurationDirectory, but so far most unit
files don't take advantage of that.)

~~~
downerending
This is nice to have, but it doesn't really distinguish between the "logical"
differences and the "physical" differences.

As a trivial example, one step in one's setup script might be "install package
X". That can end up creating/modifying a lot of files, but many of those
differences aren't necessarily meaningful or something one would want to carry
into future reimages. And as things progress over time, those diffs might even
be "wrong".

I liken it to whittling vs CNC. Either can be the right way to go, but usually
at scale we end up doing better with CNC. And the best CNC program is a
compact one.

~~~
pas
That's the advantage of Nix's top-down approach. You declare everything on as
high level as you want/can/wish. You specify the package plus your
customization for it.

It's not that much different than a one-liner that installs the package plus
uses echo to write your config file. The magic is the "forced" reproducibility
by default.

~~~
downerending
Conceptually, I think nix is great. Most of us can appreciate the value of a
perfectly reproducible system, etc.

That said, unfortunately nix is a practical option for about zero percent of
commercial shops. Sigh.

I did actually work at a place that tried it once. Summarizing, the
introduction failed through some combination of politics and the perception
that nix would introduce a lot of complexity vs a standard vendor distro.

------
infogulch
As a tiny effort to tame the immense mutability of Windows, at least for the
question of "what programs are installed at which version?", I'm using a
combination of _Chocolatey_ (an unofficial package manager for windows,
[https://chocolatey.org/](https://chocolatey.org/)) and _Git_.

I maintain a 'packages.config' xml file in a git repo that lists all the
packages I decide I want to be installed and their versions. I intentionally
don't list dependencies of the packages I want in this file, so I can remove a
package later and I don't have to trawl through the huge list of potentially
unnecessary packages now that I've changed what I want. I don't (typically)
manage the version number listed in packages.config, I have a few scripts to
help me do that:

install.ps1 has two uses: 1. install on a new system by cloning (or
downloading the zip of) the repo and running the script, which will download
and install the exact versions of the software that is declared in
packages.config. 2. If I want a new package I can edit the packages.config to
add the line then run install.ps1 (this second capability could use some UX
work).

update.ps1 lists all of the packages that are currently out of date, gives you
the option to update all of them, and then (regardless of which option you
chose) rewrites packages.config with the currently installed versions. This
allows you to use git to identify and manage the version differences between
multiple windows installations and also upgrade all of them easily.

This doesn't configure any of the applications, but at least they're installed
and they're the same version when you move from one computer to another. You
can also quickly identify version differences between installations which
makes debugging application version problems much easier. And you also don't
forget what you have installed.

------
greymeister
Haha, nuke and pave as a first resort rather than a last resort.

~~~
rconti
Over the years I went from making fun of Windows Admins rebooting to fix
everything, to, say, 12 years ago rebooting my unix boxes _any_ time I made a
change.

Virtually by definition if I was changing something, it wasn't in production
at the moment, and I just learned my life was so much easier if I made sure my
changes were really committed to state _at the moment I made the change_
rather than learning it the hard way at 2am 6 months down the road, and
desperately trying to remember what I had "fixed" and why.

Granted, these were still pets, but at least they were well-trained pets.

------
imhoguy
How that could work with suspend/hibernation when one doesn't do reboot for
weeks?

I rarely shutdown Linux desktop. I also keep desktop VMs with project context
suspended to just reopen it next day or in a month to be at place where I
stopped the work.

~~~
pas
Why wouldn't it work with suspend? NixOS works because it's brutally up-front
about its declarativeness. Since you have to specify everything in your config
that's not what the default gives you, you'll have your own changes in your
config file (or files).

It's great because it separates /etc into vendor-provided defaults and your
own customizations. It's not so great because it's not that automatic, you
need to script it.

systemd/Lennart also explored this topic a bit:
[https://www.youtube.com/watch?v=pL0AMLiwPj8](https://www.youtube.com/watch?v=pL0AMLiwPj8)

[https://www.freedesktop.org/software/systemd/man/systemd-
vol...](https://www.freedesktop.org/software/systemd/man/systemd-volatile-
root.service.html)

------
zeveb
Sounds pretty neat. I wonder if the same is achievable with Guix, which is
based on Nix. Now I am gonna have to spend some of my precious free time
noodling around the Guix docs!

------
nojvek
Wouldn’t using k8s that needs docker images built from docker files achieve
this ?

------
jackcviers3
Git. Vagrantfile. Last pass. G Suite. I can work anywhere if I've got a vm.

------
ggm
wipes the bits they dont consider persisting, but designs a persistence model
in ZFS to say "ok: next time, keep this bit"

takes discipline, but interesting.

------
ghuntley
Check out the NixOS for existing sysadmins workshop over at
[https://github.com/ghuntley/workshops/tree/master/nixos-
work...](https://github.com/ghuntley/workshops/tree/master/nixos-workshop)

If you want a TL;DR overview of NixOS then start here
[https://github.com/ghuntley/workshops/tree/master/nixos-
work...](https://github.com/ghuntley/workshops/tree/master/nixos-
workshop/modules/01-introduction-to-nixos)

~~~
ghuntley
For an advanced example of overriding nixpkg and usage of pinning/override
layers (ie some mandate or reason to run super old version of grpc) —
[https://github.com/digital-
asset/daml/blob/master/nix/nixpkg...](https://github.com/digital-
asset/daml/blob/master/nix/nixpkgs.nix#L10)

------
ashishb
Why not use docker containers which will do this by default for stateless
portion of your infrastructure?

~~~
ghuntley
Docker containers are not reproducible and building the containers takes
bloody forever/poor cacheability.

Look at every Dockerfile

FROM Ubuntu/Ubuntu

RUN apt-get update (or apt-get install) # BOOM — no longer possible to
reproduce the build

~~~
GordonS
Sure, if you always rebuild your container images - but surely a more typical
approach is to build them and put them in a container registry?

Of course, they would be rebuilt when base layers change, but if you really
want exactly the same image, you reference it by a digest, which will give you
a point-in-time image.

------
crimsonalucard
I love immutability it makes things ten times easier. This isn't really
immutability? It looks like an infrastructure refresh on deploy. If you are
refreshing something by definition it's probably not immutable. I guess the
author means getting rid of the mutation of state.

I'm just waiting for the day they can make the database immutable. It'll
probably look something like git.

~~~
Nuzzerino
> I'm just waiting for the day they can make the database immutable.

Wouldn't event stores with CQRS effectively do this?

~~~
crimsonalucard
No. You still need a regular database due to read performance. If I want to
read the "latest" data, a search on an event store for a piece of data that
wasn't changed in the past year would take an inordinate amount of time.

Additionally what if I want to look up a "row" of data that was modified 5
times on 5 different "columns" at various dates across 3 years? That's an
aggregation job across 3 years of event data.

For event sourcing you still need to turn the "event" into an actual operation
and record that database operation in a classic database.

Event stores just make the "event" the source of truth. It doesn't get rid of
regular databases.

Traditionally other services read a single entity database/service and use
that as a source a truth. Now a single button click records data across
several databases and several services. It's not necessarily a better
architecture, just different/buzz-wordy, and definitely more complicated.

~~~
waheoo
Look up database as a value rich hickey talk on datomic which essentially does
this.

Bottleknecks writes through an acid layer, makes reads against any single
database connection (the value) immutable.

~~~
Scarbutt
Querying history(values of attributes in time) is very very slow in datomic.
It is meant for auditing/troubleshooting, not to be used for your application
domain.

