What I prefer to do is use Terraform to create immutable infrastructure from code. CoreOS and most Linux variants can be configured at boot time (cloud-config, Ignition, etc) to start and run a certain workload. Ideally, all of your workloads would be containerised, so there's no need for configuration drift, or for any management software to be running on the box. If you need to update something, create the next version of your immutable machine and replace the existing ones.
(Context: my team manages dozens of clusters, each with a score of services across thousands of physical hosts. Every minute of every day, multiple things are being scaled up or down, tuned, rearranged to deal with hardware faults or upgrades, new features rolled out, etc. Far from being immutable, this infrastructure is remarkably fluid because that's the only way to run things at such scale.)
Beware of Chesterton's Fence. Just because you haven't learned the reasons for something doesn't mean it's wrong, and the new shiny often re-introduces problems that were already solved (along with some of its own) because of that attitude.
My understanding of immutable infrastructure is the same as immutable data structures: once you create something, you don't mess with it. If you need a different something, you create a new one and destroy the old one.
That doesn't mean that the whole picture isn't changing all the time. Indeed, I think immutability makes systems overall more fluid, because it's easier to reason about changes. Mutability adds a lot of complexity, and when mutable things interact, the number of corner cases grows very quickly. In those circumstances, people can easily learn to fear change, which drastically reduces fluidity.
It seemed like a fine idea to me. The best way to be sure that everything can be rebuilt is to regularly rebuild everything. It also solves some security problems, simplifies maintenance, and allows people to be braver around updates.
And just to be clear, I'm very willing to believe that your particular legacy setup isn't a good match for cattle-not-pets practices. But I think that's different than saying it's impossible for anybody to bring an immutable approach to things like storage.
To give a really silly example, adding a node to a cluster is a configuration change. It wouldn't make sense to destroy the cluster and recreate it to add a new node. There are lots of examples like this where if you took the idea of immutable infrastructure to the extreme it would result in really large wastes of effort.
On the other hand, while my suggestion of doing it to add a node sounds ridiculous I'm sure there are circumstances in which it's not only understandable but necessary, due to some aspect of the system.
If your biggest objection to an idea is that you can make up a silly thing that sounds like it might be related, I'm not understanding why we need to have this discussion. I'd like to focus on real issues, thanks.
I also think it's a cool idea to destroy the entire cluster just to add a node, and it sounds ridiculous but also like there's some circumstances where it makes perfect sense.
If your systems are immutable they can run read-only. In the in nineties Tripwire, the integrity checker, popularized it. You could run it off cdrom. Today immutable infrastructure is VMs/containers that can be ran off a SAN or a pass through file system that is readonly. It means snapshots are completely and immediately replicatable. When you need to deploy, you take a base image/container, install a code onto it, run tests to ensure that it is not broken and replicate it as many times as you need, in a read-only state. This approach also has an interesting property where because system is readonly ( as in exported to the instance read-only/mounted by the instance readonly ) it is extremely difficult to do nasty things to it after a break in - if it is difficult to create files, it is difficult to stage exploits.
That's the only kind of infrastructure where configuration management on the instances themselves is not needed
As for the conflicts, I have to say I loathe the way the more dynamic part of configuration works. It might be the most ill conceived and poorly implemented system I've seen in 30+ years of working in the industry. Granted, it does basically work, but at the cost of wasting thousands of engineers' time every day. The conflicts occur because (a) it abuses source control as its underlying mechanism and (b) it generates the actual configs (what gets shipped to the affected machines) from the user-provided versions in a non-deterministic way which causes spurious differences. All of its goals - auditability, validation, canaries, caching, etc. - could be achieved without such aggravation if the initial design hadn't been so mind-bogglingly stupid.
But I digress. Sorry not sorry. ;) To answer your question, my personal solution is to take advantage of the fact that I'm on the US east coast and commit most of my changes before everybody else gets active.
State is always the enemy in technology.
I can't even imagine managing hundreds of servers whose state is unpredictable at any moment and they can't be terminated and replaced with a fresh instance for fear of losing something.
I work in data storage. Am I the enemy, then? ;)
> can't even imagine managing hundreds of servers whose state is unpredictable at any moment
Be careful not to conflate immutability with predictability. The state of these servers is predictable. All of the information necessary to reconstruct them is on a single continuous timeline in source control. But that doesn't mean they're immutable because the head of that timeline is moving very rapidly.
> can't be terminated and replaced with a fresh instance for fear of losing something.
No, there's (almost) no danger of losing any data because everything's erasure-coded at a level of redundancy that most people find surprising until they learn the reasons (e.g. large-scale electrical outages). But there's definitely a danger of losing availability. You can't just cold-restart a whole service that's running on thousands of hosts and being used continuously by even more thousands without a lot of screaming. Rolling changes are an absolute requirement. Some take minutes. Some take hours. Some take days. Many of these services have run continuously for years, barely resembling the code or config they had when they first started, and their users wouldn't have it any other way. It might be hard to imagine, but it's an every-day reality for my team.
You’re the prison guard.
I don't trust predictability. Drift is always a nightmare. Nothing is ever as predictable as you would like it to be.
>You can't just cold-restart a whole service that's running on thousands of hosts and being used continuously by even more thousands without a lot of screaming.
If it's architected well you can :)
Except that state and its manipulation is usually the primary value in technology.
> I can't even imagine managing hundreds of servers whose state is unpredictable at any moment and they can't be terminated and replaced with a fresh instance for fear of losing something.
Yes, that sounds awful. That's why we have backups and, if necessary, redundancy and high availability.
Exactly and thats why you put state in data stores and keep your servers immutable.
Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?
Ansible is used in networking, which many vendors having official modules:
There are even frameworks if you want to write things in 'raw' Python as well.
Yes. Well, OK, maybe not good but better than ad hoc.
Avoid managing the underlying os as much as possible. Use vanilla or prebuilt images to deploy these containers on, coreos, Amazon's new bottle rocket (maybe). Or use a service like fargate when possible. All configuration should be declarative to avoid errors.
If you need to build images tools like packer are great. AWS has a recommended "golden Ami pipeline" pattern and a new image builder service if you can't use community images.
I'm speaking imperatively but read these as my own directives. I work for a company that consults and actively helps fortune 500's migrate to the cloud. So some of what I'm saying is not possible or harder on prem and I recognize that.
If I had to, I still like Chef, puppet second favorite mostly because of familiarity. Ansiblee can be used with either of these. And tools like serverspec to validate your images. I don't really use any of this anymore though.
You always have a configuration management system.
Terraform has some limitations. For example, one deployment cannot deploy hosts and their containers . And there is no usable support for rolling deployments [2, 3]. So I've ended up with a 4-stage deployment: host-set1, containers on host-set1, host-set2, containers on host-set2.
I also use Terraform to deploy the servers to my laptop during development. Docker for Mac works well.
Someday, Kubernetes will get some usable documentation on how to do normal things . Then I will use it for deploying containers, load balancers, and persistent volumes. For now, it's too big of a complexity jump over plain Docker.
"Amazon/Google/Azure takes care of that" is not the answer, unless your comment is predicated on a world where compute can only be rented from big corps... and their methods of managing underlying infrastructure are sacred secrets for which we are to unworthy to comprehend.
For example, I deployed an app backend to DigitalOcean with a load balancer, 2x replicated API server, 2x replicated worker process, managed database, file storage, and static website. Terraform is tracking 114 resources for each deployment.
It seems that automatic removal is poorly supported by Ansible, Chef, Puppet, Salt, etc. One can explicitly add items to a "to_remove" list, but this is error-prone.
Terraform has many limitations and problems, but I have found no better tool.
(although that's of course not strictly "old stuff")
Rather than doing that I take a light touch approach with ansible that will suffice until we can dockerize (which would require the same work, but then it’s a dev project vs now when it’s just a devops thing)
Can you mount all your volumes read-only and run all of your stack? If you cannot, then you do not have immutable infrastructure. You simply happen to agree that no one write anything useful, which with time will absolutely fail because someone, somewhere is going to start storing state on a stateless system giving you "a cow #378 called 'Betsy'"
1. You deploy a completely fresh instance/container, instead of in-place updates
2. You don't actively push changes on a running instance/container
Of course you might have stuff written to disk, such as logs, temp files, etc. But it should be non-essential data, and potentially pushed to a central place in near real-time.
There's no point in trying to manage a database or similar resources this way.