Hacker News new | past | comments | ask | show | jobs | submit login

I'm curious why people use configuration management software in 2020. All of that seems like the old way of approaching problems to me.

What I prefer to do is use Terraform to create immutable infrastructure from code. CoreOS and most Linux variants can be configured at boot time (cloud-config, Ignition, etc) to start and run a certain workload. Ideally, all of your workloads would be containerised, so there's no need for configuration drift, or for any management software to be running on the box. If you need to update something, create the next version of your immutable machine and replace the existing ones.

"Immutable infrastructure" what a laugh. In a large deployment, configuration somewhere is always changing - preferably without restarting tasks because they're constantly loaded. We have (most) configuration under source control, and during the west-coast work day it is practically impossible to commit a change without hitting conflicts and having to rebase. Then there are machines not running production workloads, such as development machines or employees' laptops, which still need to have their configuration managed. Are you going to "immutable infrastructure" everyone's laptops?

(Context: my team manages dozens of clusters, each with a score of services across thousands of physical hosts. Every minute of every day, multiple things are being scaled up or down, tuned, rearranged to deal with hardware faults or upgrades, new features rolled out, etc. Far from being immutable, this infrastructure is remarkably fluid because that's the only way to run things at such scale.)

Beware of Chesterton's Fence. Just because you haven't learned the reasons for something doesn't mean it's wrong, and the new shiny often re-introduces problems that were already solved (along with some of its own) because of that attitude.

Are you sure you two are talking about the same thing?

My understanding of immutable infrastructure is the same as immutable data structures: once you create something, you don't mess with it. If you need a different something, you create a new one and destroy the old one.

That doesn't mean that the whole picture isn't changing all the time. Indeed, I think immutability makes systems overall more fluid, because it's easier to reason about changes. Mutability adds a lot of complexity, and when mutable things interact, the number of corner cases grows very quickly. In those circumstances, people can easily learn to fear change, which drastically reduces fluidity.

Yup. We do this. When our servers need a change, we change the AMI for example, and then re-deployment just replaces everything. Most servers survive a day, or a few hours.

Makes sense to me. I was talking with a group of CTOs a couple years back. One of mentioned that they had things set up that any machine more than 30 days old was automatically murdered, and others chimed in with similar takes.

It seemed like a fine idea to me. The best way to be sure that everything can be rebuilt is to regularly rebuild everything. It also solves some security problems, simplifies maintenance, and allows people to be braver around updates.

Configuration Management is still present in this process, it's just moved from the live system to the image build step.

Probably the most insightful comment in this entire thread. Thank you. In many cases, an "image" is just a snapshot of what configuration management (perhaps not called such but still) gives you. As with compiled programming languages, though, doing it at build time makes future change significantly slower and more expensive. Supposedly this is for the sake of consistency and reproducibility, but since those are achievable by other means it's a false tradeoff. In real deployments, this just turns configuration drift into container sprawl.

Is this still as painful as it used to be? AMI building took ages, so iteration ("deployment") speed is really awful.

Personally that's why I avoid Packer (or other AMI builders) and keep very tightly focussed machines set up by the cloud-init type process.

So, once you create a multi-thousand-node storage cluster, if you need to change some configuration, replace the whole thing? Even if you replace onto the same machines - because that's where the data is - that's an unacceptable loss of availability. Maybe that works for a "stateless" service, but for those who actually solve persistence instead of passing the buck it just won't fly.

Could you say more about why your particular service can't tolerate rolling replacement of nodes? You're going to have to rebuild nodes eventually, so it seems to me that you might as well get good at it.

And just to be clear, I'm very willing to believe that your particular legacy setup isn't a good match for cattle-not-pets practices. But I think that's different than saying it's impossible for anybody to bring an immutable approach to things like storage.

The person you're replying to didn't say "replace every node," they said "replace the whole thing."

To give a really silly example, adding a node to a cluster is a configuration change. It wouldn't make sense to destroy the cluster and recreate it to add a new node. There are lots of examples like this where if you took the idea of immutable infrastructure to the extreme it would result in really large wastes of effort.

Could you please point me at prominent advocates of immutable infrastructure who propose destroying whole clusters to add a node? Because from what I've seen, that's a total misunderstanding.

As I said, it's a silly example just to highlight an extreme. In between there are more fluid examples. I don't think it's that ridiculous to propose destroying and recreating the cluster in its entirety when you're deploying a new node image. However as you say I'm not sure anyone would advocate that except in specific circumstances.

On the other hand, while my suggestion of doing it to add a node sounds ridiculous I'm sure there are circumstances in which it's not only understandable but necessary, due to some aspect of the system.

I'm saying it's not even an extreme, in that I don't believe what people are calling "immutable infrastructure" includes that.

If your biggest objection to an idea is that you can make up a silly thing that sounds like it might be related, I'm not understanding why we need to have this discussion. I'd like to focus on real issues, thanks.

I'm not objecting categorically to anything. I think that immutable infrastructure is a spectrum, and depending on your needs you may have just about everything immutably configured, or almost nothing. I just don't think it's so black and white as "you should always use immutable infrastructure."

I also think it's a cool idea to destroy the entire cluster just to add a node, and it sounds ridiculous but also like there's some circumstances where it makes perfect sense.

Again, do you have a citation for the notion that it's a spectrum? The original post that coined the term doesn't talk about it that way, and neither do the other resources I found in a quick search. As I see it, it's binary: when you need to change something on a server, you either tinker with the existing server or you replace the server with a fresh-built one that conforms to the new desire.

Wow, look at those goalposts go! If you make enough exceptions to allow incremental change, then "immutable" gets watered down to total meaninglessness. That's not an interesting conversation. This conversation is about configuration management, which is still needed in a "weakly immutable" world.

Again, could you please point me at notable advocates of immutable infrastructure proposing the approach you take such exception to? And note that I'm not proposing any exceptions.

Presumably you replace the parts that changed and keep the parts that didn't.

Interesting to say you've "solve[d] persistence" when you seem to be limited by it here. Is there a particular reason your services can't be architected in less stateful, more 12-factor way?

Kick the persistence can down the road some more? Sure, why not? But sooner or later, somebody has to write something to disk (or flash or whatever that doesn't disappear when the power's off). A system that stores data is inherently stateful. Yes, you can restart services that provide access or auxiliary services (e.g. repair) but the entire purpose of the service as a whole is to retain state. It's the foundation on top of which all the slackers get to be stateless themselves.

The vast majority of people simply redefine the terms to fit whatever they are selling.

If your systems are immutable they can run read-only. In the in nineties Tripwire, the integrity checker, popularized it. You could run it off cdrom. Today immutable infrastructure is VMs/containers that can be ran off a SAN or a pass through file system that is readonly. It means snapshots are completely and immediately replicatable. When you need to deploy, you take a base image/container, install a code onto it, run tests to ensure that it is not broken and replicate it as many times as you need, in a read-only state. This approach also has an interesting property where because system is readonly ( as in exported to the instance read-only/mounted by the instance readonly ) it is extremely difficult to do nasty things to it after a break in - if it is difficult to create files, it is difficult to stage exploits.

That's the only kind of infrastructure where configuration management on the instances themselves is not needed

What sort of stack do you all use then to manage these clusters? Have you found any solutions to your conflicts?

The hosts are managed via chef, the jobs/tasks running on those hosts by something roughly equivalent to k8s.

As for the conflicts, I have to say I loathe the way the more dynamic part of configuration works. It might be the most ill conceived and poorly implemented system I've seen in 30+ years of working in the industry. Granted, it does basically work, but at the cost of wasting thousands of engineers' time every day. The conflicts occur because (a) it abuses source control as its underlying mechanism and (b) it generates the actual configs (what gets shipped to the affected machines) from the user-provided versions in a non-deterministic way which causes spurious differences. All of its goals - auditability, validation, canaries, caching, etc. - could be achieved without such aggravation if the initial design hadn't been so mind-bogglingly stupid.

But I digress. Sorry not sorry. ;) To answer your question, my personal solution is to take advantage of the fact that I'm on the US east coast and commit most of my changes before everybody else gets active.

Sure. Its more that your CICD is lacking.

Sometimes you have to work with what you're given in a brownfield env and a config managment tool is useful in that case, but it's possible that you are working with a less than ideal architecture with less than ideal time/money to make changes.

State is always the enemy in technology.

I can't even imagine managing hundreds of servers whose state is unpredictable at any moment and they can't be terminated and replaced with a fresh instance for fear of losing something.

> State is always the enemy in technology.

I work in data storage. Am I the enemy, then? ;)

> can't even imagine managing hundreds of servers whose state is unpredictable at any moment

Be careful not to conflate immutability with predictability. The state of these servers is predictable. All of the information necessary to reconstruct them is on a single continuous timeline in source control. But that doesn't mean they're immutable because the head of that timeline is moving very rapidly.

> can't be terminated and replaced with a fresh instance for fear of losing something.

No, there's (almost) no danger of losing any data because everything's erasure-coded at a level of redundancy that most people find surprising until they learn the reasons (e.g. large-scale electrical outages). But there's definitely a danger of losing availability. You can't just cold-restart a whole service that's running on thousands of hosts and being used continuously by even more thousands without a lot of screaming. Rolling changes are an absolute requirement. Some take minutes. Some take hours. Some take days. Many of these services have run continuously for years, barely resembling the code or config they had when they first started, and their users wouldn't have it any other way. It might be hard to imagine, but it's an every-day reality for my team.

> I work in data storage. Am I the enemy, then? ;)

You’re the prison guard.

> Be careful not to conflate immutability with predictability.

I don't trust predictability. Drift is always a nightmare. Nothing is ever as predictable as you would like it to be.

>You can't just cold-restart a whole service that's running on thousands of hosts and being used continuously by even more thousands without a lot of screaming.

If it's architected well you can :)

> State is always the enemy in technology.

Except that state and its manipulation is usually the primary value in technology.

> I can't even imagine managing hundreds of servers whose state is unpredictable at any moment and they can't be terminated and replaced with a fresh instance for fear of losing something.

Yes, that sounds awful. That's why we have backups and, if necessary, redundancy and high availability.

> Except that state and its manipulation is usually the primary value in technology.

Exactly and thats why you put state in data stores and keep your servers immutable.

But... How do you configure the hosts where your containers are running on? How do you configure your storage (NAS/SAN)? How do you configure your routers and switches? ...

The original question didn't have much context, and I guess my answer assumed someone would be using a cloud provider as opposed to anything on premise.

Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?

> Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?

Ansible is used in networking, which many vendors having official modules:

* https://github.com/PaloAltoNetworks/ansible-pan

* https://github.com/aristanetworks/ansible-cvp

* https://www.juniper.net/documentation/en_US/junos-ansible/in...

* https://github.com/F5Networks/f5-ansible

There are even frameworks if you want to write things in 'raw' Python as well.

> Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?

Yes. Well, OK, maybe not good but better than ad hoc.

Build the underlying vms with packer. Or use cloud-init as the parent mentioned - I think it has a bunch of knobs.

I'm going to agree with you. In 2020 (and really the last few years), configuration management is outdated. IaC (infrastructure as code) is the current approach. Containerize everything you can, use terraform or cloudformation, or azure devops.

Avoid managing the underlying os as much as possible. Use vanilla or prebuilt images to deploy these containers on, coreos, Amazon's new bottle rocket (maybe). Or use a service like fargate when possible. All configuration should be declarative to avoid errors.

If you need to build images tools like packer are great. AWS has a recommended "golden Ami pipeline" pattern and a new image builder service if you can't use community images.

I'm speaking imperatively but read these as my own directives. I work for a company that consults and actively helps fortune 500's migrate to the cloud. So some of what I'm saying is not possible or harder on prem and I recognize that.

If I had to, I still like Chef, puppet second favorite mostly because of familiarity. Ansiblee can be used with either of these. And tools like serverspec to validate your images. I don't really use any of this anymore though.

But you still need to configure things, even if they are immutable at runtime. And you need to manage that configuration over time in some systematic way.

You always have a configuration management system.

I'm using Terraform to deploy Docker containers. Terraform's docker_container resource has a lovely 'upload' feature which one can use to upload files into the container. I make Terraform load server config files (or use multi-line strings in the .tf file), perform variable replacement, then destroy Docker containers and recreate them with updated config files. All persistent data is stored in directories bind-mounted into the Docker container.

Terraform has some limitations. For example, one deployment cannot deploy hosts and their containers [1]. And there is no usable support for rolling deployments [2, 3]. So I've ended up with a 4-stage deployment: host-set1, containers on host-set1, host-set2, containers on host-set2.

I also use Terraform to deploy the servers to my laptop during development. Docker for Mac works well.

Someday, Kubernetes will get some usable documentation on how to do normal things [4]. Then I will use it for deploying containers, load balancers, and persistent volumes. For now, it's too big of a complexity jump over plain Docker.

[1] https://github.com/hashicorp/terraform/issues/2430

[2] https://github.com/hashicorp/terraform/issues/23735#issuecom...

[3] https://github.com/hashicorp/terraform/issues?q=is%3Aissue+%...

[4] https://github.com/kubernetes/website/issues/19139

Not all of us have the luxury of our projects being greenfield.

The question asks what I would consider to be the right approach for 2020, and also what my team is doing. This is the design pattern I've been following for 5 years, but obviously your mileage may vary, it won't work for everyone, etc.

What about the system that runs the containers?

"Amazon/Google/Azure takes care of that" is not the answer, unless your comment is predicated on a world where compute can only be rented from big corps... and their methods of managing underlying infrastructure are sacred secrets for which we are to unworthy to comprehend.

Terraform keeps track of resources it creates. One can remove resources (VMs, managed databases, persistent volumes, DNS records) from the config file and Terraform will cleanly delete them. This is a crucial feature for most deployments.

For example, I deployed an app backend to DigitalOcean with a load balancer, 2x replicated API server, 2x replicated worker process, managed database, file storage, and static website. Terraform is tracking 114 resources for each deployment.

It seems that automatic removal is poorly supported by Ansible, Chef, Puppet, Salt, etc. One can explicitly add items to a "to_remove" list, but this is error-prone.

Terraform has many limitations and problems, but I have found no better tool.

I mostly don't for new stuff (all in on Docker/ECS), however we have a lot of old stuff and things in the process of being migrated where it makes sense. There's also always the odd bird thing you use that needs to run on a regular host.

(Genuinely curious) what old stuff do you think doesn't make sense to be set up immutably? and what odd stuff needs to run on a regular host?

Example: How do you do "immutable" management of Mac OS machines? Taking what's typically described as such there, you've just turned a 30s deploy of software into a multi-hour "lets reimage the entire machine"?

(although that's of course not strictly "old stuff")

Were Macs in scope of the original post? I assumed it was server side stuff, rather than office hardware. For that, though, I'd use Jamf (Pro) or some other MDM option.

Even if you exclude "office hardware" from configuration management, our Mac OS build and test farm is "servers" I'd say. Not everyone running servers is doing so to run an online service on a platform of their choice.

Not the GP, but some proprietary software requires license activation and you only get a certain (small) number of activates/deactivates.

I would have to suss out dependency chains on stuff that others built.

Rather than doing that I take a light touch approach with ansible that will suffice until we can dockerize (which would require the same work, but then it’s a dev project vs now when it’s just a devops thing)

For mutable infra that holds state. IMO not all infra is gonna to end up in k8s and some still needs to be self hosted.

> What I prefer to do is use Terraform to create immutable infrastructure from code.

Can you mount all your volumes read-only and run all of your stack? If you cannot, then you do not have immutable infrastructure. You simply happen to agree that no one write anything useful, which with time will absolutely fail because someone, somewhere is going to start storing state on a stateless system giving you "a cow #378 called 'Betsy'"

In the current state of infrastructure, an accepted definition of "immutable infrastructure" is that:

1. You deploy a completely fresh instance/container, instead of in-place updates 2. You don't actively push changes on a running instance/container

Of course you might have stuff written to disk, such as logs, temp files, etc. But it should be non-essential data, and potentially pushed to a central place in near real-time.

Interesting. How would you do that if your deployment is, say, a couple of new tables in a 50TB Oracle database?

It only works with stateless resources.

There's no point in trying to manage a database or similar resources this way.

So, Linux-only? ;)

Yes, but to be clear, some of those containers have been .Net Core containers (running in Kubernetes) for me. I appreciate not having Windows in an estate isn't common to all setups.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact