Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Which configuration management software would/should you use in 2020?
257 points by uaas on March 14, 2020 | hide | past | favorite | 211 comments
What is your team using at work? What should be used at scale (FAANG, or similar)? What are you planning to switch to?

Not FAANG but for small to medium "cloud native" businesses I like to use this approach with minimal dependencies:

Managed Kubernetes cluster such as GKE for each environment, setup in cloud provider UI since this is not done often. If you automate it with terraform chances are next time you run it, the cloud provider has subtly changed some options and your automation is out-of-date.

Cluster services repository with Helm charts for ingress controller, centralized logging and monitoring, etc. Use a values-${env}.yaml for environment differences. Deploy with CI service such as Jenkins.

Configuration repository for each application with Helm Chart. If it's an app with one service or all services in a single repo this can go in the same repo. If it's an app with services across multiple repos, create a new repo. Use a values-${env}.yaml for environment differences. Deploy with CI service such as Jenkins.

Store secrets in cloud secrets manager and interpolate to Kubernetes secrets at deploy time.

Cloud provider keeps the cluster and VMs up-to-date, CI pipelines do the builds and deployments. No terraform/ansible/other required. Again, this only works for "cloud native" models.

Our setup is quite similar to this. Some differences - each environment is represented as a helm parent chart with each application being a child chart. Each environment chart has it's own repo where values.yaml supplies environment specific overrides for each application. Each application has its own repo where the helm charts and application source both reside.

Similar to what I've done.

Applications have a unique source repo. Each rep has a build dir. Build dir contains a sundir for docker build, terraform configs (if needed) for dependent infrastructure and a helm chart for deploy.

I have a few things that don't fit the microservice pattern. They are terraform first (root of repo is TF code) and they have a build dir to define the next steps (mostly packer).

As I'm writing this, I think I need to change that and make the top level a README.md file and use the build dir pattern to be consistent.

Yeah, in a decent architecture the only place state is located is in the datastore layer.

The goal is to make servers disposable, able to be destroyed and created at will, so configuration management becomes kind of a legacy technology at that point.

Yes, but created from what? This can be called config mgmt too.

For traditional datastore, I usually do:

Dev/QA/Similar: Either containerize and back to persistent volume, or use a managed DB service such as RDS or Cloud SQL and create a schema per environment. Include a deployment pipeline argument to reset to known state. CI pipeline can be tuned to handling dynamic environments in either case.

Stage/Prod: Use managed DB service such as RDS or Cloud SQL.

The time and cost to automate a DB upgrade with every edge case considered is huge. Rarely makes sense for small/medium business.

Nitpick: I really don't suggest a divergence in the DB/stack-of-choice between Dev/QA/Stage/Prod. I've chased so many issues that were in the planning process dismissed as "yeah that's an edge case and most likely won't happen".

The reasons I've seen for doing so are usually penny-wise, pound-foolish. Penny-wise in saving a few dollars (conceptually) on a spreadsheet for per-env/per-cycle, while neglecting the long-tail consequence of your labor factor just growing, potentially forever, without regard for total cost of ownership.

Sorry didn't mean to rant. Hope this helps.

Are you using Jenkins or Jenkins X for the deployment?

Normal Jenkins, with a shared library [1] to track service versions and inject the latest deployable version in the deployment pipeline.

[1] https://github.com/boxboat/dockhand

I still prefer the Open Source edition of https://puppet.com/ to manage larger, diverse environments - which may include not just servers, but workstations, network appliances and so on. It's well established with lots of quite portable modules. But it can also be a bit on the slower side and comes with a steeper learning curve then some of the others.

https://www.ansible.com/ is surely a good solution for Bootstraping Linux cloud machines and can be quite flexible. I personally feel like its usage of YAML manifests instead of a domain-specific language can make complex playbooks harder to read and to maintain.

If all you do is to deploy containers on a managed Kubernetes or a similar platform, you might get away with some solution to YAML templating (jsonnet et al) and some shell glue.

I am keeping an eye on https://github.com/purpleidea/mgmt which is a newer contender which many interesting features but lacks more complex examples.

Others like saltstack and chef still see some usage as far as I know, but I've got no personal experience with them.

Ansible amazing for configuration management, much better than Puppet. Storing the config in YAML makes it super easy to read and maintain, also much better than Puppets method.

As you mention, puppet has a steep learning curve, whereas Ansible has a very shallow one. It’s easy to get running in a few minutes!

We use both Puppet and Ansible at work, and its constant complaints and delays with Puppet whereas Ansible is little complaints and no delays.

> We use both Puppet and Ansible at work, and its constant complaints and delays with Puppet whereas Ansible is little complaints and no delays.

That's probably because you are not running masterless, which means your puppet master is a bottle neck.

The master is part of the bottleneck, but a lot of the complaints are trying to get it to do what it says. But a big benefit of puppet is the master feature, so if that's taken out, why puppet?

Puppetmaster took off because it conceptually easy to understand to people who were used to managing servers by hand.

I would argue that masterless puppet is a superior pattern for both scaling creating a hierarchical structures.

I used to use Puppet back when they were ruby based. I dropped them once they switched to Java, not interested in pushing Java onto every host when it's not in our stack.

It's still good in Enterprise land where taking the time to work out the declarative style and dependency chains is worth it (and you have the people to put on it and the CAB process to review infrastructure changes). For a small to mid sized company I find it gets in the way of iterating fast. I spent waaaay too much time there either fighting the tooling or having to work out dependency chains. Redhat and I-think-AWS-but-I-might-be-thinking-of-Chef also have tooling in this space.

I'll take Chef or Ansible's imperative approach in the environments I work in any day (Mostly ansible playbooks for baking hosts only, I've never been entirely comfortable with having one Ansible Tower/Chef Server/Puppetmaster/etc be authoritative over everything, too large a failure pattern if security controls fail). But again, I'm working in many younger small environments and not large mature ones.

Most of this is also irrelevant for us as we're all in on Docker/ECS for anything new. Config management plays a limited role there over having your tasks/services checked into the individual repos.

Just for reference, the clients are all still ruby based. It's only the web servers for the puppet masters ( the parsing code is still jRuby ) and puppetdb that are written in clojure that runs on the JVM.

I favor Ansible for 2 main reasons:

- If you have SSH access, you can use it. No matter what environment or company you work for, there’s no agent to install and no need to get approval to use the tool. It’s easy to build up a reproducible library of your shell habits that works locally or remotely, where each step can avoid being repeated in case there’s a need to rerun things.

- If you get into an environment where performance across many machines is more important you can switch to pull based execution. Because of that, I see very little advantage to any of the other tools that outweighs the advantages of Ansible.

Try Puppet Bolt. Workflow is similar to Ansible. No pesky master or certificate setup, no preparation needed, just an inventory file and SSH to the remote server. You get the entire ecosystem of Puppet modules and the Puppet language scales well when your configuration becomes larger.

Does it need the Puppet agent installed on the remote server?

No, you only need Puppet Bolt on your computer and SSH to the remote host.

> If you have SSH access, you can use it. No matter what environment or company you work for, there’s no agent to install

I don't get why is this always brought up as a major advantage when discussing CM. Ansible actually installs its Python runtime to target systems. Once I had a server that had full disk root and Ansible failed to work because there was no space left to copy tons of its Python code.

Ansible doesn't install a runtime on the target machine, it temporarily copies over the scripts that do the work and removes them after the run is complete. These are a few kilobytes typically.

No configuration system is likely to work with a full root partition, though.

Agree. Although I personally prefer Ansible to the alternatives, the one thing I don't like is that it does require python to be installed on the target hosts for most modules. That's not a problem usually, but every now and then it is. Also, sometimes additional python package requirements are needed, managing those in an automated manner is usually a hassle....

I've had this idea for a config management system that compiles all provisioning code to posit shell. One of those years I'll finish it... :-)

> it does require python to be installed on the target hosts for most modules

I recently learned that ansible supports binary modules, in addition to the OOtB support for modules written in any already-configured scripting language (including shell): https://docs.ansible.com/ansible/2.9/dev_guide/developing_pr...

However, while I know you meant "all the provided modules," it seems they are headed toward a less "batteries included" style and more (pypi|maven|npm|rubygems) style of "the community will sort it out" mechanism of distribution: https://docs.ansible.com/ansible/2.9/user_guide/collections_...

Which I welcome wholeheartedly because landing even the simplest fixes to ansible modules is currently a very laborious and time intensive operation

Tried using gather_facts=false and raw module to setup python ?

Example : https://yourlabs.io/oss/bigsudo/blob/master/bigsudo/bootstra...

Bigsudo ansible wrapper always plays that no matter what and as a result ansible never fails because of missing python.

> Ansible doesn't install a runtime on the target machine > it temporarily copies over the scripts that do the work

So yeah, it does copy its "runtime". It cannot work purely by running commands over SSH. It needs to copy "its agent" to target, and worse, it does this every time you run it.

> there was no space left to copy tons of its Python code.

What can anything do without any space left?

ssh myhost rm -r /var/cache/... for example works quite well. It's not like the server becomes a piece of useless metal once you run out of space.

I'm curious why people use configuration management software in 2020. All of that seems like the old way of approaching problems to me.

What I prefer to do is use Terraform to create immutable infrastructure from code. CoreOS and most Linux variants can be configured at boot time (cloud-config, Ignition, etc) to start and run a certain workload. Ideally, all of your workloads would be containerised, so there's no need for configuration drift, or for any management software to be running on the box. If you need to update something, create the next version of your immutable machine and replace the existing ones.

"Immutable infrastructure" what a laugh. In a large deployment, configuration somewhere is always changing - preferably without restarting tasks because they're constantly loaded. We have (most) configuration under source control, and during the west-coast work day it is practically impossible to commit a change without hitting conflicts and having to rebase. Then there are machines not running production workloads, such as development machines or employees' laptops, which still need to have their configuration managed. Are you going to "immutable infrastructure" everyone's laptops?

(Context: my team manages dozens of clusters, each with a score of services across thousands of physical hosts. Every minute of every day, multiple things are being scaled up or down, tuned, rearranged to deal with hardware faults or upgrades, new features rolled out, etc. Far from being immutable, this infrastructure is remarkably fluid because that's the only way to run things at such scale.)

Beware of Chesterton's Fence. Just because you haven't learned the reasons for something doesn't mean it's wrong, and the new shiny often re-introduces problems that were already solved (along with some of its own) because of that attitude.

Are you sure you two are talking about the same thing?

My understanding of immutable infrastructure is the same as immutable data structures: once you create something, you don't mess with it. If you need a different something, you create a new one and destroy the old one.

That doesn't mean that the whole picture isn't changing all the time. Indeed, I think immutability makes systems overall more fluid, because it's easier to reason about changes. Mutability adds a lot of complexity, and when mutable things interact, the number of corner cases grows very quickly. In those circumstances, people can easily learn to fear change, which drastically reduces fluidity.

Yup. We do this. When our servers need a change, we change the AMI for example, and then re-deployment just replaces everything. Most servers survive a day, or a few hours.

Makes sense to me. I was talking with a group of CTOs a couple years back. One of mentioned that they had things set up that any machine more than 30 days old was automatically murdered, and others chimed in with similar takes.

It seemed like a fine idea to me. The best way to be sure that everything can be rebuilt is to regularly rebuild everything. It also solves some security problems, simplifies maintenance, and allows people to be braver around updates.

Configuration Management is still present in this process, it's just moved from the live system to the image build step.

Probably the most insightful comment in this entire thread. Thank you. In many cases, an "image" is just a snapshot of what configuration management (perhaps not called such but still) gives you. As with compiled programming languages, though, doing it at build time makes future change significantly slower and more expensive. Supposedly this is for the sake of consistency and reproducibility, but since those are achievable by other means it's a false tradeoff. In real deployments, this just turns configuration drift into container sprawl.

Is this still as painful as it used to be? AMI building took ages, so iteration ("deployment") speed is really awful.

Personally that's why I avoid Packer (or other AMI builders) and keep very tightly focussed machines set up by the cloud-init type process.

So, once you create a multi-thousand-node storage cluster, if you need to change some configuration, replace the whole thing? Even if you replace onto the same machines - because that's where the data is - that's an unacceptable loss of availability. Maybe that works for a "stateless" service, but for those who actually solve persistence instead of passing the buck it just won't fly.

Could you say more about why your particular service can't tolerate rolling replacement of nodes? You're going to have to rebuild nodes eventually, so it seems to me that you might as well get good at it.

And just to be clear, I'm very willing to believe that your particular legacy setup isn't a good match for cattle-not-pets practices. But I think that's different than saying it's impossible for anybody to bring an immutable approach to things like storage.

The person you're replying to didn't say "replace every node," they said "replace the whole thing."

To give a really silly example, adding a node to a cluster is a configuration change. It wouldn't make sense to destroy the cluster and recreate it to add a new node. There are lots of examples like this where if you took the idea of immutable infrastructure to the extreme it would result in really large wastes of effort.

Could you please point me at prominent advocates of immutable infrastructure who propose destroying whole clusters to add a node? Because from what I've seen, that's a total misunderstanding.

As I said, it's a silly example just to highlight an extreme. In between there are more fluid examples. I don't think it's that ridiculous to propose destroying and recreating the cluster in its entirety when you're deploying a new node image. However as you say I'm not sure anyone would advocate that except in specific circumstances.

On the other hand, while my suggestion of doing it to add a node sounds ridiculous I'm sure there are circumstances in which it's not only understandable but necessary, due to some aspect of the system.

I'm saying it's not even an extreme, in that I don't believe what people are calling "immutable infrastructure" includes that.

If your biggest objection to an idea is that you can make up a silly thing that sounds like it might be related, I'm not understanding why we need to have this discussion. I'd like to focus on real issues, thanks.

I'm not objecting categorically to anything. I think that immutable infrastructure is a spectrum, and depending on your needs you may have just about everything immutably configured, or almost nothing. I just don't think it's so black and white as "you should always use immutable infrastructure."

I also think it's a cool idea to destroy the entire cluster just to add a node, and it sounds ridiculous but also like there's some circumstances where it makes perfect sense.

Again, do you have a citation for the notion that it's a spectrum? The original post that coined the term doesn't talk about it that way, and neither do the other resources I found in a quick search. As I see it, it's binary: when you need to change something on a server, you either tinker with the existing server or you replace the server with a fresh-built one that conforms to the new desire.

Wow, look at those goalposts go! If you make enough exceptions to allow incremental change, then "immutable" gets watered down to total meaninglessness. That's not an interesting conversation. This conversation is about configuration management, which is still needed in a "weakly immutable" world.

Again, could you please point me at notable advocates of immutable infrastructure proposing the approach you take such exception to? And note that I'm not proposing any exceptions.

Presumably you replace the parts that changed and keep the parts that didn't.

Interesting to say you've "solve[d] persistence" when you seem to be limited by it here. Is there a particular reason your services can't be architected in less stateful, more 12-factor way?

Kick the persistence can down the road some more? Sure, why not? But sooner or later, somebody has to write something to disk (or flash or whatever that doesn't disappear when the power's off). A system that stores data is inherently stateful. Yes, you can restart services that provide access or auxiliary services (e.g. repair) but the entire purpose of the service as a whole is to retain state. It's the foundation on top of which all the slackers get to be stateless themselves.

The vast majority of people simply redefine the terms to fit whatever they are selling.

If your systems are immutable they can run read-only. In the in nineties Tripwire, the integrity checker, popularized it. You could run it off cdrom. Today immutable infrastructure is VMs/containers that can be ran off a SAN or a pass through file system that is readonly. It means snapshots are completely and immediately replicatable. When you need to deploy, you take a base image/container, install a code onto it, run tests to ensure that it is not broken and replicate it as many times as you need, in a read-only state. This approach also has an interesting property where because system is readonly ( as in exported to the instance read-only/mounted by the instance readonly ) it is extremely difficult to do nasty things to it after a break in - if it is difficult to create files, it is difficult to stage exploits.

That's the only kind of infrastructure where configuration management on the instances themselves is not needed

What sort of stack do you all use then to manage these clusters? Have you found any solutions to your conflicts?

The hosts are managed via chef, the jobs/tasks running on those hosts by something roughly equivalent to k8s.

As for the conflicts, I have to say I loathe the way the more dynamic part of configuration works. It might be the most ill conceived and poorly implemented system I've seen in 30+ years of working in the industry. Granted, it does basically work, but at the cost of wasting thousands of engineers' time every day. The conflicts occur because (a) it abuses source control as its underlying mechanism and (b) it generates the actual configs (what gets shipped to the affected machines) from the user-provided versions in a non-deterministic way which causes spurious differences. All of its goals - auditability, validation, canaries, caching, etc. - could be achieved without such aggravation if the initial design hadn't been so mind-bogglingly stupid.

But I digress. Sorry not sorry. ;) To answer your question, my personal solution is to take advantage of the fact that I'm on the US east coast and commit most of my changes before everybody else gets active.

Sure. Its more that your CICD is lacking.

Sometimes you have to work with what you're given in a brownfield env and a config managment tool is useful in that case, but it's possible that you are working with a less than ideal architecture with less than ideal time/money to make changes.

State is always the enemy in technology.

I can't even imagine managing hundreds of servers whose state is unpredictable at any moment and they can't be terminated and replaced with a fresh instance for fear of losing something.

> State is always the enemy in technology.

I work in data storage. Am I the enemy, then? ;)

> can't even imagine managing hundreds of servers whose state is unpredictable at any moment

Be careful not to conflate immutability with predictability. The state of these servers is predictable. All of the information necessary to reconstruct them is on a single continuous timeline in source control. But that doesn't mean they're immutable because the head of that timeline is moving very rapidly.

> can't be terminated and replaced with a fresh instance for fear of losing something.

No, there's (almost) no danger of losing any data because everything's erasure-coded at a level of redundancy that most people find surprising until they learn the reasons (e.g. large-scale electrical outages). But there's definitely a danger of losing availability. You can't just cold-restart a whole service that's running on thousands of hosts and being used continuously by even more thousands without a lot of screaming. Rolling changes are an absolute requirement. Some take minutes. Some take hours. Some take days. Many of these services have run continuously for years, barely resembling the code or config they had when they first started, and their users wouldn't have it any other way. It might be hard to imagine, but it's an every-day reality for my team.

> I work in data storage. Am I the enemy, then? ;)

You’re the prison guard.

> Be careful not to conflate immutability with predictability.

I don't trust predictability. Drift is always a nightmare. Nothing is ever as predictable as you would like it to be.

>You can't just cold-restart a whole service that's running on thousands of hosts and being used continuously by even more thousands without a lot of screaming.

If it's architected well you can :)

> State is always the enemy in technology.

Except that state and its manipulation is usually the primary value in technology.

> I can't even imagine managing hundreds of servers whose state is unpredictable at any moment and they can't be terminated and replaced with a fresh instance for fear of losing something.

Yes, that sounds awful. That's why we have backups and, if necessary, redundancy and high availability.

> Except that state and its manipulation is usually the primary value in technology.

Exactly and thats why you put state in data stores and keep your servers immutable.

But... How do you configure the hosts where your containers are running on? How do you configure your storage (NAS/SAN)? How do you configure your routers and switches? ...

The original question didn't have much context, and I guess my answer assumed someone would be using a cloud provider as opposed to anything on premise.

Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?

> Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?

Ansible is used in networking, which many vendors having official modules:

* https://github.com/PaloAltoNetworks/ansible-pan

* https://github.com/aristanetworks/ansible-cvp

* https://www.juniper.net/documentation/en_US/junos-ansible/in...

* https://github.com/F5Networks/f5-ansible

There are even frameworks if you want to write things in 'raw' Python as well.

> Are Ansible/Puppet/Chef any good for managing the hardware you mentioned?

Yes. Well, OK, maybe not good but better than ad hoc.

Build the underlying vms with packer. Or use cloud-init as the parent mentioned - I think it has a bunch of knobs.

I'm going to agree with you. In 2020 (and really the last few years), configuration management is outdated. IaC (infrastructure as code) is the current approach. Containerize everything you can, use terraform or cloudformation, or azure devops.

Avoid managing the underlying os as much as possible. Use vanilla or prebuilt images to deploy these containers on, coreos, Amazon's new bottle rocket (maybe). Or use a service like fargate when possible. All configuration should be declarative to avoid errors.

If you need to build images tools like packer are great. AWS has a recommended "golden Ami pipeline" pattern and a new image builder service if you can't use community images.

I'm speaking imperatively but read these as my own directives. I work for a company that consults and actively helps fortune 500's migrate to the cloud. So some of what I'm saying is not possible or harder on prem and I recognize that.

If I had to, I still like Chef, puppet second favorite mostly because of familiarity. Ansiblee can be used with either of these. And tools like serverspec to validate your images. I don't really use any of this anymore though.

But you still need to configure things, even if they are immutable at runtime. And you need to manage that configuration over time in some systematic way.

You always have a configuration management system.

I'm using Terraform to deploy Docker containers. Terraform's docker_container resource has a lovely 'upload' feature which one can use to upload files into the container. I make Terraform load server config files (or use multi-line strings in the .tf file), perform variable replacement, then destroy Docker containers and recreate them with updated config files. All persistent data is stored in directories bind-mounted into the Docker container.

Terraform has some limitations. For example, one deployment cannot deploy hosts and their containers [1]. And there is no usable support for rolling deployments [2, 3]. So I've ended up with a 4-stage deployment: host-set1, containers on host-set1, host-set2, containers on host-set2.

I also use Terraform to deploy the servers to my laptop during development. Docker for Mac works well.

Someday, Kubernetes will get some usable documentation on how to do normal things [4]. Then I will use it for deploying containers, load balancers, and persistent volumes. For now, it's too big of a complexity jump over plain Docker.

[1] https://github.com/hashicorp/terraform/issues/2430

[2] https://github.com/hashicorp/terraform/issues/23735#issuecom...

[3] https://github.com/hashicorp/terraform/issues?q=is%3Aissue+%...

[4] https://github.com/kubernetes/website/issues/19139

Not all of us have the luxury of our projects being greenfield.

The question asks what I would consider to be the right approach for 2020, and also what my team is doing. This is the design pattern I've been following for 5 years, but obviously your mileage may vary, it won't work for everyone, etc.

What about the system that runs the containers?

"Amazon/Google/Azure takes care of that" is not the answer, unless your comment is predicated on a world where compute can only be rented from big corps... and their methods of managing underlying infrastructure are sacred secrets for which we are to unworthy to comprehend.

Terraform keeps track of resources it creates. One can remove resources (VMs, managed databases, persistent volumes, DNS records) from the config file and Terraform will cleanly delete them. This is a crucial feature for most deployments.

For example, I deployed an app backend to DigitalOcean with a load balancer, 2x replicated API server, 2x replicated worker process, managed database, file storage, and static website. Terraform is tracking 114 resources for each deployment.

It seems that automatic removal is poorly supported by Ansible, Chef, Puppet, Salt, etc. One can explicitly add items to a "to_remove" list, but this is error-prone.

Terraform has many limitations and problems, but I have found no better tool.

I mostly don't for new stuff (all in on Docker/ECS), however we have a lot of old stuff and things in the process of being migrated where it makes sense. There's also always the odd bird thing you use that needs to run on a regular host.

(Genuinely curious) what old stuff do you think doesn't make sense to be set up immutably? and what odd stuff needs to run on a regular host?

Example: How do you do "immutable" management of Mac OS machines? Taking what's typically described as such there, you've just turned a 30s deploy of software into a multi-hour "lets reimage the entire machine"?

(although that's of course not strictly "old stuff")

Were Macs in scope of the original post? I assumed it was server side stuff, rather than office hardware. For that, though, I'd use Jamf (Pro) or some other MDM option.

Even if you exclude "office hardware" from configuration management, our Mac OS build and test farm is "servers" I'd say. Not everyone running servers is doing so to run an online service on a platform of their choice.

Not the GP, but some proprietary software requires license activation and you only get a certain (small) number of activates/deactivates.

I would have to suss out dependency chains on stuff that others built.

Rather than doing that I take a light touch approach with ansible that will suffice until we can dockerize (which would require the same work, but then it’s a dev project vs now when it’s just a devops thing)

For mutable infra that holds state. IMO not all infra is gonna to end up in k8s and some still needs to be self hosted.

> What I prefer to do is use Terraform to create immutable infrastructure from code.

Can you mount all your volumes read-only and run all of your stack? If you cannot, then you do not have immutable infrastructure. You simply happen to agree that no one write anything useful, which with time will absolutely fail because someone, somewhere is going to start storing state on a stateless system giving you "a cow #378 called 'Betsy'"

In the current state of infrastructure, an accepted definition of "immutable infrastructure" is that:

1. You deploy a completely fresh instance/container, instead of in-place updates 2. You don't actively push changes on a running instance/container

Of course you might have stuff written to disk, such as logs, temp files, etc. But it should be non-essential data, and potentially pushed to a central place in near real-time.

Interesting. How would you do that if your deployment is, say, a couple of new tables in a 50TB Oracle database?

It only works with stateless resources.

There's no point in trying to manage a database or similar resources this way.

So, Linux-only? ;)

Yes, but to be clear, some of those containers have been .Net Core containers (running in Kubernetes) for me. I appreciate not having Windows in an estate isn't common to all setups.

Surprised more people here are not using Salt. Having used both Salt and Ansible, I much prefer Salt, especially when working with larger teams.

When working solo I use Guix, both Guix and Nix are _seriously_ amazing.

Salt has much nicer configs and feels supeior to Ansible. The main disadvantage that I had with salt was the need to have a salt master server. I read that this is no longer needed but I have not tried it myself. Keeping secrets outside of the repo was not trivial task, ansible has an easy way to encrypt secrets.

Wow really? I tried to learn Salt and it was way too complex. Comparatively Ansible was amazing to learn.

Made worse by the same thing that plagues Homebrew: making up cutesy vocabulary for things. I've been working with computers for a long time, Salt, you can feel free to use Adult Words when describing a thing to me

Shameless plug, i've created a GUI for Salt:


feedback welcome!

What's salt? Any link? I found something called SaltStack but that appears to be enterprise security software.


Salt (also known as SaltStack) was right.

> Salt is a new approach to infrastructure management built on a dynamic communication bus. Salt can be used for data-driven orchestration, remote execution for any infrastructure, configuration management for any app stack, and much more.


I'm ultra confused about their marketing btw. Their website doesn't even say it's open source. You have to sign up to "try it now". It's like they don't want customers? Or are people who want to understand what they're buying not the target market, somehow?

For reference, this appears to be the Salt primer: https://docs.saltstack.com/en/getstarted/system/index.html

Salt the company's marketing is atrocious, Salt the software is great.

I think there is a small subset of users where open source is actually a buying decision

Sure, but understanding what the thing is, that's part of the buying decision right? I have no clue what https://www.saltstack.com/ is about.

How do I get from this:

> Drive IT security into the 21st century. Amplify the impact of your entire SecOps team with global orchestration and automation that remediates security issues in minutes, not weeks.

to "it's a provisioning tool for servers, like ansible but faster"?

The docs landing page have me an easier to grok picture of what it does, purely based on seeing what the doc headers are - https://docs.saltstack.com/en/latest/

SaltStack is actually a configuration management system, but they're re-characterizing themselves as security management software.

they had a very large presence at RSAC 2020. i was very confused by it. they are not security software. however i suppose there are no “configuration management “ tradeshows

Config Management Camp is the main conference for config management: https://cfgmgmtcamp.eu/

I haven't looked into what they added specifically for the space, but I think a configuration management company at a security trade show makes sense: Configuration management is a very useful tool for various security goals.

I use Ansible, mostly because it works pretty well for deployments (on traditional, non-dockerized applications), and then I can just gradually put more configuration under management.

So it's a very good tool to gradually get a legacy system under configuration management and thus source control.

My default tends to be Ansible because it is really versatile and lightweight on the systems being managed. That versatility can bite you though because it's easy to use it as a good solution and miss a great one. Also, heaven help you if you need to make a change on 1000s of hosts quickly.

I also use (In order of frequency): Terraform, Invoke (Sometimes there is no substitute for a full programming language like python), Saltstack (1000's of machines in a heterogenous environment)

If I were going to deploy a new app on k8s today, I would probably use something like https://github.com/fluxcd/flux.

I haven't really had a pleasant time with the tooling around serverless ecosystem yet once you get beyond hello worlds and canned code examples.

> Also, heaven help you if you need to make a change on 1000s of hosts quickly.

Why? I would have seen that as Ansible's strong point.

It gets terribly slow and eats up literal 10s of gigabytes of RAM. Extensions like mitogen can help, though.


Re: performance: That's fair. I didn't realize it scaled that badly.

Re: mitogen: Thanks! I saw that once, a long time ago, but couldn't find it again. I'll have to try it; vanilla ansible is fine for me so far, but I'm hardly going to ignore a speed boost that looks basically free to implement.

Mitogen looks cool. I hadn't seen it thanks.

Are you referring to the "Invoke" Python library http://www.pyinvoke.org ? Would you please explain a bit about how you use it for deployments?

I might be fanboy of the type safety and having a quick feedback loop, but I cannot imagine a better configuration management system than just straight configuration as code e.g. in Go: https://github.com/bwplotka/mimic

I really don't see why so many weird, unreadable languages like jsonnet or CUE were created, if there is already a type safe, script-like (Go compiles in miliseconds and there is even go run command), with full pledged IDE autocompletion support, abstractions and templating capabilities, mature dependency management and many many more.. Please tell me why we are inventing thousands weird things if we have ready tools that helps with configuration as well! (:

I agree. I wish we could just use EDN and Clojure, but your DevOps guy is not writing Go or Clojure code.

They are also not doing code reviews to enforce security policies.

If you have DevOps guys who are also software developers, more power to you, but if I approach my DevOps team with:

Hey just code your scripts in this turing-complete languages, they will ask me "what's your username again?" BOFH-style ;)

I find this post very condescending. You are not better than your devops guy at managing infrastructure because you can code in Clojure or GoLang or whatever other programming language.

Holy gods yes! Please let me use a real programming language instead of an unholy mixture of yaml and jinja. Clojure would be such a dream!

Hah, true.

I totally see this being difficult. To enforce DevOps/Ops to actually do code reviews, and versioned, type-safe configuration, but once you accomplish it - the profits are really worth it!

Please consider that you're a principal engineer with a BS and a Master's. And you've achieved all those things quite quickly! You're on the far end of a bell curve.

A full programming language is the natural choice for people who are full programmers. But for people who aren't, they're intimidating and add a lot of complexity. Templating systems are much more approachable for people who have a lot of experience configuring things via big blobs of text.

As a programmer, I would personally rather express everything in a programming language, so I get your perspective here. But it isn't an accident that there are so many ops-focused systems that are different takes on just automating the things people were previously doing manually.

I totally understand that, but why not aiming high? Why we just say "you have experience configuring stuff, so we will just give you some extended json with templating, you won't be able to code....". I think this is bad approach (: We should always aim high and mentor those less experience to use programming languages for this. They don't need to know complex algorithms, distributed systems and performance optimization. It's really just more smart templating that is actually can be easier to use! (:

> Templating systems are much more approachable for people who have a lot of experience configuring things via big blobs of text.

I have mixed feelings about this statement. How reading or using jsonnet is easier? I am a principal engineer and I am struggling to work with this, how less expierience people can deal with that efficiently? (:

They can deal with it more efficiently because its model is closer to their current mental model, and so requires less cognitive load to achieve initial results.

Gabriel talks about this under the label "worse is better". [1] I agree your preferred approach is better in the long term and at scale. But that only matters in the long term, and only if scale is eventually achieved. Tool adoption is generally a series of short-term decisions, and most projects start small.

I agree mentoring people is great, but neither you nor I have time to mentor all the people just configuring things into becoming good programmers.

[1] https://en.wikipedia.org/wiki/Worse_is_better

Interesting view, thanks. (:

Secretaries at Multics sites not only used Emacs in the 70s -- they customized it with Lisp. They were only ever told they were customizing their editor, not programming it, so they never got intimidated!

Haha, that's actually amazing fact, lol.

Let's create a CONFIGURATOR language (and secretly hide Go underneath)!

Go is pretty dumbed down as languages go.

I take it as.. yes? That it would do good for configuration as it is simple? (:

The tool you link recommends "kubectl apply, ansible, puppet, chef, terraform" to actually apply the changes, at least 3 of those I'd classify as configuration management. Generating the configuration is only a small part of it, and the traditional tools typically have some way to do that too because they were designed to be used by non-/almost-non-coders too.

Sounds like you may be interested in Pulumi: https://www.pulumi.com/

Thank you! I've seen that and I don't fully like it. I am not interested in deploying the configuration. I believe that generating configuration, versioning it, baking it, should be a totally separate process to deploying, rolling out, reverting etc

That's why IMO we should separate those. (:

We are writing Pulumi in node (far more mature then their Go offering, but that has been recently improved) and version our releases into private NPM packages. That isn’t strictly necessary, as every update Pulumi pushes gets versioned internally along with associated code changes, and is always accessible in the stack’s history.

If you’re a K8s dev, they recently announced the ability to output Helm3 files rather than deploy directly.

I am k8s dev, but also I think Helm is antipattern. Still thanks for sharing your experience, who knows, maybe worth to look again on Pulumi. (:

I misspoke - it generates straight yaml not helm https://www.pulumi.com/blog/kubernetes-yaml-generation/


Hashicorp tools are quite solid, and give you a lot for free. Ansible can automate host-level changes in places where hashicorp cannot reach. There shouldn't be many such places.

Alternatively, if you have the option of choosing the whole stack, Nix/NixOS and their deployment tools.

I would recommend staying away from large systems like k8s.

Here's what we're using which I'm pretty happy with:

0. Self-hosted Gitlab and Gitlab CI.

1. Chef. I'd hardly mention it because it's use is so minimal but we have it setup for our base images for the nitpicky stuff like connecting to LDAP/AD.

2. Terraform for setting up base resources (network, storage, allocating infrastructure VMs for Grafana).

3. Kubernetes. We use a bare minimum of manually maintained configuration files; basically only for the long-lived services hosted in cluster plus the resources they need (ie: databases + persistent volumes), ACL configuration.

4. Spinnaker for managing deployments into Kubernetes. It really simplifies a lot of the day-to-day headaches; we have it poll our Gitlab container repository and deploy automatically when new containers are available. Works tremendously well and is super responsive.

Nix (nixos, nixops) is worth looking into if you want a full solution and can dedicate the time and energy.

Also Morph, which is like NixOps, but stateless:


Morph is lovely because it ends up being a very thin layer over the existing Nix toolkit. All it does is deploy your NixOS config to a remote machine.

Cool, thanks for this

Yes, I love the immutability there!

We use Ansible with Packer to create immutable OS images for VMs.

Or Dockerfile/compose for container images.

Cloud resources are managed by Terraform/Terragrunt.

I think this is the ideal scenario for Ansible— one-time configuration of throwaway environments, basically as a more hygenic and structured alternative to shell scripts.

My experience trying to manage longer lived systems like robot computers over time with Ansible has been that it quickly becomes a nightmare as your playbook grows cruft to try to account for the various states the target may be coming from.

Could you say more about why ansible is better than shell scripts for one-time configuration? In my mind, ansible's big advantage over shell scripts is that it has good support for making incremental changes to the configuration of existing resources. In a situation like packer, where the configuration script only gets run once, I prefer the conciseness of a shell script.

I see the incremental piece as a dev-time bonus rather than something to try to leverage much in production— it lets you iterate more quickly against an already-there target, but that target is still basically pristine in that any accumulated state is well understood. But that's very much not the case if you're trying to do an Ansible-driven incremental change against a machine that was deployed weeks or months earlier.

Even in the run-once case, though, I think there's a benefit to Ansible's role-based approach to modularization. And again for the dev scenario, it's much easier to run only portions of a playbook than it is to run portions of a shell script.

And finally, the diagnostics and overall failure story are obviously way better for Ansible, too.

Now, all this said, I do still go back and forth. For example, literally right now in another window I'm working a small wrapper that prepares clean environments to build patched Ubuntu kernels in— and it's all just debootstrap, systemd-nspawn, and a bunch of shell script glue.

That's a very good point : I also found that the core feature of configuration management - idempotency - actually becomes mostly useless in this case, as ansible applies a playbook only once.

I still use it as it allows more portability across OS releases and families (as in easier migration), but it also increases the complexity when creating a new task/role/playbook.

In that sense, Dockerfiles with shell-based RUN commands are much easier to manage.

Another advantage of config management over shell might be a better integration with the underlying cloud provider. For instance Ansible supports AWS SSM parameter store, which allows me to use dynamic definitions of some configuration data (RDS database endpoints, for instance) or secrets (no need for Ansible vault)

+1 for packer and ansible.

I haven't used either (yet) but Dhall or Cue lang should be on your list of candidates IMO.


(To me things like puppet or ansible seem like thin layers over shell and ssh, whereas both Dhall and Cue seem to innovate in ways that are more, uh, je ne sais quoi ;-) YMMV)

Just started using Cue, it is fantastic. It was built for this problem

You can never go wrong with bash, you should not put secrets in metadata and you should not have IAM profiles that have overreaching privileges. Any IAM profile that you use or whatever you use on azure or gcp you should always consider what somebody can do with it if they get access to it.

Probably also just straight up docker and docker compose is another good idea, and terraform and possibly hashicorp vault are real high on the list, too. Ansible and chef and puppet are all pretty esoteric and I thought chef was great till I just got good with bash and gnu parallel

Salt because it's declarative and runs on linux, windows and osx.

I have been using Ansible for over four years now, my current use case has around 1k VMs and a handful of baremetal in a couple of different datacenters running 100s of services.

No orchestration as well FWIW, we usually have ansible configuring Docker to run and pulling the images...

As for the future I have been meaning to explore Terraform and some Orchestration platforms (Nomad).

I would go with Ansible for side projects/smaller tasks, and use Puppet at large.

Any reasons for those choices?

Ansible is just extremely easy to begin with, and comfortable to use since it is an agentless solution using SSH. As for Puppet, well, it could largely depends on your team. Is it a devops one or a strictly dev one? Puppet seems to be the perfect balance for us (devops mostly, but devs can touch it with confident too)

Shameless plug for a thing I maintain, which is in the config management space but a little bit different from the usual tools: https://github.com/sipb/config-package-dev#config-package-de...

config-package-dev is a tool for building site-specific Debian packages that override the config files in other Debian packages. It's useful when you have machines that are easy to reimage / you have some image-based infrastructure, but you do want to do local development too, since it integrates with the dpkg database properly and prevents upgraded distro packages from clobbering your config.

My current team uses it - and started using it before I joined the company (I didn't know we were using it when I joined, and they didn't know I was applying, I discovered this after starting on another team and eventually moved to this team). I take that as a sign that it's objectively useful and I'm not biased :) We also use some amount of CFEngine, and we're generally shifting towards config-package-dev for sitewide configuration / things that apply to a group of machines (e.g. "all developer VMs") and CFEngine or Ansible for machine-specific configuration. Our infrastructure is large but not quite FAANG-scale, and includes a mix of bare metal, private cloud and self-run Kubernetes, and public cloud.

I've previously used it for

- configuring Kerberos, AFS, email, LDAP, etc. for a university, both for university-run computer labs where we owned the machines and could reimage them easily and for personal machines that we didn't want to sysadmin and only wanted to install some defaults

- building an Ubuntu-based appliance where we shipped all updates to customers as image-based updates (a la CrOS or Bottlerocket) but we'd tinker with in-place changes and upgrades on our test machines to keep the edit/deploy/test cycle fast

Thanks for posting this. I’ve rolled my own version of this in the past and was very happy with the end results.

Ansible for dev boxes or smaller deployments. For large-scale deployments CFEngine3. When deployed within a cloud environment one doesn't even need a master node for CFE3 but the agents can just pull the latest config state from some object storage.

If you want massive parallel remote script execution, none beat gnu parallel or xargs + "ssh user@host bash < yourscript.sh".

All of cofiguration management tools( ansible, puppet, chef, salt etc ..) are bloated.

We already have FINE SHELL. Why do we need crappy ugly DSL or weird yaml ??

These days, Newbies write ansible playbooks without even basic unix shell & commands knowledge. What the hell?

I like ssh + pure posix shell approach like

Show HN: Posixcube, a shell script automation framework alternative to Ansible https://news.ycombinator.com/item?id=13378852

I typically use terraform and ansible. tf creates/manages the infrastructure and then ansible completes any configuration.

This is the approach we take. We don't track states or do continuous config management either as we're all in on cattle > pets (and we don't typically have the time to maintain terraforms properly enough to do anything but cut new environments). Something gets sick? Shoot it and stand up another one.

Funnily, I wrote my take on this not too long back:


Don't be distracted by FAANG scale. It's not relevant to most software and is usually dictated by what they started using and then deployed lots of engineering time to make work.

My suggestion is to figure out how you will manage your database server and monitoring for it. If you can do that, almost everything else can fall into line as needed.

Why did you leave out DigitalOcean and Terraform?

Terraform isn't equivalent to Puppet, Chef, or Salt. It's a tool for specifying cloud deployments, not a configuration management system.

DigitalOcean might be fine. So is Arch Linux. But if someone just wants to get on with what they're interested in with a minimum of fuss over time, it wouldn't be the right recommendation.

I've prototyped ansible for rolling out ssl certs to a handful of unfortunately rather heterogeneous Linux boxes - and it worked pretty well for that.

I still think there's too much setup to get started - but am somewhat convinced ansible does a better job than a bunch of bespoke shell would (partly because ansible comes with some "primitives"/concepts such as "make sure this version of this file is in this location on that server - which is quick to get wrong across heterogeneous distributions).

We're moving towards managed kubernetes (for applications currently largely deployed with Docker and docker-compose on individual vms).

I do think the "make an appliance;run an appliance;replace the appliance" life cycle makes a lot of sense - I'm not sure if k8s does yet.

I think we could be quite happy on a docker swarm style setup - but apparently everything but k8s is being killed or at least left for dead by various upstream.

And k8s might be expensive to run in the cloud (a vm pr pod?) - but it comes with abstractions we (everyone) needs.

Trying to offload to SaaS that which makes sense as SaaS - primarily managed db (we're trying out elephant sql) - and some file storage (100s of MB large Pdf files).

For bespoke servers we lean a bit on etckeeper in order to at least keep track of changes. If we were to invest in something beyond k8s (it's such a big hammer, that one become a bit reluctant to put it down once picked up..) I'd probably look at gnu guix.

Fabric https://www.fabfile.org/ (just one step above shell scripts using python), using 1.x as the 2.x stuff is still missing things. The key is having is structure to almost be like Ansible where you kind of have "playbooks" and "roles" (had this structure going before Ansible) ... probably have to move out of this soon though

Could you please explain why you think you'll have to move out of it soon?

Works for smaller teams and smaller # of hosts. I would say that it would start getting harder with 5-6 people and > 100 hosts. But for small stuff, it is the most awesome thing in the world. I had a structure I used a long time ago (https://github.com/chrisgo/fabric-example) but have broken it up now differently in the last 2 years (looks more like Ansible)

... and peer pressure (which is probably not a good reason)

I use OPS https://ops.city which uses the nanos unikernel https://github.com/nanovms/nanos and since I work on it would appreciate any suggestions/comments/etc. on how to make it better.

I'll tell you the one tool I DON'T use. Cloudformation. I've touched it a grand total of once and it burned me so hard I set a company policy to never use it again.

It's like terraform, except you can't review things for mistakes until it's already in the process of nuking something. Which is terrible when you're inheriting an environment.

Isn't that what changesets are for?

And a set of environments along the lines of at least, Dev, Test, preview, production.

This is a correct answer.

I enjoy using mage (https://github.com/magefile/mage). I like having a full language at my disposal for configuring things rather than yaml or json or whatever else.

I like Chef for similar reasons. It's just Ruby code.

I operate a couple of Elixir apps and so far a simple Makefile with a couple of shell scripts has been enough. This simplicity is due to the fact that the only external dependency is a database server, everything else (language runtime, web server, caching, job scheduling, etc.) is baked in the Elixir release. One unfortunate annoyance though is that Elixir releases are not portable and can't be cross-compiled (e.g. building on latest Ubuntu and deploying to Debian stable won't work) so we have to build them in a container matching the target OS version. So to be really honest I should mention that Docker is also part of our deployment stack, although we don't run it on production hosts.

How do you handle multiple servers? Eg for fallback, vertical scaling, whatever

Easy, flexible, ansible but not super fast (ssh) Still pretty easy but very fast saltstack (zmq)

Terraform for everything 'outside' your runtime (VM, container), SaltStack for everything 'inside' (VMs and containers) and for appliances (where Terraform has no provider available) as well.

I think we've developed multiple layers in our infrastructure (Cloud Infra - AWS, GCP.., Paas - Kubernetes, ECS.., Service mesh - Istio, linkerd.., application containers..). So it depends on how many layers you have and how you want to manage a particular layer. Companies at `any` scale can get away with just using Google App Engine (Snap) or have 5+ layers in their infrastructure.

I find Jenkins X really interesting for my applications. It seems to solve a lot of issues related to CI/CD and automation in Kubernetes. however, still lacks multi-cluster support.

Hey, there! I'm a product manager working on Jenkins X. We are at work right now on multi-cluster support, actually.

I'd love to talk to you about it in more detail and get you involved in the experiments around it - feel free to email me at ejones@cloudbees.com if you'd like to be involved or chat more about it.

I'm pretty happy using both Puppet and ansible. I use Puppet for configuring hosts and rolling out configuration changes (because immutable infrastructure isn't a thing you can just do; there's overhead and it does not fit all problems) and ansible for orchestrating actions such as upgrades. They work well together.

I very much dislike ansible's YAML-based language and would hate to use it for configuration management beyond tiny systems, but it's pretty decent as a replacement for clusterssh and custom scripts.

I'm using puppet for everything, including nearly immutable infrastructure ( if you can't mount your disks read only and run that way you dont have immutable infrastructure )

Puppet maintains the base image with the core system.

Special systems are recreated by applying system specific classes to a base image.

Application software is installed via packages with git commit-ids being versions.

Nothing is upgraded, rather a new instances are rolled out and the old instances are destroyed.

This also ensures that we always know that we can recreate our entire infrastructure because we do that for rapidly changing systems several times a day and for all systems at least monthly.

This makes our operational workflow match the disaster recovery, which is god sent.

Ansible Ansible Ansible for me!

I’ve tried Puppet and SaltStack, and I constantly find they are harder and more complex than Ansible. I can get something going in Ansible in short order.

Ansible really is my hammer.

We use terraform to describe cloud infrastructure, check all k8s configmaps and secrets into source control (using sops to securely store secrets in git).

Curious about why you're using SOPS [1] instead of say, hashicorp vault or AWS/GCP's integrated keystores or git-crypt, etc?

[1] https://github.com/mozilla/sops

I found vault a pain to setup and maintain. I think security solutions should be as simple as possible so people use them and understand them.

What I like about SOPs is that we still leverage AWS keystore to store our master key, but we store encrypted secrets in git. This is helpful as we have a history of rotations (great for rollbacks, audits, etc). Additionally, SOPs doesn’t encrypt yaml keys, so one can tell what a secret is but not its values. One also edits keys using the sops utility, so odds of accidentally committing plaintext secrets to source control are lower.

I won't talk much about FAANG scale, because that is hyper specialized.

A small startup shouldn't use any configuration management (assuming configuration management means software like Puppet, Chef, Salt, and Ansible). That is because small startups shouldn't be running anything on VMs (or bare metal). There are so many fully managed solutions out there. There is no reason to be running on VM, SSHing to servers, etc. App Engine, Heroku, GKE, Cloud Run, whatever.

Once you get to the point where you need to run VMs (or bare metal), there are many options. A lot of systems are going to a more image + container based solution. Think something like Container-Optimized OS[1] or Bottlerocket[2], where most of the file system is read-only, it is updated by swapping images (no package updates), and everything runs in containers.

If you are actually interested in config management, I'll give my opinions, and a bit of history. I've used all four of the current major config management systems (Puppet, Chef, Salt, and Ansible).

Puppet was the first of the bunch, it had its issues, but it was better than previous config managements systems. Twitter was one of the first big tech companies to use Puppet, and AFAIK they still do.

Chef was next, it was created by people to did Puppet consulting for a living. It follows a very similar model to Puppet, and solves most of the problems with Puppet, while introducing some problems of its own (mainly complexity in getting started). In my opinion Chef is a clear win over Puppet, and I don't think there is a good reason to pick Puppet anymore. One of the biggest advantages is that the config language is an actual programming language (Ruby). All the other systems started with language that was missing things like loops, and they have slowly grafted on programming language features. It is so much nicer to use an actual programming language. Facebook is a huge Chef user.

Salt was next. It was created by someone who wanted to run commands on a bunch of servers. It grew into a configuration management system. The underlying architecture of Salt is very nice, it is basically a bunch of nodes communicating over a message bus. Salt has different "renderers"[3], which are the language you write the config in, including ones that use a real programming language (Python). I'll back to Salt in a minute.

Ansible... it is very popular. This is going to sound harsh, but I'm just going to say it. I think is it popular with people who don't know how to use configuration management systems. You know how the Flask framework started as an April Fool's joke[4], where the author created something with what he thought were obviously bad ideas, but people liked some of them. Ansible is so obviously bad, at its core, that I actually went and read the first dozen Git commits to see if there were any signs that is was an April Fool's joke.

There was a time a few years ago when Ansible's website said things like "agentless", "masterless", "fast", "secure", "just YAML". They are all a joke.

Ansible isn't agentless. It has a network agent that you have to install and configure (SSH). Yes, to do it correctly you have to actually configure SSH, a user, keys, etc. It also has a runtime agent that you have to install (Python). You have to install Python, and all the Python dependencies your Ansible code needs. Then it has the actual code of the agent, which it copies to the machine each time it runs, which is stupidly inefficient. It is actually easier to install and configure the agents of all the other config management systems than it is to properly install, configure, and secure Ansible's agent(s).

Masterless isn't a good thing, and a proper Ansible setup wouldn't be masterless. The way Ansible is designed is that developers run the Ansible code from their laptops. That means anyone making code changes needs to be able to be able to SSH to every single server in production, with root permissions. And it also risks them running code that hasn't been committed to Git or approved. Any reasonable Ansible setup will have a server from which it runs, Tower, a CI system, etc.

Fast. Ha! I benchmarked it against Salt, wrote the same code in both, that managed the exact some things. Using local execution so Ansible wouldn't have an SSH disadvantage. Ansible was 9 times slower for a run with no changes (which is important because 99.9% of runs have no or few changes). It is even slower in real life. Why is it so slow? Well, SSH is part of it. SSH is wonderful, but it isn't a high performance RPC system. But an even bigger part of the slowness is the insane code execution. You'd think that when you use the `package` or `apt` modules to ensure a package is installed, that it would internally call some `package.installed` function/method. And that the arguments you pass are passed to the function. That is what all the other configuration management systems do. But not Ansible. No, it execs a script, passing the arguments as args to the script. That means every time you want to ensure a package is still installed (it is, you just want to make sure it is), Ansible execs a whole new Python VM to run the "function". It is incredibly inefficient.

Secure. Having a network that allows anyone to SSH to any machine in production and get root isn't the first step I'd take in making servers secure.

It isn't just YAML. It is a programming language that happens to sort of look like YAML. It has its own loop and variable syntax, in YAML. Then it has Jinja templating on top of that. "Just YAML" isn't a feature. To do config management correctly you need actual programming language features, so use an actual programming language.

If I had to pick one again, I'd pick Salt. Specifically I'd use Salt with PyObjects[5] and PillarStack[6].

But I'll reiterate, you shouldn't start with a config management system. Start with something fully managed. Once you need a config management system, take the time to do it correctly. Like it should be a six week project, not a thing you do in an hour. Chef and Salt will take more time to get started, but if setup correctly they will be much better than any Ansible setup. If you don't have the time or knowledge to do Chef or Salt correctly, then you don't have the time or knowledge to manage VMs correctly, so don't.

[1] https://cloud.google.com/container-optimized-os

[2] https://aws.amazon.com/bottlerocket/

[3] https://docs.saltstack.com/en/latest/ref/renderers/

[4] https://en.wikipedia.org/wiki/Flask_(web_framework)#History

[5] https://docs.saltstack.com/en/latest/ref/renderers/all/salt....

[6] https://docs.saltstack.com/en/master/ref/pillar/all/salt.pil...

> One of the biggest advantages is that the config language is an actual programming language (Ruby). All the other systems started with language that was missing things like loops, and they have slowly grafted on programming language features. It is so much nicer to use an actual programming language.

This is the most important thing I've learned from using all of the mentioned options, as well as older options like cfengine. Please make it easy to work in an actual programming language instead of an unholy mix of yaml and jinja! I think Ruby excels for this, as well as Clojure or a Scheme, because it's so easy to write "internal" DSLs.

If you already know and/or use Ruby, use Chef.

It is silly to ask "what should be used at FAANG scale", because either you are working at a FAANG and you are using what they use, or you are very unlikely to ever be at that scale -- and somewhere along the journey to getting there, you will either find or write the system that you need.

It’s not a silly question if you want to learn. Just because you don’t need it doesn’t mean it isn’t worth learning about.

Some FAANGS I've heard about roll all their own config management tools.

Odd reply to my statement that FAANGs roll their own config management tools. Are you smoking a lot of weed during this quarantine? You can google linkedins version of docker 'locker', facebooks hydra, etc etc. Alot of these big companies roll their own tools because consumer tools aren't scale enough. Linked In rolled Kafka for just this reason as well.

Also, odd and a little creepy that you're stalking my comments but I hope you enjoyed my brain droppings. ;)

For anyone here who isn't yet using and end to end setup like terraform, ansible, puppet etc and has more basic needs around managing environment variables and application properties, I highly recommend https://configrd.io.

I used to use Chef, but I really didn’t like it. For small projects now, I just use a set of shell scripts, where each installs and/or configures one thing. Pair it with a Phoenix server pattern. It has treated me very well the last two years

What about Nix?

puppet is pretty good in my experience

For most teams: Docker or Ansible all the things.

For teams that have a large IaaS footprint: Chef (agent-less actually adds complexity in this environment.)

Ansible where possible, Chef when I have to (for legacy reasons, usually), and Terraform/Docker/Packer when given the option.

I'm working with both Ansible and Puppet for the last 6 years on a daily basis. Ansible for: - i absolutely love and adore Ansible - extremely easy and much much much pleasant to read. Sometimes ansible feels like poetry to me. - ad-hoc SysAdministration - I do not mean "ansible" command, but actual style of work when you need to do something right here and right now. - prototyping, Dev and staging environments setup, experiments.

Puppet for a polished production. Puppet has robust and stable ecosystems and infrastructure. It is a client-server model from the beginning. It is easier to create and put in the production library of all your puppet modules. It has hiera for central config values and secrets management. At the same time, I hate the Puppet's resource relations. Puppet's architecture feels like something developed in 1991, an ugly monster monolith and extremely heavy.

Terraform. For actual low-level infrastructure management. And I don't like to put whole high-level host configuration into IaaC! IaaC has minimal host configuration capabilities. Set hostname, set IP, register with Puppet or call ansible - only a few lines in user-data or bash-script on boot, which then calls actual configuration management!

Gitlab-CI - switched from Jenkins. Concourse-ci looks extremely interesting! Also reviewing some GitOps frameworks. Kubernetes - bare-metal runs self-made puppet-based pure k8s. Also, kops and EKS for AWS. Applications in k8s are managed via Helm.

Compared to Puppet Ansible is less enterprissy, it is more like a hipster tool. I would like to replace the Puppet with Ansible. But maybe I need the help from all of YOU who have voted for Ansible. How do you achieve Puppet's level of management with Ansible? How do you achieve client-server setup with ansbile - somehow I do not see lot's of people using ansible-pull? (Without using Tower!) You create cronjob with ansible-pull on a node boot? :D Or whole your ansible usage is limited to running ansible-playbook from your console manually? Ok maybe you sometimes put it in the last action of your CI/CD pipeline ;) Nodes classification and review? Central config values management for everything?

I use hashi Vault and lots of other things too. Some questions are rhetoric. I've just expressed my mistrust in ansible which doesn't feel complete. :(

How to do you manage a fleet of 1000, 500 or even 200 hosts with ansible? When after provisioning you need to review your fleet, count groups, list groups, check states. Ah, you want to suggest Consul for that role? :)

Kubernetes for the win. It will replace config management diversity. It gives you node discovery, state review, and much much much more.

We're using Terraform for infrastructure and Ansible for deployments with great success.

Shameless self-plug: ChangeGear. We’re cheapest in-class for medium-sized companies.

At G scale you could never afford to run something as grossly wasteful as chef. It would be cheaper to have several full-time engineers maintaining a dedicated on-host config service daemon and associated tools, than it would be for some ruby script to cron itself every 15 minutes.

That's strange, because the closest company to Google scale is Facebook, and they actually use Chef in production[1][2][3] on hundreds of thousands of servers.

[1] https://www.chef.io/customers/facebook/

[2] https://engineering.fb.com/core-data/facebook-configuration-...

[3] https://github.com/facebook/chef-cookbooks/

100s of 1000s. Adorable.

Salt + Serverless.

I'm also really interested in what companies at scale are using. Anyone here from FAANG?

Facebook uses Chef to manage its base hosts [0] and its own container system [1] to manage most workloads.

[0] https://www.chef.io/customers/facebook/ [1] https://engineering.fb.com/data-center-engineering/tupperwar...

docker-compose + custom stuff + reduce all dependency on tooling

Here also docker-compose. Easy to separate tenants using same stack (nginx+django+postgres+minio).

Question though: how do you manage the possible rebooting-containers-loop after a host reboot? I had to throw in more memory to prevent this but it feels like a (expensive|unnecessary) workaround. Anyone figured out how to let multiple containers start after each other (while not in 1 docker-compose.yaml)?

docker-compose is typically not recommended for production, and if you do end up using it - do check out the production guide - https://docs.docker.com/compose/production/

Kind of surprised there isn't really a consistent answer for this. Just skimming through these answers.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact