Hacker News new | past | comments | ask | show | jobs | submit login
Ansible Techniques I Wish I’d Known Earlier (zwischenzugs.com)
346 points by oedmarap on Aug 27, 2021 | hide | past | favorite | 94 comments

Ansible is abysmal. I don't know why anyone still chooses it. It's a mess of yaml and what feels like a million yaml files that is always extremely hard to follow. Honestly writing some python, or bash is a lot easier to follow, read, and understand. The only thing it has going for it is the inventory system. I wish ansible would die already.

The tips mentioned in the article are helpful, but still beyond the initial development of roles, maintaining roles is a pain no one can alleviate. It is the perl of automation. I wish people would stop using it. I've yet to start a job where people haven't regretted going towards ansible.

The learning curve to Ansible is low, and it doesn't depend on any additional software running on the server.

I dislike yaml, but I like Ansible. I think it hits the sweet spot for small and medium size enterprises, who are looking for a middle point between manually provisioning servers and containerisation.

>bash and python

Neither of those of idempotent.

It seems likely you are coming at this from the perspective of "when I find an ansible playbook on github its hard to follow whats going on".

There is a certain truth to that, but much of its complexity is probably related to said playbook developers trying to abstract away specifics and create a playbook that works on a variety of distributions / configuration options.

Ansible can be as simple as a single file that proceeds sequentially like a bash script.

Most of the ansible roles I come across written by my team are not idempotent either, its a huge lie that Ansible is idempotent. Its idempotent if you put the effort into make it be but if I see tons of shell or command module invocations without prerequisite checks to see if the work should be done. Most devs I see using Ansible treat it like a shell script written in YAML and to that purpose it sucks.

I can appreciate that I learned the term idompotency from working with Ansible, but I think they failed to really make it a feature of the language- it could have been a stronger default, explicit, like Unsafe in Rust. As it stands, writing idempotent Ansible takes as much discipline and intentionality as any other language, and can prevent integrating roles that other's have written without such discipline. There is value in it, but not in respect to idempotentcy as any sort of inherent principal (at least outside of the std lib, which is ofc pretty great abt it).

This. The built in tasks only get you so far. If you have anything custom you have to create the task for it yourself, with idempotency in mind, at which point you might as well have the exact same logic in a ./install.py file.

Even with the standard library of predefined task, most devs don't put in the effort to learn them since "ifconfig | grep" and other commands are baked into their muscle memory, they would rather make a quick and dirty bash-step than figure out what is the best practice ansible-equivalent. In the end the result is a sequential shell script written in yaml with no guard rails keeping it idempotent, plus riddled with jinja template-substitutions making it a nightmare to follow.

A truly declarative system would be designed more like makefiles, really enforcing idempotence. With steps in playbooks being executed in sequence it's too easy to fall back to script-thinking and side-effects. Sadly i'm not aware of any.

> >bash and python

> Neither of those of idempotent.

There is nothing magical about Ansible that makes it idempotent by default either.

The simple stuff like having specific versions of a package installed that you might be thinking of is just using Python under the hood. If you get into more complicated stuff, especially around custom infrastructure abstractions you use internally, guess what, you will have to write a bunch of Python and effectively call it from Ansible. And you will have to put in extra effort to make sure it's idempotent.

At which point you start thinking...why not just write a Python script instead? At least that gives you all of the flexibility.

> It seems likely you are coming at this from the perspective of "when I find an ansible playbook on github its hard to follow whats going on".

No, I've worked with ansible over 3 jobs, for the past 4 years. It hasn't been hard to convince colleagues to move away from Ansible once I show the competition. Sadly the migration process is rarely easy.

If you don't mind me asking, what's the competition? Do you mean you have a bespoke bash stack or you're using chef, capistrano, etc.?

I, too, am wondering what you recommend be used instead.

Yes, it can get hairy but I just have a hard time seeing an adhoc mess of python and bash doing the equivalent amount of lifting without becoming an even bigger hairball. And the thing is, you can also just run those same bash and python scripts using Ansible if you really want to, just to keep them orchestrated properly.

> you can also just run those same bash and python scripts using Ansible if you really want to, just to keep them orchestrated properly

Most of the time (i.e. if your Bash script generates a known file), Ansible also gives you idempotency for free, thanks to the ’creates’ parameter.

Wait, Ansible is sold as being idempotent? Really?

What about it isn't idempotent for you? Are you falling into the anti-pattern of using shell/command all over the place instead of real modules?

We have playbooks installing and managing our databases, web servers, VoIP PBX, DNS servers, backend services, and on and on. All completely idempotent and safe to run at any time against any host. No special effort whatsoever was required to make that safe to do.

Perhaps I'm not using it for its intended use case, but I've run into many issues where built-ins simply don't do the job. I'm using it for setting up my local personal machines, including a remarkable tablet.

If Ansible is a pain, what's your view on Chef? :)

I've found Ansible good for quick one-liners across multiple nodes :)

Having used Puppet, Chef, Ansible, custom shell and Python scripts, and Terraform provisioners, I'll say there aren't really any "good" automation tools.

They each have their pros and a long list of cons.

The good thing is that a lot of garbage that I used to do with those tools has been replaced with K8s. Some Terraform and Ansible still survive for self-managed VMs, but that's about it.

Many of the automation tools aren't great, but some are better than others. I'd take terraform and puppet or chef over ansible any day. I'd much rather have something that is declarative and easy to make idempotent over something that by design is procedural and much harder to make idempotent. Even bash easier to make idempotent. Don't forget yaml hell that ansible loves to put you in. If there was a tool that had a planning and apply phase for server management like terraform, I'd take that (for pets only). Ansible's dry run barely works since it requires additional effort to get working properly, often impossible without faking data. It's been a while since I've used chef or puppet and I can't remember if their dry run or the like is useful.

Having said that, in an ideal scenario, you use auto scaling combined with cloud init to avoid having to manage live servers. Cattle, not pets. That way you can provision images using chef/puppet, and then have them deploy and be automatically deployed. It gives also gives you the opportunity to create self healing architecture that requires less baby sitting.

K8s certainly forces you to think closer to that architecture and that is good.

What is your take on cfengine? Just curious...

Do you have any suggestions on good set of sdk or library for writing those automations in python? The main reason for using ansible for me was always because of their rich built in tasks. I tried writing in python my automations but even very simple tasks were needed too much effort compare to ansible so I had to return back.

I wouldn't. I said that as a point the ansible is bad. It is procedural and hard to follow. Bash and python don't have any host management features (I.e. inventories) so you'd have implement that yourself, which won't be easy.

I suggest looking into the alternative, such as chef, puppet, saltstack etc. They allow for declarative configurations and make it a lot easier to make your configurations idempotent. On top of that, do the infrastructure itself (aws, gcp etc) in terraform (or pulumi if you want it in a programming language). Terraform has its issues, but it is certainly a lot better at managing infrastructure than ansible.

Ansible and Puppet are tools that do very different things. I mostly agree with you, but Ansible does work decently as an orchestrator.

You want to use tools like Puppet and Terraform to define the state of your systems, and Ansible to run operations on those systems, because not everything is stateless; trying to upgrade a database with Puppet or Terraform will be painful, but Ansible won't have trouble.

I do wish it had static typing (Puppet took 4 major releases to finally get it and its type system is its best feature over alternatives) and less YAML, but it is what it is...

Nowhere near as complete as Ansible or even in the same space, but helps to write remote SSH automations:


Saltstack. It is super easy to write your own python modules and everything else, such as a transport, module distribution, delivery from git and management, is already there for you.

Having used all of the configuration management tools available, and I do mean all of them, I have found Ansible to be the most flexible, the easiest to execute arbitrarily, and the easiest to understand and package alongside applications. I would love to know what you're using that's better than Ansible and doesn't rhyme with poohbernetes.

I wholeheartedly agree. I've used CFEngine, BladeLogic, Chef, Puppet, Ansible, Salt in various infrastructure management tasks in the UNIX/Linux space. They're all shitty tools, don't get me wrong; they all have giant gaps in functionality especially when it comes to physical storage management. However, Ansible is maybe the least shitty. The cleanest and easiest environment to manage by far, was NIS/NIS+ with properly managed and groomed tables, and bash/ksh over automounter/NFS. I watched this work idempotently, and easily, across arch platforms and various versions of operating systems (some around since before i was born) without anyone on our small team breaking a sweat). Unfortunately, this architecture requires vigilance and a level of control that seems to have slipped away from the operations groups in recent years.

Github.com/Triplea-game, I wrote bash scripts to do deployments as a way to avoid ansible. The bash scripts were not well received (who likes bash scripts) They were coverted to andibke and became simpler. There was then an uptick of people engaging with the deployment code. IMO it's tough to do ansible well and keep it simple. In part restricting how it is used, where to find what, I think goes a long way

> It's a mess of yaml and what feels like a million yaml files that is always extremely hard to follow.

Don't look at kubernetes then...

I somewhat agree. That's why I chose Bash in teaching people deployment. It's better to go back to basics, and then if you know how things work, can choose (or write?) a higher level tool you like.

I'm using ansible for some years already. The biggest game changer for me was ansible molecule [0].

I integrated it to the repo, where I store my code and it just runs all the test cases for my code. It saved me a bunch of hours of investigation in test environments before I even released the code. Highly recommend to try it even for a small project.

Apart of that play with ansible strategies if you have more than one server to apply the role. It really might save you some minutes of a runtime.

[0]: https://molecule.readthedocs.io/en/latest/

This is really great advice. Chef was doing this years ago with test kitchen and inspec (and serverspec). Most community cookbooks came with extensive test suites, and in turn made it easy to encourage people to write tests for cookbooks in house. It seems theres a real lack of testing in opensource Ansible playbooks a lot of the time, and a lot of times when I see it used in organizations.

Same, it saves me so much time knowing my code will work before I have to do a “full run” on a site.yml

After using Ansible for a while at a company I got massively turned off by the fact the version in apt-get doesn't work and if you go to ansible.com there isn't an obvious download button, there's only a button to register for a goddamn free event. There should be a big fucking red "DOWNLOAD" button at the center of ansible.com, and not a goddamn GDPR intrusion and free event. And then if I search Google for "ansible download" I land at a documentation website with 50 billion links.

No. F that. Back to shell scripts. Git clone and run "./install.sh". Most of the things I do are containerized anyway, and the heavy lifting is done by Dockerfile not Ansible, so the install.sh isn't actually that involved.

Hey Ansible if you're reading this, head over to https://google.com/chrome/. Make your front page look like that with a big obvious download button and no popups.

That's interesting, that you're talking about docker and that most things in your environment are dockerized, but didn't check the option to run ansible in docker container. There's publicly available image on docker hub which works without issues.

PS. Just out of curiosity I just tried "ansible install" in google and ddg and the very first link is pointing to an installation guide [0]. The guide gives a very complete installation instructions with multiple options for various systems. You literally can choose whatever fits your environment more.

[0]: https://docs.ansible.com/ansible/latest/installation_guide/i...

> After using Ansible for a while at a company I got massively turned off by the fact the version in apt-get doesn't work

FWIW: Works just fine here on Debian (using it since Debian 8, now on Debian 11).

"The version in apt-get", uh-huh.

Do you feel they’re referring to anything else but a binary package repository belonging to a versioned (Debian-style) distro? It’s not helpful to be snarky.

It's a damn shame it took until this article for me (and it appears some others) to learn about the console and the debugger; those seem like massive time savers! My own biggest helper has been to use "ANSIBLE_STDOUT_CALLBACK=yaml" wherever "ansible-playbook" is called (https://jpmens.net/2021/03/12/alter-ansible-s-output-on-debu...). It makes the verbose output much more readable and is especially handy for tasks that can spew thousands of lines at ones (e.g. "apt update/upgrade").

Thanks, that setting helps a lot!

Whoever prefers configuration files over environment variables can get the same effect by adding to their ~/.ansible.cfg:

    stdout_callback = yaml

One needs to lookout for their config file being overwritten [1] if another ansible.cfg exist. In my case my ~/.ansible.cfg config got overwritten by a local ./ansible.cfg (as per the order the config files are read [2]) in a project directory.

In my case I workaround it by _merging_ all potential config files into one at runtime using `crudini` as demonstrated by rsguhr [3].

[1] https://github.com/ansible/proposals/issues/35

[2] https://docs.ansible.com/ansible/2.4/intro_configuration.htm...

[3] https://gitlab.com/-/snippets/1851171

fwiw, I myself only found out about that trick sone months ago...

Another one for the list is `--start-at-task` and `--list-tasks`, for those tasks that are statically imported. If you have a task bomb out, you can take over from where you left off. Doesn't work in all playbooks, if facts need to be set earlier, but it's still very useful and saves tons of time. Obvious, but I wish I knew that one earlier too.

At work we use Puppet and I have to say that although the DSL is much more expressive and nicer to use than YAML, Puppet's tooling sucks in comparison. There isn't really any way to test your Puppet code that I know of, you just have to apply it and check if it all looks good. Puppet doesn't have a set of good practices established either. Different companies use completely different repository structures, compiler settings, linter settings. Modules for managing common software like Docker are low quality and riddled with issues.

Ansible and Puppet claim to be declarative but it's a lie. Configuration management really needs a rethink. I know Nix exists but that has a huge barrier to entry and companies don't want to use something that isn't battle tested. In my view, the ideal config management tool would be fast, testable, and provably idempotent. It would also be as far away as possible from the Python/Ruby swamp that Ansible and Puppet are stuck in.

Puppet is declarative, isn't it? You declare states and relationships, and it computes plan of action.

Ansible is not declarative at all though, and it's not even funny how almost every fan says it is. It's just a sequence of actions.

Sadly, no. Puppet masquerades as being declarative just like Ansible does. It's very easy to get into a situation where your Puppet class is just a fancy shell script which works the first time and then fails because a file doesn't exist or something. There is nothing to lint for this or to protect against this.

I know Puppet is a mess. What you say is that you spill lot of exec's into a class? You can avoid that, can't you? If you just declare "resources" and connect them via relationships, then it seems pretty "declarative" to me, i.e. there's no sequence of actions in Puppet DSL and when you run puppet it only does what's necessary. While I find it truly "declarative", I still think Puppet and its ecosystem just suck, for lot of other reasons.

I don't think Ansible even tries to hide the fact that it's just imperative sequence of actions. I wonder where the people saying opposite are coming from. It's just retarded imperative language.

There's so many cases where a hacky Exec or five becomes necessary. For example, I am not aware of any Puppet module that does disk partitioning + formatting. Our solution was to write a class with a few Execs to do that. But there's so many edge cases where this can go wrong if it's not truly idempotent. An accidental disk wipe across our fleet would be a disaster, so what did we do? We basically wrapped the class in a big if guard that checks if the disk is already partitioned. That's not idempotent at all of course or declarative - it's just a script.

I can think of another horrifying example. We extract archives on to the machine to a specific target directory, and all files in that directory should have a specific owner and group. So we have a File resource that states the target directory exists which is ordered before the Archive resource. But we need to recursively chown after extraction too because tarballs preserve the original owner and group on extraction. We can't add a File resource with the same path that states the owner and group after extraction because duplicate resources aren't allowed. The only solution in this case is a hacky Exec. And that isn't declarative either because you'd have to check the owner/group of all files recursively to know whether it's in the right state or not. We just have this Exec run if the Archive resource refreshes. What if something else adds a file to this directory with the wrong owner/group? It wouldn't be corrected. Again, not declarative.

And god help you if a machine in the fleet runs out of disk space and resources only partially apply. The state will be completely fucked up and you'll have to manually nuke directories to correct it. Because Puppet only pretends to be declarative.

/rant :(

At a previous job we used beaker tests as part of our pipeline to test puppet modules & environments before we would merge to prod. One of the things this tests is that the module applies and doesn't try to change anything / error out on subsequent runs.

The documentation sucks though imo, every different guide you find will show a completely different way of doing it

The `--step` flag will definitely be useful for me, didn't know about that.

My one tip I'll add is to use the `--diff` argument when running a playbook to print out the specific changes made for each step. For debugging, use `--diff --check` to see all the changes that will be made - useful if it's been a while since you last run your playbook over a host.

The Ansible —diff and —check options were the main reason I started using it in 2013; the other reason was that it’s agentless, so I didn’t need a special vm to run it on and I could control when it might change things.

Yes, usually working with "--diff --check" flags as well, and did not know about "--step" either. Will incorporate that into my workflow for sure.

I have used Ansible for some years. I recently switched jobs to a company having no automation on VMs. And we are migrating our workload to managed services in the cloud (managed Kubernetes), I have to say I'm relieved I don't have to deal with Ansible again. It's probably very efficient in many ways, but just copying a file/directory with the correct options is a bit of pain in Ansible. Having not to worry about VMs anymore is a huge win for me at least. /OffTopicEnd

> just copying a file/directory with the correct options is a bit of pain in Ansible.

How so? Do you mean you had trouble remembering what arguments to pass to the copy module? (I'm not sure what other options there are to be a pain)

I'm not the one you're replying to, but I one problem I've had with Ansible directory copying is that it's terribly slow.

Using the "state" parameter for files and directories is also rather odd. You specify "state: absent" to delete, "state: directory" to create a directory. It makes sense considering the declarative nature of Ansible, but it's unfamiliar to those used to writing shell scripts.

Yes, ansible in general is painfully slow; I guess I assumed "painful" meant hard to write

Over the years I've tweeted Ansible tips, and Ansible Gotchas for my own memory and others

https://twitter.com/search?q=from%3Amoreati%20ansible%20tip&... https://twitter.com/search?q=from%3Amoreati%20ansible%20gotc...

Gotta try that console and debugger, those weren't even on my radar!

These are both great. I still have an issue with console for things requiring a sudo access, for some reason, but it is still very handy!

Another trick I use routinely is this, to display a given variable across all hosts:

ANSIBLE_STDOUT_CALLBACK=json poetry run ansible all -m debug -a var=vault_ssl_cert | jq ".plays[0].tasks[0].hosts"

Nice, I might steal that for part II...

Unfortunately, console appears to be effectively ignored by the developers and it's pretty obvious when you're in it. There's just a lot of things that you can't do. It's a shame, too, because ansible is particularly frustrating to write for me coming from chef where everything is written in pure ruby and the REPL is basically just IRB with some extra stuff. Having an interactive environment to see how the data I'm handling is being transformed would be helpful, but ansible-console just doesn't do what I need most of the time.

Me neither, but they are certainly something I can use.

I still lack something that will tell me where a variable is sourced from.

I wish I had read this a couple of weeks ago. I am new to Ansible but needed to use it to migrate a legacy system. I’m not really accustomed to long build times and kept running out of things to do in the interim while I was waiting for it to break.

Maybe a stupid redundant slightly offtopic question, but I'll ask anyways:

What is the best way to couple Ansible with some git repository to automatically apply changes while still being able to trace and debug?

Tower? Open sourced by Red Hat as AWX, it is a central Ansible hub for exactly that purpose.

Tower/AWX are bad products and they should feel bad. It’s actually easier to run Ansible via any CI system than it is to use Tower. I get what the devs are trying to do but Tower wants to own so much of the process that it’s basically not Ansible anymore. You pretty much have to write your code specifically for Tower (which means a total rewrite in mosts shops because who starts with Tower?)

That type of criticism is a bit broad. While it has all the hallmarks of a commercial product trying to fit every type of customer, on the other hand the whole point is having a centralized architecture for Ansible. This is something desirable as soon as you outgrow using it for ad-hoc jobs, and there is a payoff compared to just scripting in bash or dockerfiles. Of course the code has to be adapter for it, but it's not anything like a complete rewrite.

Ansible doesn’t have the right model for painless auto application, because it’s only idempotent in some cases and not others (e.g. removing config).

It’s a good tool to set up a server from scratch to a desired state, but not to go between two versions of desired state. So a git-based deployment workflow is doomed for failure as soon as you start actually using the history (or removing config).

We've been using the git module to pull down the desired version of the repo and then using synchronize to copy it over. That could be used to overwrite whatever was previously being copied from the git repo. So, can you explain where that doesn't work for a git based deployment? It's a pattern we use a lot. (A slight tangent though, systemd-tmpfiles can ruin your day if it partially deletes the repo, because that makes the git module crap out.)

If I understood your use case correctly, I think you’re deploying an application with it? I meant more about server configuration things with Ansible configuration which is version controlled and auto-applied on push.

In that case, say you are using Ansible to set up a daemon A on a server. Now you want to remove the setup for daemon A and move to daemon B. Just changing the configuration is not enough (since nothing stops daemon A and cleans up it’s configuration). And after you’re in the new state (let’s say daemon B was just an experiment), if you decide to revert the commit that added daemon B and push, nothing will stop daemon B.

I'm still not seeing the difficulty with doing that. Basically any daemons that you're setting up using distro packages or systemd service units (which, even though I'm not strictly in love with systemd, is what you should do for any service daemons you set up), it's just a matter of telling the service/systemd module and probably the copy or template module that you used to created the config files that they should be "state: disabled" or "state: absent" respectively. It'll disable and remove config of whatever service you've set up. What I will also do some times is create an array variable of services to enable and services to disable, and all that's required to switch is moving a daemon's name from one list to another. So, unless you're doing something outside of the modules like copying a bunch of stuff with the shell module and launching these putative daemons without some supervisor like systemd, it's pretty trivial to tell Ansible to reverse whatever it did. And if you are just launching processes using a bash script or something, why? It can get more complicated if it's e.g. a dev box and people are using ssh to go in and tinker directly, but generally Ansible will still just remove/create whatever needs to be there if it needs to be changed, or do nothing. Depending on how robust you want to make Ansible against people doing unexpected things, you can use trap doors like "meta: end_host" to bail out on errors if people are causing problems with manual edits, but that's better solved as a people/process problem than trying to make Ansible handle all possible unexpected situations.

In the end it's also usually just easier to use Ansible to bake a new VM or container configuration and deploy that instead of mutating an existing one over and over again.

If you move daemons from enabled to disabled, then you’ve taken an extra step. The difficulty is not in remembering to do it once, and daemons are a simple example. The difficulty is in always remembering, especially for trickier changes. But sure, if everyone always remembers, then it’s fine :)

I don’t think reversing what you just did is as trivial as it should be. For a git-based workflow to work well, it should be as simple as reverting a commit.

Setting up a VM from scratch is where Ansible is great. A CI/CD that recreates the VM and applies the config would work really well with automatic git deployment, but that’s generally not how people use it.

People have been chanting cattle not pets since before Ansible existed, so if people cargo cult it without using it in the way it's useful, there's not a tool out there that's going to really protect people from lack of due diligence. We don't judge a kitchen knife according to the people who try to grip it by the blade. Dependencies outside of an app's git repo will always exist for anything beyond the most trivial application.

Tooling can assist process, provide guard rails and reminders, but it's never going to replace remembering some important things. Having layers of process that mean that a single person forgetting is not going to cause disaster are necessary no matter what. Maybe when general AI exists we won't have to worry about remembering things, but then at best most of us will be out of jobs, at worst we'll be running from kill bots.

Don't get me wrong, there are plenty of design decisions in Ansible that regularly annoy me, but I've never expected it to make me less forgetful or substitute for people cooperating and communicating, and I think expecting any tool to do that is a road to disappointment.

This definitely isn't the best way because it's far from painless, but we use GitHub Actions to run some Ansible playbooks. When there's a problem that needs debugging we add a tmate[0] session with mxschmitt's action[1] so we can step through the playbook, etc.

[0]: https://tmate.io/

[1]: https://github.com/mxschmitt/action-tmate

Or Jenkins pipelined steps which might include setting up the pipenv to include the setup of Ansible in your git as requirement file. So it starts with Groovy and switches to Ansible etc

Build/Buy a real CI/CD system for this TBH. Once you hit a certain level of complexity, this sort of deployment strategy will mostly just amplify pain.

For small deploys you could set a cron that does `git pull && ansible-playbook`. You can setup your cron to email failures. This can be setup with the ansible playbook and bootstrapped by running over ssh instead of local. Cron can email failures

For more than a few servers, other answers are probably better

> set a cron that does `git pull && ansible-playbook`

AKA ansible-pull, which can greatly simplify the auth story since the machine only needs outbound access and auth to the repo (if required): https://docs.ansible.com/ansible/2.10/cli/ansible-pull.html

Didn't realize that was built in to Ansible. Depending on how ephemeral your hosts are, you could scale this to quite a bit of hosts and use host-reported health checks to determine if things are working correctly e.g. HTTP POST a "this went ok" to some web server or metrics/monitoring service

ansible-console is a treasure! I wish I saw that article years ago...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact