
Ansible vs. Chef (2015) - fanf2
http://tjheeta.github.io/2015/04/15/ansible-vs-chef/
======
mdeeks
Ansible has scaled really poorly to thousands of hosts for us. Things we have
run into:

\- Running a job against a single host will finish in 3 minutes... running
that exact same job against thousands will take well over an hour and max out
your machine.

\- Running against more than around 3k hosts will somehow consume all 60GB of
RAM and trigger the oom-killer

\- CPU usage on the ansible runner is absurd for a large amount of hosts.
We're currently using a c4.8xlarge (our biggest box) just to run deploy jobs
and have them finish in a reasonable amount of time (10-15 minutes)

Slicing up our inventory into chunks and running them on different servers
sucks big time and is pretty hacky. How do I combine the results? Can't do
orchestration like "Run X on these roles first, then run Y on these roles when
you're done".

Most likely what I'm going to do is have a single server execute ansible doing
only the following in async (aka CPU friendly) mode:

\- Upload a current copy of ansible to S3

\- Upload the configs to the target machines with ONLY the secrets that role
needs in plain text. (I'm not putting my vault secret on every box!)

\- Have the servers pull it down and execute in --connection=local mode.

\- Wait until each remote finishes

All that said, I LOVE LOVE writing stuff in Ansible. It is so easy to read,
follow, and understand. I picked up most of it in a day or two just by reading
their "Best Practices" page. Getting it to work at scale hurts though :(

~~~
pyre
> It is so easy to read, follow, and understand

I never liked that the variable namespace is global, so there isn't anyway for
a module to be self-contained. If you execute Module 1, and then Module 2,
Module 1 can set a variable that inadvertently affects Module 2. The
"recommended" way around this is to prefix all variable names, but this
becomes unwieldy very quickly as your variable names grow in size. The "nicer"
way to do it would be to set have a dict/hash of variables, but that makes
top-level overrides difficult because there is no way to override
"hash_name.variable_name" you basically have to override the whole "hash_name"
variable or nothing at all.

I found it difficult sometimes to reason about how these variables would work,
and if I needed to add something to defaults.yml or variables.yml in a module.

~~~
LukeHoersten
Agreed. In fact variables are scoped by "play" not by "role" which is almost
weirder. For those who don't know the organizational structure from largest to
smallest is playbook->plays->roles->modules/tasks. Variable scope is at the
"play" level.

More than that, roles are meant to be distributable components on the Ansible
Galaxy service they run. Galaxy gets almost no use because modularity and
reusability is broken by having no idea how the role author namespaced their
variables. Collisions happen all the time. Why we must manually manage scope
with naming conventions when the computer can do it automatically with scope
is beyond me.

This is Ansible's biggest downside in my opinion. I've talked with the core
devs about it on IRC and they (bcoca) agreed but thought it too late to make
such a pervasive change as introducing role-level scope.

As some of the other posts have mentioned, I still love Ansible despite this
shortcoming.

~~~
andrewvc
Can't agree enough here. The core deva act like it isn't a problem. The lack
of encapsulation makes reusability impossible

~~~
krakensden
I remember being really down on Ansible Galaxy, and whenever someone tried to
use a community role, I'd pull it down, audit it, and make them fork it,
because it was inevitably dangerous, not thought out, and with no tests. Now
I'm back in a Chef shop, and Chef has a ton of tooling for re-usability, has
put loads of thought and effort into the problem, and there are tons of
cookbooks, many maintained by Chef, Inc. The problem isn't really any better
though- the official and officially blessed cookbooks are _still_ terrible,
broken and unsafe in all sorts of obvious ways. I'd rant more, but I'd have to
get specific and mean about actual people.

It's the problem space, honestly. It all depends on the guardrails your
workflow provides, and there's never enough in common.

Anyway, there's a reason everyone loves golden-images.

~~~
LukeHoersten
Good point. The Ansible community who happened to be talking to me on IRC
about this basically said that at least at this point, you have to look at all
your role code no matter what anyway. Meaning scope wasn't the core issue. But
lack of variable scope even hurts my own Ansible config abstractions.

------
_qc3o
If you're using yaml and templates then you've already lost. The only tool in
this game that is not braindead is chef. Sometimes you need imperative things
and conditional logic with iteration. If you don't have a real programming
language then the contortions you have to go through gets really old really
fast.

As for the deployment patterns. If you're in the cloud then you should be
baking AMIs (or equivalent in your cloud provider) and shipping your
configuration the same way you ship your application code, as native packages
like .deb or .rpm. If you jumped on the docker bandwagon then your hosts are
basically there to look pretty and host the containers which means you have
some other way of getting configuration to your servers, i.e. etcd, consul,
etc. so the problems brought up in this post don't exist in that setting. You
are also probably using some kind of container orchestration system like
kubernetes so again the problem of orchestration and deployment is offloaded
to some other system. The only problem you have in that setting is doing a
rolling deploy of containers and halting when things go wrong.

I think the only place any of these tools make sense now is some private on-
premise cloud. Ever other place has already moved on.

~~~
smw
Yes! Please, for the love of all that is holy, please quit writing tools that
make me write code in something that isn't a general purpose programming
language! Didn't we learn from Ant and xml?

I really want to see a clojure/clojurescript based config management system --
it would be so pleasant to write EDN/sexps for basic config, and yet have it
be a real language when you need to do something hard.

Edit:

Forgot to mention, pallet [1] is something like that, but unfortunately it
appears to be mostly dead.

[1]: [http://palletops.com/](http://palletops.com/)

~~~
crdoconnor
>Yes! Please, for the love of all that is holy, please quit writing tools that
make me write code in something that isn't a general purpose programming
language! Didn't we learn from Ant and xml?

Yes, we learned to use _less_ powerful languages where it was appropriate
because they're more readable and less susceptible to technical debt.

This principle, in other words :
[https://en.wikipedia.org/wiki/Rule_of_least_power](https://en.wikipedia.org/wiki/Rule_of_least_power)

Ant XML was _as powerful as Java_ \- it was turing complete and terribly
designed to boot. That was its primary failing.

Likewise, using turing complete PHP to generate HTML was never as clean as
using a less powerful templating language (like jinja2) to generate the HTML.
Separation of concerns with a language barrier is a good thing.

If all of this means nothing to you, you've probably created some huge code
messes in your time.

~~~
Singletoned
> Likewise, using turing complete PHP to generate HTML was never as clean as
> using a less powerful templating language (like jinja2) to generate the
> HTML.

I strongly agree with your point, but Jinja2 is Turing Complete as well (it's
still preferable to PHP though).

~~~
crdoconnor
I don't really know enough computer science to validate this idea, but I can
sense that there are different levels of "power" among turing complete
languages (and also among non-turing complete languages). And Jinja2 <
Python/PHP, despite all three being turing complete.

Metaprogramming / C++ style templating, for instance, goes above and beyond
the power provided by regular turing complete programming constructs, and
while that means that you can do cool stuff with them you couldn't easily do
otherwise, they're a massive headache to reason about, debug, and keep free
from technical debt.

Similarly, when you take blocks of code and "lower the power" to make it
declarative instead of imperative (e.g. using list comprehensions instead of
for loops) it almost inevitably ends up cleaner.

------
danielvf
I have done several small deployments (under two dozen servers) with Chef,
Puppet, and Ansible. I've also evaluated Salt and worked with other people's
Bconfig systems. Ansible is the best for my situation, hands down. I no longer
have any of the others in use.

Ansible is easy to reason about - it's never surprised me once in use. You
have about an order of magnitude less to learn when compared with chef or
bconfig.

Also, for setups with small target VM's, it's increadbly handy to not have to
install a bunch of stuff on each server and make sure it doesn't conflict with
anything else.

But mostly it's that Ansible can be understood enough without devoting a
couple weeks of your life to it. And you can come back later and understand
what you have written.

------
falcolas
> it works reasonably well on the scale of thousands of hosts

I could see this if you're working from one really powerful machine... no,
that won't work, it's constrained by SSH, not hardware specs.

I could see this if you're calling Ansible on another host... no, then you
have to copy everything out to the sub hosts, who have to copy everything out
to their sub hosts... A scalability nightmare.

You can use redis as a distributed store of truth and... wait, what? Now
there's a blog post worth reading. Show us how to scale Ansible with real-
world examples using redis and autoscaling groups. Please.

Got it. I can see this work reasonably well if you're willing to wait 10 hours
for a deploy to complete. Personally, I'm not.

> If you want to have 1000 forks, that will cost about 30 GB of memory

Ansible is not, nor has it ever been, limited by available memory. It's
limited by the number of concurrent SSH sessions it can handle while copying
every single module to be executed to that host.

There's plenty of reasons and ways to use Ansible for deploying code. Some of
the post has accurate and reasonable information, but the scaling portion is
pure fantasy right now.

~~~
dozzie
I had around a thousand of hosts and one small virtual machine, and my SSH
call to each and every one (and running something trivial, like `uptime') took
less than a minute _in total_. Though this was a custom script that used
Erlang's built-in SSH client, and had some unpleasant trouble when hitting the
limit on file descriptors (1024, bumped later to 4096 just because of that).

~~~
falcolas
It's much less about just the ssh overhead, as it is about copying Python
scripts to the target hosts, then executing them, one at a time.

~~~
drakenot
I thought Ansible had different execution strategies and you could make the
hosts not wait for each other if you wanted. Wouldn't that speed things up a
good deal?

------
dmourati
If someone will rewrite this article as Ansible vs. Puppet I will not only buy
you a beer but I will throw a parade in your honor.

~~~
movedx
Puppet has pretty much all of the same problems.

Deploying Ansible: \- SSH \- SSH keys \- Git \- CI \- ... fin?

Deploying Puppet: \- Install the agent on every system (using Ansible,
ironically) \- Install a set of masters (need HA) \- Install a message bus for
mcollective (if this is still even a thing?) \- Upgrade the agent on ALL of
your hosts as and when there's an update or a security patch \- Have an
agent/process running on ALL of your hosts with root access to the box to
execute any command a hacker managers to inject \- ... oh you've installed
Ansible? I'll stop now then.

~~~
quicksilver03
The only way I deploy Puppet today is in masterless mode, having the manifests
on a private BitBucket repository. You still have an agent to deploy, but at
least it doesn't run all the time, and there's no master and no mcollective to
complicate things.

~~~
brazzledazzle
Despite some warts I like Puppet a lot because it's simple while still
providing some flexibility to use Ruby. I prefer Ruby over Python by a huge
margin and while that would seem to make Chef ideal, at the end of the day
Puppet is more approachable for people with zero to minimal experience
developing.

------
heavenlyhash
This might be off-topic (or it might be a breath of fresh air if you're tired
of configuration managers) --

I've been toying with the idea of making a trolling-but-no-really deployment
framework called tarpipe, and all it does is take some files on your host, get
em to a $place in one step, and run hook.sh. Oooooptionally, do some dir moves
and symlnks to keep a prior state backed up, and service stop|start on either
side of the mv/ln, to minimize downtime.

Usage could be `tarpipe ssh user@host` or `tarpipe <(echo "cd keke && bash
-c")` just as easily.

It goes without saying that this simply wouldn't be comparable to Ansible,
Chef, or other CM because it's _too simple_. It doesn't help manage state if
it escapes $cwd. But if your application can curb its enthusiasm to a
directory... boy is it simple if that's true.

I already do this on the daily to crashland my bashrc and dotfiles on any new
remote host. Maybe this kind of explicitly zero-dep deploy would be useful for
more situations.

Would anyone have a use for that?

\---

EDIT: What the heck: I prototyped it:
[https://gist.github.com/heavenlyhash/b575092aa84ce9f3e1d2](https://gist.github.com/heavenlyhash/b575092aa84ce9f3e1d2)

~~~
mattzito
I'm sure it'll work for some people, but configuration management is one of
those things where the devil isn't IN the details, the devil IS the details.

What happens if you have binary files to move around that are big? How do you
insure atomicity? What happens when a run dies midway through? Can you make
sure operations are idempotent?

That's even before we get into variables, state, coordination/orchestration,
and so on.

I spent a lot of time (a troubling amount, really) in the automation space,
and while it's true that simple problems are easy to solve, simple problems
quickly become hard problems. Then you're building state machines, binary
distribution systems, and now all of a sudden you have an enterprise
workflow/configuration management system.

I swear, half the products in this space started out exactly as yours, and the
author just kept finding edge cases to fix until someone gave them $10m in
series A funding and it was suddenly a business.

~~~
heavenlyhash
Absolutely. But there's a really wildly underserved niche where I have _less_
computers than netflix and I really just wanna push, not have a push framework
engine manager tower of turtles.

Sometimes I need kubernetes and orchestration and teleporting state snapshots
and oh my.

Sometimes I just need three files on the dang VPS box.

(I see your point, of course. I wouldn't want to build orchestration in bash.
Rather, I'd really like to see at least one tool that looks at all this, and
goes "huh. We're gonna KISS. And no, it's never going to orchestrate 200
machines. That's okay.")

~~~
empthought
That's Ansible as far as I can tell. Roles, group_vars, dynamic inventory,
etc. are all optional; pushing a few files to some VPSes is just a few lines
in a YAML file.

~~~
heavenlyhash
And a sprawling python dependency on each side?

You can play the "python's everywhere" card now if you like, but I'll look at
you funny.

How many megs, exactly, is the smallest thing you can have ansible push to an
empty slave server with no prior contact?

~~~
empthought
I thought you were talking about simplicity/ease of use, not network bandwidth
efficiency.

If you're counting megabytes in a land where low-end mobile phones have 16
gigabytes of storage and 100Mbps wireless, then you probably have a boutique
enough use case that you _should_ roll your own.

------
LukeHoersten
Some things I really like about Ansible are:

\- Super simple declarative yaml configs

\- Agentless. You needed to have SSH working anyway so Ansible just uses that.
With ssh pipelining it's so fast.

\- The community support is huge and extensive.

\- They have a module for everything and development is constant and active,
much of which from the community.

\- Hardware and networking equipment can be provisioned just the same as a VM
or OS image.

The list goes on. Definitely give Ansible a try.

~~~
vacri
I've previously used Puppet in agentless mode and now Ansible. Ansible is
_much_ easier to troubleshoot - it just executes things in order until you
reach the thing that breaks. Puppet... fails in ways that you sometimes can't
even tell with debug on.

Not to mention that Puppet totally freaks out if you have the exact same step
configured the exact same way in two different modules and refuses to run
('just have package X installed, no further config'). Apparently you're
supposed to write a whole new module at that point.

My guess is that Puppet would work better in large, complex environments as a
sort of live-managed system, but down at my small scales (tens of servers),
Ansible is far easier to develop and manage, I find.

------
bfrog
I guess maybe I'm one of the few, while I've used ansible, chef, and now
salt... I'd much prefer something like nixops + nix, if it were more mature.

~~~
iso-8859-1
Where is the lack of maturity exactly? Nix itself is pretty stable. I don't
know about NixOps, but I suspect you may be talking about Nixpkgs, not NixOps
itself?

~~~
k__
Someone who is using Nix in production told me, NixOps wasn't ready for more
complex deployments (the example mentioned was EC2 inside VPC)

------
gtirloni
We've been using Ansible to configure baremetal/VM servers as well as building
Docker images. The advantage of using it with Docker is that we get to use
something more powerful and readable than what's possible with a Dockerfile
(which tends to quickly turn into spaghetti/unreable code).

If the application being containerized adheres to some principles like taking
configuration options from environment variables or some other discovery
method, then Dockerfiles are simple enough. But you if have to do anything
more than a few steps, we've found it more manageable to encapsulate the
knowledge in an Ansible role.

We also build our own base images from scratch using Packer (which we love)
and Jenkins (not so much love there). That being said, having the required
packages for running Ansible inside the image does bloat things a little bit.

If Dockerfiles had more powerful and elegant constructs we could stop using
Ansible and remove a layer of abstraction. That would be great.

------
jv22222
If you are thinking about working with either of these (or puppet) please
check out salt stack it is a pleasure to work with and equally as powerful.

~~~
saltuser
I've used Salt extensively, and found it unbelievably buggy. I'm not sure the
company behind it tests it in any way, and you can expect terrible bugs in
every point release.

Their bug tracker (which is 3500 open issues deep) has to be treated as
documentation, and the real documentation is often a work of fiction.

~~~
rdtsc
I had bought into it originally because it claimed some Windows support. It
did have it. But it was very limited and bug ridden. I really wanted to like
it, but it just fell down at every step. Worked ok-ish on Linux systems
though.

------
3lux
If you'd like something simpler than both, have a look at:
[https://pressly.github.io/sup](https://pressly.github.io/sup)

~~~
danieltillett
This looks very nice.

------
vegardx
I'm having a little trouble taking this article serious when they are talking
about ancient versions of Chef, and that being a problem. If you are still on
the < 0.9 version of Chef you are doing something very wrong. You already have
configuration management, it's not really an argument to not update to a
somewhat stable release. Chef isn't just a pull-based system, you have things
like push-jobs to go the other way around.

ChefDK is amazing. You get all the tools to do things right. Foodcritic will
make you write good code, fast. It is hard wired for environments. It has a
rigid precedence for anything, that is also well documented. I mean anything.

I work with Ansible, Puppet and Chef, and Ansible is cute, perfect for one-off
configurations, but it doesn't do anything chef-solo couldn't do, if you
wanted it agentless. Puppet and Chef does more or less the same things, I just
find Puppet extremely slow to work with (heira, r10k, different environments,
etc), hard to debug and in general just slow.

------
themckman
So, after working with Ansible over the last 7 months, I've come to the
conclusion that it's not SO bad (it is still bad) if you have some sort of
frontend to it that builds your playbooks for you and then calls out to
`ansible-playbook` instead of trying to write your playbooks from scratch in
YAML. We have a big CLI application, written in Python, that does a lot of
this for us. It provides clean, specific interfaces to our most common tasks
and has wrappers around the base tools (`ansible` and `ansible-playbook`) that
add some useful features we found missing from the base tooling (Want to set a
nested variable without resorting to passing a JSON dictionary at the command
line? `-V foo.bar.baz value`). Even though our CLI is written in Python, we
still prefer to build and write playbooks to temp files before calling out to
`ansible-playbook` over the Python API just due to the amount of logic baked
into the module and playbook runners that we'd have to replicate ourselves.
Ansible 2.0 looks to provide a more powerful Python API (and a better example
on how to use it in the docs), but porting our stuff now isn't really a
priority. As I started writing this, I thought this would be a much more
scathing comment as I find myself terribly frustrated on a regular basis, but,
as I said, it's not SO bad using a frontend written in a language I enjoy
working with. As another commenter wrote, I really feel that Lisps/Schemes (or
some other language with a powerful macro system) would make the ultimate
ops/cms languages...Guix and Pallet seem to be the only games in town in that
regard.

~~~
davexunit
I have begun work on a remote deployment system for GNU Guix, which will be
our Ansible/Chef/etc. equivalent. We already have the tools for performing
full-system configuration management on the local machine and offloading
builds from one machine to another, so it's a matter of gluing the two
together. We'll offload full-system builds to a cluster of machines and then
instantiate the configuration(s).

------
rdtsc
From what I know, but this is probably outdated knowledge -- Ansible is what
you'd use if you already use shell scripts + ssh. Then you upgrade to Ansible.
Instead of shell scripts you get nicer things.

However if you have something complicated that needs templating, 1000s of
servers, secrets/passwords to distribute, then something like Salt or Chef is
the way to go.

Personally I want to see where NixOS/Disnix/Guix will end up. On paper those
seem very powerful and I like the idea behind them.

------
glasz
ansible ftw. i really regret having started with chef a few years back.

------
GauntletWizard
This is pure bunk. Let's go down the list, one by one:

Maintenance: Sure, chef has a server component. So does ansible, if you're
using it the way he suggests (with a host periodically running ansible
playbooks on all hosts). Ansible has no client component to upgrade, though,
so that's a win, right? It totally is, until Ansible doesn't work on a host
and you can't figure out why, and the error logs you get are useless because
some of Ansible's many assumptions about what the host's initial state is are
incorrect. Chef can be managed by a standard package manager, which costs
nothing on the client side, and allows far, far better assumptions to be made.

For the record, I eventually gave up on the chef server, replicated the
playbooks to each machine (using a cronjob and git), and chef-solo.

Speed: Ansible pipelining speeds it up, significantly. You can almost get one
command a second! Chef runs on host. It is ruby, and goes slow, but I have
programmed a lot of chef and run a lot of Ansible, and my average chef run was
under 30 seconds, and I've yet to have Ansible run any playbook in under a
minute. Some of this is from atrocious default behavior, like requiring all
hosts to complete a step before moving on to the next step on any hosts, or
the fact that it spends nearly 10 seconds of cpu time on each machine
'gathering facts' at the beginning of it's playbooks, even if none of those
facts are ever used.

Fact caching: This is a solution to the aforementioned problems with Ansible.
It may make sense in the chef-server context, but I don't have a whole lot of
experience with it.

Tags: This is probably a matter of personal preference, but I prefer to give
the set of things that need to be done, and have the tree descend downwards
based on dependencies from there. The author clearly prefers to specify with
tasks when they should happen, and for each host a set of initial
circumstances. This one can be argued til you're blue in the face. I make the
point that there's a clear tree that can be built of dependencies under my
scheme.

Push vs Pull: There's no maintenance cost to upgrading? What is this dude on?
When you change Ansible revisions, you have to do just as much work adapting
as from chef revision to chef revision. Ansible has always been highly in
flux, and not great about not changing default behavior.

Pulls still have to be triggered, but they can (and should) be triggered on-
host, in a cronjob. Your monitoring system should alert you when the chef run
is out of date, though, honestly, if it is failing on just some of your hosts
you need to clean up and unify your infrastructure.

Raw numbers: Ansible scales one large machine. Chef costs you a tiny amount on
each machine. One of these scales. One of these does not.

Search and inventory: Oh gods, if you're using Ansible for inventory
managment, please don't. If you're using chef for inventory management, please
don't. Neither are reasonable tools for the job.

Orchestration: Neither chef nor ansible are appropriate tools for dealing with
your application's data model. Full stop. Actually, full stop. There's nothing
else of value further down this article. Please don't take any of it's advice.

------
finchisko
What are the cons instead of using Ansible (or others), just do simple clone
of virtual disk image from some configured master? There will be still need to
configure some things (IP addresses,...), but lot of things can be shared.
Also one can use LayerFS to have shared things in RO layer and changes
specific to clone in another RW layer (basically what Docker does). I know
Ansible has somehow solved the problem of running same config scripts multiple
times, but is it 100% bulletproof?

------
movedx
All of the responses to this post so far: "This sieve is useless! It won't
hold my coffee and I get burnt every time I use it!!"

------
jdubs
"but I prefer using git submodules" \- nope.

Great tutorial!

~~~
pwelch
It's weird the author uses git submodules. Ansible has a requirements file you
can use for dependency management.

[http://docs.ansible.com/ansible/galaxy.html#advanced-
control...](http://docs.ansible.com/ansible/galaxy.html#advanced-control-over-
role-requirements-files)

------
awjr
It may be me but I rather like [http://kubernetes.io/](http://kubernetes.io/)
as this takes the approach that you are deploying an application not a set of
disparate services.

~~~
meritt
These aren't exactly comparable with Kubernetes. While there's overlap between
the ultimate goals, we're fundamentally talking about two very different
things.

    
    
       Kubernetes, Mesos, Swarm, Fleet, Marathon belong in one category (orchestration tools).
    
       Ansible, Chef, Puppet, Salt belong in another category (configuration management).

~~~
iso-8859-1
What does Google use for configuration management? And how do I do
configuration management with Nix?

~~~
bbrazil
[http://flabbergast.org/](http://flabbergast.org/) is the closest, and in a
similar vein you'd have
[https://github.com/google/jsonnet](https://github.com/google/jsonnet) and
then Nix.

Configuration management is a very different thing to what most of
ansible/chef/puppet/salt are commonly used for - it's possible to do it but
it's a bit of an uphill battle. The core problem of config management is not
pushing out things, but rather how do you represent configs with their myriad
of interacting multi-level exceptions while also allowing
tens/hundreds/thousands of operational changes to be ongoing at the same time.

Ansible for example is good for about 3-4 levels of exceptions, which isn't
that much considering region and environment (dev vs prod) are two of those.

~~~
arde
Flabbergast seems to have the right idea, but its lack of examples is a
problem. Its only example is based around Apache Aurora (who in hell uses that
and wtf it is I don't know, and I don't really want to learn that just to
understand an example). This hasn't changed since it was first announced, so I
don't think I'll take another look at it for quite a while unless I come by
some reasonable howto.

------
Florin_Andrei
This is very useful, thanks.

