This is definitely cool but the real hard problem isn't in these simple easily s...

mpdehaan2 · on July 3, 2013

We've got quite a few users managing hundreds and thousands of hosts, so I'm not seeing these kinds of compliants. If I would, I'd feel it, but we don't :)

One of the things many people want to do is rolling updates too, and Ansible is remarkably good at them, having a language that is really good for talking to load balancers and monitoring and saying, "of this 500 servers, update 50 at a time, and keep my uptime". Folks like AppDynamics are using this to update all of their infrastructure every 15 minutes, continuously, and it's pretty cool stuff.

For those folks that do want to do the 'facebook scale' stuff, ansible-pull is a really good option. One of the features in our upcoming product is a nice callback that enables this while still preserving centralized reporting.

Happy to have the conversation, but definitely I've never heard the CPU time compliant. I think the one thing we see is a lot of users are happy that Ansible is not running when it is not managing anything, rather than having daemons sucking CPU/RAM/etc, and folks are actually getting a little better performance from avoiding the thundering herd agent problems.

MichaelSalib · on July 3, 2013

I just did some consulting on helping another team improve their hadoop cluster performance and the first thing I noticed is that all 40 boxes in the cluster were burning a CPU core with a puppet agent process that was running at 100% CPU for months.

mpdehaan2 · on July 3, 2013

That's one of the nicer things about the no agent setup, when Ansible is not managing something, there is nothing eating CPU or RAM, and you don't have a problem with your agents falling over (SSHd is rock solid), so you get out of the 'managing the management' problem as well as the 'management is effecting my workload performance' problem.

In particular with virtualization, running a heavy agent on every instance can add up. (reports of the ruby virtual machine eating 400MB occasionally occur).

otterley · on July 3, 2013

How does Ansible effectively scale to thousands of hosts using ssh? My experience is that you can only run a few hundred ssh sessions at a time with reasonable performance, and that's on beefy hardware to begin with.

mpdehaan2 · on July 3, 2013

Several different options.

Many folks are actually not doing repeated config management every 30 minutes, though I realize that may be heresy to some Chef/Puppet shops, there's also a school of thought that changes should be intentional. So there is often a difference in workflow.

LOTS of folks are doing rolling updates, because rolling updates are useful.

Many folks are also using ansible in pull mode.

You could also set up multiple 'workers' to push out change, using something like "--limit" to target explicit groups from different machines.

What happens if you feed Ansible --forks 50 it's going to talk to 50 at a time and then talk to the next (it uses multiprocessing.py). If you also set "serial: 50" that's a rolling update of 50 nodes at a time, to ensure uptime on your cluster as you don't take the 1000 nodes down at once.

This is really more of a push-versus-pull architecture thing, while it presents some tradeoffs, it's also the exact mechanism that allows the rolling update support and ability to base the result of one task on another to work so well.

Ansible also has a 'fireball' mode which uses SSH for the initial connection for key exchange and then encrypts the rest of the traffic. It's a short-lived daemon that doesn't stay running, so when it is done, it just expires.

otterley · on July 3, 2013

> Many folks are actually not doing repeated config management every 30 minutes, though I realize that may be heresy to some Chef/Puppet shops, there's also a school of thought that changes should be intentional. So there is often a difference in workflow.

I think this is a false dichotomy. Those who believe runs should be performed frequently often implement this to revert manual changes performed by people operating contrary to site policy.

mpdehaan2 · on July 3, 2013

Not so sure, I've heard that quite a few times. The use case of rack-and-do-not-need-to-reconfigure-until-I-want-to-change-something seems quite common, but I suspect it's in often better organized ops teams where you don't have dozens of different people logging and not following the process. There is of course --check mode in ansible for testing if changes need to be made, as is common in these types of systems. Thankfully, both work, and you can definitely still set things up on cron/jenkins/etc as you like, if you want.

druiid · on July 3, 2013

You can actually manage a deployment methodology this way (update x number of hosts kinds of things) pretty handily using Mcollective with puppet. You can even script your own plugins for it to do basically whatever orchestration flow you want. Pretty cool toolkit and I use it myself.

hunvreus · on July 3, 2013

I keep on hearing these arguments of scale. I haven't ever experienced such issues. Now yes, if you're running Facebook infrastructure you may run into issues. Though honestly, who here manages more than a few hundreds hosts? I can't find the post that was making the point of encryption cost and parallel SSH sessions, but it'd be helpful to get actual hard numbers backing these claims, and how they compare to Puppet's or Chef's approach.

I agree with the complexity of similar-but-different hosts, and that's actually something we're set to solve with devo.ps (we'll see if we pull if off).

3amOpsGuy · on July 3, 2013

To some extent i agree; most will never scale so why worry about it? X times out of Y you'll come out ahead by just ploughing on and not considering the future. Even if we are thinking up front, we can rationalise that it's just a gamble, often one worth taking!

It definitely doesn't take Facebook sized infra to outgrow a technology though. What if the gamble doesn't pay off? What if you planned to scale the central NFS server dependency by just adding an extra NFS server but have now found there's no rack space left / no budget / a purchasing delay / insufficient network capacity / cooling capacity / power capacity / a.n. other unplanned problem.

For the SSH question, i couldn't reliably get more than 250 concurrent connections outbound from one circa 2008 blade server. From memory that would have had a spec of dual core CPU, maybe around 2.4Ghz with 8Gb ram using PAE as it would have been a 32bit kernel (our spec, the cores will have been 64bit). They were multiply-connected at chassis level on myrinet fabric in one DC and infiniband in another and the resource being exhausted was CPU.

These days all the blade servers are gone but we see an absolute explosion of virtual machines so it's a similar and still relevant problem in many ways.

hunvreus · on July 4, 2013

It's not a matter of taking the gamble you'll stay small. It's that there are different kind of big, only 1 times out of 1 billion there is a single player (Google, Facebook, Amazon) who reach an insane scale. By then, you'll have the resources to solve whatever issue you may hit.

As Michael (creator of the project) said here; it scales just fine even with serious players (thousands of instances). He has actual hard data and concrete use cases to back it up.

So far, all I've heard from detractors is the "OMG Chef is so hardcore Facebook uses it". Well, they use some. They'd probably do just fine with Ansible instead, provided they were putting half the brain power they put into Chef to make it work for their infrastructure.

You may get very big, just likely not Facebook big.

mpdehaan2 · on July 3, 2013

It is worth pointing out that Ansible does not need concurrent connections to each server to manage each machine, you can address as large of groups as you want and control parallelism with the --forks parameter.

eblume · on July 3, 2013

I really don't intend this as a brag, but I just wanted to point out that I learned Puppet yesterday. No, I mean, I really learned it yesterday. I started with my company's vagrant-managed VM and took their existing puppet architecture, and armed with that I learned (from no previous experience) how to write Puppet manifests, modules, 'define's, etc.

My only point is that it doesn't take two weeks to learn Puppet. I'm not saying that Ansible is worse or anything like that, rather I just wanted to contribute another data point.

mpdehaan2 · on July 3, 2013

I think the issue is not so much the question of how long it takes someone to learn a tool, but the repeated cost you get get from using it on a day to day basis. (I'd still be super impressed if you had storeconfigs and the spaceship operator nailed in a day!)

For instance, I worked for a major computer vendor doing an OpenStack deployment, and watched a simple deployment there suck up 20 developers for six months, where all of that time was in writing automation content.

Repeated hammering out of dependency ordering issues, coupled with the non-fail-fast behavior, and having to trace down where variables came from turned us into automation tool jockeys, so we couldn't focus on architecture and development. The project barely had deployments extending beyond 5 nodes in the end from all of the complexity.

Ansible already existed at this point, but it provided major fuel for me doubling down efforts into expanding it. The goal here is not just the basic language primatives, but making it really easy to find things as you have a large deployment, and making it really easy to skim/audit even if you aren't a really smart programmer.

That all being said, Puppet deserves major credit for pioneering a lot of concepts and revising CFEngine.

While Ansible aims to be a cleaner config tool, but also focuses on application deployment and higher level orchestration on top, so you get some capabilities not found in those other chains (like really slick rolling update support).

eblume · on July 3, 2013

Thanks for the helpful comment and also for Ansible in general! I'll take a good, close look at it - you might find me in your IRC channel sometime soon as I poke around Ansible, trying to see if it might be a good idea to port things over.

I know we were looking at Boxen as a way to roll out environments to our dev machines, and we are hoping that our existing Puppet configs will work well with that effort (since Boxen uses Puppet). Do you think it's at all possible that there will be some sort of adapter to allow Boxen to use Ansible? I have no idea if that would a good idea or not (I haven't looked in to either Boxen or Ansible enough yet) but that's the sort of thing that would likely help steer our decision process.

mpdehaan2 · on July 3, 2013

Sure thing -- Jesse Keating has been working on a side project called 'ansibox', which is effectively about taking boxen like ideas and applying them to Ansible. He's 'jlk' in #ansible and you should stop by and say hey. It's new, but it has the same kind of 'choose what you want and we'll make it for you' kind of workflow.

mechanical_fish · on July 3, 2013

You're tinkering with an existing, working architecture and a company-maintained Vagrant VM?

Of course it's going well. But you should make sure to buy all the people who built all these well-engineered things a tasty drink at the next company get-together, because rest assured: Not every Puppet setup is easy to tinker with.

eblume · on July 3, 2013

I absolutely do, and having a working environment is a huge help! However I found the puppet docs to be more than sufficient, and in fact I spent the day learning puppet not through our existing code (which actually needs a fair bit of work still) but rather using the learning VM that Puppet provides on their website.

gazarsgo · on July 3, 2013

i, on the other hand, tried to learn puppet over the course of a weekend with a brand new laptop and Boxen just after it was released. There is a nightmare of broken dependencies that still aren't resolved months later (librarian-puppet won't install apache because apache's version has a -rc in it which isn't a ruby version blah blah blah...)

now I know that using VMs / vagrant is critical to sane server orchestration development workflows though, so there's that.

eblume · on July 3, 2013

Interesting! We do use librarian-puppet but we do not use apache. It's possible that we just managed to step around that land mind essentially by accident. Did you run in to any other issues?