I'm not a web developer but I have a side-project that runs on a cobbled together EC2 instance. The server state is in theory documented in a set of of shell scripts and virtualenv requirements files.
I know that I should be doing this in a more robust way but whenever I try and read up on configuration management tools like Puppet and Chef, they're all described in comparative terms - Puppet does X better than Vagrant which does Y better than Chef, etc. I quickly lose patience and get back to digging myself into a deeper technical hole.
Is there a non-recursive explanation of what these tools are able to do and where someone like me should start?
These tools allow you to specify a state and then specify which servers should have that state. A 'state', loosely speaking, is a collection of definitions of 'X should be Y'.
Package 'libapache2-mod-php' should be installed. File '/etc/apache2/sites-available/customersite' should be the contents of this file we have on the master server. File '/etc/apache2/sites-enabled/customersite' should be a symlink to above. Directory '/var/www/customersite' should be the contents of this git repository. Command 'apache2ctl graceful' should be run if any of the above change.
With these five rules, you can turn a default install of Ubuntu into a webserver providing client files in a matter of seconds.
Once you have those rules written, you then apply them. You can wildcard, so that hosts matching 'www*.yourhostingco.com' get these rules.
Now, when you want to add a new webserver to your cluster, you install a new Ubuntu instance, you register it with the Salt master server, and then trigger a state update, and you're done. No SCP'ing files, no manual git checkouts, no copy-paste-edit configuration management.
Then, once you've got all that, you can get into templating. You can do templating both on the contents of files (e.g. for memcached config, insert the machine's internal IP address instead of having one config per server) as well as the rules themselves to avoid having a ton of boilerplate (for module in 'list','of','python','modules', install the module via pip with these rules).
One of my favourite benefits of something like Salt is that if you use Salt to do all of your configuration, then just back up your salt config to github or wherever, then you always have a record of how you did things. You don't get the incomplete documentation or missing steps that happen with most approaches.
As a sysadmin managing a cluster of systems, it can be life-changing.
But the problem that all of the configuration managers have and also why they all suck is that the state you specify has no guarantee of being the end state. A much better alternative would be to manage state in chunks and test those chunks, e.g an http-server transaction would require responding on port 80 if not fail or roll it back. Rolling back could be using namespaces, jails, etc.
I would not say that is a reason why all configuration managers suck. The feature you'd like to see would represent a significant increase in overhead, complexity, and potential drawbacks that not everyone needs yet.
Meanwhile, a disciplined system of designing and documenting tests and rollback procedures before committing changes will accomplish this without the need for additional software.
My guess is that if you are not disciplined enough to design tests and rollback procedures on your own, then you are probably going to just switch off or work around those features if they are built into the software.
I don't think Mitchell really needs to worry much at this point about marketing Vagrant. Everyone in the DevOps space is already plenty well familiar with it (mostly because it's incredibly awesome). Also, TFA explains what it is: “Vagrant is an established project that wraps around a number of existing virtualization providers to allow for extremely quick provisioning of disposable, consistent environments.”
TFA, or any least the quote you have selected, does not explain what it is. It states (in quite ambiguous terms) what the combination of Vagrant and "existing virtualization providers" can achieve but not what Vagrant's role is.
From http://www.vagrantup.com: "Create a single file for your project to describe the type of machine you want, the software that needs to be installed, and the way you want to access the machine. Store this file with your project code"
That, and http://docs.vagrantup.com/v2/why-vagrant/, sounds quite similar to a configuration management tool to me (I believe you that it's not, I'm just saying I wouldn't have understood that). Seems geared to launching instances/VMs too perhaps. Maybe it's my lack of domain knowledge and maybe I have unfair expectations but I think that the sales pitch should be accessible to a developer who's not solely in the "DevOps space".
Accessibility is good because I'm sure it is incredibly awesome and may well be useful to a broader set of people who aren't yet in the loop.
With vagrant, I can create a vagrantfile. This says: "create a VM using this basic distribution, this networking configuration, these forwarded ports, and this hardware exposed". It defines the basic machine.
I can then use "vagrant up". This creates a new VM using Virtualbox, VMware, or whatever I have installed (I think there's even EC2 providers now!). It configures that VMs hardware as above.
Then, vagrant can use one of many tools in the configuration space -- Chef, Puppet, Salt -- to configure the software in the VM.
I guess the distinction lies there: vagrant configures hardware. Salt et all configure the software.
The quote that SandB0x posted (it comes from vagrantup.com) says that it can be used for defining "the software that needs to be installed". So is the distinction Vagrant : h/w + s/w, Salt et al : s/w?
It's more that you use configuration management for the software, and then Vagrant does the hardware and hands control over to your configuration management for the software. Vagrant can also do the software itself, yes, but the way of doing so is running shell scripts, so it's far preferable to use a configuration management tool (the main benefit being idempotence).
The quote seemed pretty clear to me, but I suppose that could be because I already know what it does. Vagrant wraps VM software (VirtualBox, VMWare, etc — the "existing virtualization providers" in the quote) to create disposable development environments. It also allows you to connect with configuration management tools like Chef or Puppet, which then configure that disposable environment to mirror your production environment. It doesn't really do configuration on its own (unless you count running shell commands), so it isn't a configuration management tool itself. It does one thing well.
I agree that we should be evangelizing DevOps and bringing more people into the loop. I was simply saying that Vagrant itself doesn't really need any marketing. It's hard to read much of anything about the field without seeing it mentioned.
Thank you, your explanation finally made me clearly understand what these tools are for. I constantly hear about them, have read their landing pages, and even listened to a talk on Vagrant + Puppet, but still didn't understand what it is exactly that these tools do (and do differently).
Vagrant doesn't need more marketing, but it -- and the other tools -- do need to explain clearly what it does (and does not) do in a way that non-DevOps folks can understand. Perhaps the easiest way to do that is with a simple use case.
I couldn't agree more that most of the tools in this area could benefit from some better explanations and best practices. The development is happening at such a rapid pace that it becomes very difficult for a newcomer to figure out what's going on. I talked to someone at Opscode about it yesterday and he agreed, and said that it's something they are working on, so hopefully this will get better soon.
I just now (past few days) started learning about this stuff. I've been doing the "Learning Puppet" tutorials .
The most important thing about it is that you describe the state of the server. And puppet applies it for you. It's idempotent (you might know this term from REST). You can run puppet multiple times and the end result would be the same.
Normally (without config management) you would write scripts that do the configuration. If you ran them twice, things could go wrong, or they would overwrite the stuff from last time. Scripts are descriptions of what steps to take. Puppet manifests are descriptions of what state you want the server to be in.
Ok, the declarative and idempotency sounds great. But consider this one use case. I want to stop a service, upgrade it to a new version, restart that service. The service is an application which I wrote that runs in apache tomcat.
1: Stop Service
3: Start Service
Right now I have just made three non-declarative statements about my tomcat server. I am describing a process that runs through a series of steps to reach a new conclusion.
So puppet has these beautiful (sounding) properties. But declaratively describing the state of my servers isn't a real problem I have.
I need to deal with the way that my servers change. I need to upgrade services, including starting and stopping. Puppet absolutely fails at this. Any attempt to describe a process is antagonistic to puppet at a fundamental level.
Whenever people seek a solution to these problems they always get fed that line. That puppet is declarative and idempotent, which is fine. But we should be _very_ clear that puppet does a spectacularly bad job of solving a very ordinary problem that I have, and which I would expect almost everyone else working with large numbers of servers to have too.
If you explicitly need to stop/start tomcat beforehand (I believe this resource does a restart after deploying your app) you can do that with the "service" resource built into Chef - though doing so may mean that you're restarting every time you converge the node, which is probably not what you're looking for.
Your ultimate goal is to ensure that the service is at the latest version.
There are numerous dependencies to that task - one of which may be stopping the service before upgrading. There are dependencies to stopping the service - maybe telling the load-balancers or your monitoring system. When the service has been upgraded, you can trigger notifications that other items within your config can depend upon (for example, when an upgrade is complete, restart the service, when the service restarts, add it back to the monitoring system etc).
All of these can be modelled in these config-management tools, as explicit dependencies within your manifest file - as opposed to implicit and ad-hoc dependencies within your script.
But consider this one use case. I want to stop a service, upgrade it to a new version, restart that service. The service is an application which I wrote that runs in apache tomcat.
Perhaps I'm missing some obvious detail but I would think you would just update your repository with the new version of your application and/or Tomcat package and that would cause the service to restart on the next Puppet run, after the new version has been installed.
The last web company I worked for had an app that required a little extra care and feeding to deploy, and we just didn't use configuration management for that. There were plenty of other uses for config management (changes to rewrites in httpd.conf, for example). We used bash scripts for handling deployments, which primarily relied on distributed ssh and rsync to perform upgrades. (I believe this is the fabric use case, though it didn't exist at the time and I've never used it)
The app was already documented, source-controlled, version-controlled, and tested by developers and QA. There wasn't much benefit to adding anything to the configuration management repository unless there were platform/infrastructure changes.
This is a pretty reductionist explanation, and it misses a lot of the benefits of state management. You make it sound like you just write a bunch of lines of shell script and you can sub in some variables, which is not at all their major strength.
The benefit of Salt, which it has over Ansible, is that it doesn't have to use SSH (which, as mentioned in the article, doesn't scale well to large deploys), and it can do a great deal more things than just 'run shell scripts from a file'.
If someone were to read only your comment they wouldn't see the benefit of these systems at all, which is unfortunate.
Founder of https://commando.io here. We are a web based interface for doing remote executions via SSH. Our approach is a beautiful web interface with no external dependencies (agents). Salt and Ansible are amazing, but they still have a bit of a learning curve and setup "costs". Our target market also tends to be less technically proficient, and thus not willing to dive into more robust solutions. Think cPanel for operations. SandBOx, if you'd like to try out Commando.io, let me know, and I'll get you a private beta invite.
You can, certainly create an AMI preconfigured with all of your libraries, packages and services ready to go, but the issues arise over time with package updates, system updates, one-off configuration changes .. the list goes on.
The actual configuration of your system will drift further and further away from that templated AMI, leaving you to constantly have to build a new template every time you deploy a new machine, or manually make all of those differential changes to your new system.
Config management can be tailored to rapidly update the configurations of certain classes of systems at a greater interval than the constant cycle of blowing away machines, updating templates, redeploying, etc , etc.
That doesn't really solve the problem, it just moves it.
If AMIs are your deployment method of choice, you still need to build a repeatable process for applying updates and changes to your previous base AMI, testing that it still works under all your realistic configurations, and then gracefully deploying it to your production infrastructure.
Which is all doable, but not trivial. And the processes you build will be very Amazon-centric.
Having done this a dozen or more times... can you package your ami for your developers to use? Can you remake your image exactly the same way again as it was when you first imaged it should it be lost in the cloud (100% reliability isn't something AWS provides).
Provisioning tools let you create it, and incrementally update your image in a way that lets you redo it from scratch at any time.
That alone is the reason why I like the idea, not necessarily the resulting applications that have been created so far for the task.
Shell scripts could do the same and have for years if your only interested in from scratch setups.
Because a lot of people don't use their brain, and realize that linux package management / image creation has been a solved problem since forever (I've been doing this since redhat kickstart circa 2001, and I'm a young guy).
However, there are some common sense tiers:
3rd party dependencies that changes infrequently = put in ami
Common ops daemons and boot scripts = put in ami
your application code and your configs = deploy via python/ssh client side
Also, one simply has to use ec2 instance tags to name things.
Result, you can have bit-for-bit identical instances fired up in no time with high confidence without need for a full OS package mirror. Without, of course, a ridiculously over-engineered configuration server framework, DSL, SPOF, security holes...
I asked myself this same question several years ago. The basic advantages chef and puppet (and salt, etc.) offer over client-side python alone are:
A thin abstraction layer for basic system/platform information (facter/ohai). System information that you'd obtain using various tools like dmidecode, uname, df, /proc, ip, are all gathered and made available in a single data structure. Overkill? Maybe, but makes for much cleaner scripts.
Finally, a framework for organizing your configuration in a standardized way, with most of it in one place. With chef, for example, the idea is to put all the configuration for a single application in one "cookbook" and then if you have two different servers that both use that application in different ways, you define two different roles that pass different parameters to the basic recipes. Obviously the extent to which this one a benefit depends somewhat on your needs and personal preferences.
But I do agree that both chef and puppet are over-engineered. Puppet's DSL and chef's server are mostly overkill that I don't have much use for. But the DSL can be dealt with and the chef server is optional. There doesn't have to be a single point of failure.
We use Chef for distributed team. They download a very basic VM, run Chef, and are off to the races in minutes. When updates or config changes, they update from VCS and everything is configured and good to go. This is especially awesome for more UI focused devs that aren't as comfortable with Unixy tools.
I've used Fabric, Chef, Puppet, and Ansible, and have settled on Ansible; it's a sort of middle ground between Fabric and Chef that does more than just run commands on servers but doesn't require me to buy into a whole elaborate universe of configuration management servers and whatnots. Ansible is great.
The ZeroMQ stuff makes sense if you're pushing configurations inside a data center, but it's a dealbreaker for us having things hosted externally.
It's ridiculous how much more pleasant it is to use Ansible than Puppet or Chef. Its invention solves a big pain for me: As a veteran user of Puppet I'm a firm believer in using a tool like Puppet, but Puppet-and-Chef are overdesigned for small jobs (and, arguably, for most other jobs as well) so actually recommending them to a beginner has always felt like this:
A: "I just set up a cloud instance by running some shell commands by hand."
B: "You shouldn't do that, because of X and Y and Z. You should learn Puppet or Chef."
A: "Wait... did you just tell me to go spend thirty hours banging my head against solid objects, in exchange for nebulous benefits that I can't even perceive yet?"
After working in Chef for several years and fighting with both gem-rot and managing colliding chef-client and application ruby+rubygems+gem environments with RVM... I was ready for something else. Also, our group just started working alongside another group that does not use cfg management; I wanted something that would be unobtrusive. At first I thought Salt would work, but the minionless mode was not really recommended. You lose a lot of the power of Salt when you do that. Ansible, however, is designed to be minionless. After using it for the last few months, I am (frankly) in love. The last time I felt butterflies like this for a tool was when I grokked git.
Look, I can maintain my own ansible configurations (inventories, playbooks and roles), and the other groups need not be the wiser. They don't need to worry that servers are going to be reconfigured out from under them, since I am running my configurations explicitly. All they see is that I ssh'd in and did a bunch of stuff really fast. To work with my other groupmates, we just keep the files in git (just flatfiles, yes: just flatfiles). With git I have an audit log of everything that has changed (hosts that have moved environments, configurations that have been updated, etc). And like git it is as distributed as you want it to be. Want to run it from a central server only? Fine. Want to have your admins all run it from their workstations? Fine. Go for it.
Even if the server is ancient or weird (ie. no python), I can still manage it with the raw module.
Ansible gives me everything I want, with no fuss. It is so basic that I can do things the way I want to.
I've recently been working to get a set of Puppet scripts running along side of Vagrant so both Server side and Client side devs can get working on projects faster.
Personally I've found dealing with Ruby and the various versions rather irritating. I kind of wish I had started with Salt or Ansible because I know far more Python, than I do Ruby. I started down this bunny trail before Vagrant support Ansible and/or Salt. I guess it's not to late to start over... hahaha Maybe I will after the initial release.
I guess the only advantage of Puppet was that I'm able to do most of what I want using major modules...
One important difference: you don't have to install Ansible on the nodes you're managing, just Python >= 2.4 with a JSON module installed (default for 2.6 and later, available through simplejson for previous versions).
Also, Ansible does not require you to mess around with dependency lists to ensure that packages/files are installed in the right order - the order is built into the yaml config file. You don't lose any capability, you just gain (and this is my personal opinion) ease of understanding the order that operations will occur in.
How many people have reviewed Paramiko? In particular, how about that ecdsa patch to Paramiko that you'll need to be accessing modern Ubuntu or Fedora (and before long, RHEL/CentOS). What about the python-ecdsa (that paramiko's provisional support for modern Fedora and Ubuntu's default configs is based on)? This entry from its README seems pretty frightening:
This library does not protect against timing attacks.
Do not allow attackers to measure how long it takes you
to generate a keypair or sign a message. This library
depends upon a strong source of random numbers. Do not
use it on a system where os.urandom() is weak.
I'm not saying Paramiko (or its patch sets) are insecure, just pointing out that the same arguments can be made against the libraries and code that Ansible is based on.
ZeroMQ used to be full of assert() calls that would fail on bad data and crash whatever was listening. It was trivially easy to cause problems for someone's ZMQ socket-listening service. It's been fixed, but the perception lingers on.
you care about network security, why are you exposing ssh ports in public anyhow? Put them behind a VPN.
There are ways to configure salt with masterless or behind vpcs or with syndics that are perhaps an enhanced security model. But defaults are for regular use cases, and for most cases the defaults are fine.
SSH is secure. The configuration of SSH, and the system that it's running on, its operating system, accounts and applications, might not be. It all comes down to how secure you want to be.
There is a longer conversation that can be had to explain this. There are lots of mistakes one can make in managing a box with SSH that will easily lead to an account compromise, not even considering universal methods like 0-days, stack attacks, mitm, phishing, etc. The simplest defense to all these problems is a hardened VPN device on a separate network segment with separate authentication and strictly defined ACLs.
If you just run a personal VPS, I wouldn't worry about it. But if you ever start handling customer data, get serious about security and don't run SSH in the open.
Keep in mind that most of the time, penetration happens because of operator error. Most common vulnerabilities are patchable or preventable in one way or another.
SSH contains many features which may make it vulnerable to attack. If configured incorrectly it will expose the machine, and depending on the machine this could expose a big portion of your network. I could write 5 pages on abusing SSH options for entry.
After SSH itself, there's the operating system. PAM has had holes for years and yet people everywhere rely on it for their authentication. Their OS might not be patched up, and probably has had no hardening of the kernel or userland against typical attacks.
The box is also probably on a shared network segment, meaning your attack surface is now a whole lot bigger: many machines and user sessions to hijack.
On the other hand, a VPN on a separate network segment provides protection for all these components, if you do it right. Usually the VPN should be an appliance of some kind, with service by a company who spends its time hardening the device and patching the software regularly. This makes sure the configuration of the software is correct, the system itself is resistant to attack, and authentication/authorization is managed outside the machine, which means limited access to the network for specific user sessions.
You cut the attack surface down by separating their network access away from a specific network and shared server. You also cut down on mistakes in managing said service because everything is a managed service, not a config file edited by one or more admins.
Also consider that SSH just doesn't have the features of a standard VPN that make it useful on a network level. Its tunneling feature is a joke; using ppp is the closest way to provide an actual remote vpn tunnel. Persistent connections are nonexistent. Network access control is nonexistent. Pushing routing, DNS and other network information is nonexistent. It is not meant to be a network-level tunnel, it's a single-session user tunnel for remote hosts, not networks. The extra options and features for more network access are hacks.
If you wanted to emulate a VPN using SSH, you could:
* put a machine on a separate network segment
* apply Grsec and other patches
* harden the filesystem and stop all services other than SSH
* disable user logins
* disable all features of ssh
* enable all strict ciphers, modes, options for ssh
* add ssh config options to run pppd as soon as the user logs in
* configure PAM to use RADIUS or SQL to auth with a remote database
* use iptables and some kind of packet-marking module to track network sessions by user id
* create a custom application that configures iptables rules per user
* constantly update all patches
That would get you close to a real, normal VPN. Unfortunately, it's also a metric shit-ton more code and services to do the same thing as one service, and considering you'd be setting it up for the first time, there's probably gonna be some mistakes made.
SSH is not a VPN, and a VPN is not SSH. Like I said before, if you're just remoting into your personal box, I don't think you'll have issues, because nobody would care to break in. Bigger networks are a different matter.
Keep in mind what we are discussing here, a lot of that was off topic. I never said ssh and VPN is the same, or the ssh should replace VPN, it's the other way around.
Now, it seems like you're saying that vpn is more secure because it's often separated and more hardened. I guess that kind of makes sense, but isn't that really an argument for a single, hardened entry-point into the network? Could be anything.. And of course, this has its drawbacks as well =|
It's an argument for better network and host security.
If you have a building you want to secure, you could make a lobby with a man-trap and a guard at the front desk and security keys for the lobby and on each floor. Or you could put a padlock on the back door. You decide whether it's worth the risk.
In terms of what to use to remote into your private VPS, just use SSH. A VPN would be overkill.
The ssh authentication works over TCP which isn't realistically spoofable because to open a connection (which you need first, before you can fail to authenticate ssh), you need to be able to receive the response packets to your spoofed sender address
a) This was suggested as a method to combat random trolling from botnets, so this sort of response is highly unlikely unless you're already getting DDOS'd , and b) Fail2Ban allows you to whitelist some IP's that won't be banned, if you really need to protect against that sort of attack.
1: at which point you'd most likely need to be a lot more involved anyways, and you should most likely be running something more serious in front of your server anyways...
Salt/Puppet/whatever. I ignore them all. Why? I have put a lot of thought in to this area.
IMHO, the overwhelming problem with salt/cfengine/puppet style solutions (which I will refer to as 'post-facto configuration tinkerers', or PFCT's) is that they potentially accrue vast amounts of undocumented/invisible state, therefore creating what I refer to as configuration drift.
IMHO, a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made. In addition, versioning one's environment in this way creates an identifiable point against which to execute automated tests. (This class of solution I refer to as 'Clean-slate, Identifiable Environments' or CSIES.) Examples are Amazon AMI's, and any other kind of versioned/identified VMs.
PFCT's deployment paradigm tends to be relative slow and error prone. CSIE's tend to be fast and atomic. PFCTs are headed for the dustbin of history. They are temporary hacks that clearly grew from old-school sysadmins' will to script. CSIEs embrace modern day devops, as more holistic entities that embrace virtualization and recognize the integrity of the environment as critical to preventing ridiculous numbers of environment-induced, service-level issues that are an expensive tangent to service development, testing and deployment. Thus, I would argue that what we are looking at with PFCT's is a failed paradigm, and with CSIEs, the now real and current opportunity for something far more elegant.
(Disclaimer: Haven't tried ansible or vagrant first hand, but they do seem to be PFCT's to me.)
Computer systems in the wild are stochastic - that's the fundamental assumption that leads to the design of tools that you call "PFCTs" here.
This type of design leads to tooling which is relatively slow and often obtuse, but there are well-reasoned (and researched, start maybe with ) assumptions behind those tools. In particular: systems modeled by these tools are designed to run for years, include a large variety of hardware and software, to be worked on by a lot of different people, and to be able to grow to massive size.
The "clean slate" idea is a good one, and has been in use for a long time (the idea of gold images goes way, way back). But that's for initial system state, not dynamic system modeling. The cfengine family of tools grew out of limitations in the "clean slate" approach for real-world system models.
If your concern is "how can I deploy my Rails site to EC2 quickly and reliably," you'll have different goals and assumptions on the tools you need, than if your concern is "how can I grow a resilient infrastructure". (edited to add: both are completely legitimate goals)
All that aside, I definitely agree that the tools still suck :)
Try telling a client who wants 100% uptime that! Computer systems can be many things - it could be (and probably has been) argued that we as progammers fundementally aim to reduce random qualities and increase stability within our programs.
there are well-reasoned (and researched) assumptions behind these tools
I can see where PFCT's came from. I took a look at the paper (which is exclusively about policy-driven management systems), but don't think this invalidates my points, re: paradigm failure. Taking the same policy based management systems and applying them to a CSIE environment (say, with corosync/pacemaker) is one way to combine their benefits. (That's what we do on our own infrastructure, actually.)
If your concern is...
Right, differing concerns. You also have different life expectencies for instances, purposes (dev/staging/prod), availability expectations, etc. But these are tangents! Developing good software comes down to consistently carrying out fundamental practices regardless of the particular technology. - Paul Duvall.
We attempt to remove an entire class of issues related to deployment environment by changing our platform engagement paradigm to one that is less procedural/'stochastic' to something that is more atomic/reliable. At the same time, this facilitates a clean segregation (and versioning!) of deployment environments versus service development (which may target one or more CSIE).
> Try telling a client who wants 100% uptime that!
A client who asks for 100% uptime will end up disappointed.
> We attempt to remove an entire class of issues related to deployment environment by changing our platform engagement paradigm to one that is less procedural/'stochastic' to something that is more atomic/reliable.
I'm afraid I don't get the difference. You've argued that configuration drift is a major problem with configuration management tools, but I don't see how any solution based around deploying full system images couldn't also apply to stepped upgrades applied by CM. Allowing config changes to be made directly to production servers instead of going via the deployment tool is the problem there, not anything fundamental to which deployment style has been used.
With puppet configuration (for instance) in version control, it's not a problem to test against a known, versioned, identifiable and auditable environment. As long as you're not applying config changes to live servers, the switchover to a new environment is equally atomic either way.
Given that the trade-off is building, storing and deploying full system images, I don't think there's a fundamental advantage to full-system deployment that can't be matched by a thought-out application of conventional configuration management.
A client who asks for 100% uptime will end up disappointed.
[...] I don't think there's a fundamental advantage to full-system deployment that can't be matched by a thought-out application of conventional configuration management.
OK, on the face of it, this is a fair line of reasoning. If we assume, however, that we are looking for ... replication (say, for the purposes of regression testing, etc.) then we really do need to know that the entity in question is the same as the last time it was .. err .. generated/instantiated. With the process you propose, there is clearly higher risk here. That is a paradigm weakness.
I think you're right that Salt/Puppet and to a lesser extent Chef take the wrong approach, but you make some confusing comments that make me suspect you might not understand what these existing approaches are about.
> IMHO, the overwhelming problem with salt/cfengine/puppet style solutions (which I will refer to as 'post-facto configuration tinkerers', or PFCT's) is that they potentially accrue vast amounts of undocumented/invisible state
I think you mean "post-facto" as in: "run after everything is done"? This is not the way that people would advocate Puppet should be used. Puppet should be used from the start, not added as an afterthought once you are done.
> IMHO, a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made
This isn't a cleaner solution, this is almost the solution you get when you use Puppet. With puppet the development workflow is like this:
- Span up a vagrant VM
- Run your manifests against this VM to test them
- To check in, run your manifests against your staging environment
- To deploy, spin up new clean production VMs and run puppet manifest against them
- Use a reverse proxy to route all traffic to new production VMs. Terminate old production VMs
> PFCT's deployment paradigm tends to be relative slow and error prone.
This much is true. Puppet's slow speed is particularly galling, but maybe that's just because I use it at work.
> isn't a cleaner solution, this is almost the solution you get when you use Puppet
The difference betwen deploying an instance of a stored environment and generating that environment from some prior state is the generative process, which can fail or change in unexpected ways due to network conditions and other factors.
More importantly, PFCTs enable and to some extent encourage modification of generated environments remotely, en-masse, without any significant capacity to ensure that individual instances within a group have not subtly shifted in configuration. This is what I meant by configuration drift.
CSIEs, by contrast, are essentially the complete product of the entire generation process, thus ensuring that future instantiations are identical. A subtle difference, but an important one.
1) You have not dealt with large enough data, since you advocate just creating VM copies or snapshots. Try that on 10 or 100 TB of data.
2) You haven't thought about what and how those initial CSIE configurations are generated. Do you hand tweak everything, make && make install onto a particular installation of a particular OS all the software then just spawn those? It seems that should go to the dustbin of history. You know essentially have a black-box that someone somewhere tweaked and you have not recipe on how to repeat it. If that person left the company, it might be tricky to understand what and where and at what version was installed.
If you have "configuration" drift there needs to be a fix to the configuration declaration and people shouldn't be hand editing and messing up with individual production servers. If network operations fail in the middle then the configuration management systems needs to have better transaction management (maybe use OS packages instead of just ./configure && make && make install) so if the operation fails, it is rolled back.
you advocate just creating VM copies or snapshots. Try that on 10 or 100 TB of data
There are many ways to take an image of an environment, not only VMs or snapshots. But if your system image includes 10-100TB, it could be argued that the problem of size really lies in earlier design decisions.
You haven't thought about what and how those initial CSIE configurations are generated.
On the contrary, generation should be automated. In the same way that a service to deploy to such an environment is maintained as an individual service project, the environment itself is similarly maintained, labelled, tested and versioned as a platform definition.
> In the same way that a service to deploy to such an environment is maintained as an individual service project, the environment itself is similarly maintained, labelled, tested and versioned as a platform definition.
A mix of two. Use salt/puppet/chef etc to bootstrap a known OS base image to a stable production platform VM for example. Then spawn clones of that. I would do that and I see how it would work very well with testing.
The difference betwen deploying an instance of a stored environment and generating that environment from some prior state is the generative process, which can fail or change in unexpected ways due to network conditions and other factors.
CSIEs are still a generative process. The difference with what you call PCFT is that the generative process isn't swept under the rug and codified into a versioned image unless it's really necessary for performance reasons.
The result is that it's easy to maintain a clear distinction between machine state and human instructions. For a trivial example: a list of packages that humans decided are necessary for the system versus the final output of 'dpkg -l' after all dependencies have been resolved.
With chef/puppet/etc. the code used to generate instances represents a human-created description of what the environment is supposed to look like, with as much version-control and referenced documentation as is necessary. With a versioned-image approach, all you have is the one-dimensional history of the image in question.
I fully advocate the description of build steps for environments, just as PFCTs encourage. However, the use of PFCTs to prepare and manage environments seems .. suboptimal, in terms of potential for issues. I suppose a PFCT could be useful as a means to automate the generation of environments, but ... IMHO ... it should not be used for the live instantiation/configuration of real infrastructure (which should be more atomic, from some versioned/known quantity). A subtle difference, but important.
I forgot to mention this before, it is strange that you credit yourself with defining this term when it has been well defined for some time in ops.
> More importantly, PFCTs enable and to some extent encourage modification of generated environments remotely, en-masse, without any significant capacity to ensure that individual instances within a group have not subtly shifted in configuration.
But this adds a significant weight over and above the "generative process" of running manifests. Yes, running manifests against your VMs can "fail or change in unexpected ways due to xyz" - don't do it against VMs that are currently in production! I'm not sure you've ended up with anything less error prone and you're still going to need a way to get from a fresh VM image and your output images - which is where Puppet would come in.
I'd really rather not make the entire disk image my build artefact, for fairly obvious reasons (ie: size).
You might like this, which is written by a colleague of mine, except that it is not in "opposition" to Puppet/Chef/etc:
Perhaps. The problem is, if you use PFCT on a bunch of hosts and something subtle changes that can cause issue, the granularity of the PFCT doesn't necessarily equate to that required for detecting the cause. With a CSIE-style atomic approach to deployment and a properly segregated monitoring system, you can, say, 'roll back' to the last known good version. PFCTs leak state, and will not always allow you this reverse pathway (random examples might include kernel feature or compiler version migration).
I also think that rebuilding from scratch is way better than some idemponent CM solution. I work at small scale - just need to have some services working on a vps and also manage my linux box.
This is why I think that docker will be a great solution to me. Spinning lxc container from existing image is really fast - I have made some tests recently. Now I'm setting up a system where I will have portable containers for specific tasks - a nodejs container for a webapp, a postgres db, tor machine for anonymous work, etc. All of them will be built using Ansible and versioned using git.
Then I can put them all on a vps and redeploy independently and just use one machine.
Having a Clean-Slate Identifiable Environment is well and good from a usage standpoint, but from a creation & maintainability standpoint any bucket of bits qualifies as an identifiable environment.
The very purpose of most of these tools is to make state visible, to be able to see what has been poured in. That visibility contributes to extensibility, the ability to take that known configuration starting place and to be able to branch and create new configurations.
If you have infrastructure in mind for how your CSIEs are constructed, I'm all ears, but I'm envisioning more sh scripts posted on the corporate wiki as your CSIE implementation.
any bucket of bits qualifies as an identifiable environment
The very purpose of most of these tools is to make state visible
Right, but they do a poor (ie. post-facto, limited granularity) job of it. This is why IMHO their paradigm is inelegant.
infrastructure in mind for how your CSIEs are constructed
We use scripts with exit values that execute within the target environment to bring it from a base configuration (eg. some AMI or some distro) through to the desired config, plus validation tests. I think most people's approaches will be similar. PFCTs essentially provide this, and could be used for this step without issue.
I may have jumped the gun. I was assuming PXE to boot an imaging environment, which copies the system image to local media so you don't have a runtime dependency on the boot infrastructure. In that case, speed-wise, you've not gained anything because you've got two boot cycles and a system copy before your new server is ready.
If you're not copying the system image and simply network booting a remote image, then that doesn't apply, and yes, that can be fast.
a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made.
All of us who have built big cloud-server clusters have dreamed of this plan at least once. But there are big practical problems.
Relaunching infrastructure is easy in theory, but from time to time it becomes very difficult. There is nothing like being blocked on a critical upgrade because your Amazon region has temporarily run out of your size of instance, or because the control layer is having a bad day, or because you've accidentally hit your instance limit in the middle of a deployment, or...
A much bigger issue is that bandwidth is finite, so "big" data is hard to move. This is a matter of physical law. It's all well and good to declare that you're never going to apply a MySQL patch in place: You're just going to launch a new instance with the new version and then switch over. But however fast you manage to launch the new instance (and you will be hard put to launch an instance faster than you can apply a patch and restart a daemon...) you will be limited by the need to copy over the data. Have you ever tried copying half a terabyte of data over a network in an emergency while the customer is on the phone? It is very annoying. Because it is often physically impossible to do it quickly: Cloud infrastructure isn't generally built for that, and when it is it costs money that your customer will not want to spend for the luxury of faster, cleaner patch-application.
A solution to this is to use cloud storage like EBS. Now your data sits in EBS and you just detach its drive and reattach it to a new instance. That actually works okay, provided you're happy with the bandwidth and reliability of EBS, which lots of people aren't – and, as those people will cheekily point out, you have now solved the "relaunches are slow" problem by replacing it with an "everything is uniformly slow" problem. Moreover, detaching and reattaching EBS volumes isn't instantaneous either. You have to cleanly shut down and cleanly detach and cleanly restart, and there's like 12 states to that process, and all of them occasionally fail, and if you don't want your service to go down for thirty seconds every time you apply a patch you need a ton of engineering.
Which brings us to the other problem: Complexity. Most programmers are not running replicated services with three-nines-reliable failover that never breaks replication. But even if you are, because you've got the budget for excellent infrastructure and a great team, it will always - for values of "always" measured in several more years, anyway - be more complicated and risky to fail over a critical production service than to apply, say, a security patch to 'vi' in place on a running server. 'vi' is not in your critical path. If you accidentally break 'vi' on a live server (and you won't, because vi is older than dirt and solid as a rock), you will have a good laugh and roll it back. Why risk a needless failover, which always has a chance of failure, when you could just apply the damn patch and thereby mitigate risk?
At Google scale that argument probably stops applying. But most people don't run at that scale and it will take decades to migrate everyone to a system that does, if that even happens.
So, "dustbin of history", maybe, someday, but in the long run we are all retired, and I will be retired before our dream becomes reality. ;)
The bulk of your comment - your second, third and fourth paragraphs - focus on issues of speed, bandwidth and reliability in a third party hosting/cloud-based architecture, which are a design-time tradeoff, so I don't see them as strictly relevant (though anecdotally informative).
Your fifth paragraph describes problems related to operations process, which are entirely avoidable.
I may be missing your point here, so forgive me if this seems ignorant, but I do not see how something like Puppet accrues this "invisible state".
I'm in the midst of a large puppet enterprise deployment at my day job, and it seems they've taken great pains to prevent any drift from happening. You get a dashboard webapp that shows you every puppet run on every machine, and a large overview that gives you states like "nonresponsive / changed / unchanged / pending / error".
When a puppet run makes a change to a systems, that run is marked as changing the server. If you're getting lots of "changed" runs on a system, you immediately know to go look at that server because something is making that box deviate from your defined baseline.
You write your configurations (manifests), add them by name to the webapp, add those to groups, and then add machines to the groups, which define what set of manifests to apply.
On my admittedly newbish level, it would seem this system is completely immune from any drift-over-time, provided you bother to glance at the dashboard occasionally. We're going to put it in our NOC as soon as the deployment is done :)
To be honest nobody else uses the term... I just made it up! But I've been wanting to further publicly elucidate my thinking in this area for some time... maybe soon I'll get around to more. I would say the responses to this post have been sort of comfirmative (not barking up entirely wrong tree; others can see the logic of my thinking to some extent).
I personally believe that this area is going to expand rapidly, building off of the trajectory begun by present first-generation of cloud infrastructure. Perhaps for someone who wanted to practice thinking in this area, I would a say exercise is to limit yourself purely to third party and cloud based infrastructure but demand high performance and global (multiple cloud provider) availability, and challenge yourself to write a multi-service system including a deployment tool that automates your solution and actually produces maintainable infrastructure. If you follow through, you will understand the problems.
Perhaps what we want is a build system (a la make) that builds your software and constructs a virtual machine image for it, including the performing of all your tests within the running virtual machine.
You're thinking along the right path. But you can't fully control the environment on all third party hosts/cloud providers, so how do you ensure that the code works on all target infrastructure? There needs to be a unifying abstraction to tie them together, something static enough to maintain a fixed target for service authors.
I would hazard that for a given set of axes in the multivarial formula we will call "control" for short. You definitely can be fully in control. The question is "does what I'm doing fit within the bounding space I can exert full control over.
Salt as an 'ecosystem' has capabilities that extend more into this area. Salt Cloud and the Salt Bootstrap script are 2 of the things that I feel invalidate some of the argument that tools like Salt ( I wont comment on the Chef & Puppet ecosystems ) are capable of operating much closer to your aim than you give them credit for.
For starters, I'd quit before letting a boss tell me their precious snowflake AMI was 'more stable' and I shouldnt waste any time trying to ensure I can recreate it should someone delete the image. In my own workflow, Salt is step 1 in any system. I build a stand alone salt configuration that can from bare OS image on first boot, initialize salt, and then pull the system forward to the desired state. Then for cloud roll out, the next step is to take an image of that VM/Container which I can then replicate much faster.
And one of my long terms plans is to hammer Buildbot, Salt, Salt Cloud, and a lot of time into a system that gives end to end control. Now if only I could code a decent UI for it... putting bootstrap on months of work would feel cheap lol.
You definitely can be fully in control. The question is "does what I'm doing fit within the bounding space I can exert full control over.
Right. It's probably fair to say that lots of present-era PaaS doesn't give you a large bounding space. Perhaps cloud providers always limit your space. Your own hardware can even provide limitations. But within that which you control, it's extremely important to version, package and test configuration sets .. or you wind up with a wide class of tangential issues.
Salt Cloud and the Salt Bootstrap script [are] capable of operating much closer to your aim than you give them credit for.
You may be right.
... waste any time trying to ensure I can recreate it...
If you can't automate the generation of your environment, you want to maintain the systems that you produce, and they of reasonable complexity, then IMHO you are asking for trouble in the long run. This is something that took me awhile to learn.
"Chef works atop ssh, which – while the gold standard for cryptographically secure systems management – is computationally expensive to the point where most master servers fall over under the weight of 700-1500 clients. Salt’s approach was far simpler."
Does that assertion about Chef somehow don't apply to Ansible?
On the use case:
"I have this command I want to run across 1,000 servers. I want the command to run on all of those systems within a five second window. It failed on three of them, and I need to know which three."
Well, ansible by default runs with paramiko which is a python implementation of SSH protocol. It will also keep connections open for multiple commands. It also has a pull mode and it also has a fireball mode which uses 0mq:
However, you're not forced to use this. In the beginning, you can just seed your CentOS or debian with a Kickstarter or seed file and then run your inital thing with ansible simply over ssh (using all the goodies, ssh-agent, password less ssh etc..).
One huge plus for ansible is also that it used yaml which is rather simple. I've been following both project for >1 year and it seems that recently ansible has picked up a lot and will probably make the "race" (IMO).
Salt also uses yaml for its configuration backend (by default). You can also write your state in Python if you prefer, with all the power that that brings (including pulling data from databases, remote API calls, or whatever you like).
I'm going to go on a limb here and say that the 700-1500 clients limitation is a non-problem for the vast majority. It's like the C10K "problem" all over again. Newsflash: most shops don't have more than a few dozen servers, if that, and the few ones that do must have already done their homework.
Policies (read "perl scripts" + "file templates") are fetched from a central location, which could be a git repository, a SSH server, an rsync export, or similar. Then they're compiled and executed locally.
Surprisingly powerful, plus you get the power of Perl + CPAN.
While I would prefer ansible as well, the author does show why ansible is problematic for some people:
> Chef works atop ssh, which – while the gold standard for cryptographically secure systems management – is computationally expensive to the point where most master servers fall over under the weight of 700-1500 clients
That said, unless you must have 700+ simultaneous slave connections, you should probably make life easier on yourself and choose ansible.
And if you don't like Python, you usually don't have to touch it. The tasks you run on the server can be any kind of executable. They read some JSON from stdin, do their action and write some JSON to stdout. Whether that program is a Python/Perl/Ruby/LUA script or go binary doesn't matter.
I was frustrated with Puppet when I first started. All I wanted was a VM to install a few things so I could do some development and not have to worry about managing my VM.
It turned out to be a rabbit hole. As soon as I thought I learned just enough to get it running, something else popped up that stopped me.
That's why I created PuPHPet . So far the reception has been fairly positive.
At one point in my learning, I got fed up and tried Salt. I couldn't get the Salt hello-world running. I followed the directions to a T. If your tutorial is incorrect, or hard to follow just to get the most basic version up and running, it will turn people away.
I think salt is neato, but I also find it very frustrating to use! (Possibly through no fault of salt itself - I feel like I must be missing something.)
I am generally able to SSH into a box and get things configured the way I need. However, I have huge amounts of trouble translating that into salt scripts.
Consider logrotate. Here is the only documentation I can find on the topic . From this, I have no idea what to put in init.sls to make sure a given log file is being rotated correctly. It seems this would work on the cmdline, but not necessarily in a salt script.
And that's just for logrotate! My uswgi + nginx configuration - translating that into salt - I don't know where to begin.
How do I make sure things get installed in a certain order? (Answer seems to be having 10 directives, for 10 packages, each depending on another, to enforce order.)
Is there anything that more closely mirrors what I actually do when configuring the box? SSH in, set certain values, etc? I guess I could write a shell script (or use fabric) but then I seem to have lost the point of configuration management.
As one of the people on that list who may well respond there, I'll reply here as well so its on the record.
You will be very well off if you read and 'digest' the Salt docs on States first, before moving on to modules, pillars, grains, custom returners, etc.
What you probably need to do with logrotate is take the configuration that you normaly setup on your servers, then add it to your salt system. So top.sls calls 'logrotate' running the 'logrotate/init.sls' and that has a definition that says "I want logrotate installed, I want it running as a service, and by the way take the file 'logrotate/config.conf' and shove it in /etc/logrotate.d/ as <correct filename>, p.s. If i change that file, restart logrotate"
With States & the requisite declarations to enable salt to know what order things need to be in, you shouldnt have much trouble adding a simple service like logrotate along with a specific config file to use for that service.
I'm still looking for a configuration management system that doesn't assume that the first step towards managing servers is to add a new "master" server. From the thread, ansible looks promising. In the meantime I'll keep using chef-solo until opscode kills it.
Thanks, it may even be that a salt master is lightweight worth configuring. But I should note there's a difference between "can run" and "intended to run."
It's as much a semantic thing as a technical thing. Instead of thinking about an unconfigured node as a "minion" awaiting orders and provisions from central command, I prefer to think of a node like a stem cell, fully capable of differentiating itself based on signals that it receives. You need a way to update the DNA and a way to send the signals, that's it.
This may seem like a meaningless difference, since there is still value in centralized services (package repositories, security, reporting, monitoring). But it's still a subtly different focus and over time yields different results.
For my part I think the distributed, organic "stem cell" way of thinking will win out over "master/minion" in the long run.
The thing that bugs me about salt is the almost complete lack of testing/coverage. They had tons of egg-face releases for crypto bugs, upgrade issues, things a basic test suite would have solved. I'd rather not trust my production environments to something that's a roll of the dice of whether its working, secure, or upgradable on a given release.
I used Puppet for a few years (and created a few modules for it https://github.com/puppetmodules). I switched to Salt a year ago. My main motivation were its simplicity (YAML+jinja), lower memory consumption, easier source code both to read and contribute to, and its support for both push and pull based architectures.
Chef works atop ssh, which – while the gold standard for cryptographically secure systems management – is computationally expensive to the point where most master servers fall over under the weight of 700-1500 clients.
It doesn't have to be this way. The situation where one host repeatedly needs to talk to hundreds via SSH is precisely where the SSH ControlMaster socket shines. This saves you a ton of overhead by not having to start up and tear down the session every time you want to issue a command via SSH.
I often use this trick on busy Nagios servers that execute many active checks via SSH -- it works well.
Agreed. I work with Chef and it is my tool of choice but don't think that Puppet sucks.
The important thing is someone uses some form of Configuration Management. If people find Salt easier than Chef/Puppet/CFEngine/Ansible then great. At least they have something to build/scale their infrastructure.
I forgot to mention that just installing ruby is a HUGE PITA on anything but the most common OSes. It took me 3h last week to get it on a CentOS installed. And I don't even want to try to get it running on our Solaris hosts...
Point is: Python is the number one scripting language (after bash) for sysadmins just like Perl used to be.
Yes I needed 1.9.3 for this silly software. And I had a little special setup so rvm failed to compile. I'm also very overwhelmed by rvm,gem,bundler etc... Python has pip,easy_install(old) and virualenv. Which are just easier to understand for me. Ruby is too much magic and is trying to do everything automatically (IMO).
Red Hat recently released Ruby 1.9.3 packages as part of "software collections"; I assume CentOS is rebuilding these and making them available the same as they do for other Red Hat Enterprise Linux packages.
As a developer that primarily works on ruby, I can see it if he wanted 1.9.3. Most of the references you'll find on that tell you to use @environment_wrapper_of_the_month, and those are deep rabbit holes for people who know nothing about the language.
rbenv is a great way to get whatever version of ruby you want up and going fast. Install a couple dependencies, clone the git repo, and run the install.sh. I went from never using it to running on CentOS 6 in a half hour.
> It took me 3h last week to get it on a CentOS installed
This matches my experience with a significant amount of software on CentOS. Since CentOS is just a rebuild of RHEL, and RHEL is extremely conservative when it comes to new software, CentOS tends to be out of date at release and get progressively worse.
I support Ruby apps so Ruby is on every server anyways.
With that said you can use Omnibus Chef Installer now which includes a copy of Ruby just for Chef. Good for servers where you don't need Ruby or small servers that would take awhile to compile a newer Ruby.
One doesn't need much ruby to get going with puppet. You need to learn the declarative puppet DSL, which is arguably an even bigger barrier. Automatic list expansion, for example, is a nice feature. But you can't have independent or nested variables, as with an imperative loop. So you wind up having to break each list into its own definition and then combine them all. The sysadmins I worked with were definitely not keen on having to think that way on a regular basis just to get some things done.
That said, I also disagree that puppet "sucks." It's good at what it claims to be good at so long as you can deal with its quirks.
This article makes a claim (...Puppet...Suck(s)), but does not take even attempt to explain what it is that sucks.
What, specifically, "sucks" about Puppet and Chef and what is so much "simpler" about Salt or Ansible? As an Ops guy who has been running Puppet since 2008 (and Chef most recently) against hundred of servers, I don't see the simplicity reflected in the documentation, nor do I find Puppet or Chef particularly complicated.
(Ok, Chef's attributes system is a bit confusing at first, but it is hugely powerful.)
> Chef works atop ssh, which – while the gold standard for cryptographically secure systems management – is computationally expensive to the point where most master servers fall over under the weight of 700-1500 clients. Salt’s approach was far simpler.
I think I'm with you (without the experience): I find "it works over ssh" a lot simpler than "we wrote a custom protocol on 0mq." Simplicity apparently has lots of interpretations. I couldn't care less if ssh performs well enough to support a trillion connections. In practice, you only need a handful, usually one.
Maybe Salt is fantastic. I'm not really in a position to judge. The article made it sound interesting to me, but I'm not sure attacking Chef/Puppet was really necessary, especially since it wasn't really expounded on.
There's words written about "automatic configuration" in that link but I don't see any guidance or information on what those configuration tools are. The focus is certainly not on automatic configuration: the focus is on use and re-use of images, on taking images, doing something to them, and getting new images.
The notion is deeply flawed to me: using an image as a precondition for making an image, over time, becomes an intractable mess and requires very careful supporting documentation to prevent the scheme from devolving into a bunch of "buckets of bits" with no transparency into what work has gone on to make it that way.
The #1 thing that I enjoy about automated tooling is that I can take a bare OS and spin up a complete new system in a matter of minutes, and I get to watch that entire process happen before my eyes. There's no mystery, no external dependencies, no existing work I'm riding off of: everything that happens is visible to me in an immediate way.
There's a value & use to immutable images, but please decouple your image making from past images made: no one wants to root around to figure out what twelve horrible things you did to install Java 9 image instances back whenever, nor are they going to have any fun reproducing it on the twenty nine active variants of that ancestor image when there's a security fix to be done.
I use SaltStack for managing a render farm consisting of 73 Ubuntu nodes. My requirements are rather simple, really: most states just install some packages, put configuration files into place (sometimes using a template) and enable/start services. However, I can't recall a single problem when setting everything up. SaltStack is clean, simple, and just works.
This is timely! I've just started writing a set of salt states to capture the set up of my new dell xps 13 (sputnik) so I never have to go through the pain of setting up xmonad, emacs and various other development environment stuff again.
What I really like about salt is that everything is in one place and all goes towards building the same data structure that everything runs off.
Both Salt and Ansible look interesting. It's much easier to define system state using Ansible or Salt than Puppet.
However, I am not sure how would one use Ansible where VMs get launched dynamically (private cloud/virtualization fabric where devs can instantiate systems) and then receive their configuration without any manual steps.
For example, one can create kickstart/VM-images which get a hostname based on certain regex pattern, register with a Puppet master, the Puppet master auto-signs certs matching this specific hostname pattern and then client nodes receive their catalog. This is really useful pattern wherein systems pull their configuration state almost immediately after boot. It requires manual setup only while writing kickstart/VM-iamge profile and Puppet master configuration.
Ansible's SSH keys setup requires manual intervention, however, I think it can be automated using pre-defined keys in kickstart/VM-images. Haven't tried it yet though...
Yes, having predefined keys in your VM images does the trick, and is exactly what we do for (almost) zero-intervention deployments of our servers in my particular environment.
We tend to destroy and recreate servers more often than we scale out, so we haven't bothered to remove the manual step of adding the server's hostname to the ansible inventory_hosts file. However, that's easily automatable...
Ansible will _execute_ your inventory_hosts file if it's executable, and IIRC it just needs to return a JSON or YML data structure representing all your servers and the groups they're in. So, as long as you have a library which can query your infrastructure (e.g. boto for EC2 etc) it's not hard to automate this.
What I prefer about ansible above all others, besides its simplicity, is that its use case scales up and out. By that I mean ansible can be used for platform/app stack provisioning while OS/infra sys admins maybe another tool. To often an agent based approach causes a conflict with OS sys admins and platform/app team regarding ownership/sharing. I want to offer self service as much as possible.
Further, most cfg mgmt tools are monolithic in that they want to manage all servers as tho a single team/overload manages them all, rather than various independent sys admin teams. With various independent teams its just too much hassle trying to share roles appropriately or setup separate master/agents.
Ansible does not have these issues.
We're actually a Windows-centric shop and have been actively evaluating configuration management solutions for Windows-based virtual machines. Initially, we were only looking at Puppet, Chef, and a commercial product called uProvision along with Vagrant. I was surprised to find that Salt had a real community behind it.
Our greatest challenge has been coming up with a tool which can manage images for both VMWare and Microsoft Hyper-V. This article introduced a web integration between Salt and libvirt called Salt-virt. Has anyone tried this interface for managing images? Does it work better than the young integration between Vagrant and libvirt?
For anyone else looking, I just found Foreman. This seems to do exactly what we're looking for, but it uses Puppet instead of Salt. Even if it requires a Linux server, making our solution more complicated, it appears to meet our needs very nicely.
I use fabric for everything just because I dont have time to learn another one of these technologies. This salt article seems great - but at the end of the day (and I may be way off base here) all I want to do is install a given version of a piece of software on my server. I dont want to create a receipt (chef), or learn another configuration format (sounds like I would need to do this with Salt stack), etc. My fabric file really seems to do only three things: use pip to install shit that is python (I use Django), use apt-get to install anything that is ubuntu specific, and make wget calls to various pieces of software, pull them down, and build them from source. Until there is an easy way for me to do this without needing to learn yet another technology, I will continue to use fabric (or, until the job of doing this gets so big I can hire a dev ops guy that actually already knows, but I am not there yet :) ). Sorry for the rant, it's just every time I see these articles I wish I had time to learn the technology but then I realize I don't.
So - is it just me or is there seem to be a big/huge learning curve for all of these dev ops technologies?
Yaml is a hair above "properly indenting my templates" as far as complexity goes. You write django templates, you can handle Yaml ;-)
As current 'devops guy' on a django project myself, salt works wonderfully. Salt has states available that let you setup all that software, create the virtualenv you need (including telling it you want to use the requirements.txt that you pulled down with your django project source code - Salt gives me my own little Heroku :D ) and for anything left in those wgets you can throw a block of salt cmd.run calls using specified ordering to enable them to run neatly in the sequence you desire.
These technologies come into their own when you are managing large fleets of servers; but by the same right, you need to offer a lot of sophistication to make large fleets of heterogeneous servers work. When you have five servers these seem baroque and overly complicated, but when you have five thousand servers...
> MCollective (which Puppet Labs acquired several years ago) was (and remains!) fiendishly complex to set up.
I didn't find MCollective hard at all - you just install some debs, a message queue server (Stomp was easiest at the time - it's now deprecated, but surely is not much different to RabbitMQ?) and it Just Worked for me. And there was a great screencast.
Did it get far more complicated since I used it last?
To be honest, I've never had an MCollective deployment Just Work(tm). It's always taken some serious debugging to figure out what the heck went wrong this time.
RabbitMQ works but is slightly problematic since the authors have a morbid penchant for not wanting to support anything but Apache ActiveMQ. Ask a question about MCollective and RabbitMQ and the answer you get is 'switch to ActiveMQ'.
Well, rabbitmq is kind of a pain to get working with it. Additionally, the modules for rabbitmq and mcollective, for puppet, don't really work that well together (read: I had to re-write the ones I found to get them working).
I'm starting work at Puppetlabs in exactly a week's time as part of a brand new "module team" and I personally promise you here and in the open that I am going to tackle the puppetlabs-rabbitmq module and fix it so that it actually works and isn't an abandoned wasteland.
Come to that I'm hoping we can start building out a full set of mcollective modules to replace the existing ones that will be fully supported and kept up to date so that getting mcollective running will be as easy as including a class and waiting.
A huge part of this job is ensuring community patches get merged in and contributors get treated as I would like to be treated when contributing to a project. I hope we can reverse your experience with modules within a few months (I took this job because I've been in exactly your position, grabbing official modules and having them not work at all!)
Powershell works great for executing commands on arbitrary servers (which sounds like the basis of Salt), but it'd be great to declaratively say "I want the server in this state" like the config management side of salt. I assume there is a tool built atop of Powershell like this somewhere?
From skimming the top level of comments, it seems most people don't like these tools. Fair enough.
That said, on-topic, I just wanted to say that having tried Puppet, Chef and Salt, I've found Salt the easiest to use. Straightforward installation (no messing with Ruby versions/rvm/etc.), really simple setup (systemctl start salt-master; systemctl start salt-minion; salt-keys -L; salt-keys -A yourbox; done), and the YAML-based configuration syntax has been a breeze to work with.
Really quite pleased with it; it's made getting a few of my hairer boxes under control much easier than I expected (and much easier than I found with Chef or Puppet).
If someone can't get over the Ansible startup costs, then they have no business managing a system... I can't speak for Salt (last time I tried to use it was in 2010, and it was atrocious then and I haven't gone back).
What a bonehead I am. This is what happens when you say to yourself, "that was about 2 years ago" and you base it on 2012. My only excuse is that I was working on a system config & management product and that I'm having a busy day... I upvoted you for correcting my record.
Good to see tools that work as a system configuration framework and also allow command execution.
[ControlTier](http://www.controltier.org/) had (don't think it's actively developed now) options to execute general system commands, configure systems and application deployment. But it was fairly complex and required [ant](http://ant.apache.org/) skills.
For years I have been trying to spread the gosspel of bcfg2 because, while not perfect, I thought was a more complete system over Puppet or Chef. Bcfg2, however has some big warts of its own AND it never really caught on.
In the past few months I've been slowing converting to SaltStack and it really is everything I ever dreamed of for a CM system. Fast, easy, real-time. Lovin' it.
If bcfg2 was a complete system, it never caught on because the documentation was entirely missing. Every time I looked at it, I blocked on actually getting anything done because I couldn't find an equivalent to these reference manuals:
I use Salt to run commands across our EC2 environment to do things like restart Varnish, clear logs, and run updates. Paired with Unison for file synchronization it works well when your auto-scaling kicks in and you need your new AMI to be synched from staging.
You are wrong on all counts here. Salt supports Jinja2 template engine, so you can template your states . You define custom states in python . In the root configuration file (top.sls) you target configuration based on host name, grains (machine specific information) .