Hacker News new | past | comments | ask | show | jobs | submit login

Salt/Puppet/whatever. I ignore them all. Why? I have put a lot of thought in to this area.

IMHO, the overwhelming problem with salt/cfengine/puppet style solutions (which I will refer to as 'post-facto configuration tinkerers', or PFCT's) is that they potentially accrue vast amounts of undocumented/invisible state, therefore creating what I refer to as configuration drift.

IMHO, a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made. In addition, versioning one's environment in this way creates an identifiable point against which to execute automated tests. (This class of solution I refer to as 'Clean-slate, Identifiable Environments' or CSIES.) Examples are Amazon AMI's, and any other kind of versioned/identified VMs.

PFCT's deployment paradigm tends to be relative slow and error prone. CSIE's tend to be fast and atomic. PFCTs are headed for the dustbin of history. They are temporary hacks that clearly grew from old-school sysadmins' will to script. CSIEs embrace modern day devops, as more holistic entities that embrace virtualization and recognize the integrity of the environment as critical to preventing ridiculous numbers of environment-induced, service-level issues that are an expensive tangent to service development, testing and deployment. Thus, I would argue that what we are looking at with PFCT's is a failed paradigm, and with CSIEs, the now real and current opportunity for something far more elegant.

(Disclaimer: Haven't tried ansible or vagrant first hand, but they do seem to be PFCT's to me.)




Computer systems in the wild are stochastic - that's the fundamental assumption that leads to the design of tools that you call "PFCTs" here.

This type of design leads to tooling which is relatively slow and often obtuse, but there are well-reasoned (and researched, start maybe with [1]) assumptions behind those tools. In particular: systems modeled by these tools are designed to run for years, include a large variety of hardware and software, to be worked on by a lot of different people, and to be able to grow to massive size.

The "clean slate" idea is a good one, and has been in use for a long time (the idea of gold images goes way, way back). But that's for initial system state, not dynamic system modeling. The cfengine family of tools grew out of limitations in the "clean slate" approach for real-world system models.

If your concern is "how can I deploy my Rails site to EC2 quickly and reliably," you'll have different goals and assumptions on the tools you need, than if your concern is "how can I grow a resilient infrastructure". (edited to add: both are completely legitimate goals)

All that aside, I definitely agree that the tools still suck :)

[1] http://cfengine.com/markburgess/papers/sysadmtheory3.pdf


Computer systems in the wild are stochastic

Try telling a client who wants 100% uptime that! Computer systems can be many things - it could be (and probably has been) argued that we as progammers fundementally aim to reduce random qualities and increase stability within our programs.

there are well-reasoned (and researched) assumptions behind these tools

I can see where PFCT's came from. I took a look at the paper (which is exclusively about policy-driven management systems), but don't think this invalidates my points, re: paradigm failure. Taking the same policy based management systems and applying them to a CSIE environment (say, with corosync/pacemaker) is one way to combine their benefits. (That's what we do on our own infrastructure, actually.)

If your concern is...

Right, differing concerns. You also have different life expectencies for instances, purposes (dev/staging/prod), availability expectations, etc. But these are tangents! Developing good software comes down to consistently carrying out fundamental practices regardless of the particular technology. - Paul Duvall.

We attempt to remove an entire class of issues related to deployment environment by changing our platform engagement paradigm to one that is less procedural/'stochastic' to something that is more atomic/reliable. At the same time, this facilitates a clean segregation (and versioning!) of deployment environments versus service development (which may target one or more CSIE).


>> Computer systems in the wild are stochastic

> Try telling a client who wants 100% uptime that!

A client who asks for 100% uptime will end up disappointed.

> We attempt to remove an entire class of issues related to deployment environment by changing our platform engagement paradigm to one that is less procedural/'stochastic' to something that is more atomic/reliable.

I'm afraid I don't get the difference. You've argued that configuration drift is a major problem with configuration management tools, but I don't see how any solution based around deploying full system images couldn't also apply to stepped upgrades applied by CM. Allowing config changes to be made directly to production servers instead of going via the deployment tool is the problem there, not anything fundamental to which deployment style has been used.

With puppet configuration (for instance) in version control, it's not a problem to test against a known, versioned, identifiable and auditable environment. As long as you're not applying config changes to live servers, the switchover to a new environment is equally atomic either way.

Given that the trade-off is building, storing and deploying full system images, I don't think there's a fundamental advantage to full-system deployment that can't be matched by a thought-out application of conventional configuration management.


A client who asks for 100% uptime will end up disappointed.

Sure.

[...] I don't think there's a fundamental advantage to full-system deployment that can't be matched by a thought-out application of conventional configuration management.

OK, on the face of it, this is a fair line of reasoning. If we assume, however, that we are looking for ... replication (say, for the purposes of regression testing, etc.) then we really do need to know that the entity in question is the same as the last time it was .. err .. generated/instantiated. With the process you propose, there is clearly higher risk here. That is a paradigm weakness.


I think you're right that Salt/Puppet and to a lesser extent Chef take the wrong approach, but you make some confusing comments that make me suspect you might not understand what these existing approaches are about.

> IMHO, the overwhelming problem with salt/cfengine/puppet style solutions (which I will refer to as 'post-facto configuration tinkerers', or PFCT's) is that they potentially accrue vast amounts of undocumented/invisible state

I think you mean "post-facto" as in: "run after everything is done"? This is not the way that people would advocate Puppet should be used. Puppet should be used from the start, not added as an afterthought once you are done.

> IMHO, a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made

This isn't a cleaner solution, this is almost the solution you get when you use Puppet. With puppet the development workflow is like this:

- Span up a vagrant VM

- Run your manifests against this VM to test them

- To check in, run your manifests against your staging environment

- To deploy, spin up new clean production VMs and run puppet manifest against them

- Use a reverse proxy to route all traffic to new production VMs. Terminate old production VMs

> PFCT's deployment paradigm tends to be relative slow and error prone.

This much is true. Puppet's slow speed is particularly galling, but maybe that's just because I use it at work.


> isn't a cleaner solution, this is almost the solution you get when you use Puppet

The difference betwen deploying an instance of a stored environment and generating that environment from some prior state is the generative process, which can fail or change in unexpected ways due to network conditions and other factors.

More importantly, PFCTs enable and to some extent encourage modification of generated environments remotely, en-masse, without any significant capacity to ensure that individual instances within a group have not subtly shifted in configuration. This is what I meant by configuration drift.

CSIEs, by contrast, are essentially the complete product of the entire generation process, thus ensuring that future instantiations are identical. A subtle difference, but an important one.


It seems that

1) You have not dealt with large enough data, since you advocate just creating VM copies or snapshots. Try that on 10 or 100 TB of data.

2) You haven't thought about what and how those initial CSIE configurations are generated. Do you hand tweak everything, make && make install onto a particular installation of a particular OS all the software then just spawn those? It seems that should go to the dustbin of history. You know essentially have a black-box that someone somewhere tweaked and you have not recipe on how to repeat it. If that person left the company, it might be tricky to understand what and where and at what version was installed.

If you have "configuration" drift there needs to be a fix to the configuration declaration and people shouldn't be hand editing and messing up with individual production servers. If network operations fail in the middle then the configuration management systems needs to have better transaction management (maybe use OS packages instead of just ./configure && make && make install) so if the operation fails, it is rolled back.


you advocate just creating VM copies or snapshots. Try that on 10 or 100 TB of data

There are many ways to take an image of an environment, not only VMs or snapshots. But if your system image includes 10-100TB, it could be argued that the problem of size really lies in earlier design decisions.

You haven't thought about what and how those initial CSIE configurations are generated.

On the contrary, generation should be automated. In the same way that a service to deploy to such an environment is maintained as an individual service project, the environment itself is similarly maintained, labelled, tested and versioned as a platform definition.


> In the same way that a service to deploy to such an environment is maintained as an individual service project, the environment itself is similarly maintained, labelled, tested and versioned as a platform definition.

A mix of two. Use salt/puppet/chef etc to bootstrap a known OS base image to a stable production platform VM for example. Then spawn clones of that. I would do that and I see how it would work very well with testing.


Agreed. I think PFCTs are good for generative part, not for the rest. They should be part of a build process really, not live infrastructure.


> if your system image includes 10-100TB, it could be argued that the problem of size really lies in earlier design decisions.

Without disagreeing with your conclusion about the design process, it's useful to note that this situation simply isn't a problem for a conventional configuration management tool.

> On the contrary, generation should be automated.

One could argue that Puppet and Chef are ideal tools for performing that automation.


this situation simply isn't a problem for a conventional configuration management tool

Sure. But loads of other stuff is. The weight of tradeoffs is clearly against PFCTs here.

One could argue that Puppet and Chef are ideal tools for performing that automation.

Absolutely agree - but not within live infrastructure. Only build.


The difference betwen deploying an instance of a stored environment and generating that environment from some prior state is the generative process, which can fail or change in unexpected ways due to network conditions and other factors.

CSIEs are still a generative process. The difference with what you call PCFT is that the generative process isn't swept under the rug and codified into a versioned image unless it's really necessary for performance reasons.

The result is that it's easy to maintain a clear distinction between machine state and human instructions. For a trivial example: a list of packages that humans decided are necessary for the system versus the final output of 'dpkg -l' after all dependencies have been resolved.

With chef/puppet/etc. the code used to generate instances represents a human-created description of what the environment is supposed to look like, with as much version-control and referenced documentation as is necessary. With a versioned-image approach, all you have is the one-dimensional history of the image in question.


I fully advocate the description of build steps for environments, just as PFCTs encourage. However, the use of PFCTs to prepare and manage environments seems .. suboptimal, in terms of potential for issues. I suppose a PFCT could be useful as a means to automate the generation of environments, but ... IMHO ... it should not be used for the live instantiation/configuration of real infrastructure (which should be more atomic, from some versioned/known quantity). A subtle difference, but important.


> This is what I meant by configuration drift.

I forgot to mention this before, it is strange that you credit yourself with defining this term when it has been well defined for some time in ops.

> More importantly, PFCTs enable and to some extent encourage modification of generated environments remotely, en-masse, without any significant capacity to ensure that individual instances within a group have not subtly shifted in configuration.

But this adds a significant weight over and above the "generative process" of running manifests. Yes, running manifests against your VMs can "fail or change in unexpected ways due to xyz" - don't do it against VMs that are currently in production! I'm not sure you've ended up with anything less error prone and you're still going to need a way to get from a fresh VM image and your output images - which is where Puppet would come in.

I'd really rather not make the entire disk image my build artefact, for fairly obvious reasons (ie: size).

You might like this, which is written by a colleague of mine, except that it is not in "opposition" to Puppet/Chef/etc:

http://martinfowler.com/bliki/ImmutableServer.html


"Immutable Server" seems like an oxymoron.

Frequent refreshes are great, but your system is only doing something useful once it has "mutated" (i.e. accepted external data to operate on).

The tradeoff system designers have to make is frequency of refreshes vs the cost of transferring interesting data to that server.

Seems like your organization has just coined a new synonym for "gold master"


You can do it the other way round: have your data on direct-attached storage and network mount the root filesystem.


We do well with DRBD as an alternative here .. adequate network redundancy without performance penalties.


"Adequate" is an interesting word :-) We don't consider DRBD in anything other than synchronous replication mode to be reliable, which puts a fair performance penalty on it.


If used properly they should have an identical result.


Perhaps. The problem is, if you use PFCT on a bunch of hosts and something subtle changes that can cause issue, the granularity of the PFCT doesn't necessarily equate to that required for detecting the cause. With a CSIE-style atomic approach to deployment and a properly segregated monitoring system, you can, say, 'roll back' to the last known good version. PFCTs leak state, and will not always allow you this reverse pathway (random examples might include kernel feature or compiler version migration).


I also think that rebuilding from scratch is way better than some idemponent CM solution. I work at small scale - just need to have some services working on a vps and also manage my linux box. This is why I think that docker will be a great solution to me. Spinning lxc container from existing image is really fast - I have made some tests recently. Now I'm setting up a system where I will have portable containers for specific tasks - a nodejs container for a webapp, a postgres db, tor machine for anonymous work, etc. All of them will be built using Ansible and versioned using git. Then I can put them all on a vps and redeploy independently and just use one machine.


Having a Clean-Slate Identifiable Environment is well and good from a usage standpoint, but from a creation & maintainability standpoint any bucket of bits qualifies as an identifiable environment.

The very purpose of most of these tools is to make state visible, to be able to see what has been poured in. That visibility contributes to extensibility, the ability to take that known configuration starting place and to be able to branch and create new configurations.

If you have infrastructure in mind for how your CSIEs are constructed, I'm all ears, but I'm envisioning more sh scripts posted on the corporate wiki as your CSIE implementation.


any bucket of bits qualifies as an identifiable environment

Sure.

The very purpose of most of these tools is to make state visible

Right, but they do a poor (ie. post-facto, limited granularity) job of it. This is why IMHO their paradigm is inelegant.

infrastructure in mind for how your CSIEs are constructed

We use scripts with exit values that execute within the target environment to bring it from a base configuration (eg. some AMI or some distro) through to the desired config, plus validation tests. I think most people's approaches will be similar. PFCTs essentially provide this, and could be used for this step without issue.


CSIE sounds great for a certain class of users.

For me, however, it seems like it would be impossible. My team has to manage 10,000+ physical servers. We can't possibly wipe and provision from scratch every time we change configurations.

In other words, if your servers are ON Amazon's platform, use CSIE. If you are RUNNING the Amazon servers, you need something else.


PXE boot is a great way to automatically re-provision physical servers.


If one of your complaints about configuration management systems was that they're slow, this is not the answer.


PXE is fast. I'm not sure what your experiences are, but I'd like to hear them. I've never had an issue with speed.


I may have jumped the gun. I was assuming PXE to boot an imaging environment, which copies the system image to local media so you don't have a runtime dependency on the boot infrastructure. In that case, speed-wise, you've not gained anything because you've got two boot cycles and a system copy before your new server is ready.

If you're not copying the system image and simply network booting a remote image, then that doesn't apply, and yes, that can be fast.


a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made.

All of us who have built big cloud-server clusters have dreamed of this plan at least once. But there are big practical problems.

Relaunching infrastructure is easy in theory, but from time to time it becomes very difficult. There is nothing like being blocked on a critical upgrade because your Amazon region has temporarily run out of your size of instance, or because the control layer is having a bad day, or because you've accidentally hit your instance limit in the middle of a deployment, or...

A much bigger issue is that bandwidth is finite, so "big" data is hard to move. This is a matter of physical law. It's all well and good to declare that you're never going to apply a MySQL patch in place: You're just going to launch a new instance with the new version and then switch over. But however fast you manage to launch the new instance (and you will be hard put to launch an instance faster than you can apply a patch and restart a daemon...) you will be limited by the need to copy over the data. Have you ever tried copying half a terabyte of data over a network in an emergency while the customer is on the phone? It is very annoying. Because it is often physically impossible to do it quickly: Cloud infrastructure isn't generally built for that, and when it is it costs money that your customer will not want to spend for the luxury of faster, cleaner patch-application.

A solution to this is to use cloud storage like EBS. Now your data sits in EBS and you just detach its drive and reattach it to a new instance. That actually works okay, provided you're happy with the bandwidth and reliability of EBS, which lots of people aren't – and, as those people will cheekily point out, you have now solved the "relaunches are slow" problem by replacing it with an "everything is uniformly slow" problem. Moreover, detaching and reattaching EBS volumes isn't instantaneous either. You have to cleanly shut down and cleanly detach and cleanly restart, and there's like 12 states to that process, and all of them occasionally fail, and if you don't want your service to go down for thirty seconds every time you apply a patch you need a ton of engineering.

Which brings us to the other problem: Complexity. Most programmers are not running replicated services with three-nines-reliable failover that never breaks replication. But even if you are, because you've got the budget for excellent infrastructure and a great team, it will always - for values of "always" measured in several more years, anyway - be more complicated and risky to fail over a critical production service than to apply, say, a security patch to 'vi' in place on a running server. 'vi' is not in your critical path. If you accidentally break 'vi' on a live server (and you won't, because vi is older than dirt and solid as a rock), you will have a good laugh and roll it back. Why risk a needless failover, which always has a chance of failure, when you could just apply the damn patch and thereby mitigate risk?

At Google scale that argument probably stops applying. But most people don't run at that scale and it will take decades to migrate everyone to a system that does, if that even happens.

So, "dustbin of history", maybe, someday, but in the long run we are all retired, and I will be retired before our dream becomes reality. ;)


The bulk of your comment - your second, third and fourth paragraphs - focus on issues of speed, bandwidth and reliability in a third party hosting/cloud-based architecture, which are a design-time tradeoff, so I don't see them as strictly relevant (though anecdotally informative).

Your fifth paragraph describes problems related to operations process, which are entirely avoidable.


Well, okay. Give my regards to Saint Peter and all the angels!


In the parlance of our times: "Do you even lift?"


I may be missing your point here, so forgive me if this seems ignorant, but I do not see how something like Puppet accrues this "invisible state".

I'm in the midst of a large puppet enterprise deployment at my day job, and it seems they've taken great pains to prevent any drift from happening. You get a dashboard webapp that shows you every puppet run on every machine, and a large overview that gives you states like "nonresponsive / changed / unchanged / pending / error".

When a puppet run makes a change to a systems, that run is marked as changing the server. If you're getting lots of "changed" runs on a system, you immediately know to go look at that server because something is making that box deviate from your defined baseline.

You write your configurations (manifests), add them by name to the webapp, add those to groups, and then add machines to the groups, which define what set of manifests to apply.

On my admittedly newbish level, it would seem this system is completely immune from any drift-over-time, provided you bother to glance at the dashboard occasionally. We're going to put it in our NOC as soon as the deployment is done :)


Do you have any resources where one might learn about CSIEs? Where to get started, available tools, best practices, etc.


To be honest nobody else uses the term... I just made it up! But I've been wanting to further publicly elucidate my thinking in this area for some time... maybe soon I'll get around to more. I would say the responses to this post have been sort of comfirmative (not barking up entirely wrong tree; others can see the logic of my thinking to some extent).

I personally believe that this area is going to expand rapidly, building off of the trajectory begun by present first-generation of cloud infrastructure. Perhaps for someone who wanted to practice thinking in this area, I would a say exercise is to limit yourself purely to third party and cloud based infrastructure but demand high performance and global (multiple cloud provider) availability, and challenge yourself to write a multi-service system including a deployment tool that automates your solution and actually produces maintainable infrastructure. If you follow through, you will understand the problems.


Perhaps what we want is a build system (a la make) that builds your software and constructs a virtual machine image for it, including the performing of all your tests within the running virtual machine.


You're thinking along the right path. But you can't fully control the environment on all third party hosts/cloud providers, so how do you ensure that the code works on all target infrastructure? There needs to be a unifying abstraction to tie them together, something static enough to maintain a fixed target for service authors.


I would hazard that for a given set of axes in the multivarial formula we will call "control" for short. You definitely can be fully in control. The question is "does what I'm doing fit within the bounding space I can exert full control over.

Salt as an 'ecosystem' has capabilities that extend more into this area. Salt Cloud and the Salt Bootstrap script are 2 of the things that I feel invalidate some of the argument that tools like Salt ( I wont comment on the Chef & Puppet ecosystems ) are capable of operating much closer to your aim than you give them credit for.

For starters, I'd quit before letting a boss tell me their precious snowflake AMI was 'more stable' and I shouldnt waste any time trying to ensure I can recreate it should someone delete the image. In my own workflow, Salt is step 1 in any system. I build a stand alone salt configuration that can from bare OS image on first boot, initialize salt, and then pull the system forward to the desired state. Then for cloud roll out, the next step is to take an image of that VM/Container which I can then replicate much faster.

And one of my long terms plans is to hammer Buildbot, Salt, Salt Cloud, and a lot of time into a system that gives end to end control. Now if only I could code a decent UI for it... putting bootstrap on months of work would feel cheap lol.


You definitely can be fully in control. The question is "does what I'm doing fit within the bounding space I can exert full control over.

Right. It's probably fair to say that lots of present-era PaaS doesn't give you a large bounding space. Perhaps cloud providers always limit your space. Your own hardware can even provide limitations. But within that which you control, it's extremely important to version, package and test configuration sets .. or you wind up with a wide class of tangential issues.

Salt Cloud and the Salt Bootstrap script [are] capable of operating much closer to your aim than you give them credit for.

You may be right.

... waste any time trying to ensure I can recreate it...

If you can't automate the generation of your environment, you want to maintain the systems that you produce, and they of reasonable complexity, then IMHO you are asking for trouble in the long run. This is something that took me awhile to learn.


this sounds like what Martin Fowler calls "immutable servers": http://martinfowler.com/bliki/ImmutableServer.html


Yeah, this is pretty much the exact same idea.


Allow me to introduce you guys to NixOS, and its ops counterpart, NixOps.

http://nixos.org/nixos/ https://github.com/NixOS/nixops

It's a young project but it already solves most of these issues.


This project looks interesting but I would be wary of any single tool whose scope includes the kernel configuration, boot loader, and userspace and network configuration management.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: