Hacker News new | past | comments | ask | show | jobs | submit login
The architecture of declarative configuration management (nelhage.com)
82 points by zbentley 27 days ago | hide | past | web | favorite | 20 comments

Building your own version of something is surely self indulgent wheel reinventing, but that’s what I’m currently doing with distributed configuration management.

It’s certainly been helpful in terms of understanding the boundaries between parts of the system, as this post also describes. The desire to auto configure everything is strong — one day you’ll have a VLAN hard coded into the config, but the next day you’ll be trying to programmatically distribute VLAN ids based on function instead. The day after that VLANs themselves are a artifact generated from a higher level separation in your human readable config. What was once a list of hosts with an attached VLAN id is now a group of hosts with a declared function that just happens to be programmatically assigned a VLAN id, but only as an implementation detail.

The same happens with IP address management — your root configuration moves closer and closer to being a document describing what you want to do, and less about how to go about doing it (which is implemented in your custom augmentations to the engine instead.)

When you can justify it as an exercise in understanding a system, and you have time for it, building your own tool chain is incredibly rewarding.

> We could imagine resolving this tension if Terraform had two different convergence engines...The “create a new environment” engine, which always creates from scratch every resource it was given. This would excel at spinning up fresh environments as quickly as possible, since it would have to perform a minimum of introspection or logic and would just issue a series of “Create()” calls.

This just doesn't make sense; introspection usually allows you apply changes more quickly. For example, it takes seconds to describe and update an existing AWS ELB; it takes minutes to delete and create a new one.

If you really want to forgo analysis and reuse of existing infrastructure, just do

    terraform destroy
    terraform apply
> Importantly, however, it by design will never issue a destructive operation, and will error out on changes that cannot be executed non-disruptively.

The notion of a "destructive operation" is not clear cut. Is it destructive to remove a file from S3? To update a file in S3? To delete a tag on an S3 bucket? To update a tag on an S3 bucket?

You can just manage this with permissions; that way you can specify exactly what is and isn't an allowable operation. In fact, this is best practice as it protects against bugs or misuse of the tool. Since Terraform already defaults to non-destructive, adding infrastructure-level permissions would cause it to work exactly as described.

A better example of customizable convergence would be the lifecycle management options Terraform already has, such as create_before_destroy which ensures the new resource exists before the old one is deleted.

I'm working on something called mgmt: https://github.com/purpleidea/mgmt/

It runs as a distributed system, and is reactive to events, both in the engine and in the language (a FRP DSL) which allows you to build really fast, cool, closed-loop systems.

Have a look!

Seems odd to talk about declarative configuration management and not mention NIX or GUIX.

I was going to comment the same thing. On the NixOS About Page[0], the first main section is "Declarative system configuration model"

[0] https://nixos.org/nixos/about.html

It sounds like you want something closer to a prolog language in which you could specify the rules for the engines to respect...

Exactly, and structural and functional constraints over properties and rules.

I understand the need to reinvent the wheel, but most of these efforts feel to me like customizations that most declarative languages can provide, albeit possibly in a non-intuitive syntax.

yeah. I think the syntax and the fit of the syntax with the domain matter. Tooling too.

I want to spend time building something like that, but i have doubt there is really a market need/want.

Salt, for some reason, is not discussed in the article. It's declarative.

He's spot on about separating "configuration generation" from convergence. There is no reason for the two to be the same system, the same tool. As he says, Kubernetes is only concerned with the latter, whereas Puppet, Chef, and Terraform conflate the two (insofar as it uses HCL).

And for all the talk of "declarative", there is no reason why the configuration generation stage cannot be imperative, a la Pulumi. It is the desired end state - the catalog that's being generated - that is declarative.

I mostly agree, with the caveat that in my experience, if the configuration generation stage is entirely imperative it is harder to reason about it. That might not be a problem for low-complexity setups, but can get quite important (and bad) in some more involved cases.

My experience too. I strongly believe there is room for some kind of tool to help with this process, whether it be a library, DSL, or framework. Something lightweight that places some order on the problem of generating configuration, nothing more.

Otherwise writing raw python and dumping to JSON (or using python client libraries for whatever you're targeting, e.g. kubernetes), quickly becomes an unmaintainable mess.

I suppose that functional languages might be a good fit this problem, then. Nix and Guix come to mind.

Yes, agreed. I don't think Nix nor Guix are there yet, in terms of usability (not that most current alternatives are much better, mind). But I could see a wrapping layer on top of either of them working quite well. It's difficult to come up with abstractions for the kind of complexities we're dealing with nowadays. I'm hopeful someone will eventually, though...

The "pluggable convergence engines" is what we've built in Gyro[1] for this very reason. We wanted to have more control over how changes are made in production.

An example is doing blue/green deployments where you want to build a new web/application layer, pause to validate it (or run some external validation), then switch to that layer and deleted the old layer. All while having the ability to quickly roll back at any stage. In Gyro, we allow for this with workflows[2].

There are many other areas we allow to be extended. The language itself can be extended with directives[3]. In fact, some of the core features like loops[4] and conditionals are just that, extensions.

It's also possible to implement the articles concept of "non-destructive prod" by implementing a plugin that hooks into the convergence engines (we call it the diff engine) events and prevents deletions[5].

We envision folks using all these extension points to do creative things. For example, it's possible to write a directive such as "@protect: true" that can be applied to any resource and would prevent it from ever being destroyed using the extension points described above.

[1] https://github.com/perfectsense/gyro [2] https://gyro.dev/guides/workflows [3] https://gyro.dev/extending/directive/ [4] https://gyro.dev/guides/language/control-structures.html [5] https://github.com/perfectsense/gyro/blob/master/core/src/ma...

That is why immutable infra becomes popular. You could easily destroy and rebuild the whole thing.

And for Prod env, what are discussing sounds like update behavior for me. In cloudformation, you could choose the different update policies.

Comparing with each cloud's provisioning engine, cloudformation/gcloud deployment manager/azure resource manager, terraform is lacking a lot of features. So unless you are dealing with a private cloud, using cloud default provisioning service is a no-brainer.

Cloud formation is after lacking support for resources that are new or less popular. Terraform is much better in this, supports most of the resources from the start as far as I know

In cloudformation, you could customize your resource types with aws lambda. It is like creating the providers in terraform.

Plus, cloudformation is free and managed service. You do not need to maintain and it is wrong, you could yell at AWS. Unless you bought terraform enterprise, it is still a pain to maintain another possible failure point in your system.

Considering the staggering breadth of AWS, it is amazing how fast and completely Terraform supports AWS.

Only a couple times have I actually encountered unsupport features (1. global WAF 2. non-canned S3 bucket policy)

My experience is that the production operations engine is hard to get right because your target environment can drift from the desired configurations for reasons that you did not anticipate.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact