Hacker News new | past | comments | ask | show | jobs | submit login
Infrastructure as Code, Part One (crate.io)
145 points by based2 65 days ago | hide | past | web | favorite | 61 comments

What I find unfortunate about infrastructure-as-code tooling is that a lot of the tooling isn't actually using code, but instead uses esoteric configuration languages. Indeed, the article refers to Terraform with its custom syntax.

Imho tools should use actual code (whether it's TypeScript or Kotlin or whatever) instead of reinventing constructs like loops and string interpolation.

Thankfully these tools are getting more popular, because frankly I can't stand configuring another Kubernetes or GCP resource using a huge block of copy/pasted YAML.

Wanting IAC to be procedural instead of declarative is like saying you want SQL to be procedural.

You’re not telling AWS how to do something you’re telling it what you want the end state to be and let it figure out what needs to be created, updated, deleted, or replaced and the dependency chain what can be run in parallel when you create or update the template.

There are linters and editors for CloudFormation that help you with autocomplete (?) and to warn you when you are specifying a a resource type. You can even add your own custom definitions to the linters for custom resource types that you create.

CloudFormation just like generic SQL doesn’t by itself have the concept of loops. CloudFormation does support custom transforms and macros that you can create in any language and you can write programs that generate CF in many languages using the CDK where it will perform validation. I haven’t used it so I am being really hand wavy.

I agree with this in general, but I think there’s room for more programmatic ways of calculating what the desired state is in tools like Terraform. There are already a lot of built-ins in terraform that are deterministic but procedural ways of calculating what the state is, without infecting the declarative nature of the reconciliation process.

We also were frustrated at the verboseness and vagaries of the terraform syntax.

We ended up preprocessing them using jinja2, injecting our variables via its context, and using its cleaner syntax for expressing conditions, loops, etc. Now we have the best of both worlds.

... until your jinja templating breaks for some obscure reason, or the underlying terraform resource upgrades (they follow the provider api after all), or if you need to do some state surgery and partition one terraform state into multiple ones, or until you want to import resources into the generated terraform or....

Apologies, I don't mean to be a downer. But... you're making your present life easier but making the code harder to maintain for future maintainers of your codebase.

What's the more maintainable alternative? Is writing repeated constructs etc out individually really more maintainable, or are you referring to other ways of getting these things reduced?

I highly recommend taking a look at the AWS CDK. It's a relatively new project, still in beta and hasn't gotten a lot of attention yet. It allows you to use real programming languages to generate Cloudformation templates. In my eyes this is the best of both worlds.

Just plain terraform modules, for now.

I’m eagerly awaiting 0.12 so the language would have similar features as jinja templating.

If that's what you want to achieve, you can generate the terraform files from any external system. Basically template it.

That's what sparkleformation does for cloudformation stacks.

When there is live traffic or data in the infrastructure you’re modifying, the procedure matter a great deal.

Just as in databases, you end up playing lots of “read the query plan, restate the query to try and get a better one” games.

What I've seen is that when you get a real programming language for IaC you'll end up with a half assed, poorly implemented and documented version of a declarative DSL that is organization specific.

I saw it with early versions of Chef, I've seen it with Apache Auroras python based job specs for Mesos etc.

I'd rather work around the limitations of a DSL that is declarative and limited but is consistent across orgs than having to retrain myself if I move to a new job.

What's nice about ansible, if you can grok how the inventory system works, custom modules and plugins are super simple. So, OP could indeed program in the manner they wish, with a very nice/simple abstraction on top.

Same for Salt Stack I'd say. Just a yaml wrapper around execution modules which themselves are either wrappers around cli's or pure python versions of them.

I think this points more to that it’s a normal project that’s developed and you need competent developers like on any software project. The DSL route was likely to dumb things down for operations people with little software experience, but in doing so they created declarative models that don’t model reality and have created such an awkward fit that it reduces productivity that’s a solved problem for decades now.

Operations is the wrong place to do engineering, either that or operations needs to level up significantly since they do cause a lot of delays and damage to companies with these poorly design tools.

using Turing complete languages for configs is a mistake.

Configs are hard, and they're different to enough from normal software (for example DRY doesn't apply in the same circumstances) that using the same tools is a bad idea.

I've seen what happens when swes use swe languages on configs. They get unintelligible. And then I have to clean up the messes.

To be clear though, using Json or toml for configs is also often a mistake.

The problem is that the declarative model doesn’t fit, so in Terraform for example they’ve added branching and for loops even, so you’re making a language at that point. But rather than keeping the logic and the data structure separate, they’re mixed up with each other. This is a solved problem in many languages, Java especially.

I think a valid question would be what is the right framework factoring or design should be, but pure declarative isn’t not and half declarative like Terraform are just repeating past mistakes.

> for example DRY doesn't apply in the same circumstances

Doesn't it, though? Dumb configs tend to suffer very much because of the impossibility to apply DRY directly, and people end up using/writing config generators just to ensure consistency of values.

Hell, isn't half of the job of a IaC tool to be such a config generator, papering over lack of capabilities of configuration languages?

>Doesn't it, though?

Theoretically, maybe. In practice, I don't think so. Mostly because code is much less prone to change than configs are. Like if you have some encapsulated set of behavior in code, it's often easy (and not particularly painful) to do something like

    self.thing = x if self.other else default
This is (normally) fine. Sometimes this will expand to be 2-3 different things. So you might get

    if self.other:
       self.val = ...
    elif self.different_situation:
       self.val = ...
       self.val = default
And this avoids setting self.val in all of the child classes, it's "nice and DRY".

Because code-code doesn't change that much, this kind of weird gross encapsulated behavior is (usually) ok. It's still a code-smell, but you don't get burned by it. But when you are dealing with configs, they do change, and you want to be extra explicit, because (as you probably know), the number of exceptions and special cases will inevitably grow, and you'll end up with implicit leaf-level configuration hidden away in your so-called encapsulated stuff, but without end-user visibility into what is actually happening.

One solution to this is to completely ignore DRY and ensure consistency via unit tests or static analyzers or something, but that also sucks (possibly more) than doing some denormalization within your config-language itself.

My experience says the right way to do this is to restrict yourself to not using any conditionals other than perhaps get-with-default, and doing everything that would be done conditionally via inheritance or composition. Remains to be seen if I think I made the right decision 5 years from now.

These inheritance constructs are either missing or too ridged in the current tooling to be able to use those patterns unfortunately. Java in a lot of ways seems like an ideal language, especially with AOP, where for example you could use exactly the same code and swap out the cloud provider.

I agree that most languages are bad for this. I think Java is also too verbose to be a good config language. My experience has been with a Python dsl but I'm open to other options too.

I actually see it less as a configs language issue and more that you want a program that encapsulates the patterns and logic for your app to run. So the language itself is a bit less of an issue other than it needs to support the model you’re creating.

CF supports macros and custom transforms that can be written in any language.

> The DSL route was likely to dumb things down for operations people with little software experience,

I find this a strange argument. Traditional operations people will still resist using these techniques, and will use pre-canned modules and resources if they're forced to use something.

The real reason seems to be because of the declarative nature of IaC rather than concerns about adoption.

I'm trying to write something snappy, but honestly you're just rude. Your shitting on my profession and from here it looks like you've only ever operated extremely simple production environments.

(Those examples, the real world ones, where from competent software developers that did product development, not operations people).

I’ve seen both at sizable scales, the logic patterns are the same, the complexity almost always comes from incorrect modeling of the problem (which devops tools perpetuate). Once a problem is solved correctly, take K8s for example, then the problem becomes much simpler and operations has a well thought out model to work in.

If you’re in technology your profession is to solve problems, it’s not to entrench poor decisions in a company, which in some sense everyone is at fault for, but this area affects a lot of a company and the attitudes are pretty anti-improvement if it goes out of someone’s skill set. The end result is that the software will be moved out of operations, I see this more and more. Operations can’t improve just by magic, they need to embrace the past 50 years of computing knowledge too.

> Operations can’t improve just by magic, they need to embrace the past 50 years of computing knowledge too.

Oh. Really? /s

That's literally what DevOps is all about. I'm sorry but Infrastructure isn't as simple to manage in code as "regular" software. The Infrastructure API's mutate or behave in unexpected ways too often. State drift is common. There is active work in figuring out better solutions. But to say that operations have not embraced computing knowledge is nonsense.

Here at Pulumi, this is something we are working hard on!

For all the many benefits we've seen with Infrastructure as Code to date, the tools are still fairly primitive - copy/paste is the norm for reuse, testing is rare or non-existant, productivity during infrastructure development is low, continuous integration and delivery are largely ad-hoc, and there are very few higher-level libraries available to abstract away the complex details of today's modern cloud platforms. Net - it feels like we're still programming cloud infrastructure at the assembly-language level.

I'm really excited about the opportunity to bring more software engineering rigor into the infrastructure as code space. At Pulumi we believe using existing programming languages is a key enabler of this. Pulumi is still a desired state model like other Infrastructure as Code offerings (so you can still preview changes and make minimal deltas to existing infrastructure) - but you can write code to construct that desired state. As a result, you get for loops and conditionals, you get types and error checking, you get IDE productivity tooling, you can create abstractions and interfaces around components, you can write tests, you can confidently refactor you infrastructure code, you can deliver and version components via robust package managers, and you can integrate naturally into CI/CD workflows.

Pulumi isn't the only tool in this space - we're seeing things like Atomist bringing this same model to delivery pipelines, and AWS CDK bringing this model to the CloudFormation space. I'm excited about where these tools will take the Infrastructure as Code ecosystem in the coming years.

[disclaimer - CTO at https://pulumi.io so clearly biased on this topic :-)]

Pulumi definitely looks very nice and I hope you manage to get some traction in the k8s space. The amount of stacked tooling (like skaffold) currently being used in that space is papering over the real issue that raw YAML simply isn't working.

So good luck to you!

Agreed completely, the current state is like using a language with no debugger, no modern knowledge of code reuse, no real IDE integration, Terraform is especially poor on a large project since it’s parser doesn’t report line numbers on errors and it has no forward references or ways to resolve interdependence between objects and leaves that up to the user to manually converge its state, not to mention you have to rewrite everything for every cloud and it’s riddled with bugs for common uses.

Pulumi is going down a better route in that they’re using a normal programing language with a normal tool set, however I’m skeptical of how their engine is designed with RPC calls for language bindings, since again it makes debugging more complex as opposed to just a normal sdk. They also don’t have debugging enabled yet.

I do not think imperative languages are better, because they suck at parallelization. And creating infrastructure elements in parallel is crucial.

What I see as optimal is tool like make, which calls other command line tools. Resolves dependencies, processes errors. But not make, something with sane syntax and error handling.

All these new fancy all-Go tools creep me out. Infrastructure tooling should not live in domain of one programming language. Extensibility should be language agnostic. If I want to write some Perl/Python/Bash script to support some very non standard part of my infrastructure, I should be able to. If I want to plug vendor specific utility execution into deployment pipeline, I should be able to. And it should be easy.

> All these new fancy all-Go tools creep me out.

It certainly makes one's ability to look into the machinery and know what is going on much harder.

> If I want to write some Perl/Python/Bash script to support some very non standard part of my infrastructure, I should be able to

That very feature is why I love ansible: if there is some quirk, or even an unreleased module (they only ship new features in major releases, I recently learned), then you can copy the upstream file, or a modified version of the existing one, or even a whole new module, into the `library` or `lookup_plugins` directory of your playbook and you're back in business. No fighting with golang anything. You can also write ansible modules in any language you like.

>Imho tools should use actual code (whether it's TypeScript or Kotlin or whatever) instead of reinventing constructs like loops and string interpolation.

The counterpoint to this is things like Gulp or Gradle which become a nightmare after a couple years of multiple developers and coding styles appending things here and there. Now rather than just spending a few hours learning a basic config DSL, I have to build up a mental execution model every time I want to add a build step.

I don't mind using code to generate what the status should look like, as long as the code doesn't actually mutate the state.

Similar to React and it's virtual dom.

> Imho tools should use actual code (whether it's TypeScript or Kotlin or whatever) instead of reinventing constructs like loops and string interpolation.

I disagree with this one. I believe describing an infrastructure should be fully declarative and the tool should decide how it needs to create resources based on the description. This way, the infrastructure code almost can't have bugs, but the tool can be fixed for everybody.

I've moved my bash based GCP vm deployment script to typescript, via shelljs: https://www.npmjs.com/package/shelljs

ShellJs works pretty nice. Not only is my script now cross platform, but doing conditional logic and user prompting is a lot easier in code than bash.

The only "issue" I've found is that ShellJs is quite barebones. I wrote a wrapper over it to do everything, such as nice question prompting and colored output.

Could you name some of those code-based tools?

One of the oldest examples of this: https://github.com/infochimps-labs/ironfan. But yeah, Chef. Examples: https://github.com/infochimps-labs/ironfan-pantry/blob/maste....

I used to use this heavily back in 2013, 2014. Infochimps got picked up by eBay, afair. Hence why this was never developed further.

Start with hitting cloud provider APIs directly, then looking for value-add on top of that.

This was called Chef, and every time I've encountered a Chef infrastructure it's been a complete mess.

So true! Declarative tooling works great until you have a complex situation, and the abastraction leaks.

The way I think of it is that there's too much personality in this technology.

Nice piece. Looking forward to Part II.

What I am missing (often, in these type of articles as well as in actual production environments) is the fact that if you develop (infrastructure) code, you also need to test your (infrastructure) code. Which means you need actual infrastructure to test on.

In my case, this means network equipment, storage equipment and actual physical servers.

If you're in a cloud, this means you need a seperate account on a seperate creditcard and start from there to build up the infra that Dev and Ops can start deploying their infra-as-code on.

And this test-infrastructure is not the test environments other teams run their tests on.

If that is not available, automating your infrastructure can be dangerous at best. Since you cannot properly test your code. And your code will rot.

I found that Kubernetes + minikube (or a variant of that) is a fairly straightforward way to handle this. Teams / developers can easily set up a local testing environment, product owners can QA that way, etcetera.

This of course depends on your level of lock-in with various cloud environments.

IaC tools often handle more than kubernetes, but agreed that k8s is a fantastic way to get reproducible behavior which is absolutely imperative for testing.

This is kinda why I love Google Cloud and don't see myself moving to another cloud provider until they match GKE. I want all developers to throw everything into GKE, and Operations manages only the VPC's, firewalls etc. Developers get complete ownership over compute (and networking within the cluster) while broader network management can still be managed by an operations team.

Yup - that works pretty well. And gives developers some insight into what is required to get stuff working.

It does assume no hardware or complex networking needs to be handled.

And there is the point of observability. When there is a proper testing ground for developers that is as-production, it enables developers to dig into and mess with logging, tracing, debugging of all sorts.

This adds value by providing developers insight into what a reliability engineer (or whatever they call sysadmins these days) needs to provide whatever service it is that the developers' code is part of.

Why would you need a separate credit card? It’s easy enough to set up separate accounts under an Organization with shared billing and with rules that work across accounts.

Because I want to limit the impact of maxing out a creditcard to one environment.

And I want engineers to be able to futz about with all cloud services available, without having to worry about any negative impact on production.

And finally: What happens when $cloud_provider makes changes to the accounts interface and you want to mess around with those new features, without hitting production?

Give your future-self a break, and make sure you can futz around on any and every layer.

Another common practise is using seperate domainnames. Don't use 'dev.news.ycombinator.com'. Instead, use 'news.ycombinator.dev'. This frees you up for messing around with the API of the DNS provider. And when switching DNS provider, test whatever automation you have in place for this.

Because I want to limit the impact of maxing out a creditcard to one environment.

Just because you maxed out the credit card doesn’t mean that you don’t still owe the money if you go over. That’s what billing alerts are for.

And I want engineers to be able to futz about with all cloud services available, without having to worry about any negative impact on production.

And finally: What happens when $cloud_provider makes changes to the accounts interface and you want to mess around with those new features, without hitting production?

Give your future-self a break, and make sure you can futz around on any and every layer.

That’s what the separate accounts are for but you don’t need a separate card and you still should be using an organization.

This frees you up for messing around with the API of the DNS provider. And when switching DNS provider, test whatever automation you have in place for this.

Why isn’t your DNS provider AWS with Route 53 where the separate domains would be in separate accounts with separate permissions and separate access keys/roles per account?

All true. Life is so simple all of a sudden. Thanks!

Part 2 is linked at the bottom :) looking forward to part _3_!

Nice article! What Crate advocates here might be common sense to a lot of people here on HN, but I can assure you that there are lots of people in leading positions out there who has no clue about this paradigm. Hopefully some of them will read this article.

I might be one of those people that doesn't completely grok IaC yet. I understand the necessity of configuration tools such as Ansible, I wouldn't live without it, but in the case of my company (~15 apps, 20 servers) I still don't understand the use case of orchestration tools such as Terraform.

All our servers configuration is managed through Ansible, apps are containerised and run on Kubernetes (on CoreOS, so even less configuration required), apps are deployed automatically with CI scripts.

Why would I need to describe the hardware/infrastructure as code? I create VPSes manually, once in a blue moon, and they just get added to our K8S cluster, using the same Ansible template. It takes 2 minutes from VPS creation to adding the new node to the cluster. It'd be nice to describe in code the exact hardware provisioning, but that can be done easily with a couple lines in the internal documentation. And actually, having heterogenous server sizes might be a feature, to be able to schedule less demanding apps on less powerful and cheaper servers.

What benefit would Terraform give me? Not being snarky, I just don't know how to fit it in our process.

- Scalability since you're a small shop, it doesn't seem a huge deal to have to setup networking and other scaffolding manually. But as your company grows and more teams join, will you provision these things for every team by hand? What if they want to experiment with different networking configurations?

- Disaster Recovery: Imagine for whatever reason your VPC is completely destroyed, all your VM's are gone. You still have data backed up (you do backups don't you? :)). Your company has to inevitably be down for some time, but how long would it take to restore everything? Terraform can reduce this from days/weeks to minutes; all configuration is there in code, is reproducible. Even if you don't/can't use terraform itself, you still have captured all your infra config information in code and not just in a Google Doc/Evernote/Post it notes.

- Audit Trail: You want to empower developers to make changes to infrastructure without opening a ticket and asking you to do it. But if they don't open a ticket, it might make your compliance story much harder. If they do open a ticket, you now have a huge ticket backlog so you hire more engineers .... kinda relates to the point about scalability. Using Terraform, and enforcing infra changes through terraform, you have a super simple audit story, and will know exactly who made the changes, when they were made, for what reason etc.

- Infra Convergence: Its a fact of life that you need to make temporary infra changes for fixing fires, hotfixes, super important custom customer request etc. If you allow your infrastructure to be cluttered with these one off changes, it will be messy, developers won't give a shit when making new changes, and often everyone will have permissions over everything. Using terraform, every time you make changes to infra, you discover these manual changes and either revert them (if they're one offs) or commit them to code (if they're meant to be permanant)

I'm a new user to terraform, and it certainly has its problems (its still 0.* for chrissake). But... it has been a very useful tool when used correctly and with discipline.

Terraform really shines when you have a complex environment that depends on various external services. I don't know enough about your architecture to comment on the benefits for you, but for a generic web services environment on, say, AWS, you have network configuration (VPC, subnets, security groups), EC2, load balancers, S3 buckets, RDS instances (managed databases), DNS, and possibly a number of other services which need to be set up for everything to work. Explicitly defining what services your product depends on and precisely how they're to be configured is extremely valuable.

You certainly can write Ansible playbooks to create all of those resources if that works better for your team, but generally it's better to draw a line between configuring your VM/container/server and provisioning all the infrastructure it depends on.

There is a lot more to modern infrastructure than VMs and Docker containers. These are all of the resources you can create with CloudFormatiom:


You spend the rest of the day investigating what went wrong. Eventually, you figure it out. Somebody logged on to the third machine last week and manually updated some of the software. These changes were never propagated to the other servers, or back to the staging environment, or dev environments.

I'm sorry -- how many people are still doing this in 2019?

Many, especially for tweaks after installation.

I’m sorry, but infrastructure as a code is a fake news.

Most popular tools offer infrastructure as a config with some horrible and limited scripting language (in some cases even JSP doesn’t look like a horrible idea - hello helm). Declarative VS Imperative Holywar is getting a bit old (like any other tech war).

The best language is “the one you know” which is why I think pulumi looks like the best answer so far.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact