I've been side-stepping this whole conversation by using as little infrastructure as possible. I'd absolutely be digging into IAC abstractions if I had a circus on my hands and no say over why it had to be a circus.
Going into a "cloud native" stance and continuing to micromanage containers, VMs, databases, message buses, reverse proxies, etc. seems absolutely ridiculous to me. We're now using exactly 2 major cloud components per region: A Hyperscale SQL database, and a FaaS runner. Both on serverless & consumption-based plans. There are zero VMs or containers in our new architecture. We certainly use things like DNS, AAD, VNets, etc., but it is mostly incidentally created by way of the primary offerings, and we only ever have to create it 3 times and its done forever and ever - Dev cloud, Prod cloud, DR cloud. And yes - we are "mono cloud", because any notion of all of Azure/AWS/GCP going down globally and not also dragging the rest of the internet with it is fantasy to me (and our customers).
When you literally have one database to worry about for the entire universe, you stop thinking in terms of automation and start thinking in terms of strategic nuclear exchange. Granted, one big thing to screw up is a big liability, but only if you don't take extra precautions around process/procedure/backup/communication/etc.
The benefit of doing more with less also makes conversations around disaster recovery and compliance so much easier. Our DR strategy is async log replication of our 1 database. I really like the abstraction of putting 100% of the business into one place it magically showing up on the other side of the flood event.
How about this for a litmus test: If your proposed solution architecture is so complicated that you would be driven to IAC abstractions to manage it, perhaps we need to re-evaluate the expectations of the business relative to the technology.
> Going into a "cloud native" stance and continuing to micromanage containers, VMs, databases, message buses, reverse proxies, etc. seems absolutely ridiculous to me.
Honestly you're just paying the cloud provider to manage these things behind the scenes for you. Which is fine but also has its own risks. If you can keep your product that simple for the business then that's pretty incredible.
I do suspect that's not a common situation though, at least in my experience.
This is a nice perspective from the developer of a single application, but as a platform developer, I'm usually dealing with using IaC tooling to set up multi-tenant environments. I can't just deploy one database because there may be 50 different teams working on 50 different sets of problems, some of them basic research, some of them products, some of them purely exploratory, and there are often legal restrictions on who is even supposed to be able to make a network connection to a particular database, so simply using roles and users built into the DBMS engine itself isn't good enough to achieve the required separation, not to mention they need to be encrypted at rest with different keys. This often needs to be done across separate accounts within the same cloud provider for budgeting and accounting purposes as well, so they couldn't just share a resource even if it was otherwise okay for them to potentially step on each other's work.
My thoughts have been going into another direction entirely:
- We need to get rid of YAML. Not only because it's a horrible file format but also because it lacks proper variables, proper type safety, proper imports, proper anything. To this day, usage & declaration search in YAML-defined infrastructure still often amounts to a repo-wide string search. Why are we putting up with this?
- The purely declarative approach to infrastructure feels wrong. For instance, if you've ever had to work on Gitlab pipelines, chances are that already on day 1 you started banging your head against the wall because you realized that what you wanted to implement is not possible currently – at least not without jumping through a ton of hoops –, and there's already an open ticket from 2020 in Gitlab's issue tracker. I used to think, how could the Gitlab devs possibly forget to think of that one really obvious use case?! But I've come to realize that it's not really their fault: If you create any declarative language, you as the language creator will have to define what all those declarations are supposed to mean and what the machine is supposed to do when it encounters them. Behind every declaration lies a piece of imperative code. Unfortunately, this means you'll need to think of all potential use cases of your language and your declarations, including combinations and permutations thereof. (There's a reason why it's taken so long for CSS to solve even the most basic use cases.) Meanwhile, imperative languages simply let the user decide what they want. They are much more flexible and powerful. I realize I'm not saying anything new here but it often feels like as if DevOps people have forgotten about the benefits of high-level programming languages. Now this is not to say we should start defining all our infrastructure in Java but let's at least allow for a little bit of imperativeness and expressiveness!
I have a similar view to yours: as soon as you need variables, imports, functions or any other type of logic ... the existing "data-only" formats break down. Over time people either invent new configuration languages that enable logic (i.e. cue or jsonnet), or they try to bolt-in some limited version of these primitives into their configuration.
My personal take is that at some point you are better of just using a full programming langugage like TypeScript. We created TySON https://github.com/jetpack-io/tyson to experiment with that idea.
Thanks for your comment! This is now the second time I'm coming across Jetpack.io (the first time was when I found your devbox project) and this time, too, I come away thinking that you're magically reading my mind. :) Thank you for your work!
May I ask you, what exactly is Jetpack.io? It sounds like a blend between startup and open-source organization, given the prominent links to Github & Discord on your home page, the lack of a hiring page etc. I mean, I had to browse your website quite a while to find out you're actually selling a product(?) :)
Anyway, back to the topic at hand: The TySON README says
> The goal is to make it possible for all major programming languages to read configuration written in TypeScript using native libraries. That is, a go program should be able to read TySON using a go library, a rust program should be able to read TySON using a rust library, and so on.
YESSSS. In my wet dreams I sometimes even go one step further: How great would it be if every language could import constants[0] from any other language?
I was only introduced to terraform a few months ago, but my own takeaway so far closely mirrors yours.
One thing that's sorely needed, especially for beginners, is something like a schema. Something that would provide editors with typical language features, but especially autocomplete. Maybe protobuf would work but I've also heard about some language called Cue that may be worth exploring as well.
I also feel that declarative is the wrong approach. Building infrastructure is inherently imperative and making the build process apparent in the code would go a long way towards readability. I'd love to be able to read through the terraform modules like I'm reading a story about how the system gets built.
Without suggesting it solves most (if any) of your complaints, https://yglu.io is significantly less horrible than text templating YAML files like everybody else seems to want to do.
1. A low-level, open-ended language for describing infrastructure; it should have absolutely no facilities for abstraction, should be human-legible and machine-readable (so based on JSON, probably), and should be applicable to everything from configuring physical hosts and switches up to containers.
2. For each kind of infrastructure, a tool which can apply that language to the infrastructure; one for AWS, one for physical hosts, one for Kubernetes i suppose, etc.
3. Tools and libraries for producing documents in that language from more expressive, concise sources; could be a YAML-to-language compiler, could be a classic Ruby DSL, could be a Python API, could be this guy's CSS idea, could be a GPT prompt, whatever.
Mostly, i want options for the last part to include libraries in sensible programming languages. Then i can just write real code, with full abstractive power, and the possibility of unit tests etc, to define my infrastructure, run it, and feed the output to the applier tool. No more enterprise YAML engineering. No more trying to shoehorn abstraction into Jinja2 templates. Just normal code.
Because the code produces the language, rather than operating on resource directly, writing a new library / DSL / whatever, based on a cool new model which will solve everyone's problems, becomes very easy. You don't have to build a whole IaC tool from scratch.
It also means you have an obvious and simple checkpoint to apply diffing, linting, security checks, etc. Not on the input code, but on the resulting document.
And it means you have one place you can always look to determine the ground truth of what is going on.
So not what i described at all then. The whole idea is to have a low-level language you can generate with high-level code written in a real programming language.
> 1. A low-level, open-ended language for describing infrastructure; it should have absolutely no facilities for abstraction, should be human-legible and machine-readable (so based on JSON, probably), and should be applicable to everything from configuring physical hosts and switches up to containers.
Sorry, but if it's doesn't have abstraction it's not readable. You get smothered in details. No forest, just trees. People will just create a framework that generates the file like Sass does with CSS.
I think what is considered low level has to be qualified here. In my opinion I should be able to request virtual cores and memory the same way I do in an os. When in c I can malloc, calloc, or mmap or brk if I'm on a Linux system. If I'm doing concurrent thing in c I have to specifically request a thread then manage it. If I can do these, why can't I have a library where I can make those requests to my cloud provider from my code?
To me, low level would be individually requesting resources as I need them with abstractions that let me ignore physical vs virtual distinctions. To me, this would be fairly readable. C like code requesting then allocating memory and threads just makes sense to me and I don't really care about if they're part of a larger machine. But I also think of low level from my code's perspective not from my infrastructure's perspective.
I'm curious what you think would need to be targeted for something to be both cloud native and low level? I rarely hear people discuss this type of classification.
I think the part that upsets me is he specifically calls out the problem CSS was meant to fix, but then presents the current usage of CSS doesn't fix the problem but instead inverts it. Instead of having to scour code for instances of an attribute you wish to change, you're scouring output for potential negative consequences to a 1px margin shift you added to a style used everywhere.
It seems like this will just amplify mistakes when a lowly dev tries to increase the available RAM of their resource and instead doubles the entire RAM allotment of a resource type for the entire enterprise.
Author of the article here. I totally agree with you on this. There are definitely tradeoffs. However I think the risk of this approach is arguably better than the risk of having every team implementing their own IaC from scratch.
With the CSS example you can have a bunch of very junior developers using Bootstrap and Tailwind with minimal knowledge of CSS and get great results precisely because they don't actually have to change the CSS classes inside of the CSS framework. Junior devs don't make "1px margin shifts" as often because the framework has good margins out of the box. Additionally if needed any "1px margin shifts" that they do make could be enforced by linter or other code policies to happen as a new class, with visibility and limited impact, rather than a change to an existing underlying CSS framework class.
The same could be true for our infrastructure as code if we had similar prepackaged configuration mix-ins. It actually lowers the risk of junior developers making "1px margin shifts" because the IaC framework has sensible config mix-ins out of the box. Most modern IaC just hands the full API surface area to a junior dev, perhaps provides a few examples, and then hopes that they do the right thing with it.
I think modern frameworks like AWS Cloud Development Kit are doing the right thing by implementing higher level methods like `database.connections.allowFrom(service)`, which automatically configures minimal access security groups with the default port of a stack construct, etc. This prevents juniors from making mistakes like opening up every port to every IP address on the internet. However, I think we need the underlying infrastructure as code language to also gain an understanding of mixin methods and traits that can be reused like that, and it needs to be more general purpose.
The "CSS" analogy is a starting point for introducing people to this idea of building reusable resource configuration bundles and resource mutation methods that can be plugged into multiple infrastructure resources and reused, instead of having every IaC stack be its own special hand crafted thing.
This objection reminds me of those to AOP. With great power comes great responsibility. The increased efficiency means you can make the greater investment when it's appropriate and avoid far greater expenditures. Also, tools like Eclipse's XRef view become required.
And in an environment where staff are a cost optimization and not everyone is a 10x coder or engineer, there will be individuals in power who are not as responsible as you would like.
So it's better to have a system where you limit one individual's ability to cause harm rather than amplify it.
In an environment where staff are a cost optimization, tech debt and things like CI/CD are luxuries. Much of it will be half baked or poorly implemented. Automation of unit tests and validation are going to be afterthoughts at best.
Don't get me wrong CSS isn't perfect, but it has done a really good job of scaling through difficult problems as HTML, and browsers, and user expectations have grown over the decades. The tooling and CSS frameworks have gotten really good.
I think HTML + CSS is an example of a declarative system where you can start out not knowing very much about it, just drop in Bootstrap or Tailwind, and start getting great results by using prebuilt CSS classes from someone else.
This is what is missing in most modern infrastructure as code. Sure you can start out with prebuilt IaC templates from someone else, but these templates are basically like getting handed a big chunk of HTML that has inline styles on it. It might render great in the browser and look great, but its hard to read, its hard to understand why it works, and you'll have trouble adding on your changes to it without breaking things.
What I'd like to do is decouple the semantic aspects of infrastructure from the specific configuration aspect, similar to how HTML + CSS lets developers write their semantic markup with semantic CSS class names, and then have a CSS framework provide the exact styles that make it look pretty.
Infrastructure as code needs a similar standard library of semantic configuration mix-ins that you can apply to your infrastructure as layered mutations to produce the final result. There are many tools out there approaching this challenge right now from different ways, and I think the future of infrastructure as code is going to look quite different from what most people are doing today, more like HTML + CSS, or imperative code, than flat YAML and static structures that must list out all their own properties.
I think if you use CSS as a source of lessons learned after the mistakes were implemented and cemented, then yes. If you aim for where CSS was intended, and make decisions which do not compromise that direction, then it's fine goal.
The layering of distinctly-defined concerns contained in separate files which collectively project a merged specification to an IaC tool is a good idea, I think.
What the author appears to miss is that many existing IAC tools permit exactly this.
CDK, CDKTF and Pulumi all use general purpose programming languages, so reusing parameter objects in the way that is described is trivial - indeed it is so close to second nature that I would not even think to write it down. Indeed, it's not uncommon to share functions that make such parameter objects via libraries in the package ecosystem of your choice.
I agree that IaC needs a rethink, but that is more to do with the fact that declarative systems simply cannot model the facts on the ground without being substantially more complex.
I'm the author of the article. I actually shared a prototype towards the end of the article, of the idea implemented in CDK. I agree that you can do a lot of this already in CDK, CDKTF, and Pulumi. I just don't think most people are actually doing it (yet).
I've been using CDK since early beta, and have actively contributed to the project. But most people that I'm seeing using it today are just wrapping up new higher abstractions with a simpler but limited API. I think that is an okay start, but I want to encourage people to think more about the infrastructure as being made up of traits/classes/adjectives that are mixed together to form the final product. The same way we have class inheritance in object oriented programming, or CSS classes in HTML.
Eventually the dream is to be able to provide a library of standard infrastructure as code mix-ins that can be applied to your cloud architecture. For example imagine if you could apply a generic "Graviton" trait to a CloudFormation or Terraform or Pulumi or CDK stack and it would automatically configure the appropriate properties on your EC2 instances, and your RDS database, and your Fargate tasks, and all your other compute. With CDK's built-in container and image builds it could even run your Dockerfile based build inside of the matching architecture as well, all based on a single trait that you add to your stack.
There are a wide variety of these types of "traits" that you might be able to build and add.
Most people are still clicking around the console. A new paradigm is not required to allow them to use what already exists, just education.
The question of WHY the mixins are required is interesting to me - they point to the abject failure of cloud providers to keep up with their customers.
As I work at a cloud provider I have a different perspective on the mix-ins. The problem is that at scale there is no one customer type. Some customers absolutely require a specific way of configuring things, while other customers require exactly the opposite. Cloud providers are always trying to add more features to please all people, its a fundamental side effect of growth. And an often unstated aspect is that the most custom features are often extremely high value ones that were added for customers that worth large amounts of money to the business, even though the other 90% of small users and startups may not need that.
I think mix-ins can be a way to make the growth in features accessible while not overwhelming. Out of the box mix-ins can provide 80% of the value for 80% of the people and the long tail of 20% customers with the super custom needs that are worth high dollar amounts can still build their own custom mix-ins that target their specific use cases.
Adding a "real" programming language makes certain things easier, such as abstraction, but IMO they are too powerful for the task at hand. Do we really want an infrastructure description to be able to execute arbitrary code?
Yes, that’s exactly what we want. Things like Terraform also permit this via provisioners, and CloudFormation permits it via execution of lambda functions. Almost any non trivial infrastructure requires it.
With Terraform you can statically analyze the infrastructure definition with some guarantees of determinism etc. Arbitrary execution is allowed, as you say, but only in well-contained places, such as local_exec.
How can this work if, say, TypeScript is used as the definition language?
> I believe that infrastructure as code languages and tool assisted generators that we currently use are good, and they are taking steps in the right direction, but most of them are trying to patch over underlying complexity in a way that is fundamentally unscalable.
Sure, I can get behind this. Yesterday I was trying to figure out how to give a name to EC2 instances generated by AWS-managed autoscaler group that’s created by a node group resource. Simple, right? should just be able to add a Name = $tag field to the node group somewhere to apply to the generated ec2’s?
well, not quite. What you actually need is a separate autoscaling_group_tag resource.
Well, that resource needs a reference to an autoscaling group arn. but I dont manage an autoscaling group, my node group does, so in the end I have to figure out how to reference it like:
well, not quite, you may need a try block around that, and maybe some lifecycle rules to get around weird race conditions.
so yea. I’m not complaining about HCL or terraform. I find it much better than the alternatives. but lots of times my reaction to stuff like this is “there’s no way it has to be like this.”
Yeah. I’m following this discussion and not really finding myself able to relate very well.
We’re using IAC almost exclusively; loads of Ansible particularly. Everything is essentially a Kubernetes manifest or a playbook in Ansible (which runs on Kubernetes). We exited the “all in on one cloud and all its services” methodology and it made our lives _vastly_ less complex. We don’t really need the kinds of complexity that brings. We picked two tools as close to the metal as was reasonably portable between any given VM or bare metal stack, and deploy everything else on top of that. It has made life _so_ much easier.
Of course we still use some best in class services, but we avoid proprietary services that cannot be at least functionally replaced inside a week, unless we’re hosting them ourselves.
> This is a bold statement I know. But I do not believe that infrastructure as code can ever get significantly simpler in its current form
Everything can be made easier to use. Pick the subset of functionality you care about and package it up as a library or module for other teams to use. This was how I paid the bills for years.
He's suggesting something closer to Pulumi than a declarative (Cloudformation, Terraform), but with more of an inheritance model to apply blanket attributes to the targeted resources. This is possible with Pulumi but requires a lot of boilerplate and some monkeypatching.
Author here. This is correct. I have used Pulumi and I love CDK. I think the models in Pulumi and CDK are both great, but they tend to be either too low level or too high level.
For example Pulumi has a `awsx.ecs.FargateService` class that can deploy a container to AWS Fargate, but the API is limited and you don't really have the capability to remix it into other scenarios. For example you have the capability to attach a load balancer to a service as ingress, but what if you want to attach an API Gateway, or just setup a Route53 record for direct public IP ingress?
CDK has a few more variations in what you can do but it tends to be even worse because it has class names like: `ApplicationLoadBalancedEc2Service` and `NetworkLoadBalancedFargateService`. Switching your infrastructure requires switching to an entirely different top level when you define your infrastructure.
What I am proposing with the extensions interface is that we build more a class inheritance style model, kind of like CSS, to let people mix in alternative behaviors, alternate configurations, or even attached resources in an additive way, onto the existing infrastructure as code model.
All types of mutations, from configuration mutations, to attached resources are just extensions that you can use the `ServiceDescription.add(Extension)` method to attach to your existing infrastructure, kind of like just layering on another CSS class onto your DOM element. I think for large and complex infrastructure this will be a more scalable, decouplable, and readable approach.
I have deep concerns regarding imperative programming w.r.t infrastructure as code. It's mostly developer-experience related: as a developer I just want to tell AWS/GCP/Azure to "give me three boxes behind a load balancer in two regions". I know it's currently popular to hate YAML (or HCL) but it works and at a glance I can get a feel of what my infrastructure should look like. Therefore IMHO declarative should be the preferred standard for infrastructure.
From an infrastructure management standpoint I completely agree with your take - do exactly this enough times and you get skew, compliance problems, pets instead of cattle. Hence you would want a CDK to facilitate those enterprise needs. Any sort of composability that doesn't break and has sane defaults would be ideal in a CDK. The problem I have with current CDKs is the amount of boilerplate to get this reproducibility, and then on top of that I have to write it in a non-declarative way.
I like where your head is at but I do not think AWSCDK extensions are the answer. I think cdktf and Pulumi are a little bit closer to the right ("my right"!) answer but still have a ways to go.
"This is what I want from infrastructure as code and I have yet to see it."
If you sit down with the terraform specifications for an AWS instance, a GCP instance, and an Azure instance, and start trying to write that harmonization, you will rapidly discover why for yourself. Even just trying to specify a network setup and putting an instance on the public internet is impossible to harmonize, without making something so lowest-common-denominator it is almost useless, let alone anything complicated.
Exactly. If you want portability across clouds, Terraform ain't it. The only way I know of to achieve that right now (for any reasonably complex architecture) is:
- Minimal amount of Terraform to deploy Kubernetes (which is different for each cloud provider)
> he structure and syntax for AWS is entirely different from Azure is entirely different from GCP
This!
We're essentially writing locked-in vendorscript. What we want is an actual infrastructure language. One that lets us write once and deploy anywhere (nods to Java).
That would also allow us to standardize the way additional tooling (monitoring, logging, etc.) hooks into everything. It would allow us to easily deploy to new environment types as they become available (I keep hearing about the wonders of WASM). It would allow standardized ways of doing ops testing.
You could also just use Java (or Kotlin Scripting, which is a bit more ergonomic for such use cases).
I know it's super unpopular in these parts but a lot of what the article asks for is satisfied by conventional statically typed OOP (with inheritance), and other features common in full blown programming languages. You don't have to actually do the setup imperatively, but you can construct the data structures describing what you want with such tools and let some engine bring your real infrastructure into compliance.
That's why the author is experimenting with TypeScript.
Obviously, full languages allow you to eliminate repetition in many different ways. Java interfaces with default methods are very close to 'traits', Kotlin has the same features with real properties and convenient syntaxes. It also supports builder DSLs which would be ideal for this.
The main reason to use such languages is that you get full IDE features out of the box, for instance, "show me everything that uses this trait/interface/abstract base class" is one click away. Refactoring is easy.
But there are other reasons. You will often want to define some behavior that can be composed, but which leaves "slots" that the user has to fill out. Other times you need to say "something just like that, but adjusted a bit". OOP type systems are good at this sort of thing.
The DevOps world seems to continually go through cycles where someone says, full languages are too powerful, we need a declarative subset! And then it's too limited and leads to too much repetition, so you start getting stuff like if statements encoded into YAML and other nonsense. And then people get dissatisfied and try to invent another minimal declarative markup, and the cycle repeats.
You can do that through abstraction. You “include” your Terraform Azure Provider or Terraform AWS Provider. At the end of the day, your module needs to know what it’s interacting with but not the higher level of abstraction. We have done it at my work to make it cloud agnostic just in case we need to go to another CSP
This hould be done even if you don't care about being agnostic. Lay out your declarations in the format you'd like to interact with, then transform them to whatever they need to be.
Meaning we made our own wrapper for the terraform azure provider. The higher level does not expose those technical details. It just says virtual machine.
OAM has a model of components (things like containerized workloads, databases, queues), traits (scaling behavior, ingress) and in the latest draft, policies that apply across the entire application (high availability, security policy).
It's all a little disjointed and seems to have lost steam. KubeVela is powering along, but it's the only implementation, and IMO it's highly opinionated about how you do deploys and works well for Alibaba and perhaps not for others. But it has some interesting ideas.
At large, the problem with on prem in this will be that a lot of the items in any of these "infrastructure as code" tools abstract a ton of things a human will have to do. Unless you are using "on prem" to be more akin to "private overprovisioned cloud."
Adding a new machine to your setup? When is it delivered? Who is physically connecting it to power and a network rack? Adding additional storage? Same questions, basically.
All of the software that goes onto these machines can still be done this way, of course. But that is bog standard deployment scripts. Well, it was. Not sure when we changed that.
Yeah there are people who are talking about blah blah blah cloud complexity and how on-prem is some panacea. What they're really getting at is "why go through all this trouble to setup autoscaling groups and load balancers when I'm perfectly fine running things on a single Pentium II whitebox sitting in my closet?" It doesn't seem to occur to them that although that might work _for them_ the whole reason we go through this labor is we need MORE than that.
The split between parameterized classes and logic sounds a bit like the split between Puppet and Hiera. The idea was probably a good one, but something about the implementation made people go overboard with it.
I feel IaC really peaked around Puppet 3 and Chef 1. IaC should be simple enough that people use it, and trivial to write providers for. People tend to glue much too large libraries to their IaC platforms and end up with a maintenance mess which is what kills it in the long run. However both the above projects went corporate and grew legs and arms and a billion other features that everybody won't use more than a subset of. Most people migrated to Ansible which kept more of the open source project culture and was simpler in design.
Now people seems to use a little of this, a little of that. Some Ansible, some Terraform, some other stuff. They don't know what they're missing when the entire stack is built ground up from templated components defined in a common declarative language. Some people seem to really like Nix, which I haven't used professionally, but from what I've seen it seems to inherit the same type of design. There was an experimental project called cfg which worked in real time using hooks such as inotify which was promising, if there was a Kubernetes distribution made like that it would be really easy to manage components that didn't belong to a host.
IaC is a silly term. Infra is mostly hardware, which by definition is something else than software.
But hardware needs to be configured. And load balancers, firewalls, clusters etc. are perfectly suited for the declarative style of the Ansible/Puppet/Chef type of tools. That is what people usually mean by "IaC", as silly as it may be.
Those tools really shine when used end-to-end. The defintion of an application can contain which ports need to be open towards backends, database users, and health check parameters for the load balancer. The system can then declaratively sort out the respective templates, and configuration really lives in one place. Shared secrets are defined exactly once, and rotation is deterministic across the entire environment.
I believe it is alive and well, but was never a big contender in the space. I personally have no experience due to professional reasons and a slight unease about home rolled crypto. I believe it is quite similar in concepts to Ansible/Puppet/Chef with a high level declaration of resources and provider implementations in a "real" language. Perhaps someone else can chime in!
My problem with the current IAS systems is the state storage. It should not be needed! Instead, the IAS tool should introspect the systems it's managing and build the necessary state on the fly.
Say I have resource A with property X=1 I define in IAC. Someone comes along and modifies X=2 outside of state. With your way, the IAC tool would see that change and think it was naturally part of the desired state, whereas stored state will catch the drift. And before anyone says “well dont modify outside of IAC” I say 1) that’s often impractical and 2) sometimes automation can modify resources outside of IAC beyond your control.
Also, dynamically creating state creates all sorts of concurrency issues, which is another nice thing about stored state, you can put a lock on it.
First, you can guard against it. Just periodically re-run the infra code in the "dry-run" mode (from a CI/CD system) and scream if you see any differences.
Second, this is still fine. Don't make changes outside of the IAC control. And if you do make them, retro-fix the IAC files until there is no diff with the actual state.
Third, IAC should have an option to ignore some changes.
> Also, dynamically creating state creates all sorts of concurrency issues, which is another nice thing about stored state, you can put a lock on it.
In my experience, this is not a big issue in practice. Production deployments should be done through some kind of CI/CD, and it naturally serializes builds.
However, nothing stops you from adding locking without doing the full state management.
> Second, this is still fine. Don't make changes outside of the IAC control. And if you do make them, retro-fix the IAC files until there is no diff with the actual state.
This doesn't work in practice. Some aspects of the business want to tweak things and it should be reasonably guaranteed that the automated side never touches it.
Terraform state gives this assurance because it won't destroy resources not under its state.
> Some aspects of the business want to tweak things and it should be reasonably guaranteed that the automated side never touches it.
What would a legitimate case for this be?
It seems to me like any changes either must be done via IAC -- and tracked in source control, PR'd, tested in non-prod, etc -- or a missing feature.
If there's a legitimate case for modifying something not in IAC, it should be supported -- this is what I mean by "missing feature". The app and/or IAC should have code for that feature.
Modifying IAC-deployed settings is akin to someone hacking the binary of an executable from a software vendor while still expecting the vendor to support that modified executable. Not gonna happen.
There are 2 different use cases. One is where you wany your IAC configuration to be the source of truth - any changes made outside of IAC are drift and should be fixed. The other one is where you want to take the changes thatbare made OOB and update your IAC configuration to mirror then - in this case you use the IAC config to document the state of your live infra.
The IAC tool could just as easily recognize the change as a part of the current state instead of the desired, and revert the drift. Whereas stored state would likely miss the drift without another process to compare and update the stored state against the actual.
That is how Puppet works. Introspect the current state, compare with the desired state, fix as needed. It mostly works, but in reality it will never reach the point of introspecting literally all of the current state. So there are always ways to subtly break things without the tool noticing. (E.g., a file object that ensures the correct path, contents, ownership, and mode, but doesn’t check xattrs or ACL. [That’s hypothetical, not how the actual Puppet file module works.])
A lot of people will argue that state helps protect against drift, but the real reason I find that you have to have state is to store values that won't be returned a second time and still construct and connect the graph of resources in the IaC templates. For example, if you declare the need for an RDS database and connect its output credentials into another application, you'll need state in order for the applies to work a second time because you'll never be able to retrieve the values from the target provider again.
Yeah, and now all your creds are available for everyone to see. Instead use IAM authentication for RDS, or if it's impossible, store creds in SSM or Secrets Manager.
Yeah, it's not fully transactional, but it will work fine in practice.
Is this a trick to make infrastructure/devops engineers learn TypeScript?
But, hey, when looking at the origins of OOP and its main uses (back then: Simulations and UI): Maybe this is exactly what one needs to describe and setup infrastructure and there have been various projects going in that direction.
Make message passing truly async, throw in Garbage Collection, make it dynamic (i.e. creating a new instance of an object leads to some sort of deployment) and voila: Your traversable, introspect-able object graph is now a representation of your infrastructure.
When I need to provision anything I have a powershell script that interacts with Azure CLI.
My script sets up a new resource group for every service we create, logging, key vault, webapp/functions, and if needed some kind of data storage or queuing.
In my powershell script I can via a variable indicate which environment I want to spin up: dev, staging, prod.
I have one yaml file which is for my build and a build trigger which points to the above powershell script with the given environment.
All environments: dev, staging or prod are setup manually with manual user assignments for deployment access etc.
It's really lightweight but I also believe it's lightweight because we run a small services setup where each service takes care of its own provisioning.
Terraform and Yaml are so verbose but that's not the most problematic. You can't execute those files from your local machine.
> Terraform and Yaml are so verbose but that's not the most problematic. You can't execute those files from your local machine.
Have you ever actually used terraform? You execute it from your own local computer, or from CI/CD. It runs in a compute resource you own, not the cloud provider.
> You can't execute those files from your local machine.
You can execute terraform from local machine just as easily as a powershell script. I dare say you could even make it work a shebang if you wanted (though I’ve never tried that).
Looks to me like a vendor specific language with limited capabilities compared to others. Just use Pulumi TypeScript (self hosted).
I'm not affiliated with Pulumi, it's just that unjustified vendor lock-in infuriates me.
We use it in our company to provision all our cloud resources. Granted, you can't create app registrations and such with it (yet) like you mentioned, and there are rough spots, but I think that is quite far from "not very useful".
You can execute it also on your computer, and when an individual runs it, it really should not use a service principal, as those are intended for IAM of automated systems, not people.
I run terraform against my Azure sandbox from my computer with nothing more than azure CLI credentials that were stored after I logged in with az.
The CSS like IaC language idea was not what I was expecting. It seems ok, but I was hoping for something deeper. What I mean is that I have always felt like there is tension or a mismatch between IaC and the underlying services in a more general sense. I’ve used CDK, Pulumi, Terraform, and CloudFormation and you can argue the merits of each. But they all kind of suck in the sense that you are programming a machine that was not really designed to be programmed. Sure AWS and all the rest have APIs to call and IaC is a decent abstraction over those APIs, but imagine if instead they exposed some lower level interface designed to execute IaC programs natively. I feel like that is the ultimate path to IaC that feels like actual programming.
> Centrally updatable: Sometimes best practice or corporate policy changes over time. You can update what LowCost or SecurityPolicy means later on, in one place, and that change will reapply to all resources that used it.
It sounds great but it's not. This is essentially the Fragile Base Class problem. You may _think_ that updating one of these traits in a single place will be safe and do what you want, but it may be disastrous for whoever is using it. And you're not going to find out until you deploy it.
There's only so much you can do by building on the current towering abstractions offered by GCP, AWS, etc. One of the main problems is just the slowness of it all.
Dark is a good example of something that sidesteps this stuff by more fine grained integration of infrastructure and app code.
A CSS-like language for IaC is literally the last thing I would have expected someone to suggest.
It’s an interesting idea. My initial reaction was “you can take my HCL from my cold dead hands” but I can’t seriously argue that Terraform is perfect and that I enjoy writing so much boilerplate.
Interesting comparison with the webbrowser stack! In that sense CDK (generate cloudformation) is more like PHP (generate HTML). I wonder when "Virtual DOM"s, AJAX and CSS preprocessors get introduced to IaC.
Any similar solution for self-hosting focus, my own comprehensive data center infrastructure management + cloud? router/switches, load balancer, firewall, CGN, bare metal server, VM, Containers, Application and etc.
@cyberax said, "My problem with the current IAS systems is state storage. It should not be needed! Instead, the IAS tool should introspect the systems it's managing and build the necessary state on the fly.
@firesteelrain said, "you can do that through abstraction. You "include" your Terraform Azure Provider or Terraform AWS Provider. At the end of the day, your module needs to know what it’s interacting with but not the higher level of abstraction. We have done it at my work to make it cloud agnostic just in case we need to go to another CSP"
Single ops eng in a 3 person startup here. Ops eng is only one of my hats right now :) I found crossplane to be a solid tool for managing cloud inf. My assertion is that "the only multi-cloud is k8s" and crossplane's solution is "everything is a CRD". They have an extensive abstraction hierarchy over the base providers (GCP, TF, Azure, AWS, etc) so it's feasible to do what firesteelrain did. My client requirements span from- you must deploy into our tenant (could be any provider) to host this for us.
I can setup my particular pile of yaml and say - "deploy a k8s cluster, loadbalancers, ingress, deployments, service accounts (both provider and k8s), managed certs, backend configs, workload identity mgmt, IAP" in one shot. I use kustomize to stitch any new, isolated environment together. So far, it's been a help to have a single API style (k8s, yaml) to interact with and declaratively define everything. ArgoCD manages my deployments and provides great visibility to active yaml state and event logs.
I have not fully tested this across providers yet, but that's what crossplane promises with composite resource definitions, claims and compositions. I'm curious if any other crossplane users have feedback on what to expect when I go to abstract the next cloud provider.
cyberax's note on state management is what led me away from TF. You still have to manage state somewhere, and crossplane's idea was- k8s is already really good at knowing what exists and what should exist. Let k8s do it. I thought that was clever enough to go with it and I haven't been dissapointed so far.
The model extends the k8s ecosystem, and allows you to keep going even into things like db schema mgmt. Check out Atlas k8s operator for schema migrations- testing that next...
I also like that I can start very simple, everything about my app defined in one repo- then as systems scale I can easily pull out things like "networking" or "data pipeline" and have them operating in their own deployment repo. Everything has a common pattern for IAC. Witchcraft.
The thing that I think this could run up against is that in HTML+CSS it is fairly common to take an element and apply a whole bunch of properties in coordination with each other. That is, I'm going to set similar margins and paddings and fonts and many other properties on each element, and there are a lot of broad similarities. This is where CSS variables come in; even if I'm applying a color to a lot of elements I'm probably pulling from a much smaller palette and if I change one of them I want to change all.
Cloud template definitions also have a lot of settings, but from what I can see, they are all different, all the time, for lots of good reasons. If I'm deploying a lot of different kinds of EC2 instances, I've got a whole bunch of settings that are going to be different for each type. Abstracting is a much different problem as a result. And it isn't just this moment in time, it's the evolution of the system over time, too. In code, overabstracting happens sometimes. In cloud architecture it is an all-the-time thing. It is amazingly easy to over-abstract into "hey this is our all-in-one EC2 template" and then whoops, one day I want to change the instance size for only one of my types of nodes, and now I either need to un-abstract that or add yet another parameter to my all-in-one EC2 template.
The inner platform effect is very easy to stumble into in the infrastructure code as a result, where you have your "all-in-one" template for resource X that, in the end, just ends up offering every single setting the original resource did anyhow.
By contrast, I've pondered the "focus on the links rather than the nodes" idea a few times, and there may be something there. However the big problem I see is that I like rolling up to a resource and having one place where either all the configuration is, or where there is a clear path for me to get to that point. Sticking with an instance just to keep things relatable, if I try to define an instance in terms of its relationship to the network, to the disk system, to the queues that it uses and the lambda it talks to and the autoscaling group it is a part of, now its configuration is distributed everywhere.
One possible solution I've often pondered is modifying the underlying configuration management system to keep track of where things come from, e.g., if you have a string that represents the name of the system you're creating, but it is travelling through 5 distinct modules on its way to the final destination, it would be great if there was a way of looking at the final resource and saying "where exactly did that name come from?" and it would tell you the file name and line number, or the set of such things that went into it. Then at least you could query the state of a resource, and rather than just getting a pile of values, you'd be able to see where they are coming from, dig into all the things that went into all the decisions, that might free you to do link-based configuration rather than node-based configuration. But you'd probably need an interactive explorer; if for instance the various links can configure the size of the underlying disk and you take the max() of the various sizes (or the sum or whatever), you'd need to be able to look at everything that went into the max and all the sources of those values; it's more complicated than just tracking atomic values through the system.
I've often wished for this even in just my small little configs I manage compared to some of you, and it is possible that this would be enough of an advantage to stand out in the crowd right now.
(I think the "track where values came from and how they were used in computation" could be retrofitted onto existing systems. "Focus on links rather than nodes" will require something new; perhaps something that could leverage an existing system but would require a new language at a minimum.)
Going into a "cloud native" stance and continuing to micromanage containers, VMs, databases, message buses, reverse proxies, etc. seems absolutely ridiculous to me. We're now using exactly 2 major cloud components per region: A Hyperscale SQL database, and a FaaS runner. Both on serverless & consumption-based plans. There are zero VMs or containers in our new architecture. We certainly use things like DNS, AAD, VNets, etc., but it is mostly incidentally created by way of the primary offerings, and we only ever have to create it 3 times and its done forever and ever - Dev cloud, Prod cloud, DR cloud. And yes - we are "mono cloud", because any notion of all of Azure/AWS/GCP going down globally and not also dragging the rest of the internet with it is fantasy to me (and our customers).
When you literally have one database to worry about for the entire universe, you stop thinking in terms of automation and start thinking in terms of strategic nuclear exchange. Granted, one big thing to screw up is a big liability, but only if you don't take extra precautions around process/procedure/backup/communication/etc.
The benefit of doing more with less also makes conversations around disaster recovery and compliance so much easier. Our DR strategy is async log replication of our 1 database. I really like the abstraction of putting 100% of the business into one place it magically showing up on the other side of the flood event.
How about this for a litmus test: If your proposed solution architecture is so complicated that you would be driven to IAC abstractions to manage it, perhaps we need to re-evaluate the expectations of the business relative to the technology.