If you do any non-trivial devops works on cloud providers it's immediately obvious why this is nonsensical.
Let's take the most basic example: auto-generated ids. Many resources in AWS, GCS, etc have auto generated ids (just use tags you say, but many don't have tags or tags are used as part of some other system). Now, when terraform creates that resource you have to modify the config to contain the id. But if you have any sense terraform runs as part of a CI system that lets others review your code before merging, deploy to staging, etc.
So now does the terraform process need to make an automatic git push? What if there's a conflict? Does it make a PR that has to be manually merged? All of this is much more complicated than just having one JSON file in S3.
I have actually managed resources with Ansible where you have this problem and it's worse. And this is just _one_ thing.
Is Terraform's state story perfect? No. There are definitely annoyances, and one thing I'd love to see is a way to declaratively handle imports, renames, etc. when you need to, but it's better than the alternative.
Interesting read. Seems to be focusing on the neck of the woods the author is exposed to.
Terraform isn’t about cloud apis only. Terraform allows me managing keycloack realms, ssh keys, cloud resources, postgres databases, git repos, imap accounts, and so on.
The state is there to be treated as the source of truth. It gives an answer to „do I have what I want to have”. With that state, it is possible to cross reference various resource types without having to load the curent state on every run. I’m surprised that the author did not see that as a performance issue.
Imagine that you have to load all route53 state, all buckets, ec2 instances, iam roles, … on every execution, and imagine you have 200+ machines… That’s what we used to do with puppet and chef, no?
Turned out that always fetching the view of the world from an api is pretty expensive and quickly exhausts api rate limits.
It’s a pretty weak article without any effort to suggest how could it work without a state.
Trends come and go. I too use terraform now because that is what my customers use.
Previously we have used Ansible for the same thing, in a largeish environment, in production. It had the obvious benefit of declarative and stateless.
The other comments here seem to focus on how that works badly together with manual state changes made externally from the system. The answer is that it requires another way of working where the state is the git repo. The question is not what to do when someone spawns extra test nodes, but why they did not do it in a version controlled manner.
Perhaps as a reaction to the discipline required, something about Kubernetes attracted a lot of people used to manipulating state by manual interaction. Every installation I have seen has a web interface in use, whereas not as many have in the Ansible world. I fully expect this pendulum to swing back and become more declarative and version controlled again.
Personally I think the custom dsl’s are a bigger issue. I spend a lot of time wrangling tf to have reusable, configurable modules.
The more i use tf, the more i think it would be better to remove _all_ dynamic features and use a real language to generate tf configs as flat, static files.
Yes, conditionals and loops in TF are limited and have various annoying and surprising edge cases. But if something is hard to do with the Terraform DSL it's usually a good idea to reconsider if it is really something that one should be doing.
We want infrastructure automation to be boring and just work.
The risk with general purpose programming languages is that people will always find a way to outsmart themselves. Yes, sure, you can use the testing tool chain of the language of your choosing. But it's not like we have figured out to write software without bugs, despite all the awesomeness of modern languages.
Yeah there is definitly something to be said for having a limited DSL. I feel though the language as-is, has a bit to much dynamic features that kinda half work and then give you completly useless error messages. But tbf, it has gotten alot beter with recent versions.
But to explain where I'm coming fromm, this is something I ran into recently:
this breaks and returns null if http is not set: lookup(each.value.http, "port", 8080)
So I need to do this to make it work: coalesce(lookup(each.value.http, "port"), 8080)
Not a huge deal, but just one of these many things over the years where I think don't reinvent the wheel.
If you use Terraform, you should only update your infrastructure through Terraform and persist the state in a shared place (e.g. S3 versioning). I see people having hard time when they use both AWS CLI, AWS console and Terraform.
This is my big pain point with Terraform. Sometimes I forgot to update only through Terraform. Especially when it's trivial to do the change through UI but not necessarily the config.
I wish Terraform could automatically detect the change and convert it into code.
hello? how is this even a question? i have a hard time believing that idempotent infrastructure mutations are not the right move almost always.
shameless plug[1], i’ve been exploring an aws specific approach to infrastructure that is stateless and idempotent for exactly these reason.
slow, finicky, stateful deploys are about as awful as it gets. add a pinch of lowest common denominator among all providers, and that’s a tough pill to swallow.
“level triggered” vs “edge triggered” is the kubernetes term for stateless vs stateful. It’s one of the reasons why any cluster configuration drift becomes eventually consistent with desired state.
If you want stateless, then you can use Ansible and use their providers. Enjoy spawning new instances everytime you change your infrastructure, rather than having existing ones change.
You can also use Ansible with Jinja2 templating to generate your Cloudformation templates.
Describe loops and other properties in Ansible and have it build out a complex Cloudformation template for it to deploy. Use an Ansible variable for your stack name so you update/delete an existing stack and your good to go. Could even break it down to environments so dev = smaller instances vs stage/prod etc.
This looked suspicious when I started reading. But when I reached comparison to Puppet and Ansible my feeling about author's lack of understanding of TF got reinforced.
You couldn't do everything that Terraform does today without a local state, but perhaps that would be a good thing? Call it "Terraform strict mode".
As every application developer knows, duplication of state is a primary source of bugs. To combat this, React/Flux type of architectures became extremely popular where state flows in one direction only. They dictate that, no you can't just cheat a little and use jQuery to modify some element, it _will_ get bulldozed on the next render. And a lot of Terraform headaches do come from this analogous reconciliation of what really is (our cloud env), what we want (our TF code) and this intermediate state of what TF thinks the cloud state is.
So, by saying that you cannot have resources outside those defined by TF there is actually a massive simplification with far reaching consequences possible.
How I imagine the experience would be:
- You could say that your Dev env is a shitshow and always will be, of manually created resources and only partially TFed. But your Production and Staging envs are opted in to "strict mode". This means that if there is a conflict you do only have two options: import the offender or destroy it, and the critical mindset change is that this is a good thing and will save us a lot of tears later on.
- Caching is an orthogonal concern. Terraform mixes these two together to its detriment, but the nature of a cache is such that you can safely blow it away and perhaps the next reconciliation will be slow, but it will be accurate. I also don't believe it would actually be that slow, tools like Cloudcraft map the entire metadata of your account in seconds.
- I find the excuse that some resources don't support tags intellectually lazy. Of the top of my head thinking about it for a minute, could you tag a parent resource with the child metadata you need? E.g. individual DNS records don't have tags, OK tag the Zone with childA=value. Same thing with tag length limits, you can work around it, concatenate values or whatever. However, in a truly strict mode you wouldn't even need metadata in tags because the TF code describes the entire target environment.
I hope Terraform would entertain such a strict stateless mode. Unfortunately it will probably take another tool, because the problem is not so much technical as it is an entire mindset change.
I like the idea of stateless (or at least not having to manage the state as in the case of CF) but I’d hate to see the performance times for API calls if one needed to do a search for tags to conduct a CRUD. Particularly if this were a multi-account setup.
But a generic “none” backend as described in the OP would simply be impossible. The support for diffing desired vs actual state must be implemented in the resource provider, which in turn need to be supported by the cloud API. Ubiquitous labels, namespaces and global query performance seem to be the primary blockers today, judging by most other comments here.
Interesting thought if someone would attempt to make such a provider. Also interesting to look at existing providers if they avoid state internally when possible or just use it because it is available.
Looking at the challenges with kubectl apply --purge, already having all enablers laid out, it would require big effort.
Define everything in like cdk (... I use ruby to generate the tf.json), generate the code, import everything you can without error, and apply the rest.
Performance will be _bad_ but that will completely eliminate state problems.
I think the reason why terraform did this is that if you have a medium-large deployment that you plan/run often you'll probably run face first into cloud management api limits, it's not just bad performance.
You're absolutely right. I started using Terraform around 2016 and it was pretty common to get barked at by the AWS API if you were repeatedly running plans, even with state. I bet the cloud providers have had to make significant infra changes to support the growing number of customers using TF.
There are still plenty of services in AWS that have painfully low service limits for API calls. The first one that bit me was creating parameters in SSM Parameter Store using CloudFormation.
CF tries to create as many resources in parallel as possible based on the dependency graph it creates.
The only way around it was making each Parameter resource dependent on the other (using DependsOn) to force the resources to be created sequentially.
Before the CF vs TF holy wars began, this is an API limitation that you would hit regardless of your IAC.
cdktf doesn’t really help with state problems in my opinion. Under the hood terraform is still storing state and if you’re working on a team you’ll need to share state, e.g to s3. If you don’t have any state and try to make a change, terraform will try to recreate and fail. Importing is a pain.
cdktf is great, but I would also rather do away with the state. I’ve gotten into too many problems that were only resolved by deleting everything and starting over.
Oh I mean using the synth stage to write ephemeral code to disk to be applied against “to be imported state.” Disclaimer: this is a terrible idea if you aren’t using state.
I use this today to keep very “wide” modules synced
Terraform has to have state, for the reasons outlined by other commenters.
But, it is undoubtedly a pain in the arse in practice - invariably someone else is doing a change on a branch but has already applied it to the development infra using the shared state bucket, which then pollutes it for everyone else as you don't have those changes so your applys will now want to undo them.
As a first timer with this stuff, I think a key lesson learned is to balkanise your terraform quite heavily - we have a tfstate per environment but that is nowhere near granular enough, slicing it into smaller pieces would obviate many of those 'pollution' problems.
We use a single a single tf state per Google project. Instead of slicing up the state into multiple substates, we use modules as much as possible, and use "-target=module.module_name" to test our branches.
We might take another look, but back when we launched an internal platform, it was literally impossible to import manually created resources into CloudFormation. Terraform was the only game in town. I also keep hearing about long delays in CF supporting new features; the TF provider is usually well maintained.
I wish either of them had a more graceful plugin story, though.
I haven't found a good way to handle resources that were created outside TF. Like a EC2 instance running in staging to debug an issue, but then forgotten about. They don't exist in TF state or config so TF simply ignores them.
I've tried terraformer[1], but I can't tell how well maintained that is (it failed to get my creds, I had to modify the code to fix it, and then it crashed with an obscure error).
I think the author of this article misunderstands what Terraform is for and what it does.
The author compares Terraform to Ansible and Puppet, but these are not analogous tools. If you compare to Pulumi, you'll notice it also has state.
Keeping track of what is live and what is expected to be live is much of the utility of Terraform.
If the author is getting some Terraform state pain (e.g., from drift caused by a team making changes to infrastructure and forgetting to represent those changes in code), they should try using Terraform Cloud or building a CI pipeline for their infrastructure.
"None" state store definitely would be good for one of our case - automatic preview deploys for pull requests.
Statefull approach works good 99.9% of time but rare errors like inability to destroy resource because of cloud provider issues cause that you need manual interventions - i.e. resource cant be created as not managed by state etc.
Even rare errors are not rare when you have 20+ devs :-)
That needs some manual fixes i.e import state, killing resources etc but not all devs has access rights or knowledge to do this.
>Also, if Terraform configuration is refactored, for example, to wrap a bunch of frequently copy-pasted resources into a module, state must be manually reconciled before proceeding.
Given the ability to import resources this actually wouldn't be that big of a lift for some providers like AWS.
Wouldn't take much to hack something together to test this out, either... parse the TF for resources, lookup what is used for their IDs, run TF import with discovered IDs from the service provider and then your local state is up to date, run your plan / apply and blow away the state when you finish. But this is super gross IMO :)
Would say that half of the point with Terraform is that you have a canonical expected view of the infrastructure and the state as a log to assert that in each environment. It forces you to do all changes in a code first manner. Terraform only solves the provisioning part and is easier to work with than Ansible. One need something like Ansible too for configuration though.
something that can help reduce the need for state is aggressively partitioning systems into separate aws accounts.
then you can KNOW that no other random infrastructure should exist in an account.
terraform definitely does help coordinate the bunk beds of room mates. wouldn’t want them to accidentally discard each other’s pillows as they move in and out of the shared space.
I think the best thesis for this is decentralisation. The stateless approach only works if this one tf file (or set of files) manage(s) the whole world.
As soon as you get to multiple teams managing (say) AWS infra, you can't infer that a resource present in infra but not in tf file means a resource should be deleted.
I am a big fan of Azure ARM/Bicep. I can’t imagine needing to deal with stateful IaC. Perhaps Azure doesn’t need to do that because it’s able to provide guarantees about their own platform or something.
The actual infrastructure and its configuration is the state. It does a diff to see what needs changing.
Azure has some pretty strict internal requirements around resource creation that make this possible. Every service must provide a "Resource Provider"[0] API that supports idempotent creation operations and standardized query patterns for read operations. Whether you create a resource through an ARM template, in the portal, or via the CLI, it's going through the same unified API surface.
So, there is definitely still state, it's just stored centrally.
I am not sure how feasible the proposal is.
Certainly cloudformation is too archaic without Jinja & co.
Terraform doesn't need extras, but provider dependencies can be painful.
Yep and the configs are state too. These tools (Chef, Puppet, Ansible, TF, etc) are state replication engines; they replicate the state described in the config into other systems.
The simple solution is to have tf launch an EKS cluster with route 53 external-dns, then the ingresses can be defined as k8s resources which tf also manages :D
You could work in some way without a state but actually the state is a feature which helps you.
Imagine you go down the ansible route and you write idempotent ansible code than you could argue: “See. I don’t need state. My code is idempotent. I just use this playbook to apply”
Now think of deleting resources.
You could have a delete playbook maybe. Then how would choose whether to create or delete stuff?
Maybe a colleague gives you a ticket: “please delete”
Now there are various scenarios:
You do your thing (on your laptop) and tell everybody else in your team to not run the CreatePlaybook as you have to run the DeletePlaybook first for this one machine. Afterwards you delete the actual machine/resources from your ansible repository and tell the team: “please use the newest main branch”.
So: Here is your equivalent to terraform state, the coordination effort on your side since you are the only person who currently “knows” what’s going on with deletion/applying.
So your next idea is: “no problem, I make a Pipelines which runs the playbook on your behalf”. And the pipeline will update maybe the Git repo accordingly in some way - after the run (since ansible needs to know in the run, what to delete).
Everybody can see the pipeline . The pipeline will ensure you sequentially apply the playbooks to coordinate with your colleagues.
Next problem: how does the pipeline know which playbook and resources to run on?
You create a selection box for your hosts and for the playbook yaml to trigger the pipeline.
=> there is your state.
Your ticket information are transferred to your manual labor to fill in the correct items in your selection box. To reestablish want went on you now have to look in the ansible code and the pipeline Paramus and pipeline logs.
More examples are: you only want to update some stuff with your ansible playbook. Therefore you might introduce tags on the resources and the playbook knows how to handle those tags. The extreme case might be:
TTag:state:present, tag:state:absent.
Then you can run a single playbook which can call the deletion and installation playbook for you and everybody is happy that you have everything visible in Git.
Problem here: your 2 step process to decommission things from git. First a commit which sets state:absent. Then run the pipeline and then another git commit to delete the code.
So what I am saying is:
You can do all of this with ansible for sure. But you will have state somewhere:
In a Ticket, a Pipeline log, in git, in a wiki
I am not saying absible should not be used. It makes sense to configure things. (I personally would wrap a terraform hull around my absinble code and call it, just to have terraform handle the locking for my playbooks)
But just watch out for the hidden state in your workflows and better make it explicit.
This is why people love GitOps for traceability.
You can all do this by hand and document your process in a Wiki for your colleagues so that they known what the “tag:state:absent” means for them.
Or you can rely on somebody who has done this for you already and maintains documentation and what not.
Terraform it’s slowly losing relevance. You’re making a good point but they haven’t been open to chance for decades are are content to become another Sun Microsystems. Better idea: advocate the ‘state only if needed’ approach to Pulumi.
Completely agree. Thanks for writing this up. Specifically because I've always thought this, but also don't value my opinion on the topic. I was first exposed to devops a few years ago, and as a result was learning docker, k8s, terraform, salt, &c. at the same time. (I hated it, and now I'm happy writing C++ again)
What I could never wrap my head around was why the heck the tools had to expose so much complexity. I want the description of my infrastructure to be a single, or set of text files, written preferably in JSON or YAML like everything else, and I want to run a command that behaves in the same way as GNU make.
That seems like madness to me: YAML? Really? Is this some variant of K8s Stockholm syndrome?
It's always seemed to me that one of the virtues of Terraform was that the language was just declarative, and rich enough to express what you needed without becoming a Turing tarpit.
Let's take the most basic example: auto-generated ids. Many resources in AWS, GCS, etc have auto generated ids (just use tags you say, but many don't have tags or tags are used as part of some other system). Now, when terraform creates that resource you have to modify the config to contain the id. But if you have any sense terraform runs as part of a CI system that lets others review your code before merging, deploy to staging, etc.
So now does the terraform process need to make an automatic git push? What if there's a conflict? Does it make a PR that has to be manually merged? All of this is much more complicated than just having one JSON file in S3.
I have actually managed resources with Ansible where you have this problem and it's worse. And this is just _one_ thing.
Is Terraform's state story perfect? No. There are definitely annoyances, and one thing I'd love to see is a way to declaratively handle imports, renames, etc. when you need to, but it's better than the alternative.