Hacker News new | past | comments | ask | show | jobs | submit login
Terraform best practices for reliability at any scale (substrate.tools)
322 points by holoway on Aug 4, 2023 | hide | past | favorite | 152 comments



Here's my #1 tip, most important:

Try to keep your stateful resources / services in different "stacks" than your stateful things.

Absolutely 100% completely obvious, maybe too obvious? Because none of these guides ever mention it.

If you have state in a stack it becomes 10x more expensive and difficult to "replace your way out of trouble" (aka destroy and recreate as last resort). You want as much as possible in stateless, disposable stacks. DONT put that customer exposed bucket or DB in the same state/stack as app server resources.

I don't care about your folder structure, I care about what % of the infra I can reliably burn to the ground and replace using pipelines and no manual actions.


You mean keep stateless separate from stateful?

Everyone else seems to be reading over the typo or I'm more confused than I thought.


Yes


Is a "stack" here a (root) folder on which you'd do a "terraform apply"? I've never know what to call those, surely they aren't "modules".

And, so, you're saying: try to have a separate deployment (stack then?) that contains the state, so you can wipe away everything else if you want to, without having to manage the state?


It's not exactly about the folder, the IaC from a single folder / project can be instantiated in multiple places. Each time you do that, it has a unique state file, so I usually hear it referred to as a "state". In cfn you can similarly deploy the same thing lots of times and each instantiation is called a "stack", so stack/state tend to get used inter-changeably.

And yes, that's a succinct rephrasing.

When you first use iac it maybe seems logical to put your db and app server in the same "thing" (stack or state file) but now that thing is "pet like" and you have to take care of it forever. You can't safely have a "destroy" action in your pipeline as a last resort.

If you put the stateful stuff in a separate stack you can freely modify the things in the stateless one with much less worry.


Can you elaborate on this? I've never heard of this IAC structure and I'm trying to figure out what the benefit/cons are. Maybe it's just Friday and I'm checked out already.

If you run a terraform apply and only update microservices but you also have your dbs/stateful things in the same stack/app, you're only updating the microservices so how would this affect the db/stateful at all?

On the opposite end - I feel like there would be scenarios where I needed to update the stateful AND stateless services with the same terraform apply. Maybe I'm adding a new cluster and adding a db region/replica/securitygroup and that new cluster needs to point at the new db region.

In your scenario I would have updated microservices trying to reach out to a db in a region that doesn't exist yet because I have to terraform apply two different stacks. How would you deal with a depends_on?

Maybe I'm misunderstanding this.


(Hi, I’m one of the authors of the article at the root of this thread.)

Considering your hypothetical stateless microservice change in the same root module as stateful services, problems arise when _someone else_ has merged changes that concern the stateful services, leaving you little room to apply your changes individually.

It’s also worth remembering that, even if a stateless service and a stateful service are managed in the same root module, applying changes is absolutely not atomic. Applying tightly coupled changes to two services “at the same time” is likely to result in brief service interruptions, even if everything returns to normal as soon as the whole changeset is applied.


Not who you replied to, but I was wondering if I could get your take on my use case.

I work on the systems test automation side of things. I use TF to deploy EC2 clusters, including the creation of VPC, subnet, SG, LBs, etc. Once done, I tear down the whole thing.

But from what I'm hearing from yourself and others in this thread, it sounds like I could (/should) be breaking those out separately, even though my use case isn't dev/test/prod(/prod2,/prod3) or even multi-regional.

To rephrase, it sounds like it might be useful for me to be able to create some separation, eg, tear down the EC2 instances in a given VPC/subnet while I ponder what to do next and still leaving the other AWS resources intact, for example. Maybe even deploy another subnet to the same VPC, but a different test intention. I know I can simply specify a different AMI and run tf apply and get one kind of desired deployment change.

Bigger picture: when I need to run one test, I'll copy a dir from our GH repo, edit the tfvars, and kick things off. Another test (in parallel, even), I'll do the same but to a fresh dir. (Wash, rinse, repeat.) And I suspect you're already cringing :) but I get why. It makes me cringe. Ideally I'd be working with a single source of truth (local GH fork).

There's also the consideration of possibly making edits to the stack while I deploy/destroy, while at the same time wanting stability with another deployment for a day or two. I suppose that would require having 2 copies of the remote GH repo. Which is several copies fewer than I'm working with these days.

Fwiw, I've already got "rewrite the stack" on my todo list, but can't get to it for probably another 2 months. So I'm eagerly collecting any "strive to do X, avoid Y like the plague" tips and recommendations prior to diving into that effort.


Ok I think we're talking about two separate things here - you're referencing a root module and not a "stack", as in a stack is a full service/application that uses multiple modules to deploy. Your db module, eks module, etc. All independent modules, not combined into one singular module. Say it's sitting in the /terraform/app1/services/db(&)app folders type of scenario.

I think you're talking about putting stateful and stateless objects inside of a single module. So you've got /terraform/modules/mybigapp/main.tf that has your microservice + database inside of it.

If I'm right and that's what you mean that's really interesting I don't think I've ever seen or done that but now I'm curious. I'm pretty sure I've never created an "app1" module with all of its resources.

Am I totally off here?


I stuck with my typical term, root module, synonymous with how folks are using “stack” and “state” in various parts of this thread.

A module is any directory with Terraform code in it. A root module is one that has a configuration for how to store and lock Terraform state, providers, and possibly tfvars files. Both modules and root modules may reference other modules. You run `terraform init|plan|apply` in root modules.

I think my comment makes sense in that if you mix two services into the same root module (directly or via any amount of modules-instantiating-other-modules) you can end up with changes from two people affecting both services that you can’t easily sever.

Happy to clarify further if I’m still not addressing your original comment.


@rcowley -- I'm going to preface this with I'm a Staff SRE at an adtech corp that does billions and have been a k8s and terraform contributor since 2015 (k8s 1.1 I forget the tf versions). I don't mean this to brag I just want to set my experience expectation since I'm a random name on hn who you'd never know.

I think calling a service/stack (or whatever, app, etc) a "root module" is a very, very confusing thing to do. Terraform has actual micro objects called modules. We work with them every day. I get how you could consider encompassing an entire chunk of terrafrom code that calls various modules a "root module".. but I think this is just going to lead to absolute confusion to anyone not familiar with your terminology. I don't know every TF conversation but I can't think of a single time where I've heard root module in that context. Very good chance I've just missed those conversation and am ignorant to them.

I'm currently hiring SRE 2s and 3s so I've been interviewing lots of terraform writers and one of my tech questions is to ask someone what makes them to decide to write a terraform module and what type of modules they've written - it's always ALBs, EKS, dbs, etc. components indepedently that go into creating a service/stack. I've definitely not heard anyone mention that they write "root modules" in the sense of an entire service/stack.

I don't mean you're right or wrong, maybe more people are aware of that verbage than I am. I just wanted to mention that in my personal case I think it's confusing so I would assume that there are a lot of people in my shoes who would also be confused by it.


Root module is the official terminology used in Hashicorps own documentation. That's actually the term I'm most familiar with in my own experience.

https://developer.hashicorp.com/terraform/language/modules/s...

> Every Terraform configuration has at least one module, known as its root module, which consists of the resources defined in the .tf files in the main working directory.


My 2ç as also a (very minor) terraform core, and numerous providers & modules contributor, and user (also 2015 I think): I've never heard of 'terraform stacks' before this thread, but 'root module' makes perfect sense to me:

1) without the context of where the state is/its contents, or estimating based on the resources/style/what's variable vs. hardcoded, a 'stack' (if you like) is indistinguishable from a module that's to be used in someone else's 'stack';

2) `path.root`


lol That's interesting. I feel like all of the TACOS (spacelift, env0, atlantis) refer to stacks.

Thanks for your response It's great to hear corroboration


https://developer.hashicorp.com/terraform/language/modules#t... is what Terraform calls the module in the current directory, to distinguish it from child modules you might introduce.


Ok I may absolutely have the dumb today I appreciate the response. The way this is worded, because of this line - "Modules are the main way to package and reuse resource configurations with Terraform." reads like - "I have 10 golang apps, they all at a minimum use these same services, this is our "golang root module". But some services might have more or less modules, ie service A uses redis, service B uses kafka without redis.

So in this verbiage, is every single "stack/app" a "root module" and if one of them has a different database/whatever module it's just using different child modules and the child modules are the big differentiator?

Just to kind of prove the root-module argument I'm making here, this post in here is confused on calling a "stack" a module as well https://news.ycombinator.com/item?id=37005949


Glad we cleared up our terminology! I agree that “root module” risks ambiguity, just like you point out.

I just realized I never responded to the very last point in your original comment. I don’t have, and I don’t think Terraform has, a complete solution to dependencies between root modules. Fortunately, data sources will at least fail if they don’t find what they’re looking for. For me, these failures never come up in production since I’m never using `terraform destroy` there. It does come up in pre-production testing and that’s an area that seems rich with patterns and products on top of Terraform that are controlling order and dependence between root modules.

PS thanks for your work on Terraform and Kubernetes.


Use 'terraform destroy' during CI phase. That is your pre-prod.


Root-module: contains resources, sub-modules, incl. remote module calls.

Stack: deployable with hardcoded tfstate and tfvars configs.


> it's always ALBs, EKS, dbs, etc. components indepedently that go into creating a service/stack

More importantly, modules represent the DRY principles. We host them in our private Terraform registry and share between teams


You’re not wrong. You can achieve the same results using modules well.

It’s common for an entire environment for a whole biz unit or even company to be a single “stack”.

This pretty much only works if the “terraform apply” is centrally orchestrated I.e gitops is the only way to trigger the terraform run.


Agree, it's a complicated conversation because the tools support many different ways of working - and most of them can be made to work with different tradeoffs.

A team that e.g owns its own microservice IaC can absolutely maintain a setup long term with app & db in the same state, it just requires care and love (it's easier in some ways, it just doesn't scale along certain dimensions).

Maybe you have other controls / factors that can make it work for you.

But my experience is that as you split things across more teams or have more complicated IaC "supply chains" (eg teams supplying modules to each other or have lots of ppl working on different bits of the same thing) you need to look at ways to make things more foolproof, easier to support and give yourself as much "wiggle room" as possible for upgrades. At this point having state split out is very helpful (almost essential).

Because the terminology seems to be tripping up the conversation I'd be inclined to phrase it like this: a single "terraform apply" should be touching a precious, stateful stack or a disposable stateless stack and these should be clearly delineated. Ideally the stateful stacks should be as small as possible, as much stuff as possible should be in the stateless stacks.


The way I see it is we only want to use TF for setting up "base infrastructure". This is things like our VPC networks, cloud routers, firewall rules, and finally our Kubernetes clusters.

We still allow devs to use TF, but it's only for using cloud services they depend on, like say a SQL DB or something.

I think you and I are 100% on the same page, but I think the word state (in the way you are using it) is causing a bit of confusion at least for me. A terraform stack will always have state, that's the point. In addition to the current "tfstate" there is a set of parameters that must be used with w/e modules in order to arrive at the tfstate. That's the state that causes problems, not the tfstate so much at least in my experience, as these often are _not_ version controlled.

This is why it's critical to only allow terraform to be applied using automation. I mean don't make it a company policy, make it IAM policy so it's literally not possible.


That's right, stacks can be instantiated across repos even depending upon the organization (both meanings of 'organization' are valid here).


what about prevent_destroy lifecycle?


I have the same issue. They're all modules, but the ones at the tip of the directory tree (right at the end of your env/region/stack) are called root modules. Which makes no sense because the term "root" always implies that they are at the beginning, not the tippy-toe end. So I call mine "stacks". But as another answer suggested, "states" is also fine. Even though the actual state isn't inside that directory, it's probably in an object store.

At the end of the day I don't care what other people call them.


I have adopted the term "Root Module" vs "Submodule" because those line up with terraform's own definitions, but I agree that they're terribly, terribly named.


The "stack" nomenclature used here is jarring since it is unrepresented in Terraform HCL literature.

A CDK stack, (assuming that's what is used here), would be loosely equivalent to a Terraform HCL module.


Makes sense, but how do you connect the two so e.g. credentials from one are surfaced in the other?


Use Data Sources to reference resources in a different state: https://developer.hashicorp.com/terraform/language/data-sour...


Ah - I didn't think about that. Wrapping a remote call with a data source. Thanks.


IMO you shouldn't be storing credentials in shared state, as suggested by the other comments, since that means that the principals able to read the state to deploy their service can also read the credentials for other services bundled in that state file. This could be the case if one had broken down the root modules into scopes/services like the linked page suggests.

It is reasonable to assume if you are using Terraform to manage your infra, than your infra likely has access to a secrets manager from your infra vendor, e.g., AWS. Instead I'd recommend using a Terraform data resource to pull a credential from the secret manager by name -- and the name doesn't even necessarily have to be communicated through Terraform state. Then the credentials could directly be fed into where it is needed, e.g., a resource like a Kubernetes Secret. One can even skip this whole thing if the service can use the secret manager api itself. Finally access to the credentials itself would be locked down with IAM/RBAC.


terraform_remote_state

The root module can have outputs just like any other module. These outputs can be accessed from other stacks from the backend.

And if you use CDKTF the references are handled transparently.


Thanks - another good option! I seem to get slight shivers down my spine when I read terraform_remote_state, but I never actually used it, so I might be half-remembering an old opinionated blog post.


I've never used terraform, but I have used CloudFormation and AWS CDK. It's been a while though, is there a clear indication on the major cloud provider docs which resources are stateful? Or is it always obvious?


Difficult question, as people mean different things when they say state. One example might be a relatively simple AWS Lambda. Most people would say that's easily stateless.

But, what if that Lambda depends on a VPC with some specific networking config to allow it to connect to some partner company private network? And, it's difficult to recreate that VPC without service disruption for a variety of reasons that are out of your control. Well, now you have state because you need to track which existing VPC the Lambda needs if you tear the Lambda down and recreate it.


Usually people use "state" to mean that the resource accumulates data. E.g., a VPC is stateless, but the s3 bucket containing flow logs for the VPC is not.

Tearing down and recreating a VPC can be disruptive, but that doesn't make it a stateful resource.


In my example, the "state" part is someone storing the specific VPC reference to re-apply to a newly re-created lambda. Rather than recreating both the VPC and the Lambda. Many people would refer to that as a bad practice because it's carrying state. e.g., the kind of stuff that terraform puts in .tfstate files.

There's definitely a purist camp that advocates completely stateless infra. Though there's also what you're describing, where some people consider state to only refer to things like databases.


Just stopped here because I had said XYZZY way too often in the last three hours xD


I’ve been using Terragrunt [0] for the past three years to manage loosely coupled stacks of Terraform configurations. It allows you to compose separate configurations almost as easily as you compose modules within a configuration. Its got its own learning curve, but its a solid tool to have in the tool box.

Gruntwork is a really cool company that makes other tools in this space like Terratest [1]. Every module I write comes with Terratest powered integration tests. Nothing more satisfying than pushing a change, watching the pipeline run the test, and then automatically release a new version that I know works (or at least what I tested works).

[0] https://terragrunt.gruntwork.io/

[1] https://terratest.gruntwork.io/


They seem very insistent on keeping things DRY but not explaining why. Does Terraform tend to cause water leaks?


Terraform is supposed to let you write modular, reusable code. But because it's a limited DSL that lacks many "proper language" features (and occasionally breaks the rule of least-suprise). There are several major impediments to fully data-driven terraform. These ultimately result in copy/paste code, or tools like terragrunt which essentially wrap terraform and perform the copy/pasta behind your back by generating that code for you.

Some minor examples:

- calling a module multiple times using `for_each` to iterate over data works, except if the module contains a "provider" block

- if you are deploying two sets of resources by iterating over data, terraform can detect dependency cycles where there are not any


DRY = Don’t Repeat Yourself.


While combining the word "best with "Terraform" in a sentence is more than likely to result in an oxymoron, it is counter-productive not to attempt to organize and utilize terraform as elegantly and DRY as possible. We interact with stacks (which we call projects typically) via Terragrunt and have a very large surface of modules as we do have a fair amount of infrastructure pieces. But we also try to expose Terraform infrastructure changes by use of Atlantis; though bulky, github does provide a reasonable means to dialogue and manage changes made by multiple teams. The use of modules also helps us encapsulate infrastructure, and state problems are rare with these approaches, but the data sprawl inherent to Terraform is very unwieldy regardless of so called "best" practices. The language features are weak, awkward and directly encourage repetition and specification bloat. We have had some success via Data Sources to export logic outside of Terraform and provide much needed sanity when interacting with very verbose infrastructure such as Lake Formation.


We're using Terragrunt with hundreds of AWS accounts and thousands of Terraform deployments/states.

I'll never want to do this without Terragrunt again. The suggested method of referencing remote states, and writing out the backends will fall apart instantly at that scale. It's just way too brittle and unwieldy.

Terragrunt with some good defaults that will be included, and separated states for modules (which makes partial applies a breeze) as well as autogenerated backend configs (let Terragrunt inject it for you, with templated values) is the way to go.


We use a setup where we have multiple repos with Terraform configuration and thus multiple Terraform states. We then use Terraform remote state to link everything together. I am talking about 10-20 repos and states. Orthogonal to that, we use multiple workspaces to describe the infra in different environments.

The problems I have personally experienced with this approach are:

- if you update one of the root Terraform states, you need to execute a Terraform apply for every repo that depends on that Terraform state; developers do not do that because either they forget or they do know but are too lazy and subsequently are surprised that things are broken

- if you use workspaces for maintaining the infra in different environments, and certain components are only needed in specific environments, then the Terraform code becomes pretty ugly (using count which makes a single thing suddenly a list of things, which you then have to account for in the outputs which becomes very verbose)

Is Terragrunt something that would help us? I do not know Terragrunt, and a quick look at the website did not make that clear for me.


Have you spent any time with Pulumi?

I've kind of found terraform is dying and encourages a lot of bad practices but everyone agrees with them because HCL and it is transferable as most companies are just using TF.


> I've kind of found terraform is dying

I don't think it's dying. The hype has worn off. Everybody uses it. It's very mature. There's a module for everything.

It's just not new and sexy anymore IMO.


Terraform has evolved from exciting hype to stable utility, in my opinion.


Guess this can be revisited now with their licensing


I did have a slight chuckle at the news just a few days after I made this statement.


Do you need to chain multiple Terragrunt executions to first bring the Kubernetes cluster up and then the containers, or does Terragrunt fix that?


Yes, with terragrunt you can do a `terragrunt run-all apply` and based on `output` to `variable` in each module data can be passed from one state/module to the next one, terragrunt knows how to run them in the right order so you can bootstrap your EKS cluster by having one module which bootstraps the account, then another one which bootstraps EKS, then one that configures the cluster, installs your "base pods" and then later everything else.


Genuine question for DevOps people:

Other than the fact it seems to be an industry standard so it's good for your job prospects, what are the benefits to Terraform over CloudFormation/CDK or whatever the equivalent is for your particular cloud provider?

Most companies/people pick a provider and then stick with it and it doesn't seem like there's much portability between configurations if you do decide to switch providers later down the line so I'm not sure what the benefits are. I haven't delved into Terraform yet but I tried doing a project in Pulumi once and felt by the end of it that I might as well have just wrote it in AWS CDK directly.


After you have waited 20 minutes for CloudFormation to fail and tell you that it can't delete a resource (but won't tell you why), and this is the third time it has happened in a week, you start looking at alternatives.


This 1000%. Also recently discovered Google Deployment Manager is shit for the exact same reasons. I honestly don't get it.


> your particular cloud provider

Just this week, I wrote a Terraform module that uses the GCP, Kubernetes, and Cloudflare providers to allow us to bring up a single business-need (that will be needed, hopefully, many times in the future) that spans those three layers of the stack. 200 lines of Terraform written in an afternoon replaced a janky 2,000 line over-engineered Python microservice (including much retry and failure-handling logic that Terraform gives you for free) whose original author (not DevOps) moved on to better pastures.

CDK is fine if you're all-in on AWS. It has its tradeoffs compared to Terraform. Pick the right tool for the job.


I worked closely with the folks that wrote our platform's IaC, first in CDK, then in Terraform. I wrote a bit of CDK and zero TF myself, but here are some of the reasons we switched:

A big plus is that Terraform works outside of AWS land.

CDK is a nightmare to work with. You're writing with programming-language syntax, which tempts you to write dynamic stuff - but everything still compiles down to declarative CFN, which just makes the ergonomics feel limited. The L2 and L3 constructs have a lot of implicit defaults that came back to bite us later.

With CDK you get synth and deploy, which felt like a black box. Minor changes would do the same 8 minute long deploy process as large infrastructure refactors. Switching to TF significantly sped up our builds for minor commits. There might be a better way to do this with CDK (maybe deploying separate apps for each part of our infrastructure) and we may have just missed it.


Terraform, and by extension HCL, is more powerful and flexible. It can be used across clouds. It has providers for all kinds of things, like kubernetes. It can be abstracted and modularized. It supports cool features like workspaces and junk, depending on how you want to use it.

Also recently I was forced to use Google Cloud Deployment Manager scripts for some legacy project we were migrating to Terraform, and I was shocked at how buggy and useless it was. Failed to create resources for no discernible reason, couldn't update existing resources with dependencies, couldn't delete resources, was just unfathomably shit all around. Finished the Terraform migration earlier this morning and everything went off without a hitch, plus we got more coverage for stuff Deployment Manager doesn't support. It's also organized much nicer now, with versioned modules and what-have-you.

Cloudformation is ugly and again, surprisingly isn't well supported by AWS. I don't understand how it's possible, but terraform providers seem to be more up to date with products and APIs. Maybe that's just me but I've seen others complain about the same thing.


Isn’t google cloud deployment just bash calls to the google cloud cli disguised as declarations by way of yaml?


Considering the fact that you can inject python code blocks, I kind of doubt it. It also makes API calls that populates a dashboard with an inventory of resources created, so it seems to be more of an api wrapper like other IaC solutions.


- In the event that you are working with different cloud providers, Terraform is one thing to learn that then applies to all of them, as opposed to learning each provider's bespoke infra-as-code offering. Most companies stick to one PaaS/IaaS, but individual personnel ain't necessarily as limited over the courses of their careers.

- Not all cloud providers have an infra-as-code offering of their own in the first place (especially true with traditional server hosts), whereas pretty much every provider with some sort of API most likely has a Terraform provider implemented for it.

- Terraform providers include more than just PaaS/IaaS providers / server hosts; for example, my current job includes provisioning Datadog metrics and PagerDuty alerts alongside applications' AWS infra in the same per-app Terraform codebase, and a previous job entailed configuring Keycloak instances/realms/roles/etc. via Terraform.


Also pretty neat that there's a Terraform provider for Kubernetes native resources.


Personally it's the only way to work with kubernetes. Terraforms change control is excellent. Cannot go back to helms upgrade and how it does what you want.


I've got a lot of opinions here, but the only one I'll share is that HCL knocks the socks off of json and yaml. Json is too rigid. YAML is too nested. HCL gets this just right.

Venturing away from opinions, the provider ecosystem with terraform enables some wonderful design options. For example, I have a module template that takes some basic container configs (e.g. ports, healthchecks) and a GitHub URL, then stands the service up on ECS and configures CI in the linked repo. CF can't do that.


Im 10 years working with AWS. I strongly prefer Cloudformation, just separate things smartly between stacks. It has export/import for stack outputs too. Just look at the “root module” mess in this discussion and you’ll get why.


For me personally, I chose terraform because it can work with AWS and a heap of other 3rd party services and software (Cloudflare, PostgreSQL, Keycloak, Kubernetes/Helm, Github, Azure)


I have used both terraform and cloudformation substantially and they each have pros and cons. One thing terraform has over cloudformation is its rapid support for new services and features. AWS has done an awful job ensuring that cloudformation support is part of each team's definition of "done" for each release. It just doesn't get the support it really needs from AWS.


CloudFormation is the ugly step child of AWS. It has bugs that have languished for years


Or weird missing functionality like some resources being impossible to tag from the stacks, or new services/options lagging months behind when TF implemented support for them.


Companies choose providers and tend to stick with them, but people don't always stick with companies. If I know TF there is a decent chance my skills will be applicable when I change companies.

Also some big corps run their own internal datacenters and have cloud-like interfaces with them. You can write TF providers for that (its not going to be as nice as the public cloud ones, but still nice). Then you can utilize Terraforms multi-provider functionality to have 1 project manage deployments on multiple clouds that include on-prem.

Also terraforms multi-provider functionality is also useful for non aws/azure/gcp such as Cloudflare. As far as I know CDK does not support that.


> Other than the fact it seems to be an industry standard so it's good for your job prospects, what are the benefits to Terraform over CloudFormation/CDK or whatever the equivalent is for your particular cloud provider?

For me the killer feature is that both plan and apply show the actual diff of changes vs running infrastructure. It makes understanding effects of changes much easier.


Agreed, Terraform does a good job of this. But CloudFormation & CDK can also do this via Change Sets and CDK diff.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

https://blog.mikaeels.com/what-does-the-aws-cdk-diff-command...


Providers I regularly use, even mixed in a single project. There are others I could use if they were available.

AWS GitHub Opsgenie Okta Scalr TLS DNS


You forgot the greatest escape hatch of all: null


Third-party integrations and the universality/reusability across multiple products and familiarity of HCL are big for me.


> Most companies/people pick a provider and then stick with it and it doesn't seem like there's much portability between configurations if you do decide to switch providers later down the line so I'm not sure what the benefits are.

This smells like kubernetes

> Terraform over CloudFormation/CDK

They both work. It's more about which providers you need.


We don't just have AWS resources. Our CI pipelines are managed by terraform [0], they communicate with GitHub [1]. I like that it's declarative and limited, it stops people trying to do "clever shit" with our infra, which is complicated enough as it is.

[0] https://buildkite.com/blog/manage-your-ci-cd-resources-as-co...

[1] https://registry.terraform.io/providers/integrations/github/...


It's subtle and so difficult to see the differences at smaller scales. If you're going to provision a handful of EC2 instances, all the tools work fine.

I think HCL is an under appreciated aspect of Terraform. It was kinda awful for a while, but it's gotten a lot better and much easier to work with. It hits a sweet spot between data languages like JSON and YAML and fully-general programming languages like Python.

Take CloudFormation. The "native" language is JSON, and they've added YAML support for better ergonomics. But JSON is just not expressive enough. You end up with "pseudoparameters" and "function calls" layered on top. Attribute names doubling as type declarations, deeply nested layers of structure and incredible amounts of repetitious complexity just to be able to express all the details need to handle even moderate amounts of infrastructure.

So, ok, AWS recognizes this and they provide CDK so you can wring out all the repetion using a real programming language - pick your favourite one, a bunch are supported. That helps some, but now you've got the worst of both worlds. It's not "just JSON" anymore. You need a full programming environment. The CDK, let's say the Python version, has to run on the right interpreter. It has a lot of package dependencies, and you'll probably want to run it in a virtualenv, or maybe a container. And it's got the full power of Python, so you might have sources of non-determinism that give you subtle errors and bugs. Maybe it's daylight saving gotchas or hidden dependencies on data that it pulls in from the net. This can sound paranoid, but these things do start to bite if you have enough scale and enough time.

And then, all that Python code is just a front end to the JSON, so you get some insulation from it, but sometimes you're going to have to reason about the JSON it's producing.

HCL, despite its warts, avoids the problems with these extremes. It's enough of a programming language that you can just use named, typed variables to deal with configuration, instead of all the { "Fn::GetAtt" : ["ObjectName", "AttName"] } nonsense that CloudFormation will put you through. And the ability to create modules that can call each other is sooo important for wringing out all the repetition that these configurations seem to generate.

On the other hand, it's not fully general, so you don't have to deal with things like loops, recursion, and so on. This lack of power in the language enables more power in the tools. Things like the plan/apply distinction, automatically tracking dependencies between resources, targeting specific resources, move blocks etc. would be difficult or impossible with a language as powerful as Python.

HCL isn't the only language in this space - see CUE and Dhall, for example - but it's undoubtedly the most widely used. And it makes a real difference in practice.


This was a good read but really if you already follow the common best practices of IAC/terraform/aws multi-account I don't think you're going to learn much.

The comments in here kind of made me think I was going to hop in and take away some huge wins I hadn't considered. But I have been working with Terraform and AWS for a very long time.

If you're unfamiliar with AWS multi-account best practices this is a good read.

https://aws.amazon.com/organizations/getting-started/best-pr...


I remember periodically coming across services/platforms that purport to make setting up secure AWS accounts / infra configuration easier and default secure — anyone know what I may be thinking of?


Actually the article here is one of those options - https://substrate.tools/

I don't know how integrating this into an environment where you already have tons of AWS accounts would go but it's interesting. Thankfully I only have to make new accounts when we greenfield a service and that's maybe a yearly thing.


Everybody in here is recommending tarragrunt, but I'm not sure what value it provides over regular terraform.

After using it for a few months all of the features found in tarragrunt are in terraform.


This is my impression as well. As far as I've understood, terragrunt was made back when terraform was missing a lot of key features (I think it maybe didn't even have modules yet) but when I was asked to evaluate it recently for a client I couldn't find a single reason to justify adding another tool.


The primary thing terragrunt was designed to do was let you dynamically render providers.

Terraform still does not let you this.

It becomes very problematic when using providers that are region specific, amongst other scenarios.

That being said I don’t like the extra complexity terragrunt adds and instead choose to adopt a hierarchical structure that solves most of the problems being able to dynamically render providers would solve.

Each module is stored in its own git repo.

Top layer or root module contains one tf file that is ONLY imports with no parameters.

The modules being imported are called “tenant modules”. A tenant module contains instantiations of providers and modules with parameters.

The nodules imported by the tenant modules at the ones that actually stand up the infrastructure.

Variables are used, but no external parameters files are used at any level (except for testing).

All of the modules are versioned with git tagged releases so the correct version can easily be imported.

Couple this with a single remote state provider in the root module and throw it in a CI/CD pipeline and you have a gitops driven infrastructure as code pipeline.


What do you mean by 'dynamically render providers'?

I assume you're aware you can instantiate multiple versions (different params) of a provider and pass them to child modules, e.g. you can instantiate a module once for each of a several AWS regions/accounts?

Do you mean that something like the region/account param would be set on the basis of a computed value from some other resource (because we created the account, say, or listed all regions satisfying some filter with a data source)?


Let's say you need to configure a provider for 3 different regions. The natural way you want to do this is store the regions in a variable, and iterate over the variable using "for_each".

What you are referring to are called "provider aliases" and they work well for _avoiding_ using terragrunt (not very DRY though).


providers do not support `for_each` or `count`


We migrated from terragrunt to terraform as we thought the same thing. I'm in half a mind to go back.

Managing multiple environments is much easier in TG. State management in TF is kneecapped by the lack of variable support in backend blocks. I can only assume it's to encourage people into using terraform cloud for state management.


Terragrunt shines in cases where you have independent sets of Terraform state, especially if they are dependencies/dependents of one another.

For example, say you're using Terraform to manage AWS resources, and you've provisioned an Active Directory forest that you in turn want to manage with Terraform via the AD provider. Terraform providers can't dynamically pull things like needed credentials from existing state, so you end up needing two separate Terraform states: one for AWS (which outputs the management credentials for the AD servers you've provisioned) and one for AD (which accepts those credentials as config during 'terraform init').

Terragrunt can do this in an automated way within a single codebase, redefining providers and handling dependency/dependent relationships. I don't know of a way to do it in pure Terraform that doesn't entail manual intervention.


Ideally you decouple this and store the creds in a key vault or whatever, this way you have to explicitly grant access to the service principal to access the kv secret. Decoupling usually fixes other issues as well such as expiring creds from service a to service b will then get coded into terraform to refresh.


Some of this seems like old advice, instead of having directories per environment you should be using workspaces to keep your environments consistent so you don't forget to add your new service to prod.


(Hi, I’m one of the authors of the article at the root of this thread.)

I’ve gone back and forth on workspaces versus more root modules. On balance, I like having more root modules because I can orient myself just by my working directory instead of both my working directory and workspace. Plus, I feel better about stuffing more dimensions of separation into a directory tree than into workspace names. YMMV.


Do you always store modules in the same repo as the terraform itself?

Why not put them in seperate repos that can be tagged and versioned and then referenced like below?

source = "git::https://bitbucket.org/foocompany/module_name.git?ref=v1.2"


Not who you asked, but there are different kinds of modules. I like versioning for reusable ones, but when it comes to root modules they tend to glue together a couple of those, so I just keep them in the same repo, with some terragrunt to mate them with configuration of combination of environment, region etc.


What do you think about multiple backends? It seems to be working well for me to have a single root module but with a separate backend configuration per environment.


Multiple backends are unwieldy if you're using terraform at the command line, but they beat workspaces handily for discoverability. They're a fine option if you're applying through CI though, as the drudgery of utilizing them is handled effortlessly by the robots.


That does work well for environments because typically you’d run exactly the same code, maybe with different cluster sizes or instance types, in each environment. But it doesn’t work well for isolating two services where the code is significantly or even entirely different.


Sorry if this comes across weird or snotty it’s not supposed to.

But I’m coming at this from a GCP lens and got half way through the article about how the recommended unit of isolation in the AWS environment is entirely different AWS accounts and I’m kind of hung up on that. Is that really a thing people tend to do often? Doesn’t it get super unwieldy? How does billing work? What about identity? I have so many questions.

EDIT: Despite the fact that the root resource in both GCP and AWS is an organization, when I heard “account” I mistook that to be AWS terminology for an organization.


The way this works with AWS is similar to you making a GCP project.

At the top level you have an organization account, which is where billing occurs.

From this org account you create accounts for the following (typically):

1. Security - AKA the account your USERS are in 2. Ops - The account your monitoring, etc are in

From here where a lot of people seem to deviate (I've been interviewing level 2-3 SREs for the last 3 weeks and have heard all about different AWS structures that I don't like) is how to break up your applications into their own accounts for a low blast radius.

What I DO, and is well known as being the best practice, is to create an AWS account for each environment of each application.

App1-sandbox App1-staging App1-production

Then your terraform is also structure by application/environment/service. Each environment and application has it's own state in s3 and dynamodb.

And so on.

Is this unwieldly? I have 40-50 AWS accounts and no it's not unwieldly at all IMO. Cross account IAM and trust relationships are set up very early on and they don't need to be modified much if any at all until you create another AWS account. Creating a new AWS account is kind of annoying, though. I need to automate that process better.

https://aws.amazon.com/organizations/getting-started/best-pr...


Cool, that was a genuinely fascinating window into AWS for me. Thank you for sharing


FWIW I loathe AWS IAM and miss GCPs organization.


FWIW I loathe GCP IAM and miss AWS IAM, CloudFormation, and not having to talk to any one single person or piece of software about "please enable this foundational API in your Project"


Yeah, I started with AWS and then spent a year on GCP and next time I'd much rather do GCP. It felt much more manageable and supportive to me.


A rare opinion but one I share wholeheartedly. I started my career at Google Cloud but spent the rest of it working with AWS. AWS always feels like an uphill struggle, lots of micro management and resources that need to be duct-taped together. I'm lucky to have recently landed a Google Cloud gig and my God, things are so much easier and smoother now. It just seems better designed and integrated to me, albeit much fewer services to choose from if you don't buy into their ecosystem.


I’m quite into learning a lot of cloud native security stuff and I have to say my first impression was that it seemed so much harder to think about creating a secure environment using AWS IAM. I couldn’t tell if it was just a case of familiarity or not.


I'm sure it's because of it's age and them kind of creating their version of IAM from scratch (someone correct me if they copied this structure from elsewhere) but you have to do a lot of goofy obtuse work with IAM automation. There are times I have to go into the console/cli and grab some sort of specific UID for an object instead of using its name, things like that that just make it annoying. Sometimes you can't use an account name and have to use the org ID... I could go on. You just kind of deal with it.

I haven't worked on GCP since maybe 2016-17 so I'm not sure how it's going over there anymore.


It really does sound like an entirely different level of complexity.

GCP native API is basically the same thing as knative in most ways. Just a bunch of various services and resources that you all call and authenticate and even often provision the same way.

As an example of that since we are talking about infrastructure management I would say at its “smoothest” level of integration there is a service you can use (or host it yourself on Kubernetes if that’s your thing for some reason) where like any other Kubernetes resource I would just “declare” what I wanted.

So now I’m not messing around with complicated Terraform logic at all (Google got really good with automation, I don’t think there is anything close to an equivalent for this is there?). I just declare say a BigQuery resource or a Project (AWS Account equivalent) resource and the service will do all the hard work of making sure that’s the state my account is in at any given point.

I can also stick policy controls around it like I would with K8s so only certain people can create certain resources under certain conditions.

It’s really easy to just stick that into a git repo and still do all of the IAC stuff mentioned in this article but it’s also easy to do the cross environment stuff and manage the roll out between each of them.

Overall, it’s very predictable, the IAM is really intuitive but also incredibly granular so it’s very easy to model things on top of and to feel fairly confident that I’m not accidentally doing something stupid so I really like it from that point of view.

My number one bit of advice for GCP is see how easily you can architect your way into using Cloud Run as much as possible unless you have some really wild use case. You can get to a really sophisticated set up with only a tiny team. Followed by read Google’s API guidelines (aip.dev) to understand how to build things in a way where you’re going to continuing having a good time.


How do you deploy your Apps? We exclusively use EKS and having one account per env and app seems like quite an overhead when I think about managing / updating EKS clusters for each one. It also comes with an overhead of base applications that need to run in each cluster by default (like cert-manager, externaldns etc).

Right now we’re using one account per env but also see downsides and thought of going the next step to do one account per env and tribe/team.


Each app/env has a pipeline that will trigger a tf apply in its directory w/ its assumed AWS role and deploy an env after someone gives it a manual approval after looking at the terraform apply/plan output. So it will start at /terrafrom/app1/staging then once healthchecks succeed another manual approval job for /terraform/app1/production will wait to be approved to depoy.

For our EKS apps we do helm rollouts, but most of our services are on ECS so it's mostly just updating a task definition and forcing a redeployment of containers.

Each EKS cluster is set up exactly the same aside from the usual things like vpc and ips and things of that nature that switch between them. They all get a set of "base" apps like log chutes and cert manager and all that as soon as they're deployed.

Our app environments don't communicate with one another at all. The only relationship between them is our IAM accounts in our security account can assume access into them as admin/etc.


GCP and AWS both support sharing a single VPC across multiple GCP projects / AWS accounts. See e.g.: https://aws.amazon.com/blogs/networking-and-content-delivery...

It's not for the faint of heart though. You need to allocate subnets to individual applications (with relevant capacity planning concerns) plus support is sometimes spotty (e.g. EKS doesn't support it last time I checked). Not worth doing until you have several teams trying to use the same VPC and stepping on each others' toes.


> Creating a new AWS account is kind of annoying, though. I need to automate that process better.

You can do that with terraform...


You can, but using Terraform to provision resources inside those accounts entails pulling generated/defined credentials from the org-level TF state and feeding that into the provider config for each app-env-level TF state. Vanilla Terraform doesn't support that very well (or at all, last I checked), but either some CI/CD pipeline creativity or Terragrunt (or both!) can work around it.


terraform_remote_state


Another option is to use AWS Control Tower: https://docs.aws.amazon.com/controltower/latest/userguide/af...


Can providers use the output from terraform_remote_state to set e.g. credentials? Last I checked, datasources get sourced during terraform plan/apply whereas provider configs need to be known as early as terraform init.


> But I’m coming at this from a GCP lens and got half way through the article about how the recommended unit of isolation in the AWS environment is entirely different AWS accounts and I’m kind of hung up on that. Is that really a thing people tend to do often? Doesn’t it get super unwieldy?

There are AWS systems above the account level for managing it (Organizations), so its not quite as bad as it might naively seem, but, yes, its more unwieldy than GCP’s projects.


Oh thank god, that’s much better than I naively thought. Thanks for the heads up.


You can have sub-accounts that roll up billing to a main account. Still messy, but probably cleaner (security policy wise) and possibly safer (fewer production impacting accidental config changes?) than having a single giant account with _lots_ of things mixed together.


As the resident pedant, one cannot have "sub-accounts" in AWS. One can 100% have Accounts that are a member of an AWS Organization, which itself has a Management Account that does as you describe as the "main account", but there is no parent-child relationship between Accounts, only OUs and Accounts or OUs and other OUs (which the Organization, itself, counts as an OU)


Thanks for the corrections. :)


Some AWS services have per-account (not per-resource) size and rate limits. Keeping resources in separate AWS accounts gets them out of each others’ blast radii.

AWS IAM doesn’t do capability inheritance; if I can write a policy at all I could grant any privilege to any resource in that policy, even privileges I don’t personally have. It’s easier for each groups of admins to have a separate AWS account than to put everyone in security boundaries that try to wall them off.


This doesn’t have anything to do with AWS, this has to do with terraform not allowing you to dynamically instantiate providers.

The same thing happens with the Kubernetes provider when you try to use it with multiple GKE clusters.


You should be able to organize accounts hierarchically using AWS Organizations, which allows to have cost centers and centralized billing (and some imposed policies over all accounts).


It's extremely common and recommended.

Billing works by having a billing aws account that all other accounts are in a sense "children" of.


This should be mandatory reading for anyone doing IaC, using TF and AWS or not, less for how you do it, more for what and why.

// shout out to AWS CAB alums


The article recommends to split up your state files for various advantages, but also expands into how to manage it later in a custom way.

I agree with the splitting, but based on many home-grown automation systems I've seen around this I'd really recommend you to use one of the specialized CI/CD systems that are built around automating these kinds of workflows. Once you reach the "many state files" phase, you'll save a lot of engineering time this way.

They'll take care of, among others, running the right state files, in the right order, with the right parameters. But they'll also take care of many other things you need to run Terraform at scale and with big amounts of engineers (happy to expand but don't want to kitchen-sink this comment).

Disclaimer: Take this with a sensible grain of salt, as I work at Spacelift[0] - one of the TACOS (and of course the one I'll shamelessly link and recommend!).

But really, don't use tools like Jenkins for this as you scale, it'll likely hurt you in the long run.

[0]: https://spacelift.io


I'm sure that you have no control over this but I really wish Spacelift would increase the cost of its cloud tier and lower the cost of Ent. I'm in the anti-goldilocks zone. Ent seems priced for large teams when I practically fit into the cloud offering sans missing a few required features.

Great product though from what I've experienced.


Disclaimer: Co-founder of Terrateam.

For Terrateam[0], we have probably 70% of the enterprise offering but at around 1/10th the price. If there are any features that are deal breaker, feel free to reach out to me and we'll see what we can do. That being said, Spacelift is a much more luxurious piece of software than us. We are very utilitarian, but we have to rationalize that low price-point somehow.

[0] https://terrateam.io


Hi swozey. Spacelift sales leader here. Let's have a conversation and I'll work with you to find the goldilocks zone that you are looking for. Grab a demo with us and mention this post and my name "Ryan". We can dive into the features you require.


Sorry to hear that! Pricing is hard.

If you haven't yet, please try talking to our sales team. There's usually a way to make all sides happy with some custom agreements - after all, we'd love for you to be able to use our product as much as you need.


It seems to me, this is trying really hard to shoehorn Terraform into managing at scale. For multi-account, multi-org, multi-region, multi-cloud deployments is Terraform really supposed to be the state of the art? How do you even get visibility into the various deployment workflows?


What’s the alternative?


The solution at the end almost looks like the manual setup of terragrunt which we are using to manage lots of base infra in many different accounts.

What would be interesting here would be to see how they actually reference the outputs from one layer onto the next layer. That is something that is not even solved nicely in terragrunt and one of the major annoyances for me there. Using dependencies and the mock_output option is creating lots of noise in the plan outputs as the dependencies are only completely resolved when terragrund applies all the modules.

But it seems I also missed a few additions to terraform - so probably there are better ways to take outputs from one terraform run into another one.


> At scale, many Terraform state files are better than one. But how do you draw the boundaries and decide which resources belong in which state files? What are the best practices for organizing Terraform state files to maximize reliability, minimize the blast-radius of changes, and align with the design of cloud providers?

1000% agree. I put together my version of standing up remote state in AWS into a Github repo. https://github.com/aryounce/terraform-aws-bootstrap

Our use of Terraform splits state exactly as described primarily to keep the state refresh times reasonable.


Aside from reducing the blast radius of any Terraform state (split by envs, then by teams as you grow), I highly recommend using cdktf with Python for Terraform projects. Huge timesaver


I don't know.

I kind of think using a language with native JSON support and structural type system would be best.

HCL also just works.


> Using the -target option to terraform plan is discouraged (the Terraform documentation says, “this is for exceptional use only”). Anyway, it’s likely to lead to confusing infrastructure states if changes are applied incrementally with ad hoc, situational boundaries.

We've been using -target for years, and while I understand very well the reasons it's discouraged, it is pretty much the only way you can have your cake and eat it too with respect to having "one large terraform project" and not running into terraform refreshes that make your eyes bleed.

You end up having to really understand your module structure to use it, but it let us develop some very elegant workflows around tasks like patching.

We developed a ruby code base that utilizes rake and the hcl2json tool to automate terraform-based infrastructure workflows, using various libraries to handle and validate that applications are happy with what terraform is doing to their servers while it works.

This gives us flexibility to run automations safely against a terraform code base that has been evolving since version 0.3 or so, before most of the mistakes were made often enough to come up with the best practices we have today.


I settled on 1 subfolder, 1 “stack” (stage/app, for example dev/login/frontend). This gives us fast deploy time and easy and painless way to destroy/re-create. Databases could go a separate folder if state if we had any.

The point of Terraform is to have configuration in version control not to have a giant unmanageable state file.


I'm surprised the blog advocates embedded modules instead of remote ones stored in seperate git repos. This allows you to tag and version them, and therefore progressively update modules.


The problem with TF is that it lures people into trying to be smart and try to be beat the system after which things often become a personal challenge instead of a business requirement. A true nightmare for the next person in line. Also ... every declaritive language dreams of becoming a programming language ...


This is a great read, but I always seem to run into cases where I need to define something like a security group and then reference it when deploying ec2 instances. I'd love to decouple to reduce my plan time, but I haven't figured a way out as of yet.

To be fair, I haven't used terraform -chdir yet.



-chdir is useful, but nothing to do with this (it's literally just cd before running command).

As sibling said, use data source to read remote state outputs.


not sure what you mean here.

Do you use the same security group for all of your instances?

i am usually creating a security group per group of related ec2 instances.


you can pull it in via a data source, but then of course this creates a coupling between multiple modules/state files.


If many different "states" are used for one big architecture, how are the boundaries between them managed?

Under one state, Terraform will spot issues here during `plan`, but with many states issues will only appear after `apply`.


We’ve had a pretty good experience with Terraspace at work, which is an opinionated framework/layout for Terraform. It supports hooks and splitting state between regions and accounts.


Why doesn't Hashicorp provide official best practices like this?


This isn't a rated E for everyone practice.

Hashicorp is focused on CI/CD/cloud/workspace driven workflows over monorepo `chdir` driven.


What does the phrase "stamp out" mean in this context?


Rapidly create exact duplicates I think.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: