
Terraform Gotchas and How We Work Around Them - kalmar
http://heap.engineering/terraform-gotchas/
======
luhn
> Always write your plan -out, and apply that plan

I have in my dotfiles:

    
    
        alias tfplan='terraform plan -out=.tfplan -refresh=false'
        alias tffreshplan='terraform plan -out=.tfplan'
        alias tfapply='terraform apply .tfplan; rm .tfplan'
    

That way I never accidentally `terraform apply` without creating a plan first.
I also have it not refresh the state by default, which is mostly unnecessary
and speeds up the planning significantly.

~~~
sethvargo
Hey all - Seth here from HashiCorp (the company that makes Terraform). The
next version of Terraform (0.10) natively adopts very similar behavior,
presenting a plan before applying as an added safety step. You can read more
in the 0.10 upgrade guide. At the time of this writing, 0.10 is not yet
released, but compiling Terraform from source at master will inherit this
behavior.

[https://github.com/hashicorp/terraform/blob/master/website/s...](https://github.com/hashicorp/terraform/blob/master/website/source/upgrade-
guides/0-10.html.markdown)

~~~
scrollaway
I feel bad about not following up on it yet but my comment on TF#13276 sums up
the issues I have with Terraform after using it for a little under a year now.

[https://github.com/hashicorp/terraform/issues/13276](https://github.com/hashicorp/terraform/issues/13276)

I hope you all can work on improving the definitions, because many of them
really are a chore compared to setting things up in the AWS dashboard, at the
moment (security groups for example).

------
7ewis
Terraform has interested me for a while, and I've been meaning to give it a
try, but haven't had a chance just yet.

From what I have seen so far though, there isn't _really_ that much
difference/benefit over CloudFormation. We currently have 95% of our resources
in AWS with about 4% in Azure, and 1% in Google Cloud. It's great that
Terraform is 'mulit-cloud' but it still seems like you have to write .tf's
catered to each cloud, you can't just lift and shift to another cloud by
copying and pasting a file?

People say the 'plan' feature is one of the advantages over CFN, but as far as
I can tell, CFN now offers the same feature... it tells you what's going to
change when you upload a new stack.

I sound like a CFN advocate now, but I genuinely don't have _that_ much
experience with it, and really do want to give Terraform a chance. Convince
me?

 _Oh, and since CFN started supporing YAML it looks easier to write too_

~~~
kjhosein
I've been thru the CFN v TF question. We came up with a list of benefits of TF
over CFN. (Yes, I know - one-sided, but we wanted to document the decision
with a bit more substance than "oh it's just better")

    
    
      * Ability to separate data (variables/parameters) from configs.
      * Easier to read (well at least pre-YAML CFN). 
      * Allows comments in the code.
      * Version control changes (diffs) are easier to read.
      * Multi-Cloud support. Works against AWS, Google Compute, Azure, Docker, more.
      * Multi-provider in general: can provision resources across multiple different cloud providers at once.
      * Can write modules in TF that can be reused in multiple different configs.
      * Tracks state via a version-controllable data file.
      * 'terraform plan' is essentially a no-op mode to see what changes would occur without actual running or making changes.
      * Actively developed.

~~~
solidsnack9000
Cloud Formation has good support for use from Python, Ruby, Node and the JVM
(with template generators, to help out). If you're writing JSON directly, yes,
some of the points above -- the first four, and the seventh -- are an issue;
but if you use Python you get all the benefits of it being "real code" and
"just a library" (unlike Terraform).

~~~
hamandcheese
This is a situation where I think not being "real code" is a feature of
terraform. You declaratively represent your infrastructure rather than
generate it with real code.

~~~
solidsnack9000
Over time, I have come to view "declarative infrastructure" as unrealistic.
It's right 90% of the time, but not 100% of the time -- kind of like using
only CSS and HTML. One should use markup whenever possible; but not everything
on a page is truly "declarative". Occasionally one needs to script an input
field or a transition.

One example of this is scripting the handoff process that's part of a
blue/green deploy. In practice you'll want to look at organization defined
metrics. There are libraries to do this -- either internal to your
organization, or provided by a metrics vendor -- and scripting the process
looks like this:

    
    
        (1) Setup new environment.
        (2) Divert some traffic.
        (3) Check metrics.
        (4) If metrics are okay:
            (4.1) Post message internally (IRC/Slack).
            (4.2) Divert all traffic.
            (4.3) Set up timed task to tear down old environment (in a day, hour, &c.).
        (5) If metrics are okay:
            (5.1) Post message internally (IRC/Slack), maybe to different people.
            (5.2) Stop diverting traffic.
            (5.3) Tear down new environment.
    

A large part of the work here _is_ declarative: (1) by itself is a big piece
of it, and is fully declarative, as is the teardown in (4.3) and (5.3).
However, the need for control flow in this and many other cases means that,
without a library, one must drive Terraform by templating and shelling out.
Not being "real code" pushes one in the direction, not of greater
declarativeness (libraries can certainly have declarative interfaces, like
Troposphere does), but of worse code.

Many complex and powerful features are exposed to a modern business through
libraries -- AI, payments, telephony -- and software defined infrastructure
can be, too. The benefits of "infrastructure as code" won't be realized until
that happens.

------
lobster_johnson
I've wanted, and tried and failed, to adopt Terraform several times now. What
always gets in my way is that we _already_ have all our infrastructure in
place, and Terraform's import capabilities are too limited.

For example, the last time I used it, a few months ago, it was not able to
import almost any of our Google Cloud stuff, and I discovered that import
support is only provided for some resources. There's a third-party tool called
Terraforming, but it apparently only works with AWS.

I'm quite disheartened that the world is lagging this far behind. The only
competitor I've found is Salt, and I found its orchestration support to be a
bit of a mess. And just as with Terraform, the code is constantly lagging
behind the providers.

The one provider I'd have expected to be on the forefront of orchestration is
Google, and in a different multiverse their engineers are swarming around
Terraform to make sure it has top-notch, official, first-class support, but
alas, not in this one.

Are there any competitors that provide a smoother experience?

~~~
danawillow
Hey there- Dana from Google here, I lead the efforts around Terraform from our
side. You'll be happy to know that in the last 2 months alone we've added
import for:

    
    
      - google_bigquery_dataset
      - google_bigquery_table
      - google_compute_address
      - google_compute_disk
      - google_compute_global_address
      - google_compute_route
      - google_compute_network
      - google_dns_managed_zone
      - google_sql_user
      - google_storage_bucket
    

, with more to come shortly!

We only have one open issue around import, so if there are other resources
you'd like to see imported feel free to file an issue:
[https://github.com/terraform-providers/terraform-provider-
go...](https://github.com/terraform-providers/terraform-provider-
google/issues) (just moved to a new repo a few days ago, and we're still in
the process of getting existing issues moved over). A big factor in our
prioritization of what to work on is based around issues filed (and thumbs ups
on those issues), so that's a great way to get in touch with the team.

If you have any other questions around the Terraform+GCP experience, feel free
to ask us in the #terraform channel in the GCP slack ([https://gcp-
slack.appspot.com/](https://gcp-slack.appspot.com/) if you aren't already
there). Best of luck, and do reach out if you need anything!

~~~
lobster_johnson
Thanks! I'll take yet another look at Terraform, then. Fourth time the charm,
or something.

My lasting fear is that even if it has 90% of the support, there will always
be one thing, or one edge case or bug, that will become an annoying blocker.
Using a tool like TF means becoming dependent on it to a large extent.

------
ian_d
I've been using terraform for a couple of months now (love it), but honestly
our biggest pain was just project organization. It looks like a lot of people
make a file per-resource type (elb.tf, ec2.tf, rds.tf) but we thought that
would be a lot of bloat. We opted to have a file per system (dev_db.tf,
dev_ecs_asg.tf, dev_haproxy.tf, etc) and everything related to that particular
system is contained in a single file (security groups, dns entries,
roles/profiles, etc). But it's still in one flat directory per environment. (I
know tf has introduced environments, but we haven't switched over yet.)

I know you can hack this together with modules, but it seems like
environment/project organization would be easier if _terraform just recursed
subdirectories_. Right? I've seen a couple of issues for it, but I don't
believe I've seen a concrete reason why it's a no-go.

------
philsnow
There's another similar issue with how EC2 security group rules are encoded:
you can encode them either as ingress/egress stanzas on an
"aws_security_group" resource, or you can attach rules to a security group
resource with separate "aws_security_group_rule". You can't mix the two
approaches on a single security group resource.

We adopted the ingress/egress stanza on security group resource approach.

If we ever wanted to change to the other approach (as described in the
article), I don't think I would do state surgery by hand or even use
"terraform state mv". I would:

    
    
        1. change terraforming to generate .tf files and tfstates the way I want
        2. remove the security groups from my config and my state
        3. use terraforming to regenerate the .tf files and tfstate

~~~
mitchellh
Hey there. Disclaimer: one of the creators of Terraform

I wanted to apologize that this is super confusing. All the scenarios where
this exists (there are many) are historical. We originally went with the
"nested" approach and now prefer the "split" approach for good reasons shown
to us by users. But we kept both for backwards compatibility reasons. We have
no good mechanism to enforce a migration at the moment. There are a couple
ways we can resolve this technically in the future. For now, we should
probably make sure the docs are annotated in all situations of the limitations
of nested vs. standalone. I'll mention this to the team!

------
pavement
Oh, geeze. This is about: terraform.io

~~~
PhasmaFelis
Yeah, a title change would be nice. I was anticipating something much more
interesting.

~~~
paulddraper
And I wanna hoverboard.

------
johnmarcus
I absolutely can't stand how destructive terraform is by nature. We have
switched to Ansible, which has an excellent AWS module, and never looked back.

~~~
smt88
Why Ansible rather than Elastic Beanstalk or Cloud Formation?

~~~
an27
It's vendor-agnostic? And CF is super slow and limited to a small number of
resources (not sure if it's a 100 or a 1000).

~~~
paulddraper
Definitely 100. And every little thing is a resource.

~~~
80x25
[http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuid...](http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cloudformation-
limits.html) Am I a missing something? Nothing in these docs about a 100
resource limit

~~~
paulddraper
It used to be 100. Looks like it's 200 now.

~~~
80x25
That's 200 resources per CF template. That's a massive CF template :)

------
iofiiiiiiiii
I am just now implementing Packer and Vagrant in our devops workflows.
Terraform is next on the list.

So far, it leaves me rather anxious - Packer and Vagrant appear to offer the
bare minimum of usable functionality, with any advanced scenario bumping into
(sometimes intentional) walls.

For example, it takes me 15-20 minutes to transfer a 50 MB file to a Windows
VM being created by Packer. The GitHub issue, filed nearly 2 years ago, is
closed with a comment that this is by design:
[https://github.com/hashicorp/packer/issues/2648#issuecomment...](https://github.com/hashicorp/packer/issues/2648#issuecomment-307354697)

Yet there is a PowerShell command that uses the same communication mechanism
that can somehow do it in a matter of seconds. Of course, I cannot use this
PowerShell command because Packer does not give me a variable with a machine's
IP address because... it is improper somehow?
[https://github.com/hashicorp/packer/issues/4993](https://github.com/hashicorp/packer/issues/4993)

What the hell, Hashicorp...

I have a list of 10+ issues I have found so far and I am only starting to use
these tools. From the activity in GitHub, they seem to be abandonware.

Maybe if I submitted PRs they might be accepted (then again, maybe not:
[https://github.com/hashicorp/packer/pulls](https://github.com/hashicorp/packer/pulls))
but I expect more from software than just accepting PRs - I expect its authors
to actually develop it and to show an interest in improving it.

There is unfortunately nothing better out there. I admit, I am forced to use
these products even though I do not find them satisfactory and the authors do
not seem helpful.

If I had to start all over again with my current knowledge, I might perhaps
just write my own scripting and skip Packer/Vagrant altogether. The value they
offer with VM management comes with the downside of being left in the mud and
having the system work against you when you try something nontrivial.

I am scared of what I will find when I touch Terraform. As I write this, I
think I will first see whether I can just script it manually.

------
kalmar
Hey author here! Happy to answer any questions etc :-)

~~~
toomuchtodo
No questions, just a suggestion: implement the part where the terraform plan
is added as a comment in the PR. We set this up at my current employer and it
makes the review process much quicker (also, commenting on lines in the ~plan~
terraform code changes is the bee's knees).

Don't have the apply be automatic after a review is approved; terraform
apply's occasionally go sideways and need human intervention (remember:
rollbacks are not automatic). A human should always kick off the apply and
monitor state change activity.

~~~
captn3m0
Questions from our team:

\- Are you commenting with the output of show on the planfile to get human-
readable version?

\- Line by line commenting on comments?

\- Do you have state-splits? Do you run plan on each individually for every
PR?

~~~
toomuchtodo
May I email you answers to these?

~~~
captn3m0
Yes, that works. (email in profile)

------
nunez
here's something i got bit by more recently re: terraform plan -out and
tooling using Terraform's Golang API.

Handling package dependencies with Go is not straighforward. There are several
ways of doing it, and none are native to Golang.

Additionally, Go doesn't support getting versions of packages by tag or
branch.

This bit me hard when I tried to update Palantir's TFJSON utility (turns
tfplan binaries into json) so I could do unit testing of my Terraform plans
with rspec.

The utility depended on v0.7.4 of terraform, but Terraform maintains a plan
format constant that defines which plans can be used by what versions. They
changed the plan format between 0.7.4 and 0.9.8 without bumping that constant,
so when I tried running tfjson against plans created by the latter version, I
got a weird non-matching datatype error that took a while to figure out. (I
eventually had to vimdiff the hex outputs of plans created by both versions to
figure that out.)

Additionally, HashiCorp made a significant change to the way they handled
providers between 0.9.8 and 0.10.0 that justified them to bump the plab format
version AGAIN. The catch: 0.10.0 isn't released yet, despite that being the
code in their master branch.

I figured that updating tfjson's vendored terraform library to 0.9.8 would
solve it. I first did a go get to fetch the latest TF codebase and used gvt to
vendor it. That's when I discovered that plans generated by 0.9.8 are no
longer compatible. After discovering that go get can't fetch packages by tag
(Hashicorp tags their release commita) because Google believes in stable
HEADs, I had to find a tool that could support fetching packages by tags.
Govendor did that, so I used that.

It takes FOREVER to fetch all of the subpackages used by terraform. I couldn't
do it during a three hour flight. Rubygems has its problems, but fetching deps
isn't one of them. And even when I thought I fetched the entire source tree at
v0.9.8, I would still get errors about missing types or missing packages.

I'm hopeful that I'll eventually find a solution, but it's a dog compared to
using Gemfile.lock.

------
mental_
I thought terragrunt was a must have for that kind of deployment.

~~~
brazzledazzle
I'm curious about that as well. I was told by coworkers that Hashicorp added
support for DynamoDB which rendered terragrunt redundant but I haven't had
time to look into it.

~~~
Florin_Andrei
terraform {

    
    
      backend "s3" {
    
        region = "us-west-1"
    
        bucket = "foo-tf-us-west-1"
    
        key = "foobar.tfstate"
    
        dynamodb_table = "tf-lock"
    
      }
    

}

[https://www.terraform.io/docs/backends/types/s3.html#dynamod...](https://www.terraform.io/docs/backends/types/s3.html#dynamodb_table)

~~~
tjbiddle
Awesome! Any idea when this was added? I feel like this wasn't in the
documentation a week or two ago; everything had still said "Use Hashicorp
Atlas for remote state locking".

~~~
Florin_Andrei
I've been using it for weeks if not months.

Their documentation might be lagging occasionally. It's a small team tackling
a big challenge.

------
Artemis2
This sums up our experience with Terraform perfectly:

> Most outages are caused by human error and configuration changes, and
> applying Terraform changes is a terrifying mix of the two.

Terraform is a great tool nonetheless. Just like Heap, we have code reviews
for the configuration itself, and a CI pipeline for validating it. This
pipeline is quite superficial (`terraform validate` mostly does syntax
checking), so we are too working on using centralized state to `terraform
plan` for reviews.

------
sevagh
>Terraform state surgery

Did you try to use `terraform state mv`? I've found that command useful
(albeit for much less than thousands of resources).

~~~
kalmar
I don't think it would work in this case, as the `ebs_block_device` block
isn't a resource. In fact, the TF state doesn't even have the volume IDs for
them!

An alternative to doing this was `terraform import` on all the volumes, then
defining attachments, and hoping it all worked when you run `terraform plan`.
I don't 100% remember now why we didn't do that.

------
nategri
So... _ahem_ what other idiots came here expecting a post about
troubleshooting Martian habitability?

::Sulks off dejectedly::

~~~
irfanka
At least one :)

~~~
jjtheblunt
at least two :)

~~~
pugworthy
at least many!

