
Two years with CloudFormation: lessons learned - kiyanwang
https://sanderknape.com/2018/08/two-years-with-cloudformation-lessons-learned/
======
jillesvangurp
We've been using CF for a few years as well. IMHO it is very complicated to
manage and getting stuff working involves a lot of trial and error. Also you
end up waiting a lot. Waiting for things to spin up, waiting for things to
become available, waiting for things to rollback, etc. On top of that the
failure modes can be ugly and hard to figure out.

My recommendation is to treat CF as a single point of failure. Once it gets in
a broken state, you may have to destroy your stack and rebuild it. Even if it
is fixable on paper, being able to just nuke a stack and replace it is a very
good thing. This has happened to us multiple times and having a plan helps.

So what I do with elasticsearch for example is use 3 CF stacks (one for each
AZ). This allows me to do things like rolling restarts in a sane way without
having to do some flaky deep integration into CF to make it orchestrate a
rolling restart without destroying my cluster state simply by replacing the
stacks one by one.

If I were to build this again, I'd probably use terraform. Also, I'm looking
forward to moving most of our stuff to kubernetes.

~~~
stingraycharles
> My recommendation is to treat CF as a single point of failure. Once it gets
> in a broken state, you may have to destroy your stack and rebuild

One of the more common scenarios where CF gets into a broken state is:

1) create new S3 bucket + something else (e.g. some Elastic Beanstalk env
update)

2) something else fails, causing rollback

3) S3 bucket already contained data (e.g. your failed Elastic Beanstalk env
update caused it to write data)

4) CF refuses to destroy the S3 bucket, entering a "rollback failed" state

In this cause, manually wiping the S3 bucket works well enough. But generally,
it appears that CF works kind of when the updates you're making are really
small, incremental updates.

Sometimes it gets totally corrupted and you need to nuke stuff, per your
advice. This automatically leads me to the following suggestion: leave
mission-critical data out of CloudFormation. Specifically, stuff like RDS
databases which you absolutely never ever want to have destroyed: just provide
the endpoint as an input to your CF template.

~~~
SanderKnape
You can set the DeletionPolicy attribute to "Retain" to work around this S3
issue. CloudFormation will successfully rollback without attempting to remove
the S3 bucket. You can then do so manually yourself before trying to deploy it
again.

Check out the docs here:
[https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-
attribute-deletionpolicy.html)

~~~
cle
You _better_ do so before deploying again, because the roll forward will break
since the resource already exists.

This a major pitfall when using DeletionPolicy=Retain with named resources. It
breaks seamless rollbacks/rollforwards. If you rollback, in order to deploy
again you need to either delete the named resources with DeletionPolicy=retain
that were rolled back, or update your template to rename them all. It is such
a huge pain.

~~~
SanderKnape
True, but it beats the alternative where CloudFormation deletes objects that
you didn't want deleted. The underlying issue is that the S3 objects are
outside of the CloudFormation scope, thus it takes no risk and doesn't delete
your objects.

A nice feature would be a "ForceDelete" deletionpolicy where it would delete
the objects. You can even set this initially when creating a stack, and change
it to "Retain" later when the stack is stable.

Totally agree btw that it's a huge pain initially, though once you know it
it's also not that hard to work around.

~~~
cle
My preferred behavior would be for CFN to not barf when rolling forward. In
other words, to be able to assume control over a resource that already exists.

------
kokey
The only thing that spending time with Cloudformation teaches me is how much
it makes me prefer doing things with Terraform. I think Cloudformation is
considerably better than nothing and it was great when there were no
alternatives, but that was a while ago.

~~~
auslander
Terraform is terrible compared to Cloudformation. Its selling point is multi-
cloud support, but you'll never get it, clouds are too different.

\- Good CF template is 10x less code for the same solution.

\- No corrupted state problems.

\- Native tool, supporting all properties of resources

Writing good CF templates takes good AWS knowledge, and system thinking, you
group resources that belong together, it actually _teaches_ you good
architecting.

~~~
kokey
I think Terraform's multi-cloud support is a bit better than Cloudformation's.
Jokes aside, I don't think the multi-cloud part is really the biggest selling
point, the biggest selling points, for me, are:

\- Much better than Cloudformation at telling you what it's going to change
before you apply the changes and the ability to record those changes. (much
better than those dreaded 'conditional' changes)

\- The ability to import changes if you found some that were done outside of
Terraform. It's not perfect, or easy, but mostly doable.

\- The ability to look at the code, the state file and the plan to get a good
representation of what's actually deployed.

Those three are more significant than it looks, but together it makes sure
you:

\- Don't get into a situation where automation is broken and you can only
recover by rebuilding the stack.

\- Don't get unexpected downtime because a change replaces a resource
unexpectedly.

\- Being able to track, record and manage changes in easy to read diffs and
plans.

~~~
deboflo
The changesets feature of cloudformation allows users u to do most of what you
mention here. Also take a look at resource deletion policies and Lambda custom
resources.

~~~
Rapzid
Unless they fixed it though it didn't work well in certain situations, like
with nested stacks, and often doesn't provide nearly the same level of detail
as to what EXACTLY is changing and why.

------
renke1
CloudFormation is pretty cool. In a rather short amount of time I was able to
create a reproducible deployment (based on any commit in my Git repo) that
deploys a Lambda, makes it accesible via API Gatway, creates a DynamoDB table
for storage, sets up Cognito user pool for user management, creates CloudFront
distribution that securely serves my SPA and the API Gateway and lastly adds a
record to my domain such that is accesible at `${commit}.mydomain.com`.

~~~
technics256
Awesome. Care to share or point to best resources to learn? Thanks!

------
olafalo
IMO the limits of CloudFormation are a bigger pain point than they're made out
to be here. The limit of 200 resources per stack is easy to hit, and so is the
450KB template size limit (well, it's possible at least). It's frustrating to
need to spread a single service across three stacks because it has a lot of
API Gateway endpoints. The real answer is nested stacks, but those still count
towards the (raisable) total stack limit of 200.

~~~
kesor
Have you even read the article?

Using exported resources and avoiding nested stacks makes it impossible to
reach these numbers.

~~~
deboflo
Exactly. Stack imports/exports was launched to address the shortcomings of
nested stacks. Avoid using nested stacks in favor of stack imports/exports.
Actually, avoid both nested stacks and stack imports/exports in favor of SSM
parameters.

------
Cpoll
I recommend using Troposphere instead of vanilla CF. It's a Python library
that generates CF templates. It doesn't abstract out anything, so the
structure ends up looking very similar to a json or yml template, but with all
the conveniences of working with objects in Python.

The biggest gripe I have with CF is that it's impossible to introduce existing
components into a CloudFormation stack, so any legacy infrastructure has to
remain manually managed.

------
jonthepirate
Terraform does everything CloudFormation does but in a simpler way where you
have more control over what's happening.

~~~
zaphar
Cloudformation has better atomicity guarantees though. It's not perfect but in
general if a change to a stack fails it will get rolled back to a known good
state. Terraform doesn't give you the same guarantee. You'll have to push a
rollback or fix yourself leaving your AWS resources in a potentially broken
state while you do.

------
bevel
My biggest inconsistency with Cloud Formation is with smaller AWS offerings.
If I need to build a VPC with some EC2 capacity it works well. If I want to
create a load balancer and use R53 to do DNS based certificate validation with
their in house SSL provider, I'm out of luck.

It looks like internal products need to work with cloud formation to enable
support, and aws doesn't have a consistent model here. It seems that they are
fine with some products cutting corners and not offering support (like DNS
based certificate validation)

Inconsistency within aws isn't all that surprising.

~~~
gazoakley
Terraform can do that:
[https://www.terraform.io/docs/providers/aws/r/acm_certificat...](https://www.terraform.io/docs/providers/aws/r/acm_certificate_validation.html)

That said, one irritating omission I've had to deal with is not being able to
add email subscriptions to SNS topics. The underlying AWS API is a bit odd - I
don't think it provides an ARN until the subscription is confirmed.

------
alecbenzer
How does CloudFormation compare with using something like ansible to manage
AWS environments?

~~~
unkoman
Cloudformation is infrastructure management, not configuration management.
Both Ansible and Cloudformation can be used for both in different ways, but
usually you have your configuration management (such as docker containers) in
one step of your pipeline and cloudformation templates as another. That way
you can test your infrastructure (by deploying cloudformation templates and
tearing them down) as well as your code without them being too dependant.

~~~
jbergknoff
This infrastructure/configuration distinction is very hazy when it comes to
services like Lambda or Fargate, where you just specify your code artifact and
there's essentially nothing more to do. It's not clear that it's a net benefit
to introduce additional tooling beyond CloudFormation/Terraform for deploying
to these services. It's certainly not strictly necessary.

------
Illniyar
Is there anyone here who used both amazon CF's and azure's ARM and can comment
on the benefits and problems of each?

When I used CF a few years back (when it started) it was a pain (for those
things it actually supported). I'm now using azure and ARM's integration with
azure's cloud seems better to me.

~~~
ghayes
I’d also love any experiences people have with Google’s Google Deployment
Manager. For me, the product felt like it had many flaws that also had
previously tended me away from CloudFormation (specifically, not full support
for beta or alpha features and questions about inconsistent states during
failures). I decided to go with Terraform since it feels like the industry
standard and had full support for even quite new GCP features.

------
djstein
I was unaware of the ability to create custom CF resources! This is great. I
will try to make a config to create a AWS Aurora Serverless RDS instance. It
went GR Friday, and the team says CF support won’t be available until the end
of the month.

------
kesor
Excellent advice! I would also advise creating a couple of scripts that upload
to S3 and run the update-stack commands automagically. Every advice in the
article is gold though.

