It seems like the assembly of the AWS ecosystem.
Does anyone else have a favourite hammer for this particular nail? I'd love to have something better than our home-baked solution, but I'm yet to find anything which doesn't introduce other flaws, such as an incomplete implementation (missing parameters or resource types) or ultimately making a leaky abstraction on top of CloudFormation somehow.
I ended up brewing a reasonably straightforward solution using Python as a (minimal) DSL which emits JSON. Its primary purpose is to support the whole of the CFN ecosystem (not just implement some small part of EC2, for instance) while also not trying to be too clever.
It has about 50-100 lines of python which implements helper functions such as ref(), join() and load_user_data(), and not many other things. There is an almost 1-to-1 correspondence between the generated CFN configuration and the python source. As a bonus it checks for a few common mistakes like broken refs or parameters which aren't used.
I have heard that similar solutions have been reinvented in a few places, including the BBC. But I'm yet to see a good public solution!
There are validation problems even within specific tools: Just look at the RDS setup alone: A ton of options that are often mutually exclusive. It's brutal. And don't get me started with security groups.
At Monsanto, we built our own toolset, and open sourced it. It is all Scala, so it might be a bit of a learning curve for many folks, but there's an actual attempt in there at making sure that if you can write it, have the tool blow up before it gets to AWS, which then realizes something went wrong, and that it has to roll everything back.
Writing reusable code in Terraform is an exercise in frustration due to the extreme clumsiness of HCL (which, I understand, was used because "YAML is complicated"--well, that's true, but YAML isn't a good solution either, you're HashiCorp, you wrote Vagrant, you already know how to do this!). The application architecture is reckless and full of race conditions; your state will be hosed if one resource errors out at the wrong time, while other resources are being successfully updated--the resources that return successfully after the failed resource will on many occasions fail to be persisted to state. What's more, application testing seems to be at best an afterthought: there have been regressions in the providers that will break your existing states.
I would under no circumstances use Terraform if I didn't have clients who had selected it before I was working with them. If in AWS, I would use CloudFormation, with a tool like Cfer (which is excellent, reliable code) or SparkleFramework (which is more full-featured but I hope you never need to debug it) to provision my stuff.
(Full disclosure: I'm building a much, much better provisioner for multi-provider cloud infrastructure. Neither of the projects I recommend are mine; mine's not done yet.)
 - https://github.com/seanedwards/cfer
 - http://www.sparkleformation.io/
It's older than CloudFormation and Terraform (born 2010). It can manage anything that someone's written a driver for. So far that includes AWS, Azure, vSphere/vCloud, OpenStack, VirtualBox, Google Compute Engine, Apache CloudStack and there might be others I missed.
It stores state in a database. It is able to recover from mismatches between the state of the world and the desired state. Cloud Foundry users have been using it for years to deploy and update CF installations. Pivotal Web Services (I work for Pivotal, in a different division) has been upgrading to the most recent CF release every few weeks, live, without much fuss, for years.
For any kind of heavily stateful infrastructure, BOSH is a strong candidate.
Honestly curious, can you point to one or two?
I have a sneaking hunch that the continuing problems with template-file resources (complaining that the "rendered" attribute doesn't exist in dependent resources) are related to this, but can't prove that; my clients don't pay me to debug Terraform, but to get their stuff working, and that doesn't leave much time to get in-depth with it now that I've decided not to use it for my own purposes anymore.
One of the convenient things about software that doesn't exist is that it doesn't have any bugs.
Let your software speak for itself when it exists; until then, this seems an undeserved critique of software, and a team, that is solving problems every day.
HCL has improved dramatically, and now that template strings are a thing, most of my variable interpolation issues are solved. However you still can't specify lists as input variables so you frequently have to resort to joining and splitting strings. It's hackish and worse, changing one value in the list will invalidate all other resources that use the variable.
Race conditions and dependency cycles are still a problem. Particularly with auto scaling groups and launch configurations -- I have to migrate them in two steps (create then destroy) to avoid a conflict. Same with EBS volumes, I ended up scripting my instance to attach the volume by itself, otherwise there's ordering issues when destroying and replacing.
There's also missing features, such as the ability to create elasticache redis clusters and cloudformation resources.
I'm still glad that I went with Terraform though. It takes a good amount of time to get around the limitations and bugs, which can be really frustrating, but when it works, it works beautifully.
Terraform is relatively new and improving rapidly. It has its problems, but it's light-years beyond CloudFormation. It's clear that Amazon doesn't place a high priority on making CloudFormation easy to use, or to support new features. The right approach to any problems with Terraform is not to spread FUD about it like below, but to contribute code fixes.
Oh, HCL is fine, you say so authoritatively? Well then do me a solid and show me an if statement, show me a for loop. Because you're not building nontrivial, reusable infrastructural modules without logic. I know. I've tried. I've committed, between different projects and clients, somewhere around ten thousand lines of Terraform and probably half are copy-paste garbage because HCL is so crippled a tool.
It hurts me to say this at a deep and visceral level: Terraform's interpolation syntax makes freaking Ansible and its "no, really, it's totally cool, string templates for logic are awesome" look good.
> The right approach to any problems with Terraform is not to spread FUD about it like below, but to contribute code fixes.
Spread FUD? Oh, no no no, you can take your assertions of FUD and insert them somewhere uncomfortable, thank you very much. I wrote Terraframe specifically to contribute back to the Terraform community, to make it better, and stopped (to create a different project) because I was stymied. By no documentation, by HCL <-> JSON not actually working, and by no interest from the developers in any sort of dialogue about actually fulfilling the promises they themselves assert for their software. Between this and bugs that a trivial testing framework should catch (Why are you validating AWS resource names differently between point releases? Why are you changing that validation to be wrong? Why are you breaking my existing states when you've done this? Why did your tests not catch this before you pushed this out to your entire userbase?) I cannot take the project seriously as a tool for being used in infrastructure I care about. Because I don't trust them to take Terraform seriously, either.
 - https://github.com/eropple/terraframe
There's still a lot to do to make it ideal for public consumption (like writing docs and freezing the API), but it'll get there sometime soon. PRs are most welcome.
What I love most about Terraform is that we can include the output of terraform plan in pull requests that make infrastructure changes. Then our continuous deployment process runs plan again and requires an identical output before running apply. This both makes it easier for team members to review changes but also ensures that we don't accidentally destroy infrastructure, which is really easy to do with a lot of these infrastructure-as-code tools.
The other thing that Terraform has going for it over CloudFormation is for hybrid cloud deployments, since it can provision infrastructure in vSphere and OpenStack as well as AWS.
Stuff we've thought of but haven't gotten around to yet:
- Build relatively simple tooling around terraform and Consul to acquire a lock before running apply...we haven't gone to that length yet since only our continuous deployment environment has credentials to mutate production and it runs builds of the infrastructure project sequentially.
- Watching the Consul key where the tfstate is stored for changes to kick off sanity checks to ensure that everything is still healthy.
They're both so flexible that there's probably other ways in which they'd work well together that we haven't thought of yet.
Also https://github.com/russellballestrini/botoform looks to be a newer solution in this space
Terraform is another option and then there's the model we're actively moving towards at work: using Ansible to abstract and completely replace calls to CloudFormation with a combination of existing and bespoke modules to dynamically spin up the infrastructure we need.
We are using very advanced CloudFormation in the open source Convox platform.
I have touched every corner of CF including lots of Custom Resources.
Right now we are using the golang template tools and tests to generate our templates.
But I have lots of needs and ideas for improving this. A CF template compiler and simulator should be possible, giving us all tons of confidence in making template changes and therefore any infrastructure update.
I have some sketches that I haven't published yet.
And I strongly believe CF is the best tool in this space if you're all in on AWS. Let Amazon be responsible for operating a transactional infrastructure mutation service. It's ridiculously hard to do this right.
If you want to brainstorm some ideas send me a message :)
Also, you might want to update your HN profile with contact info.
Terraform is awesome if you want an OSS project to manage multiple cloud vendors.
But I think that infrastructure change management is a really hard problem and the state of the art solution is how AWS runs CloudFormation as a managed service.
Once it's set up properly, it's amazing watching what CloudFormation can do. It can execute updating 20 instances to roll out a new AMI, and then roll the whole operation back on demand or if a failure happens. All with no application downtime in the cluster!
This obviously doesn't necessarily handle teardown very well, and it tends to be copying boilerplate and modifying it, but I find it the most straightforward thing, and simple, if a little verbose.