Hacker News new | comments | show | ask | jobs | submit login

I like CloudFormation. Unfortunately it is very unwieldy to write CloudFormation templates directly, and we're not about to start using the AWS CFN GUI editor!

It seems like the assembly of the AWS ecosystem.

Does anyone else have a favourite hammer for this particular nail? I'd love to have something better than our home-baked solution, but I'm yet to find anything which doesn't introduce other flaws, such as an incomplete implementation (missing parameters or resource types) or ultimately making a leaky abstraction on top of CloudFormation somehow.

I ended up brewing a reasonably straightforward solution using Python as a (minimal) DSL which emits JSON. Its primary purpose is to support the whole of the CFN ecosystem (not just implement some small part of EC2, for instance) while also not trying to be too clever.

It has about 50-100 lines of python which implements helper functions such as ref(), join() and load_user_data(), and not many other things. There is an almost 1-to-1 correspondence between the generated CFN configuration and the python source. As a bonus it checks for a few common mistakes like broken refs or parameters which aren't used.

I have heard that similar solutions have been reinvented in a few places, including the BBC. But I'm yet to see a good public solution!

A big problem of many tools out there, and of cloud formation in general, is that validation is a mess. And the bigger the template, the bigger the problem, even with existing tools.

There are validation problems even within specific tools: Just look at the RDS setup alone: A ton of options that are often mutually exclusive. It's brutal. And don't get me started with security groups.

At Monsanto, we built our own toolset, and open sourced it. It is all Scala, so it might be a bit of a learning curve for many folks, but there's an actual attempt in there at making sure that if you can write it, have the tool blow up before it gets to AWS, which then realizes something went wrong, and that it has to roll everything back.


If you'd consider something other than CloudFormation, there is also Hashicorp's Terraform. It has an AWS provider (https://terraform.io/docs/providers/aws/index.html) which creates resources and maintains the state in a file that you can store in version control (https://terraform.io/docs/state/index.html).

Terraform, as an idea, is brilliant. Mitchell and company isolated a hugely important need and tried to fill it, and I give them all the credit in the world for that. Cross-platform cloud provisioning? Gimme. But I cannot in good conscience not relate what a disastrous experience Terraform has been for me at both jobs and clients.

Writing reusable code in Terraform is an exercise in frustration due to the extreme clumsiness of HCL (which, I understand, was used because "YAML is complicated"--well, that's true, but YAML isn't a good solution either, you're HashiCorp, you wrote Vagrant, you already know how to do this!). The application architecture is reckless and full of race conditions; your state will be hosed if one resource errors out at the wrong time, while other resources are being successfully updated--the resources that return successfully after the failed resource will on many occasions fail to be persisted to state. What's more, application testing seems to be at best an afterthought: there have been regressions in the providers that will break your existing states.

I would under no circumstances use Terraform if I didn't have clients who had selected it before I was working with them. If in AWS, I would use CloudFormation, with a tool like Cfer[1] (which is excellent, reliable code) or SparkleFramework[2] (which is more full-featured but I hope you never need to debug it) to provision my stuff.

(Full disclosure: I'm building a much, much better provisioner for multi-provider cloud infrastructure. Neither of the projects I recommend are mine; mine's not done yet.)

[1] - https://github.com/seanedwards/cfer

[2] - http://www.sparkleformation.io/

If you're writing your own, you might also look to BOSH[1] for inspiration.

It's older than CloudFormation and Terraform (born 2010). It can manage anything that someone's written a driver for. So far that includes AWS, Azure, vSphere/vCloud, OpenStack, VirtualBox, Google Compute Engine, Apache CloudStack and there might be others I missed.

It stores state in a database. It is able to recover from mismatches between the state of the world and the desired state. Cloud Foundry users have been using it for years to deploy and update CF installations. Pivotal Web Services (I work for Pivotal, in a different division) has been upgrading to the most recent CF release every few weeks, live, without much fuss, for years.

For any kind of heavily stateful infrastructure, BOSH is a strong candidate.

[1] https://bosh.io/

Augh, how did bosh slip my mind? I've never used it in production, but I've used it to roll out a CF environment for testing and was impressed to dig into it a little more (most of a year ago now, I think your mention of CF was actually what kicked that off). From (admittedly limited) experience I'm not crazy about its developer-facing feel, but I appreciate the significant and responsible effort in it.

> The application architecture is reckless and full of race conditions

Honestly curious, can you point to one or two?

I don't have a toy example offhand, but the resource failure case I mentioned is one. If resources A, B, and C are in flight at the same time and A fails, Terraform in some as-yet-undiagnosed circumstances will not record state changes caused by the still-in-progress work on B and C. This happens a lot with SNS queues, IIRC, because SNS queue operations on the AWS API take a relatively long time to resolve. So if, say, you mistyped an attribute for an EC2 instance, it can fail out and Terraform will happily forget that it created an SNS queue for you.

I have a sneaking hunch that the continuing problems with template-file resources (complaining that the "rendered" attribute doesn't exist in dependent resources) are related to this, but can't prove that; my clients don't pay me to debug Terraform, but to get their stuff working, and that doesn't leave much time to get in-depth with it now that I've decided not to use it for my own purposes anymore.

> Full disclosure: I'm building a much, much better provisioner for multi-provider cloud infrastructure. Neither of the projects I recommend are mine; mine's not done yet.

One of the convenient things about software that doesn't exist is that it doesn't have any bugs.

Let your software speak for itself when it exists; until then, this seems an undeserved critique of software, and a team, that is solving problems every day.

This is a crazy sentiment. I was just considering checking out Terraform, and I'm really glad to have read the previous commenter's experience.

I've been using Terraform for over a year, maintaining a standard 3 AZ load balanced production cluster.

HCL has improved dramatically, and now that template strings are a thing, most of my variable interpolation issues are solved. However you still can't specify lists as input variables so you frequently have to resort to joining and splitting strings. It's hackish and worse, changing one value in the list will invalidate all other resources that use the variable.

Race conditions and dependency cycles are still a problem. Particularly with auto scaling groups and launch configurations -- I have to migrate them in two steps (create then destroy) to avoid a conflict. Same with EBS volumes, I ended up scripting my instance to attach the volume by itself, otherwise there's ordering issues when destroying and replacing.

There's also missing features, such as the ability to create elasticache redis clusters and cloudformation resources.

I'm still glad that I went with Terraform though. It takes a good amount of time to get around the limitations and bugs, which can be really frustrating, but when it works, it works beautifully.

Strongly agree. Support for new AWS features hits Terraform much faster than CloudFormation (still waiting for CF support for AWS's managed ElasticSearch service that was unveiled two months ago--Terraform got it right away). Some of the critiques below are true... HCL is fine, but Terraform's interpolation syntax has a long way to go. That said, CF's JSON is way more painful to deal with. As for the other problems, they go back mainly to someone using a tool they don't fully understand. Yes you can get into some odd states in rare cases, but Terraform gives you the ability to rapidly build and tear down your infrastructure over and over if necessary to work out details and you have fine-grained control over which pieces are built how. Not only that but you can inspect the logs to see what's happening and if there are bugs, you can fix them yourself because the tool is open source and free. CloudFormation gives zero visibility, no fine-grained control, and it's completely opaque and where it's broken, you can't fix it.

Terraform is relatively new and improving rapidly. It has its problems, but it's light-years beyond CloudFormation. It's clear that Amazon doesn't place a high priority on making CloudFormation easy to use, or to support new features. The right approach to any problems with Terraform is not to spread FUD about it like below, but to contribute code fixes.

> HCL is fine, but Terraform's interpolation syntax has a long way to go

Oh, HCL is fine, you say so authoritatively? Well then do me a solid and show me an if statement, show me a for loop. Because you're not building nontrivial, reusable infrastructural modules without logic. I know. I've tried. I've committed, between different projects and clients, somewhere around ten thousand lines of Terraform and probably half are copy-paste garbage because HCL is so crippled a tool.

It hurts me to say this at a deep and visceral level: Terraform's interpolation syntax makes freaking Ansible and its "no, really, it's totally cool, string templates for logic are awesome" look good.

> The right approach to any problems with Terraform is not to spread FUD about it like below, but to contribute code fixes.

Spread FUD? Oh, no no no, you can take your assertions of FUD and insert them somewhere uncomfortable, thank you very much. I wrote Terraframe[1] specifically to contribute back to the Terraform community, to make it better, and stopped (to create a different project) because I was stymied. By no documentation, by HCL <-> JSON not actually working, and by no interest from the developers in any sort of dialogue about actually fulfilling the promises they themselves assert for their software. Between this and bugs that a trivial testing framework should catch (Why are you validating AWS resource names differently between point releases? Why are you changing that validation to be wrong? Why are you breaking my existing states when you've done this? Why did your tests not catch this before you pushed this out to your entire userbase?) I cannot take the project seriously as a tool for being used in infrastructure I care about. Because I don't trust them to take Terraform seriously, either.

[1] - https://github.com/eropple/terraframe

I believe you can hack an if statement by doing a length, substring and equality comparison to make it equivalent.

I'm a big Terraform fan, but I really don't like HCL and its limitations. I ended up writing a PHP "SDK" of sorts that generates JSON that Terraform consumes [1]. It uses the AWS SDK for some things (like listing all available AZs in a VPC), and provides some macros. I made this for use at work, and it powers a few production sites for a large company.

There's still a lot to do to make it ideal for public consumption (like writing docs and freezing the API), but it'll get there sometime soon. PRs are most welcome.

[1] https://github.com/ameir/terraform-php

I second the Terraform suggestion...my team loves it. But we've found storing state in version control to be clunky. Storing state remotely in Consul has been less problematic for us, though S3 would also work for those that don't have a running Consul cluster.

What I love most about Terraform is that we can include the output of terraform plan in pull requests that make infrastructure changes. Then our continuous deployment process runs plan again and requires an identical output before running apply. This both makes it easier for team members to review changes but also ensures that we don't accidentally destroy infrastructure, which is really easy to do with a lot of these infrastructure-as-code tools.

The other thing that Terraform has going for it over CloudFormation is for hybrid cloud deployments, since it can provision infrastructure in vSphere and OpenStack as well as AWS.

Can you go into how you're using consul with terraform?

We're using Consul to store the state remotely (see: https://terraform.io/docs/commands/remote-config.html). In a nutshell, it just stores the JSON it would have stored in the tfstate file in a key in Consul instead. In addition to being easily available in a shared location, this allows you to leverage Consul's features (ACLs, watches, etc) to improve the process of making infrastructure changes.

Stuff we've thought of but haven't gotten around to yet: - Build relatively simple tooling around terraform and Consul to acquire a lock before running apply...we haven't gone to that length yet since only our continuous deployment environment has credentials to mutate production and it runs builds of the infrastructure project sequentially. - Watching the Consul key where the tfstate is stored for changes to kick off sanity checks to ensure that everything is still healthy.

They're both so flexible that there's probably other ways in which they'd work well together that we haven't thought of yet.

https://github.com/cloudtools/troposphere is a good option here

Also https://github.com/russellballestrini/botoform looks to be a newer solution in this space

Terraform is another option and then there's the model we're actively moving towards at work: using Ansible to abstract and completely replace calls to CloudFormation with a combination of existing and bespoke modules to dynamically spin up the infrastructure we need.

Troposphere (https://github.com/cloudtools/troposphere) is a mature Python CloudFormation solution that sounds similar to your home brewed one.

Yes.. It is mature and very active too. AWS keeps on adding services and also make them available on CF. Troposphere community is very quick in implementing them..

CloudFormation is not very elegant. Things become complicated with the concept region, zonal resources (same AMI is represented by a different id in different region etc). Try Google Clouds's Deployment Manager. Functionally similar, but Google Cloud Deployment Manager is far easier (everything is a global resource), Jinja based templates (you get to write for's and if's, evaluate lists, dictionaries inside templates .. )

I am asking this question often.

We are using very advanced CloudFormation in the open source Convox platform.


I have touched every corner of CF including lots of Custom Resources.

Right now we are using the golang template tools and tests to generate our templates.

But I have lots of needs and ideas for improving this. A CF template compiler and simulator should be possible, giving us all tons of confidence in making template changes and therefore any infrastructure update.

I have some sketches that I haven't published yet.

And I strongly believe CF is the best tool in this space if you're all in on AWS. Let Amazon be responsible for operating a transactional infrastructure mutation service. It's ridiculously hard to do this right.

If you want to brainstorm some ideas send me a message :)

One of the things that I thought would be neat is an open source CloudFormation that could work for multiple cloud vendors, possibly using a driver pattern.

Also, you might want to update your HN profile with contact info.

Thanks, I updated my profile with contact info.

Terraform is awesome if you want an OSS project to manage multiple cloud vendors.

But I think that infrastructure change management is a really hard problem and the state of the art solution is how AWS runs CloudFormation as a managed service.

Once it's set up properly, it's amazing watching what CloudFormation can do. It can execute updating 20 instances to roll out a new AMI, and then roll the whole operation back on demand or if a failure happens. All with no application downtime in the cluster!

I built a JSON templating system[0] (based on Handlebars) that is project based. Basically, you create new JSON files for each project and the system generates a CFN template that contains everything that project needs (EC2 instances, RDS instances, security groups, etc.) It's still a work in progress but I'm using it in production in my day job and I'm pretty happy with how it works. More help is always accepted. :)

[0] https://github.com/rnhurt/CFNBuilder

Create the infrastructure manually and then use Cloudformer tool to generate the template based on the already created infrastructure. You can then edit the generated template to make it more maintainable and you have a nice reusable template.

Some people like SparkleFormation (http://www.sparkleformation.io/). I'll warn you, though, that it's not a good example of how to program in Ruby. It abuses method_missing to the point that it makes your implementations difficult to debug.

I also don't like their made-up terms such as "dynamics", etc. The documentation is pretty confusing as well.

Using a DSL is tempting. I've found the AWS CLI best, and a lot of the time I think it's easier just to write a Ruby script using the SDK.

This obviously doesn't necessarily handle teardown very well, and it tends to be copying boilerplate and modifying it, but I find it the most straightforward thing, and simple, if a little verbose.

There's already a python module called troposphere.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact