Hacker News new | past | comments | ask | show | jobs | submit login
AWS CloudFormation now supports blue/green deployments for Amazon ECS (amazon.com)
174 points by unigiri 48 days ago | hide | past | favorite | 79 comments

The cloudformation team is small bc the people who run it are not fun to work with and not willing to bring on talent with their own opinions. It's a weird irony, dug in and overall detrimental.

The only option teams that need CI/CD for their cloud resources & want to use cloudformation for most of it is to have a side chain process for handling resources that can't be expressed with Cloudformation. (not at all insurmountable but shouldn't be necessary)

> The only option teams that need CI/CD for their cloud resources & want to use cloudformation for most of it is to have a side chain process for handling resources that can't be expressed with Cloudformation.

CF is extensible through Lambda (which AWS uses itself for the Serverless Application Model), which if you are going to use CF for this is probably what you bought to do to let you use it for everything, rather than having a “side-chain process”.

roughly how many is "small"? 10? 20?

Very small - and to be clear this is not supposed to happen at Amazon or AWS it's "disagree and commit" vs "two pizza team".

It'll suss out eventually but not soon.

Start telling your teams to do what I said above and if anyone tells you "but we want to use one tool" tell them they must grow out of that.

I've never understood CloudFormation's lag time in supporting new service functionality.

This launched nearly 3 years ago. https://aws.amazon.com/blogs/compute/bluegreen-deployments-w...

Having spoken with core teams like IAM and Cloudformation teams at length, this appears to an internal AWS organizational issue. Those teams are not responsible for the services integration with them and so they're at the mercy of those teams priorities.

But honestly, I think the reason that Cloudformation support isn't as widespread or a top level priority is that it simply exposes the poor architecture and behavior of many of AWSs second tier services and teams. There are many services that simply do not behave well when managed by Cloudformation, but are also completely janky on their own and I'm betting it's far easier to cover up for poor architecture in the console than expose all the services dirty laundry with a Cloudformation integration.

Additionally, there are a lot of service teams that probably don't have a lot of customers using Cloudformation, so don't prioritize it or half-ass it completely. I'm looking at you DMS, and your terrible turd of a Cloudformation integration.

I'd say nearly the same thing about IAM and service teams inability to implement it well. I still do not understand why AWS has not mandated all services need to support both tag and resource based policies and predictable IAM semantics (looking at you Glue with your little fu of love called the write action "glue:GetMapping").

Cloudformation and IAM are, to me, the two of the most killer services from AWS, neither of which I've seen replicated at other providers.

Ex AWS here. I had the fun of digging into the rabbit hole of IAM and its convoluted logic. It's definitely possible to do what your said, but it's super easy to make mistake and the internal documentation is lacking. It took me multiple trips talking to people to deliver the integration we wanted.

It's also very old with some odd decisions in there - I can't go into the specifics. And it's practically impossible for the IAM team to deprecate those impossible corners

I am not surprised, being that it's one of the oldest AWS services? What I do love about IAM is that with the work that the Automated Reasoning Group is doing with Zelkova, it's really a dream to be able test IAM policies before deploying them. I really hope their work trickles back to the service teams so that they too can leverage it to see their way out of those dark corners in IAM :)

It's one of the worst products on AWS. It's so bad, that companies would rather spend engineer's time to avoid it. That's why there are hundreds of products that replicate its functionality.

Has the GUI been fixed to be somewhat useful? Did they migrate from their god awful JSON crap? Can I embed simple infrastructure logic, like automatically adding a group of nodes to a Route53 zone?

Because while "ready" for new features includes API support, it does not include CloudFormation support. The problem is entirely AWS politics and culture.

> The problem is entirely AWS politics and culture.

Someone higher up should enforce it. I've been in teams, where if something isn't in cloudformation it doesn't exist and that attitude is totally understandable, having to do some operations by hand seriously hurts IaC efforts.

Who builds the cloudformation support? I could understand not baking it in on an initial feature release but it should be there ’soon’.

Same with Config.

I've been told that each product team is supposed to add CloudFormation support themselves. For some reason, it's not treated as a high priority and often lands on the laps of the core CloudFormation team.

It may turn out that the tools that CloudFormation team provides don't make integration easy, especially if the operation takes longer than 15 minutes (meaning they can't implement support via a single lambda invocation as an under-the-hood custom resource).

You can also invoke a custom resource via SNS and then the SNS topic is subscribed to an API endpoint.

But it’s hard to believe if the team responsible for the blue green deployment functionality developed an API endpoint to do it, they couldn’t just hand it to the CF team to call. At the end of the day that’s all CF does. Call APIs based on the different lifestyle events as far as how it actually creates resources.

>lifestyle events

Intentional or not, I'm going to start using this. :P

Ughhh. And it’s outside of the edit window...

You would have thought that product teams would love using CloudFormation to create and tear down test resources as it already has code to handle all related resources that they want to interact with (networking objects, DNS, etc)

I understand the logic of not waiting for the CloudFormation team to do their work because it could introduce delays but it makes it a considerably less useful tool.

When considering Terraform vs CloudFormation I certainly took that into account and that's why I don't use CloudFormation for anything.

I wonder if there are any killer features of CloudFormation I missed when I looked into this (years ago now).

You can now write your own resource providers [0], which seems similar to TF. It's pretty decent. The only issue is you can't easily migrate from a custom provider to an official one once they release it (without deleting and recreating the resource), which can be a dealbreaker.

[0] https://aws.amazon.com/about-aws/whats-new/2019/11/now-exten...

Not so much a killer feature, but I find that some actions in newer services are only possible to automate through Cloudformation (looking at you Workspaces), because of the time they take to open up the API.

For Workspaces it took them something like 2-3 years to release to the public API what you could do with Cloudformation

- AWS Business Support Plan

- The ability to configure everything about a lambda and export the template.

- Serverless Application Model tooling

- Quick Create Links - you can give out a link to developers that let them create infrastructure by using CF and just enter parameters. You can restrict them to only being able to create a stack that you specify.

Somehow Terraform manages to roll out new features before they are even public while Amazon's own framework lags by years. I guess the CloudFormation team must be starved for resources.

I believe AWS updates their SDK first, and as Terraform roughly wraps that SDK, as soon as it is available they can update the provider

I wonder why none of the big cloud providers has bought Hashicorp.

It seems like a really obvious move to me?

I don’t have revenue breakdown on AWS orgs to back this up but I suspect CloudFormation isn’t one of the money printing products so it’s hard to justify investments as large as acquiring Hashicorp.

This is despite the obvious role of CloudFormation as a productivity multiplier inside the AWS world

That’s an interesting point, but Hashicorp does a fair bit more than just Terraform.

Why? Hashicorp provides them with tooling at no extra cost. Majority of that is free to the end user. Having Hashicorp separate is great for the cloud providers, since they don't have to manage or spend money on Terraform... but still can contribute the code.

Because they could focus extra resources on making the Terraform providers for their own products really good. They could have really tight, quick communication channels between the teams building their features and the Terraform team. Much harder to do when you're in separate companies.

Of course, AWS could be doing the same for Cloudwatch, but apparently they're not bothered.

I think it's a difference in prioritization and how CloudFormation is treated as a separate product as opposed to a fundamental part of releasing features. The other two major cloud providers treat templates as a core part of the release process (now).

As I understand it, the three cloud providers differ in this way:

1. Microsoft Azure releases template/API first - no features can be released unless an Azure Resource Manager template can access it. I think they use ARM templates for their integration testing, as well, but I'm not sure. (If you're an Azure 'softie, please chime in!).

2. Google Cloud Platform releases APIs first - no features are released without extensively being tested via API integrations. They do have a CloudFormation/Azure RM template offering now called Deployment Manager so that my change. (This means that Terraform-like tools can target features quickly, and I have no experience with Deployment Manager but I suspect it does as well.)

3. Amazon AWS releases to the web console first. You might say, "Whoa, what about Bezos' service oriented architecture email?"[1] As far as I know, that's true for internal features. But all user features are exposed through the web console, then wrappers around those APIs are exposed, then, sometimes years later, CloudFormation gains the ability to target those APIs.

For this reason, I trust Microsoft and Google Cloud a lot more to maintain parity between what I can do in Terraform/Pulumi/native template tools and in their web consoles.

[1] - https://gist.github.com/chitchcock/1281611, per Bezos: "All teams will henceforth expose their data and functionality through service interfaces."

It looks like this uses a new, not yet documented, top level attribute in a template: Hook.

At least, I haven't found any documentation for it other than the template they say to copy in the linked user guide for this feature. Definitely not supported in CDK.

What’s CDK in this context?

AWS Cloud Development Kit

The ECS default deployment model is quite frankly a disaster. I've written many Python/Go scripts over the years to wrangle it into a sensible form for CI/CD.

Good to see that they're working on it, but I don't know why they don't fix the underlying paradigm instead of making it Cloudformation exclusive.

I would love to know what the problem is. We do dozen of deployments every week with a ALB + ECS + Fargate setup. We upload a new container image, create a new task and launch as many tasks as desired (so if we want 2 containers running we launch 2, for a total of 4). ALB calls the /health endpoints on the new containers and if they pass the healthchecks it drains connections to the old containers and stops the tasks. This has worked seamlessly for a long time now without any downtime during deployments.

EDIT: I should mentioned that we are using AWS CDK for all of this. All it does is register a new task as the default task for a service and ECS/ALB does the rest.

I would like to know the same, we have moved almost everything to fargate and ecs and have had zero issues.

I find ECS particularly, ECS on EC2, to be really painful for small deployments.

I just want a cluster that scales in and out to the amount of memory my tasks need, I'm not very concerned about CPU. Until capacity providers it was more or less impossible to do so without having a bunch of excess capacity provisioned. Even with capacity providers, I can't seem to get a cluster to scale in to 0 instances when no tasks are needed.

I have one ECS service that requires an EBS volume mount, which means that when the task definition is updated, I need the service to stop, so that the new task can mount the same volume. This deployment model is essentially impossible without implementing a custom deployment strategy.

Fargate makes things a bit easier, and now that you can use EFS volumes with it it might make more sense. But overall, everything in ECS just seems poorly designed, clunky and hacky, like many AWS products.

You can’t do EFS on Fargate in CloudFormation yet though.

True. But you can do it with a custom resource.


You also cant use capacity providers with cloudformation yet. Likely due to the fact, that you also cant delete capacity providers, nor change a cluster's default capacity provider.

The default model works for most use cases, but it's not really flexible for non-standard cases and it fails completely at larger scale. If you have a few dozen containers taking 100,000 requests per second and you want to slow warm them with a segment of traffic for example, it's much easier to do on Kube than on ECS. ECS is also possible but it's just not as transparent or easy to work with.

Same here -- I perform the exact same kind of deployment you mentioned using CloudFormation. My only grief is poor rollback detection / control.

That’s not a Blue Green deployment.....

I don't think they are saying it was, unless I'm misreading. They're talking about standard ECS deploys. Yo add my anecdata, I do lots of ECS deploys via terraform into production and it works pretty seamlessly.

I don’t have a problem with deploys with CF. But it only let’s you configure a minimum healthy percentage. Which is good enough if you only need to validate that an instance is in an acceptable state via a health check.

Exactly this - I think internally there is a tussle on what is the strategic way forward.

There are so many - Elastic Beanstalk, ECS, EKS, ECS-on-Fargate, EKS-on-Fargate..and of course the huge marketing push for Serverless.

They could have the sensible way out and built EKS as the foundation of everything - makes total sense given the massive ecosystem around kubernetes. ECS and Fargate should be killed off.

https://cdk8s.io/ replaces Elastic Beanstalk ....but still basically runs on top of EKS.

The pairing of CDK8S and EKS are fundamentally enough for all usecases that AWS basically sells.

CDK8S doesn't replace Elastic Beanstalk, because the folks using EB are going to continue to use EB. If it ain't broke...

ECS predates EKS, and is better integrated than EKS with their other services. If you want a painless container experience on AWS, ECS for Fargate is what'll you'll want to use today. So both existing folks who use ECS today as well as new customers are going to keep using ECS.

Folks who like K8, perhaps from prior experience, are going to use EKS despite it having some rough corners.

See, Amazon is perfectly happy to support everything under the sun as long as there are paying users. See their database offering for very much the same strategy, and it would be silly to say that they should drop DB x since they now offer DB y.

The different proponents who use ECS, beanstalk,etc are really talking about the UX. Because ultimately the orchestration stack is not visible to you.

And my whole point is that AWS is on the path to fix kubernetes UX and give you exactly the mental model you want...but run it on k8s.

Think of it as ECS, Beanstalk, Fargate on top of EKS.

You won't be asked to adopt the complexity of k8s

AWS doesn’t turn off barely used services, ECS has lots of happy customers that don’t want to deal with Kubernetes.

Same goes for Fargate, except that it’s now offered by EKS to, what possible reason would they have too turn it off?

Happy ECS on EC2 and ECS on fargate customer. I'd be really surprised if they turn these off - they just work in a very simple way to get stuff running. I have some "scheduled tasks" in containers, they run on fargate - works great and so easy to update.

CDK8S will actually abstract away the complexity of kubernetes. I don't think you will find the complexity any worse than ECS

I think it is a good strategy for AWS to invest in all kind of different solutions for Container technology and see which one wins. Betting on just one horse might end up being too high of a risk. Google has GKE, Cloud Run, Anthos and directly run containers on Compute Engine

k8s is not for everyone. It has a high administration and complexity burden. It would be a mistake to make k8s a requirement for core AWS services. Fargate is a low-level compute platform - it is parallel to ec2 and lambda. It does not compete with kubernetes (as can be seen by the eks-on-fargate offering).

There is probably some tension between ECS and k8s - AWS built a container orchestration platform based on what they think the world (and Amazon) needs and then k8s became madly popular. And it's not clear that AWS was wrong because k8s is essentially too complex for many use cases. It makes total sense that they would support both fully.

Vanilla K8s is not meant to be directly used by everyone. Watch https://www.youtube.com/watch?v=ZqQTEdHVaCw for insight on how K8s team expect the project to move forward. Frameworks like OpenDeis and Knative are meant to be used by developers.

IMO K8s will become akin to Linux Kernel. Almost no one uses bare mainline Linux kernels, you choose a distro based on your needs. Companies like RedHat and Canonical will pop up and provide their own packaged "distros" of K8s and you will choose which philosophy best suits your needs.

We haven't had any problems, but our workload is mostly async background work. Our CI pushes master when it passes, running a simple script to push to ECR with a git SHA tag, update the task to use the latest image, and then update the service to the latest task definition. It takes 2 lines of bash to update the task definition, and 2 lines to run the AWS commands.

Yeah for the standard vanilla case it's ok. As soon as you hit a certain scale or need to do anything non-standard it becomes much harder to work with.

Really? At the last company we had like a 200 lines PowerShell script for all of that.

You can do all of it by pushing the container to ECR, tagging it with your build number and running CloudFormation. You pass the build number in as a parameter and you specify your image as

  image: !subst “image:${Tag}”

Could you elaborate a bit on the problem of "ECS default deployment model"?

It's just very rigid and not easy to extend. As soon as you hit a certain scale or need to do something slightly non-standard it's a mess.

Really? We’ve been using it for 3 years and it works fine. Make sure you pay attention to min/max task settings.

There are basically three ways to do it:

Min less than max, max is 100.

Max greater than min, min is 100.

Some combination of the two.

For the first you take down running tasks first and then backfill with new (green) tasks.

For the second you add green tasks and once stable take down blue.

Make sure that if you only have two tasks that min/max move in 50% increments: I.e you can’t scale up/down to 125%/75% but can to 150%/50%.

The list of restrictions in the user guide seems painful - cannot be used in the same template as nested stacks, cannot start a blue green deploy if other infrastructure would be modified at the same time, cannot import properties from other stacks, cannot export output properties. I was interested, but can't imagine using it with these restrictions.

Thinking about this further, the benefit of these restrictions is in avoiding conflicts where dependant infrastructure is updated in place before the blue green has completed. Lack of inputs and outputs still seems killer though.

I don’t think I’ve ever met someone that actually uses cloudformation unless a template was provided them to setup something like a lambda.

Why would you when terraform exists?

I use CloudFormation currently, and it gets the job done just fine.

It has some nice things that terraform just doesn't have out of the box:

- Its a managed service, I don't need to setup a state bucket (or similar), I upload my template and it handles it. CI itself is not running CloudFormation, it just tells CF to do its thing. I could achieve this with other services with TF, but its not something I have seen out of the box.

- Drift Detection

- Stack Sets, the ability to deploy easily deploy a cloudformation stack to multiple AWS Accounts

- How sharing outputs works in CF between deployed stacks. There are protections in place if you have a stack that relies on a value your stack creates, you cannot modify or delete it without first changing your other stacks. I wish AWS could instead trigger a downstream change in any of those stacks, but it does not. (This is not for nested stacks, but for 2 completely separate stacks that just happen to need to share something)

However even without these, I struggle to see the point of going with Terraform if we are all in for AWS anyways. Especially when there are templates that we may want to use, so why not keep everything in the same place.

Stack Sets are a killer feature, and so much easier than trying to do the same thing in Terraform.

CloudFormation Drift Detection is very limited currently: it only supports a small subset of resources that you can create with CloudFormation. If your needs are covered by it, great, but it doesn't take much to go beyond its bounds. Also, it detects drift, but won't correct it.

Terraform both detects and corrects drift on almost every resource, on every application. Sometimes there are limitations. This does result in extra work, as you can't just ignore drift without capturing that in code (via ignore_changes), but to me this is absolutely a desirable thing.

We use CloudFormation to actually create the state resources to store Terraform state.

Terraform can share outputs between different states using the terraform_remote_state data source, though there isn't any restriction on what can be done there (no requirement to update other stacks/states).

Despite being an AWS-only shop, we've found the multi-provider support in Terraform to be really useful. For example, as part of our environment configuration, we store resources in Consul, create Pingdom checks, etc all from the same set of Terraform code.

Oh yeah, drift detection is very limited. However being able to run it periodically and trigger an alarm from a central service (going back to my first point) without needing access to the repo is a major plus.

With Terraform (unless there is functionality I am unaware of) you need the repo where the terraform is stored to be able to perform that check. Then you need to setup the system that does the periodic checks (yes I know a cron in CI is barely any work if thats your desire).

Ultimately like a lot of things, CloudFormation VS Terraform is a long list of tradeoffs. I just would not personally write off CloudFormation depending on I actually need.

If you are on GCP or Azure you definitively want to use terraform, because alternative is either POS or doesn't exist.

TF actually doesn't have much edge over CF. When TF was released it fixed many pain points of CF, but since then CF fixed those problems.

The supposedly killer feature that you can use TF with multiple clouds is not that killer, since each cloud architecture is different so you can't just reuse your code.

With CF you have extra benefits:

- built in, no extra tool necessary

- integral part of AWS, so no need to worry about being discontinued

- available within AWS support

- no need to worry about synchronizing state between people

- import / export functionality, you can expose various variables, which following stacks can build on top of

- ability to implement custom states with lambda

- uses YAML or JSON, which means you can use existing tooling to generate config programmatically.

Haven't used yet, but it appears that CDK is built on top of that, and that's much more powerful than terraform.

Terraform doesn’t support rollbacks (handy for application roll-outs) and if your shop is heavily invested in AWS CF is a perfectly fine tool. I’ve maintained and started up tens of thousands of lines of both TF and CF and they both have their strengths and weaknesses.

Sure it does? Revert your code and apply. It’s not atomic but I’m guessing neither is CF.

I’m not sure I’d use either of these tools to roll out new code though.

That’s more of a GitOps style intervention requiring a source change and the workflow to revert a change could certainly be done but is by design not a first class construct in Terraform providers (not every provider supports every feature such as importing of resources). To respond to a Cloudwatch (heck, Prometheus, Grafana, ELK, etc) alarm saying your error rate went up because of a Route53 change it’s not out of the box with Terraform (would be a custom providers or null resources probably). And as a CLI application (granted, it’s run more like an RPC style architecture) there’s no obvious way to signal different failure levels and to respond to different failure modes of different resources.

CF roll-backs are on by default and will revert changes. Sometimes it can fail and be a real pain, but it’s overall been more of a help for myself than harm.

Aren't rollbacks handled by the VCS that you check your Terraform files into?

Sorry I'm not familiar with Cloudformation but that's how I would approach it with Terraform.

CF can detect an issue while deploying, and then automatically go back to previous state (it is configurable, but that's the default behavior).

are you not mixing infrastructure deployment and application deployment?

terraform is not the tool for the later, an ansible stack would be better suited for it

Once upon a time people would put applications delivered baked into AMIs as a deployment approach. Using a combination of CloudFormation metadata, Cloudwatch Alarms, cfn-init, and other tools applications could be deployed end to end with a single tool immutably. Many people are stuck with this rather coupled approach because trying to pull it apart would take more effort than is worth to the business.

In a lot of situations infrastructure deployment and configuration management is deeply coupled in the AWS ecosystem (DB changes, Serverless technically is changing your infrastructure definition as application changes).


then terraform is probably not the right tool and cloudformation is a better suited solution

It’s not an either/or situation for tools thankfully. The author of Terragrunt has even advocated before in a blogpost for using Terraform to deploy CloudFormation stacks to make changes to AutoScaling groups and perform blue-green deployment approaches.

I've been wondering: doesn't CDK depends on CloudFormation, as it basically converts code to CF files? (similar to Troposphere but with more languages supported).

If that's the case,I don't see how CDK stands a chance against Terraform or Pulumi.

And I'm not talking about multi cloud support, but just about the fact that an AWS-managed product, CloudFormation, keeps lagging behind for YEARS for such a core feature of a core AWS service.

I’m almost surprised this welcome feature was implemented at all, as ECS development appears to have slowed in favor of EKS (Kubernetes).

Also too bad that it’s still CloudFormation.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact