I understand that the main goal here seemed to be avoiding time spent in ECS rollouts, but this solution seems to be sacrificing many of the guarantees that the rollout process is designed to provide.
The root problem is explicitly called out (slow ECS deployments), and is tied to rate limiting of the ECS `start-task` API call. The post mentions the hard cap on the number of tasks per call, but I'm curious if the actual _rate limit_ could have been increased on the AWS side. Ie, 400 calls would still be needed, but they could be pushed through much faster.
Plaid's rollback job still works the same way for the service using this new deployment, so thankfully, nothing new for engineers at Plaid to learn there.
We also have metrics in Prometheus to indicate which versions of code are running so we can easily verify what is deployed.
WRT the rate limit, we have a great relationship with AWS and Plaid pushes hard on limits which most often AWS is happy to increase for us, but this was a hard limit that could not be raised at the time, but I'm sure they are working on it.
Version 1.0.0 in prod, serving requests
You want to deploy 1.0.1
You spin up 1.0.1, leaving traffic pointed at 1.0.0
What mechanism actually shifts the traffic from the 1.0.0 instances to the 1.0.1 instances, waiting for all traffic to stop on 1.0.0 before bringing the instances down without causing abrupt connection hangups?
Workflow and previous versions of Deis, as well as Dokku and friends also depend heavily on Open Source contributions of Heroku, so it's somewhat misguided to say that Heroku stays proprietary. They have been "leading the pack" when it comes to Open Source implementations of buildpacks, also Cloud Native form which you can find at buildpacks.io—and they set the original bar for a developer-friendly PaaS experience, which is still basically unrivaled even in the current era of Kubernetes product explosion.
We often ask about Workflow: "why isn't this particular technology several orders of magnitude more popular" and struggle to explain it, since it makes perfect sense to us and of course whenever people try it, they usually love it.
But Deis saw the writing on the wall, shifted gears to be some of the early Kubernetes experts when it started to become clear who was going to win; invented and open sourced "Helm" the K8s package manager utility, which is still very popular by comparison, one of the leading ways to package stuff for K8s. And then Microsoft scooped them up.
I think Heroku did a number of things almost better in terms of a deployment / dev story? But getting past the pricing was super hard.
Then when you have a budget you are used to other systems -> so it's a learning curve to get back and switching costs are higher.
I don't know about that. ECS works fine at large scale, but it's not going to replace 4000 tasks immediately. And doing so could potentially be a huge shock to your customers if you tried. (You have to take the time to gracefully drain existing connections, etc.) Nor did the author desire to implement a blue-green type deployment which would use AWS's elasticity to its fullest potential out of unspecified cost concerns.
It's not clear to me that Kubernetes would fare significantly better here. It's not without its own particular performance bottlenecks; and the issues about safely draining connections and blue/green deployments would remain the same there. The issues aren't really related to the particular deployment orchestrator, as much as the fact that 4000 containers is pretty large number by any measure for a single unit of deployment.
> The rate at which we can start tasks restricts the parallelism of our deploy. Despite us setting the MaximumPercent parameter to 200%, the ECS start-task API call has a hard limit of 10 tasks per call, and it is rate-limited. We need to call it 400 times to place all our containers in production.
That call limit needs to scale with the cluster size
Overall task replacement time was about 15 minutes -- which is pretty respectable, in my view. Obviously not as fast as doing a hot code swap, though. Tradeoffs abound.
Im really curious about this one. What were the bottlenecks you have faced till now.
4,000 node processes? Wow. Just wow. Surely there is a better solution for this. Maybe this is one of those weird edge cases where you would craft a bespoke infrastructure tool, rather than trying to hack a bunch of off-the-shelf products.
Node event loop blockages are the primary reason we have so many processes running. We have enough integrations and iterate on them quickly enough that our infrastructure essentially treats them as untrusted/breakable. We want to avoid ReDoS-style bugs from affecting more than the current request, so we handle one request per process. A little inelegant, but we've still been able to horizontally scale the system, and frankly the extra infrastructure cost hasn't been enough to be worth the effort to change it.
To get around the start-task rate limit, we've tried running multiple identical containers per ECS task. However, they need to be marked as "essential" in CloudFormation to make sure our capacity doesn't degrade on container exits, and this means that one container exiting will also exit other containers in the same task.
Multiple processes per container is another interesting approach. We've used Node subprocesses in the past, but we found them tricky for reasons that are unrelated to deploy speed.
One thing we've really liked about rolling our own approach is that we decide when to declare a deploy complete. ECS is pretty conservative about not declaring a deploy complete until the final container has finished draining, which can take minutes for some of our requests. With our fast deploys, we declare a deploy complete when the final container running old code stops accepting new requests, which is significantly sooner. This makes follow-on deploys and rollbacks much smoother.
What are these "integrations"? Could you farm them out to a flock of subprocesses, so you needed fewer top-level Node processes? What are they written in? If Node, how do you even get them to block?
And, of course, have you considered rewriting it in ... carrier lost
"Never returning" is an interesting problem in Node. We have gRPC request timeouts, so the request fails on the client side after a set amount of time. Our Node server gets a cancellation event when this happens, but there's no guarantee that the in-flight request ever stops processing or gives up its resources, even if it's not blocking the event loop. E.g. our request handler could have a while(true) loop that continually does an async network request, and even though each gRPC request eventually times out, we would eventually have a swarm of zombie while(true) loops that are operating. To address this problem, we thread a lightweight context object into our framework which can essentially call an isCancelled() method. Before doing most low-level async operations (e.g. network requests), we check isCancelled() and throw if the gRPC request has timed out.
The integrations are all written in Typescript. As mentioned in another comment, we've attempted multiple processes per container and multiple containers per task. Right now, we see the bigger win as being able to fix "all of the above" and run multiple requests per Node process, but it'll take some legwork to get there :)
(Actually K8s collection of Pods is usually a ReplicaSet, fronted by Deployment, from the perspective of making sure they run. The Deployment is related to a Service by labels and selectors, K8s service only decides what pods it will route traffic to. An ECS service is more like a Kubernetes deployment, and I guess what Kubernetes calls a Service, basically a load balancer, is just called a load balancer, and I guess you manage the ALB's configuration together somehow for your service group, manually or somehow other, or you can just create a new ELB each time, and pay again ...)
I had never heard of a Docker Service, but from the looks of it, the Deployment (or Task), Service, ReplicaSet and Pods are all rolled into one single abstraction, the "Service" in Docker Compose.
Likely for the same reason they call it ECS and not EDS. Making everything docker-specific could be a bad idea is more container tech becomes popular and they want to allow you to start Frobnicator tasks as well as Docker tasks on the same platform.
It's inevitable that early systems are going to bend and break at various points of the scaling process, and the interesting question is how to navigate that – fight fires for awhile or fix things immediately, hack together a stopgap solution or do the full fix, etc. We thought ours might be an interesting data point to share.
Does Plaid still operate via screen scraping? I'm a little perplexed as to why banks don't have easy to use APIs, especially given recent regulation. It seems against their best interests to allow a third party to screen scrape and provide a service which the banks themselves could easily reproduce.
What am I missing? Is a bank with an easy to use API not a sound business decision from the bank's perspective?
I know Monzo (challenger bank in UK) has/had an API, though I haven't heard of anyone using it.
Basically, it's the same reason that phone companies would never have allowed MVNOs to exist without legal regulation forcing them. The MVNOs outcompete the infrastructure-building phone companies, because MVNOs don't have to build infrastructure!
I don't use ECS at the moment but this is a well laid out post on how to avoid some performance issues that could have a huge impact.
EDIT: Downvoted for expressing appreciation for someone taking the time to note lessons learned?.. OK.
That's what I don't get. All the design patterns are those of Unix. You boot the kernel with a ... bootloader. Then you've the kernel with all the system's params (call it ECS). Then each process is a child of the root process. And when you get by whatever mean the news that your app's source code has changed, you pull that code and start running it, while still having the old one live. Once the fork of the new code returns a proper response code, you kill the old one and set the new app live, otherwise you stay live with the old version.
Man, startups have no idea how good they have it. It took a solid week to deploy a change at AWS.
From reading other comments it makes me wonder if you (Plaid) tried batching the tasks into N containers? Like if a task 50 containers, then you'd reduce the task call rate limiting by 50x...
The essential vs non-essential container designation is a little confusing. The standard use case for multi-container tasks seems to be that all containers are marked as essential, i.e. they essentially represent different services that are operating in concert on the same machine. This is definitely not the situation we're in, where each container is totally independent.
So it seems like we'd be a perfect use case for non-essential containers. However, (1) at least one container _must_ be marked as essential, and (2) non-essential containers which exit don't get restarted or replaced. This means we would still have a limited bleed-over effect (if the essential container exits, the other ones do too), and more importantly, we can't guarantee that our capacity will be robust to process exits.
That's why you have one or two essential watchdog containers which relaunch the workers. You keep a large number of them in an "idle, but hot" status to allow for bursts?
Is it a mix of waiting for request handlers to terminate upon receiving a sigterm then end the current process (and timeouting after a while) ? Does kubernetes handles those kind of things (waiting for a given process to stop before trashing the vm) or is there another layer or tool to do so ?
Because eventually, something is going to die while a request is in-flight. So your batch processing needs to be able to recover from the "somebody tripped on the power cord" scenario.
Note that there's an ECS_CONTAINER_STOP_TIMEOUT parameter in ECS that sets a hard upper limit on how long containers have to exit before getting SIGKILLed, and it defaults to 30 seconds. If you want to allow requests to drain for longer than that, you'll need to update that parameter.
(For services with fast request processing time, the draining process is often a lot simpler. You can just route traffic away from the server, then wait a short period of time before telling the process to exit. It's not quite as precise, but works well for many use cases and requires no bookkeeping.)
First, maintain an active request counter that you increment when you accept a request and decrement when the request has been fully delivered (including output buffer drained).
Second, implement a TERM signal handler. The handler shuts down the listening socket (so no new connections will be accepted), starts a countdown timer, and waits for the active-request counter to reach 0. When either the countdown timer expires or the active-request counter reaches 0, exit.
There seem to be some other intermittent network issues as well. I just moved some stuff for a client off EKS and back onto ECS because it wasn't worth troubleshooting.
That said, I am not a fan of ECS at all. The scheduler is slow and it does really just take way too long for it to even start acting on new deployments.
Making a switch to EKS doesn't mean that you are stuck with EKS forever, and can set you up nicely to make the transition to a more complex/customized cluster implementation/deployment (likely still on AWS, think kops) easier if you find it's necessary for your use case in the future.
This is sort of how Capistrano handled deployments, changing a symlink to all project deps and then signaling the process to reload