Hacker News new | past | comments | ask | show | jobs | submit login
How we reduced deployment times by 95% (plaid.com)
84 points by bjacokes on Aug 28, 2019 | hide | past | favorite | 57 comments

If I'm reading this right, then this approach takes away any real safety in terms of deployment. There would be no easy rollback mechanism, and no real assurances that the new code version will actually run.

I understand that the main goal here seemed to be avoiding time spent in ECS rollouts, but this solution seems to be sacrificing many of the guarantees that the rollout process is designed to provide.

The root problem is explicitly called out (slow ECS deployments), and is tied to rate limiting of the ECS `start-task` API call. The post mentions the hard cap on the number of tasks per call, but I'm curious if the actual _rate limit_ could have been increased on the AWS side. Ie, 400 calls would still be needed, but they could be pushed through much faster.

Hey, great questions, some of the same questions we (the SRE/Platform Team at Plaid) had.

Plaid's rollback job still works the same way for the service using this new deployment, so thankfully, nothing new for engineers at Plaid to learn there.

We also have metrics in Prometheus to indicate which versions of code are running so we can easily verify what is deployed.

WRT the rate limit, we have a great relationship with AWS and Plaid pushes hard on limits which most often AWS is happy to increase for us, but this was a hard limit that could not be raised at the time, but I'm sure they are working on it.


Version 1.0.0 in prod, serving requests

You want to deploy 1.0.1

You spin up 1.0.1, leaving traffic pointed at 1.0.0

What mechanism actually shifts the traffic from the 1.0.0 instances to the 1.0.1 instances, waiting for all traffic to stop on 1.0.0 before bringing the instances down without causing abrupt connection hangups?

Whenever I see these posts I feel like Heroku narrowly missed out on shaping the rest of the cloud just by staying proprietary and expensive.

Compare to Deis, who built a Heroku analogue that you can run on your own compute resources. It was wildly popular for a little while, and then when it turned out there wasn't really a market for it, they simply had to leave it behind. (Today, that's known as Hephy Workflow, full disclosure I'm one of the maintainers of this FOSS project.)

Workflow and previous versions of Deis, as well as Dokku and friends also depend heavily on Open Source contributions of Heroku, so it's somewhat misguided to say that Heroku stays proprietary. They have been "leading the pack" when it comes to Open Source implementations of buildpacks, also Cloud Native form which you can find at buildpacks.io—and they set the original bar for a developer-friendly PaaS experience, which is still basically unrivaled even in the current era of Kubernetes product explosion.

We often ask about Workflow: "why isn't this particular technology several orders of magnitude more popular" and struggle to explain it, since it makes perfect sense to us and of course whenever people try it, they usually love it.

But Deis saw the writing on the wall, shifted gears to be some of the early Kubernetes experts when it started to become clear who was going to win; invented and open sourced "Helm" the K8s package manager utility, which is still very popular by comparison, one of the leading ways to package stuff for K8s. And then Microsoft scooped them up.


I think Heroku did a number of things almost better in terms of a deployment / dev story? But getting past the pricing was super hard.

The pricing is way worth it if you do the math on the number of engineers you won't need anymore

I don't disagree - but when you start your budget is so small that you worry overly much about price "if you get big".

Then when you have a budget you are used to other systems -> so it's a learning curve to get back and switching costs are higher.

or even better if you consider the Product value of what those engineers _could_ be producing instead.

My read seems to be: don't use ECS at large scale or you'll need some really convoluted hacks.

(Disclaimer: I work for AWS, but any opinions expressed here are solely my own and not necessarily the company's.)

I don't know about that. ECS works fine at large scale, but it's not going to replace 4000 tasks immediately. And doing so could potentially be a huge shock to your customers if you tried. (You have to take the time to gracefully drain existing connections, etc.) Nor did the author desire to implement a blue-green type deployment which would use AWS's elasticity to its fullest potential out of unspecified cost concerns.

It's not clear to me that Kubernetes would fare significantly better here. It's not without its own particular performance bottlenecks; and the issues about safely draining connections and blue/green deployments would remain the same there. The issues aren't really related to the particular deployment orchestrator, as much as the fact that 4000 containers is pretty large number by any measure for a single unit of deployment.

Actually OP gave the exact reason ECS does not work for them:

> The rate at which we can start tasks restricts the parallelism of our deploy. Despite us setting the MaximumPercent parameter to 200%, the ECS start-task API call has a hard limit of 10 tasks per call, and it is rate-limited. We need to call it 400 times to place all our containers in production.

That call limit needs to scale with the cluster size

That sounds different from the usual way customers perform an ECS deployment, which is to replace the task definition of the ECS Service with a new one and let the control plane manage it. Only a single API call is needed, and the control plane can launch replacement tasks pretty quickly itself.

We don't actually call the startTask API ourselves, but when we tell an ECS service (using cloudformation) to use a new task definition (which is basically just a new ECR image tag), ECS calls startTask and various other APIs on our behalf.

That is true. FWIW, as an experiment I built a 50-node ECS cluster and launched 4000 tasks on it. I couldn't build a single service with 4000 tasks due to soft limits, so I built 4 services with 1000 tasks each (nginx, 256MB, 128 CPU units). No load balancer.

Overall task replacement time was about 15 minutes -- which is pretty respectable, in my view. Obviously not as fast as doing a hot code swap, though. Tradeoffs abound.

Can I have a question about the bottlenecks you have mentioned in case of Kubernetes ?

Im really curious about this one. What were the bottlenecks you have faced till now.

Why exactly is it an expensive task? Where's the bottleneck? I worked on this problem before, so curious how it's handled (or not) at AWS.

You could also read this as "don't run 4,000 of anything".

4,000 node processes? Wow. Just wow. Surely there is a better solution for this. Maybe this is one of those weird edge cases where you would craft a bespoke infrastructure tool, rather than trying to hack a bunch of off-the-shelf products.

Glad you pointed this out! We could have gone into a lot more detail on the reasons that we've arrived at our current system architecture, but it would've distracted from the root problem we were solving in the post. Happy to go into it a bit here.

Node event loop blockages are the primary reason we have so many processes running. We have enough integrations and iterate on them quickly enough that our infrastructure essentially treats them as untrusted/breakable. We want to avoid ReDoS-style bugs from affecting more than the current request, so we handle one request per process. A little inelegant, but we've still been able to horizontally scale the system, and frankly the extra infrastructure cost hasn't been enough to be worth the effort to change it.

To get around the start-task rate limit, we've tried running multiple identical containers per ECS task. However, they need to be marked as "essential" in CloudFormation to make sure our capacity doesn't degrade on container exits, and this means that one container exiting will also exit other containers in the same task.

Multiple processes per container is another interesting approach. We've used Node subprocesses in the past, but we found them tricky for reasons that are unrelated to deploy speed.

One thing we've really liked about rolling our own approach is that we decide when to declare a deploy complete. ECS is pretty conservative about not declaring a deploy complete until the final container has finished draining, which can take minutes for some of our requests. With our fast deploys, we declare a deploy complete when the final container running old code stops accepting new requests, which is significantly sooner. This makes follow-on deploys and rollbacks much smoother.

Specifically, the problem is that your integrations may block, right? If they crashed or never returned, Node would deal with that okay, wouldn't it?

What are these "integrations"? Could you farm them out to a flock of subprocesses, so you needed fewer top-level Node processes? What are they written in? If Node, how do you even get them to block?

And, of course, have you considered rewriting it in ... carrier lost

If we handled N concurrent requests per process, a crash would definitely have the bleed-over effect we're trying to avoid, where a problem in a single request caused the other N-1 to fail. "Crash" meaning primarily OOM or unhandled rejections (which we exit on). And yes, the other big problem is blocking the event loop with an inefficient regexp, massive sort operation, clumsy use of Ramda, etc. This sort of blockage causes our ECS health checks to fail, so the container eventually gets killed.

"Never returning" is an interesting problem in Node. We have gRPC request timeouts, so the request fails on the client side after a set amount of time. Our Node server gets a cancellation event when this happens, but there's no guarantee that the in-flight request ever stops processing or gives up its resources, even if it's not blocking the event loop. E.g. our request handler could have a while(true) loop that continually does an async network request, and even though each gRPC request eventually times out, we would eventually have a swarm of zombie while(true) loops that are operating. To address this problem, we thread a lightweight context object into our framework which can essentially call an isCancelled() method. Before doing most low-level async operations (e.g. network requests), we check isCancelled() and throw if the gRPC request has timed out.

The integrations are all written in Typescript. As mentioned in another comment, we've attempted multiple processes per container and multiple containers per task. Right now, we see the bigger win as being able to fix "all of the above" and run multiple requests per Node process, but it'll take some legwork to get there :)

Great insights, thanks for replying!

You don’t even need a bespoke tool. Just make each container run 4–8 processes by means on something like PM2. That cuts your CI container count down to 500. No reason a container should mean 1 process.

Good topic for the next hackathon.

I'd expect any decent infrastructure tool to be able to handle 4000 processes/containers. Kubernetes, for example, claims to support 300000 containers across 5000 nodes. Mesos supported 50k+ from what I understand. Hadoop scales to over 10000 nodes.

ECS can be good for really small things but, even when I worked in public radio, we discovered that ECS is pretty hot garbage without writing a bunch of stuff to deal with its quirks. Scaling worked kinda wonky, and ECS has its own jargon for terminology that already exists with Docker in general, making it unnecessarily confusing. Why they call a container a "task" when the term "container" already exists is beyond me. We ended up using Rancher to manage Kubernetes.

ECS tasks are not containers. Tasks are a collection of one or more containers. They're the equivalent of a Pod in Kubernetes.

In Docker terminology, that would be a service?

In Kubernetes and ECS, a Service is already used to name the abstraction that represents a collection of Pods/Tasks.

(Actually K8s collection of Pods is usually a ReplicaSet, fronted by Deployment, from the perspective of making sure they run. The Deployment is related to a Service by labels and selectors, K8s service only decides what pods it will route traffic to. An ECS service is more like a Kubernetes deployment, and I guess what Kubernetes calls a Service, basically a load balancer, is just called a load balancer, and I guess you manage the ALB's configuration together somehow for your service group, manually or somehow other, or you can just create a new ELB each time, and pay again ...)

I had never heard of a Docker Service, but from the looks of it, the Deployment (or Task), Service, ReplicaSet and Pods are all rolled into one single abstraction, the "Service" in Docker Compose.

Ah, gotcha.

We use ECS via Fargate and I'm mostly happy. Deployments are pretty slow, but that's our only qualm. We PoCed EKS, but it has no integration with CloudWatch or IAM for all intents and purposes. They had roadmap items to address those things, but you absolutely have to reinvent a lot of wheels with Kubernetes compared to the things you get for free with ECS/Fargate.

> Why they call a container a "task" when the term "container" already exists

Likely for the same reason they call it ECS and not EDS. Making everything docker-specific could be a bad idea is more container tech becomes popular and they want to allow you to start Frobnicator tasks as well as Docker tasks on the same platform.

In a way you're right, which is why we are exploring kubernetes. At the same time, ECS has worked really well for us as we've grown to this point. It's worth noting that we have some other motivating reasons to switch off ECS beyond deploy limitations, such as poor observability of ECS agent logs.

It's inevitable that early systems are going to bend and break at various points of the scaling process, and the interesting question is how to navigate that – fight fires for awhile or fix things immediately, hack together a stopgap solution or do the full fix, etc. We thought ours might be an interesting data point to share.

Slightly off topic:

Does Plaid still operate via screen scraping? I'm a little perplexed as to why banks don't have easy to use APIs, especially given recent regulation. It seems against their best interests to allow a third party to screen scrape and provide a service which the banks themselves could easily reproduce.

What am I missing? Is a bank with an easy to use API not a sound business decision from the bank's perspective?

I know Monzo (challenger bank in UK) has/had an API, though I haven't heard of anyone using it.

All a bank is, is a UX on top of low-level financial APIs like ACH. If the bank then exposed an API, you could just use it to build another bank (without all the effort the original bank went through) and so compete with them on lower margins (because you don't have nearly the capital costs to recover that they do.)

Basically, it's the same reason that phone companies would never have allowed MVNOs to exist without legal regulation forcing them. The MVNOs outcompete the infrastructure-building phone companies, because MVNOs don't have to build infrastructure!

Thanks for sharing these lessons!

I don't use ECS at the moment but this is a well laid out post on how to avoid some performance issues that could have a huge impact.

EDIT: Downvoted for expressing appreciation for someone taking the time to note lessons learned?.. OK.

- How did you guys scale that much w/o a bootloader before?

That's what I don't get. All the design patterns are those of Unix. You boot the kernel with a ... bootloader. Then you've the kernel with all the system's params (call it ECS). Then each process is a child of the root process. And when you get by whatever mean the news that your app's source code has changed, you pull that code and start running it, while still having the old one live. Once the fork of the new code returns a proper response code, you kill the old one and set the new app live, otherwise you stay live with the old version.

> Engineers would spend at least 30 minutes building, deploying, and monitoring their changes through multiple staging and production environments, which consumed a lot of valuable engineering time

Man, startups have no idea how good they have it. It took a solid week to deploy a change at AWS.

> The rate at which we can start tasks restricts the parallelism of our deploy. Despite us setting the MaximumPercent parameter to 200%, the ECS start-task API call has a hard limit of 10 tasks per call, and it is rate-limited. We need to call it 400 times to place all our containers in production.

From reading other comments it makes me wonder if you (Plaid) tried batching the tasks into N containers? Like if a task 50 containers, then you'd reduce the task call rate limiting by 50x...

Yeah, I mentioned this in a comment in a different thread. The duplicate containers in the task definition need to be marked as "essential" in CloudFormation to make sure our capacity doesn't degrade on container exits, and this means that one container exiting will also exit other containers in the same task. So we have a bleed-over effect where OOM in one request could cause N-1 other requests to fail.

The essential vs non-essential container designation is a little confusing. The standard use case for multi-container tasks seems to be that all containers are marked as essential, i.e. they essentially represent different services that are operating in concert on the same machine. This is definitely not the situation we're in, where each container is totally independent.

So it seems like we'd be a perfect use case for non-essential containers. However, (1) at least one container _must_ be marked as essential, and (2) non-essential containers which exit don't get restarted or replaced. This means we would still have a limited bleed-over effect (if the essential container exits, the other ones do too), and more importantly, we can't guarantee that our capacity will be robust to process exits.

> non-essential containers which exit don't get restarted or replaced

That's why you have one or two essential watchdog containers which relaunch the workers. You keep a large number of them in an "idle, but hot" status to allow for bursts?

I'm a little confused by this approach; are there any non-essential containers in your suggested architecture? This sounds like the watchdog container is just a parent process that launches a bunch of subprocesses, which is definitely a workable solution, although not the one we decided to use. If there are primitives for an essential container to inspect container state and relaunch other containers in the task, that'd be great to know about.

Google "checkpoint restart". HPC community has had these tools for years, many in userspace. Can't wait to see a Java or C# shop doing the same hot boots.

Java had targeted hot reloads (going further than full reboot) for quite a while. See Jrebel for example.

Side question : what’s the current best practice for ensuring that a server ( node or anything) isn’t currently processing some important information before you shut it down ?

Is it a mix of waiting for request handlers to terminate upon receiving a sigterm then end the current process (and timeouting after a while) ? Does kubernetes handles those kind of things (waiting for a given process to stop before trashing the vm) or is there another layer or tool to do so ?

While graceful shutdown is important, I think the higher priority should be ensuring that you can gracefully recover.

Because eventually, something is going to die while a request is in-flight. So your batch processing needs to be able to recover from the "somebody tripped on the power cord" scenario.

When we start processing a request, we take the JS Promise that represents the result of that operation and put it into an array in our internal "server state" object. When the server gets a shutdown request, it basically awaits Promise.all(this.inFlightRequests) before exiting. Depending on the LB/routing layer on top of your service, you might need to do this in a loop in case work comes in after you do your first await. In other languages you can join a thread pool, wait on a wait group, or use whatever other synchronization tools are provided.

Note that there's an ECS_CONTAINER_STOP_TIMEOUT parameter in ECS that sets a hard upper limit on how long containers have to exit before getting SIGKILLed, and it defaults to 30 seconds. If you want to allow requests to drain for longer than that, you'll need to update that parameter.

(For services with fast request processing time, the draining process is often a lot simpler. You can just route traffic away from the server, then wait a short period of time before telling the process to exit. It's not quite as precise, but works well for many use cases and requires no bookkeeping.)

Remove the instance from whatever routes requests to it, without killing it -- wait some predefined amount of time -- kill it.

I've built services that can gracefully shut down, and my typical design is as follows:

First, maintain an active request counter that you increment when you accept a request and decrement when the request has been fully delivered (including output buffer drained).

Second, implement a TERM signal handler. The handler shuts down the listening socket (so no new connections will be accepted), starts a countdown timer, and waits for the active-request counter to reach 0. When either the countdown timer expires or the active-request counter reaches 0, exit.

Going to EKS would take less time than exploring hacks.

EKS has had networking layer issues in the past from personal experience and I've heard from good engineers that their CNI layer code (which is open source) is not very good. That would make me concerned on how well EKS (which is different from Kubernetes itself) can scale and what edge cases you're liable to hit.

Yeah, I've had the same issues. Eventually a node would run out of some network resource which would make scheduling containers onto it impossible until it was rebooted.

There seem to be some other intermittent network issues as well. I just moved some stuff for a client off EKS and back onto ECS because it wasn't worth troubleshooting.

That said, I am not a fan of ECS at all. The scheduler is slow and it does really just take way too long for it to even start acting on new deployments.

EKS may be a different implementation in some components (CNI), and have some set defaults (EBS for PVs, etc), but at it's core it is still a relatively vanilla Kubernetes.

Making a switch to EKS doesn't mean that you are stuck with EKS forever, and can set you up nicely to make the transition to a more complex/customized cluster implementation/deployment (likely still on AWS, think kops) easier if you find it's necessary for your use case in the future.

Pretty cool! Actionhero uses the ‘require cache’ trick in development mode to hot-reload your changes as you go. It’s risky in that even though you’ve change the required file, you may not have recreated all you objects again. For that reason Actinhero doesn’t allow this is NodeEnv is anything besides development.

Cool! I’m curious if this is something that nodemon/pm2 could do as task runners. You could call “npm update” and then hup your process...

This is sort of how Capistrano handled deployments, changing a symlink to all project deps and then signaling the process to reload

After all these years, how is deploying solely on AWS still worse than Heroku & Render?

Just FYI, you're "We're Hiring!" link is broken.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact