
How we reduced deployment times by 95% - bjacokes
https://blog.plaid.com/how-we-reduced-deployment-times-by-95/
======
nrmitchi
If I'm reading this right, then this approach takes away any real safety in
terms of deployment. There would be no easy rollback mechanism, and no real
assurances that the new code version will actually run.

I understand that the main goal here seemed to be avoiding time spent in ECS
rollouts, but this solution seems to be sacrificing many of the guarantees
that the rollout process is designed to provide.

The root problem is explicitly called out (slow ECS deployments), and is tied
to rate limiting of the ECS `start-task` API call. The post mentions the hard
cap on the number of tasks per call, but I'm curious if the actual _rate
limit_ could have been increased on the AWS side. Ie, 400 calls would still be
needed, but they could be pushed through much faster.

~~~
ahmcb
Hey, great questions, some of the same questions we (the SRE/Platform Team at
Plaid) had.

Plaid's rollback job still works the same way for the service using this new
deployment, so thankfully, nothing new for engineers at Plaid to learn there.

We also have metrics in Prometheus to indicate which versions of code are
running so we can easily verify what is deployed.

WRT the rate limit, we have a great relationship with AWS and Plaid pushes
hard on limits which most often AWS is happy to increase for us, but this was
a hard limit that could not be raised at the time, but I'm sure they are
working on it.

~~~
MuffinFlavored
State:

Version 1.0.0 in prod, serving requests

You want to deploy 1.0.1

You spin up 1.0.1, leaving traffic pointed at 1.0.0

What mechanism actually shifts the traffic from the 1.0.0 instances to the
1.0.1 instances, waiting for all traffic to stop on 1.0.0 before bringing the
instances down without causing abrupt connection hangups?

------
benologist
Whenever I see these posts I feel like Heroku narrowly missed out on shaping
the rest of the cloud just by staying proprietary and expensive.

~~~
privateSFacct
Agreed.

I think Heroku did a number of things almost better in terms of a deployment /
dev story? But getting past the pricing was super hard.

~~~
shay_ker
The pricing is way worth it if you do the math on the number of engineers you
won't need anymore

~~~
maerF0x0
or even better if you consider the Product value of what those engineers
_could_ be producing instead.

------
marcinzm
My read seems to be: don't use ECS at large scale or you'll need some really
convoluted hacks.

~~~
otterley
(Disclaimer: I work for AWS, but any opinions expressed here are solely my own
and not necessarily the company's.)

I don't know about that. ECS works fine at large scale, but it's not going to
replace 4000 tasks immediately. And doing so could potentially be a huge shock
to your customers if you tried. (You have to take the time to gracefully drain
existing connections, etc.) Nor did the author desire to implement a blue-
green type deployment which would use AWS's elasticity to its fullest
potential out of unspecified cost concerns.

It's not clear to me that Kubernetes would fare significantly better here.
It's not without its own particular performance bottlenecks; and the issues
about safely draining connections and blue/green deployments would remain the
same there. The issues aren't really related to the particular deployment
orchestrator, as much as the fact that 4000 containers is pretty large number
by any measure for a single unit of deployment.

~~~
maerF0x0
Actually OP gave the exact reason ECS does not work for them:

> The rate at which we can start tasks restricts the parallelism of our
> deploy. Despite us setting the MaximumPercent parameter to 200%, the ECS
> start-task API call has a hard limit of 10 tasks per call, and it is rate-
> limited. We need to call it 400 times to place all our containers in
> production.

That call limit needs to scale with the cluster size

~~~
otterley
That sounds different from the usual way customers perform an ECS deployment,
which is to replace the task definition of the ECS Service with a new one and
let the control plane manage it. Only a single API call is needed, and the
control plane can launch replacement tasks pretty quickly itself.

~~~
ahmcb
We don't actually call the startTask API ourselves, but when we tell an ECS
service (using cloudformation) to use a new task definition (which is
basically just a new ECR image tag), ECS calls startTask and various other
APIs on our behalf.

~~~
otterley
That is true. FWIW, as an experiment I built a 50-node ECS cluster and
launched 4000 tasks on it. I couldn't build a single service with 4000 tasks
due to soft limits, so I built 4 services with 1000 tasks each (nginx, 256MB,
128 CPU units). No load balancer.

Overall task replacement time was about 15 minutes -- which is pretty
respectable, in my view. Obviously not as fast as doing a hot code swap,
though. Tradeoffs abound.

------
testuser5559191
Slightly off topic:

Does Plaid still operate via screen scraping? I'm a little perplexed as to why
banks don't have easy to use APIs, especially given recent regulation. It
seems against their best interests to allow a third party to screen scrape and
provide a service which the banks themselves could easily reproduce.

What am I missing? Is a bank with an easy to use API not a sound business
decision from the bank's perspective?

I know Monzo (challenger bank in UK) has/had an API, though I haven't heard of
anyone using it.

~~~
derefr
All a bank is, is a UX on top of low-level financial APIs like ACH. If the
bank then exposed an API, you could just use it to build another bank (without
all the effort the original bank went through) and so compete with them on
lower margins (because you don't have nearly the capital costs to recover that
they do.)

Basically, it's the same reason that phone companies would never have allowed
MVNOs to exist without legal regulation forcing them. The MVNOs outcompete the
infrastructure-building phone companies, because MVNOs don't have to build
infrastructure!

------
sailfast
Thanks for sharing these lessons!

I don't use ECS at the moment but this is a well laid out post on how to avoid
some performance issues that could have a huge impact.

EDIT: Downvoted for expressing appreciation for someone taking the time to
note lessons learned?.. OK.

------
fcolas
\- How did you guys scale that much w/o a bootloader before?

That's what I don't get. All the design patterns are those of Unix. You boot
the kernel with a ... bootloader. Then you've the kernel with all the system's
params (call it ECS). Then each process is a child of the root process. And
when you get by whatever mean the news that your app's source code has
changed, you pull that code and start running it, while still having the old
one live. Once the fork of the new code returns a proper response code, you
kill the old one and set the new app live, otherwise you stay live with the
old version.

------
swiftcoder
> Engineers would spend at least 30 minutes building, deploying, and
> monitoring their changes through multiple staging and production
> environments, which consumed a lot of valuable engineering time

Man, startups have no idea how good they have it. It took a solid week to
deploy a change at AWS.

------
maerF0x0
> The rate at which we can start tasks restricts the parallelism of our
> deploy. Despite us setting the MaximumPercent parameter to 200%, the ECS
> start-task API call has a hard limit of 10 tasks per call, and it is rate-
> limited. We need to call it 400 times to place all our containers in
> production.

From reading other comments it makes me wonder if you (Plaid) tried batching
the tasks into N containers? Like if a task 50 containers, then you'd reduce
the task call rate limiting by 50x...

~~~
bjacokes
Yeah, I mentioned this in a comment in a different thread. The duplicate
containers in the task definition need to be marked as "essential" in
CloudFormation to make sure our capacity doesn't degrade on container exits,
and this means that one container exiting will also exit other containers in
the same task. So we have a bleed-over effect where OOM in one request could
cause N-1 other requests to fail.

The essential vs non-essential container designation is a little confusing.
The standard use case for multi-container tasks seems to be that all
containers are marked as essential, i.e. they essentially represent different
services that are operating in concert on the same machine. This is definitely
not the situation we're in, where each container is totally independent.

So it seems like we'd be a perfect use case for non-essential containers.
However, (1) at least one container _must_ be marked as essential, and (2)
non-essential containers which exit don't get restarted or replaced. This
means we would still have a limited bleed-over effect (if the essential
container exits, the other ones do too), and more importantly, we can't
guarantee that our capacity will be robust to process exits.

~~~
ahazred8ta
> non-essential containers which exit don't get restarted or replaced

That's why you have one or two essential watchdog containers which relaunch
the workers. You keep a large number of them in an "idle, but hot" status to
allow for bursts?

~~~
bjacokes
I'm a little confused by this approach; are there any non-essential containers
in your suggested architecture? This sounds like the watchdog container is
just a parent process that launches a bunch of subprocesses, which is
definitely a workable solution, although not the one we decided to use. If
there are primitives for an essential container to inspect container state and
relaunch other containers in the task, that'd be great to know about.

------
crb002
Google "checkpoint restart". HPC community has had these tools for years, many
in userspace. Can't wait to see a Java or C# shop doing the same hot boots.

~~~
viraptor
Java had targeted hot reloads (going further than full reboot) for quite a
while. See Jrebel for example.

------
bsaul
Side question : what’s the current best practice for ensuring that a server (
node or anything) isn’t currently processing some important information before
you shut it down ?

Is it a mix of waiting for request handlers to terminate upon receiving a
sigterm then end the current process (and timeouting after a while) ? Does
kubernetes handles those kind of things (waiting for a given process to stop
before trashing the vm) or is there another layer or tool to do so ?

~~~
rohansingh
While graceful shutdown is important, I think the higher priority should be
ensuring that you can gracefully recover.

Because eventually, something is going to die while a request is in-flight. So
your batch processing needs to be able to recover from the "somebody tripped
on the power cord" scenario.

------
cagataygurturk
Going to EKS would take less time than exploring hacks.

~~~
marcinzm
EKS has had networking layer issues in the past from personal experience and
I've heard from good engineers that their CNI layer code (which is open
source) is not very good. That would make me concerned on how well EKS (which
is different from Kubernetes itself) can scale and what edge cases you're
liable to hit.

~~~
rohansingh
Yeah, I've had the same issues. Eventually a node would run out of some
network resource which would make scheduling containers onto it impossible
until it was rebooted.

There seem to be some other intermittent network issues as well. I just moved
some stuff for a client off EKS and back onto ECS because it wasn't worth
troubleshooting.

That said, I am not a fan of ECS at all. The scheduler is slow and it does
really just take way too long for it to even start acting on new deployments.

------
evantahler
Pretty cool! Actionhero uses the ‘require cache’ trick in development mode to
hot-reload your changes as you go. It’s risky in that even though you’ve
change the required file, you may not have recreated all you objects again.
For that reason Actinhero doesn’t allow this is NodeEnv is anything besides
development.

------
evantahler
Cool! I’m curious if this is something that nodemon/pm2 could do as task
runners. You could call “npm update” and then hup your process...

This is sort of how Capistrano handled deployments, changing a symlink to all
project deps and then signaling the process to reload

------
shay_ker
After all these years, how is deploying solely on AWS still worse than Heroku
& Render?

------
mylampisawesome
Just FYI, you're "We're Hiring!" link is broken.

