
Rainbow Deployments with Kubernetes - bdimcheff
http://brandon.dimcheff.com/2018/02/rainbow-deploys-with-kubernetes/
======
andrewaylett
It feels like it should be possible to fix the reconnect experience,
especially in the planned termination of underlying container case: if you ask
the client to reconnect, rather than abruptly disconnecting them then they
could possibly even wait until their new session was fully established before
dropping the old one.

That doesn't take away from my appreciation of the pattern, though: I'm very
much in favour of rolling releases forwards, rather than being limited to two
colours.

~~~
bdimcheff
We actually have this functionality as well: we can send a signal to the
process which will cause it to display a message to the user asking them to
reload to upgrade. There is a similar feature in slack and riot that I've seen
as well.

Making this handoff automatic is definitely possible as well, though we do
want people to reload occasionally to get new client-side code.

~~~
specialist
I worked on an autobahn like thingie. Server-side could tell clients to
reconnect. Useful for draining servers, etc.

I haven't checked to see if autobahn has this functionality.

[https://crossbar.io/autobahn/](https://crossbar.io/autobahn/)

------
wjossey
Question for the OP-

I haven't ever worked on chat services, so this may not be reasonable. Would
it be possible to use some other termination endpoint that sits in front of
the service, that allows you to maintain persistent connections to the
clients, but make for more transparent swaps of backend services?

So, for example could you leverage nginx or haproxy as the "termination" point
for the chat connection, with those proxying back to the kubernetes service
endpoint, which then proxy back to the real backend service. So, when you go
to swap out the backend service, nginx / haproxy start forwarding to the new
service transparently, while still maintaining the long-lived connection with
the client.

If this was doable, it would mean you'd only have to drain if you needed to
swapout the proxy layer, which is likely a less-frequent task, and thus allows
you more agility with your backend services.

~~~
jkarneges
This is essentially how Pushpin ([http://pushpin.org](http://pushpin.org))
works. It can hold a raw WebSocket connection open with a client, but it
speaks HTTP to the backend server, and the backend can be restarted without
the client noticing.

~~~
codegladiator
This is a nginx module for this functionality

[https://nchan.io/](https://nchan.io/)

I have used it before. Super easy to setup. Even with kubernetes.

~~~
jkarneges
Interesting, I didn't think Nchan would be able to drive a raw WebSocket
session with the client, but upon closer look it seems like it might be
possible using the auth and message forwarding hooks. Very cool.

------
ninkendo
Seems like the kind of thing that a Deployment should be able to manage on its
own... some kind of DrainPolicy object maybe?

Also, if the previous ReplicaSet a Deployment is rolling past has several
pods, maybe only some of them need to stay alive (maybe some drain sooner than
others.)

Perhaps the whole endeavor should just be to make Pod drainage a bit more
explicit than just terminationGracePeriodSeconds... perhaps letting a pod
signal with a positive confirmation that it's shutting down (letting
connections drain) and the rest of the k8s controllers can just leave it alone
until it terminates itself.

Although really, I think a combination of setting
terminationGracePeriodSeconds to unlimited, and having a health check that
ensures that it doesn't get wedged and miss the termination signal (by
checking that a pod status of "shutting down" corresponds to some property of
the container, like a health endpoint saying the shutdown is in progress...)
and then nothing else needs to be done. Basically, color me skeptical when
they say:

"We used service-loadbalancer to stick sessions to backends and we turned up
the terminationGracePeriodSeconds to several hours. This appeared to work at
first, but it turned out that we lost a lot of connections before the client
closed the connection. We decided that we were probably relying on behavior
that wasn’t guaranteed anyways, so we scrapped this plan."

(This also depends on the container obeying the standard SIGTERM contract to
properly drain connections but not accept new ones, which is pretty standard
in most web servers nowadays.)

~~~
bdimcheff
yeah I don't know why terminationGracePeriodSeconds hacks didn't work. It
could have been a different, unrelated factor that we didn't discover. It
certainly could have been service-loadbalancer/haproxy's fault instead of the
termination grace period itself. I'm certainly happy to be proven wrong there.

~~~
smarterclayton
Not 100% sure about your scenario, but if you set a preStop hook to an exec
probe you can arbitrarily delay shutdown inside the gracePeriod, because the
kubelet won’t terminate the container until preStop returns.

So if you set a 5 hour grace period, and a preStop hook that invokes a script
that doesn’t return until all connections are closed (but which tells the
container process not to accept new ones) you can control the drain rate.

There is some app level smarts required - to have new connections rejected and
have any proxies rebalance you. Haproxy does this in most cases, but the
service proxy won’t (in iptables mode).

If that’s not the behavior you’re seeing, please open a bug on Kube and assign
me (this is something I maintain)

~~~
bdimcheff
Yeah I think that there is still some potential in the terminationGracePeriod
strategy, but we found this other way that worked reliably and stopped
exploring that path. If I can repro the issue I'll let you know.

One extra thing I remember that was sort of problematic was that when a pod
was Terminating it'd get removed from the Endpoints, so any tooling that was
using the API info to keep an eye on connections was basically unusable at
that point.

------
derekperkins
This is a great use case for kube-metacontroller that was introduced in the
Day 2 Keynote at Kubecon. With minimal work, you can replicate a deployment or
stateful set, but with custom update strategies.

Live demo:
[https://youtu.be/1kjgwXP_N7A?t=10m46s](https://youtu.be/1kjgwXP_N7A?t=10m46s)
Code:
[https://github.com/kstmp/metacontroller](https://github.com/kstmp/metacontroller)

------
discordianfish
Nice pattern! You could throw in a HPA to automatically scale deployments to
zero that aren't in use.

------
gouggoug
I'm not sure what problem the author is solving. I might be misunderstanding
something.

The author points out that the issue with Blue/Green/AnyColors deploys is that
they need 16 pods per color at all time (which in their case would end up
being 128 pods) and 24/48 hours for each connection pool to drain.

But how is using a SHA instead of a COLOR any different? Unless I am missing
something, and, if running 128 pods and 24/48 hours of draining is the issue,
then using SHA instead of colors is not solving those 2 issues.

You'll still need 16 pods and 24/48 hours per SHA-deploy, and you're actually
making it worst by not using fixed colors since you have quite a lot more SHA
at your disposition.

~~~
thomaslangston
It appears the issue was running pods for deployment colors not in use if they
only deploy 1/week, and this solves it because they are cleaning them up
regularly. This does nothing for the overhead of needing lots of pods to
support a high number of deploys/week.

You could do the same with $Color, just seems to be clearer since people often
think of $Color as being static deployment infrastructure, whereas people are
used to SHA's pointing to branches that are naturally cleaned up.

------
erikrothoff
This was really interesting. I'm thinking about moving to Kubernetes and have
wondered how to gracefully deal with websocket connections.

I'm curious though, if the rollout was over a couple of hours for example, why
would reconnections be that big of a problem? We host about 10,000+ websocket
on a $20 VPS, and the Go server hosting it crashes from time to time. A surge
of 10,000 reconnections instantly afterwards has never lasted for more than a
minute or so, so why is it so bad? Moments of peak load aren't that big of a
deal, or?

~~~
markbnj
(work with OP on the same team) Basically there are a lot of other things that
happen when a websocket connection is established and we don't necessarily
have the capacity to handle that volume in a complete reconnect scenario,
especially if the system is already near the daily load peak. We have hopes
that autoscaling some things in the future will make it possible to handle
peaks like this more gracefully.

------
jsjohnst
> So far we haven’t found a good way of detecting a lightly used deployment,
> so we’ve been cleaning them up manually every once in a while.

Am I missing something, or wouldn’t it be as “simple” as connecting to the
running container and running netstat and conditionally killing the pod based
on the number of connections? I bet you thought of that, so I’m curious why it
didn’t work for you.

------
bdimcheff
One thing I didn't put in here that's also turned out to be useful: We can
prerelease things relatively easily this way too. Each deployment has a git
sha, and we can have a canary/beta/dogfood version that points at an entirely
different sha.

------
_asummers
> We still have one unsolved issue with this deployment strategy: how to clean
> up the old deployments when they’re no longer serving (much) traffic.

Could probably solve this with a readiness probe / health check of sorts that
is smart enough to know what low usage means.

~~~
bdimcheff
Yeah I think if restartPolicy were changeable at runtime, we could simply have
the pods exit once their connections are drained enough. If we were to exit
using the current strategy, they'd just be restarted by kube.

------
vitalus
Curious about the 24h-48h burndown...could it potentially be longer for you
guys or is there some mechanism in place to force disconnection (and thus risk
a spike) after some TTL?

~~~
bdimcheff
We force reconnects eventually, there just aren't that many people affected at
that point. There's a very long tail of people keeping their browsers open for
days, but it's only a handful of people.

------
drdrey
So... just a blue/green deployment with a 24h delay before deleting the old
cluster?

~~~
sulam
I don't think so actually, it seems like they are having a series of old SHAs
hanging out, not just one new and one old. I did have the reaction you did
initially though and had to do some reading between the lines to come to the
conclusion that this is not what they're doing, so you could be right!

------
45h34jh53k4j
Using the 6 hex digits of the git commit hash for color is genius. I really
like this pattern!

~~~
jefurii
Dang, now I want to figure out how to print my git-log commit hashes in colors
based on the hashes themselves...

------
xir78
TL;DR

You can drain stuff by changing a Service's selector but leaving the
Deployment alone. Instead of changing a Deployment and doing a rolling update,
create a new deployment and repoint the Service. Existing connections will
remain until you delete the underlying Deployment.

