Killing Containers at Scale

cyphar · on Feb 4, 2021

Yeah, sending SIGKILL to pid1 of a container will insta-kill the whole thing. But the network resource stuff does also include removing iptables rules, so it's not something that you really want to be skipping. It feels like the network resources being cleaned up should be a background job rather than blocking the kill (after all, it's not critical to stopping the container and most resources will be freed by the network namespace dying). Also (if you're not already aware of this), you should note that if you use "docker exec" or share pid namespaces between containers, this trick will no longer work because your saved pid is not the pid1 of the namespace (runc has a "kill everything in the container" mode to work around this -- so I'd suggest using "runc kill SIGKILL" rather than doing it manually but it's probably not that important).

Have you opened a bug report in Docker upstream to see if they can improve the situation (perhaps by putting networking cleanup jobs to a background goroutine)?

cbrewster · on Feb 4, 2021

Indeed, for our use-case we want an insta-kill so we can free up our global lock on the container. But the docker daemon will still get notified of the container death and run any sort of cleanup. This just gives us an opportunity to free up the global lock before getting stuck waiting on the cleanup.

> Have you opened a bug report in Docker upstream to see if they can improve the situation (perhaps by putting networking cleanup jobs to a background goroutine)?

I have not yet, but I plan to! We figured we'd work around it ourselves instead of waiting for a potential fix upstream.

cyphar · on Feb 4, 2021

Yup, I forgot to mention that dockerd will get the death event and clean up like it would if pid1 died normally. My comment was more about what a patch upstream would look like. And I'm glad to hear you plan on opening an upstream issue about it. Sadly it's quite common for folks to work around issues which we could've fixed upstream but were never told about, so glad that you're bucking that trend. :D

alias_neo · on Feb 4, 2021

Unfortunately, the trend is the way it is because historically Docker has demonstrated contempt towards the tickets people have raised and many issues people suffer from have never been fixed.

Those of us who need to get things done have taken to just working around it and moving on.

If this attitude has changed, it should be communicated publicly so people can reset and come back to Docker with such issues.

xorcist · on Feb 4, 2021

That has historically not been reliable in any version of docker. A proper solution would probably require restructuring the whole thing.

nhoughto · on Feb 4, 2021

If the vm is going to die, why bother killing it at all? Just “make it safe” and let gce kill it.

Could you not just send a signal to the running process that it’s going to die and to hang up or save state or whatever is needed? If an Ephemeral vm ends in a bad state does anyone care?

zshrdlu · on Feb 4, 2021

> Sadly this conman is shutting down and rejects the WebSocket connection!

Sorry if this is obvious: why is the request proxied to a conman that is shutting down in the first place? Wouldn't it be more robust to work around this instead? Is it perhaps because there is no way to tell when the VM is shutting down?

cbrewster · on Feb 4, 2021

Hi! Author of the post here.

One of the most important invariants that we have to maintain is that there is only 1 container running per repl at any given time. We could determine if the machine is shutting down and not proxy the connection, but we wouldn't have a place to proxy it to since we can't be sure that the existing container has finished shutting down. So either way we end up returning an error and the client has to wait until the old container has been destroyed so a new one can be spawned.

boulos · on Feb 4, 2021

Disclosure: I work on Google Cloud (and Preemptible VMs).

You’re kind of racing against the clock though. You could instead have a “load balancing” style layer in front, so that while there is no usable session, at least the person’s connection is just “hanging”.

Feel free to send me some email, as we’re looking to make this experience better (both generally and in GKE, specifically).

hamandcheese · on Feb 4, 2021

> One of the most important invariants that we have to maintain is that there is only 1 container running per repl at any given time.

Perhaps when a host is preempted, instead of killing containers at all, you could just add an iptables rule to black-hole all network traffic for the host. Then they are as good as dead (and the host will be forcibly killed soon anyways).

boulos · on Feb 4, 2021

You still want to send SIGTERM or equivalent to the processes so that they can flush their state if needed. Just abandoning ship is often less good than using a few seconds to cleanly shut down (if only to exit and send a clear message that you have).

hamandcheese · on Feb 4, 2021

Right, but in the article they said they don’t care about graceful shutdown.

boulos · on Feb 4, 2021

I forgot they said that (I read this when it was posted a couple days ago), thanks for noting that.

But I still would encourage even a 1s “clean” shutdown. You don’t need to wait for any of this fancy cleanup, but it’s really nice to finish your writes.

Fun story: for Preemptible VMs we started (in Alpha / EAP) with no soft shutdown just to see whether people could handle it (so just immediate power off). Turns out, that if your box is running apt-get upgrade at the time or anything like that, you easily corrupt your boot disk. So, we struggled between “a few seconds” (5, 15) and the “about 30, which is how long a GCE instance takes to boot”. That’s how we ended up with 30: we wouldn’t more than double regular instance creation times at the tail. Nowadays we boot to ssh in 15 seconds!

If you don’t reuse your state, none of this matters. But I’d guess that even just getting to the point of RST’ing the the connections is valuable (so that the clients know to take action, rather than wait a while).

silon42 · on Feb 4, 2021

We had shutdown hooks on our preemptive VMs, but we often had cases (at least weekly), where it looked like they failed to run (failing to unregister from cluster). Any explanation?

boulos · on Feb 4, 2021

Do you mean on GKE or directly on GCE? It sounds like you mean GKE (“failed to unregister from cluster”).

We’re looking to fix up the GKE graceful node shutdown, because it’s currently “racy” and doesn’t actually respect the grace period properly (system pods / processes can be shutdown before waiting for user pods, causing you to lose logging or say the kubelet).

silon42 · on Feb 5, 2021

Yes, GKE containers, sorry about confusion... sometimes it looks like a node has disappeared without much shutdown work.

mikesabbagh · on Feb 4, 2021

Your solution should implement a retry on failure rule, before returning an error to the client. This is standard in proxies like envoy and others.

cbrewster · on Feb 4, 2021

We do have an internal retry mechanism for certain kinds of errors, but after too many tries it will return an error to the client. Our clients already have to include a robust retry mechanism because we have to deal any sort of network instability.

abhisheksp · on Feb 4, 2021

Would a simple retry mechanism work here? Send a request to conman, wait at most X seconds(with N retries) before marking conman entry as inactive and clean up inactive conmans in the background.

This way your client waits at most X seconds before being reassigned warm conman

KoenDG · on Feb 6, 2021

I'm suddenly reminded of this base image: https://github.com/phusion/baseimage-docker