Hacker News new | past | comments | ask | show | jobs | submit login
Docker and the PID 1 zombie reaping problem (phusion.nl)
119 points by ohcoder on Jan 20, 2015 | hide | past | favorite | 69 comments

I know this will cost me a lot of karma, but surely this is the point that you say, lets just use a hypervisor?

yes, I know its much easier to just use docker, but at the point that you have to write a new INIT, doesn't that strike you as time you used something thats not easy anymore?

Things like "make sure we don't loose syslog" make me shudder.

Yes supposedly there is an argument that docker is faster. However as a buildeng i've seen little evidence of that.

We use VMware with throwaway build disks. This means that a Dev can destroy a machine, once rebooted its back to a known default state.

the best part is that we have collectd on the vmware host, and the in the guest to collect metrics on CPU, mem and IO.

Because we have decent disks, and lots of memory, we don't hit any limitations. For heavy build we have NFS root'd servers with non persistant disks. much faster, and requires almost no thinking about.

This is actually why we all need to support systemd. It is already an init system, and a sophisticated one; now that it has integrated container support it's only a matter of time before systemd's pid 1 assumes zombie reaping duties for the host and all container images.

People who criticize systemd fail to realize that it provides solutoons for many problems under Linux, and even problems with the Unix model itself, and has widespread support from companies, distributions, and other stakeholders.

Yes, its provides a supervisord type mechanism, thats great, so is dependencies but outweighed by the following:

The horrid shite that is the journald. whoever thought it was a brilliant idea to smoosh grep tail and vi together is someone who's never had to debug anything hung over.

seriously why in living fuck do I have to use your shitty tool to look at system logs? The whole point of linux that you are supposed to be able to use anything to open anything. Sure its a stupid idea to use most tools, but thats not your decision.

Seriously do I have to learn a completely new syntax to find all things that happened at 10:23:04 three days ago? Everywhere else I can use vim/nano/emacs/awk/grep/sed.

but here? no I'm to stupid to understand those tools, here use this one. We've thought of every use case, if you're having trouble its because you're either a hater, idiot, sysadmin, or too old.

you can tell that the people who write this shit are not the poor fuckers who have to tune/support/debug/migrate or use it.

So no, I'm not going to "get behind systemd" until journal is fit for purpose.

If you're running a syslogd then the logs also get stored there, so you can continue to use the tools you are familiar with if that is more important to you than possibly considering different tools that may work better.

Much as I like the inference that I've not tried journal, or that I'm immune to new ideas, thanks, thanks very much.

Look, my job is to implement new ideas. I run a datacenter that has 15pb on tier 1 storage, and 28,000 CPUs.

The one thing I care about most is this: getting home to see my children.

The only time I look at logs is if something terrible has gone wrong. Everything else is done by sourcing metrics directly from the components directly (collectd + graphite)

We do ship logs, and yes we have elastic search, but thats far to heavy and far to higher latency to be useful.

So the only real time that I look at logs is when something is horrifically broken, or we cant figure out from the graphs what has gone wrong.

At that time, I do not want to have to remember new syntax to get at what should be text files. (yes signed logs, but lets be honest thats why you ship logs.)

If I got more functionality, then yes I'll be onboard.

However I can't use bash to talk to journal, It only has a C API, which smacks of it being designed for phones and super integrated HPC, not the 95% use case.

Because it uses binary blobs, (And yes, I do love binary, just not for logs) Its a pain in the arse to inspect using a scripting language.

if journalctl _COMM=sshd | tail | grep "something" then;

really nice. also what the fuck is with starting a switch with an _? underscore indicates hidden function. seriously, why is it different to every over log inspection tool?

It's not a switch, it's a match argument, and the underscore prefix indicates that the field is "trusted": the kernel retrieved the value and the logging application is unable to provide "fake" data for the field in case it has been compromised.

And people who blindly push systemd fail to realize that we've alreqady solved these problems. A more complex system doesn't solve it.

So instead of a 350 line Python script, you think systemd is better? I disagree.

A more complex system? How is solving this problem universally a bad thing?

Is having _one_ way to do things that works everywhere so bad? Having a container manager do things I wouldn't expect, like have issues with zombies or dropping syslog messages, makes me never want to touch it.

My dear chap, one way does not work everywhere. if that were the case we'd all have agreed to just use vim, or emacs, or sublime, or nedit, or nano, or ed......

Vim has largely won vs. emacs, and gone on to the next bracket against ST. Emacs is out of the tourney. I can't thibk of a single hacker under 30 who isn't a total Lisphead, who wants to even touch emacs.

Docker does different things than VM's, but they work together perfectly. I run my all my Docker hosts on a VMWare infrastructure, which gives me best of both worlds. Yes - for some things, VMWare can't be beaten - but that's the nice thing: they are not mutually exclusive. Both are very handy tools in my toolbox, and neither are a one-stop solution. Docker sure adds complexity - but also a huge amount of flexibility. Once you understand the power of it all, and how to think with Docker (which isn't that obvious) - it's pretty damn good. Not that there are no downsides or things that could be improved, but nothing is perfect.

And then:

1) The problem is hugely exaggerated. It only applies to badly written software. And make sure we don't lose syslog? Seriously? just mount /dev/log into your container (it's a unix socket after all) and you're done.

2) Arguably faster? Try booting up 15 instances of app X. With docker, that's achieved in less than a second. On VMWare this has a huge memory and CPU overhead, not to mention boot-up times.

3) Try versioning VM's. Good luck. I have 100% reproducible, versioned images generated and tested in our CI system, that I can easily deploy in a VM on my laptop where I will have 100% the same setup as it will in test/qa/prod. To deploy the new version, start new instance, point load balancer/proxy/whatever to the new instance (happens automagically btw). Everything goes well? Stop old instance. Something goes wrong? Point proxy to old instance, and kill the badly-behaving. If everything went well: zero downtime. If you detect something goes wrong, it's nearly instant to switch back to the previous version. Also, testing this is a lot easier than with VM's. Sure you could do that with VM's too - but the overhead is massive, and takes a lot more time.

My ideal setup would be a hypervisor infrastructure with a bunch of VM's running only Docker applications.

> My ideal setup would be a hypervisor infrastructure with a bunch of VM's running only Docker applications.

You should take a look at https://github.com/docker/machine and https://github.com/docker/swarm , they were designed with exactly that goal in mind!

Sadly, it still lags behind even KVM, it can't hot migrate, which means zero downtime maintenance is impossible, along with transparent HA.

Care to share your CI tech stack? Specially the deployment bit automagically updating the balancer. We are internally building that kind of setup at the moment and some insight from experienced people like you would be very helpful. Thanks.

CI is Jenkins, which uses the Docker plugin for creating slaves on-the-fly. The plugin has it's share of problems, so I don't use it to build the images itself - I do this using a generalized bash script, which also pushes it to our private repo.

For automatically updating the load-balancer, check out docker-gen (https://github.com/jwilder/docker-gen). I hacked together setup that detects the latest version and points to this-one when a container comes up or goes down. It assumes a lot of things specific to my setup though (versioning of the containers, env vars, ...)

This is not talking about writing a new init in the sense of Systemd. All the init program needs to do is spawn the one process that you were going to start anyway, and adopt orphan processes.

You could literally use bash as the init process if you do not care about shutting down cleanly (or are willing to have a more complicated shutdown procedure that 'docker stop').

FWIW, and I say this as a big Docker fan, I had the exact same reaction. Docker, to me, is for immutable stuff where I don't care about the end results of that particular instance of the software. Turning containers into thinner VMs doesn't do much for me.

A hypervisor has a ~150-250MB memory overhead over linux containers + namespaces because it boots a full kernel.

There are other differences but if all you want from your system is reproducibility then VM images work fine of course.

Hi, Docker author. Just wanted to say that we are aware of the problem, understand the underlying mechanics, and plan to fix it. In the meantime application-level fixes like this can definitely help.

One slight difficulty is that some applications do want to be pid1. While 99% of the ecosystem wants the problem of being pid1 to be handled natively by Docker, the remaining 1% (people who write init systems, or platforms tied to a particular init system) emphatically wants the opposite. So we need to make sure both modes of operations are available, and deal with a dilemma: do we change the default behavior, potentially breaking images expecting to be run as pid1? Or do we keep a default which breaks 99% of applications, requiring an explicit setting to opt into the correct behavior?

Anybody with opinions on the matter is welcome to voice them on #docker-dev / freenode, or the dev mailing list, we'd love to hear from you.

Having a config option is a good idea. I vote for changing the default behavior because Docker has so much growth potential left. Leaving the default as-is is problematic for the majority of people, both current and future. I think that breaking it is acceptable because a lot of the people who use Docker in production -- especially those that are expecting to be PID 1 -- are early adopters and will forgive you for the slight breakage.

Yes, change it now, before it's everywhere. The longer you wait, the worse the problem becomes.

As long as you provide a choice on what to run as PID1, and that init system immediately/automatically runs the ENTRYPOINT / CMD / run command I think it is fine to change the behaviour to the correct one. Please don't make the mistake to hardcode PID1, and worst of all not to systemd ... as systemd causes enough trouble in docker containers already (try debian host, fedora:20 docker and upgrade if you can).

Don't worry, we believe in small composable tools, for example we are planning to break up Docker into more discrete binaries. So we're not about to bundle systemd into every Docker installation ;)

Are there any progressing issues we can follow?

I think phusion's base images solution is overkill (its own init written in python3 and force users to use runit). It will be great if I can use systemd inside container (I am a CentOS 7 user, use systemd is quite easy for most packages e.g.: yum install httpd; systemctl start httpd), but it requires --privileged. Now I am considering s6 as solution based on this article: http://blog.tutum.co/2014/12/02/docker-and-s6-my-new-favorit...

But I think it will be the best if docker solve this problem itself, then I can freely use my familiar tools like monit.

How would this then be solved? Letting the .dockerinit process handle this?

And to be honest - the problem is exaggerated and not worth fixing on Docker level. Programs launching child processes have to handle SIGCHLD properly anyway - PID 1 or not. This is only a problem with badly written software.

I'm not sure writing yet another pid 1 tool is such a good idea. I have been using Docker in production for a few months now and I use supervisord as pid 1. I am able to hook it to most app servers (python,rails,etc) as well as rsyslogd . logging is a big pain on Docker, and you should think of your logging strategy before the pid 1. How are you rotating your logs and is your pid 1 restarting your app on failure? For us, supervisord does all that.

We run syslog and logrotate inside the container (using a volume) and it works fairly well for us. Plus supervisord comes with a pretty kickass tool - supervisorctl - and the conf file is fairly well documented.

Optimizing the RAM/CPU usage of supervisors and rsyslogd is not very critical - it doesn't use too much.

I actually learned a lot from baseimage - but dont know why they went with runit rather than supervisord.

Would really,really,REALLY love to learn to do thus with systemd.

EDIT: ahh.. forgot that their script was based on runit. My mistake - but dont think I will change over from supervisord anytime soon.

Supervisord docs say explicitly that it is not meant be used as PID 1. Does it still perform all the necessary functions?

yes - we use django, rails, postgres, redis, syslogd and cron. In general it works very well - we use puma for the rails server and gunicorn for python and we are able to control them very well.

Surely writing a minimal init in C (or Go if you like) is better than using Python... it is not a lot of code, and frankly I would trust it not to die and kill your container more. Plus Python has a chunk of dependencies you don't need in a minimal container.

Lots of system tools are written in Python nowadays. If you use a Debian, Ubuntu or CentOS base image then there's a good chance that Python is preinstalled because some system tool depends on it.

It was also much easier to write the tool in Python than in C. Writing it in C would have taken much longer, and at the end of the day we wanted to get work done instead of worrying about a few MBs, which only take 1 second to download and which cost $0.001 of disk space.

It really does not need much code in C, here is a simple init https://github.com/arachsys/init/blob/master/init.c for example.

In particular you want to avoid memory allocation so it cannot fail even in low memory situations, which you cannot guarantee with Python. That means you might die reaping children with low memory which would kill your container.

Writing it in Python also adds a lot of dependencies and bloat to the docker image which don't really need to be there.

Particularly when there are simple init scripts like [1] which could be easily adapted to fill both the reaping and restart problem.

[1] http://git.suckless.org/sinit/tree/sinit.c

It's trivial to write this program in C and it is a waste of memory to load up 50 MB of Python just to accomplish this simple task.

Actually, the Python process eats less than 5 MB of RAM. Most other stuff on the system eats more RAM.

Writing my_init took much more time than you think, and I do not regret choosing Python. My_init not only handles process reaping, but also environment variable handling and bunch of other stuff. Writing text parsers in C is a huge pain (libc has no regexps that are as easy to use as Python).

I do not oppose rewriting it in C, but given limited manpower and the tiny benefits it's not high on my todo list either. If you are worried about resource usage, then I would very much welcome contributions for rewriting it in C.

Why not use already existing init? runit seems small enough: http://smarden.org/runit/

Or if you want small have a look here for minimal PID 1 "So how should init be done right?" http://ewontfix.com/14/

Baseimage-docker does use Runit. my_init calls Runit.

Why not use http://smarden.org/runit/runit-init.8.html and run my_init from runit?

Fair enough, but for the record it's at least 10 MB (try starting up Python and import your list of modules).

The environment variable handling and other stuff could perhaps be handled in a separate bash-script (or Python), such that the scope of the actual reaper is sufficiently small (to use a lower-level language).

5-10MB is a big nothing, unless you're talking about running hundreds of images that consume 20-30mb each. in which case you would have some custom image structure to minimize everything anyway. But then you'd want to consider the infrastructure development cost vs just buying more iron.

There comes a point where you have to take a look at the diminishing returns and say "That's good enough."

Indeed: I solved this minor issue with a 23 line C program that compiles to a 8766 byte binary. It's really not a big deal.

(Mine doesn't pass on signals, but that's not important for us.)

I think needing to run an init system with Docker is a solution to a problem you've created for yourself. From reading this post and other posts on your blog about Docker it sounds like you are `docker start`-ing containers that have been `docker stop`-ped. In my experience, using `docker start` isn't a best practice and I design my containers in a way that I won't need to restart a stopped container.

If you use Docker containers as a form of lightweight immutable infrastructure you won't care about zombie processes because you won't be restarting stopped containers. Instead you just `docker run` every time you want to run a new container. The example of a corrupt file is moot because your containers shouldn't be writing to the file system with the intention of that file being around for long. This also applies to databases, which I'm not fond of running in containers at all (setting aside highly distributed databases, perhaps).

The problem isn't when your main docker process is dying (in this case I believe dockerd itself reaps the process), the problem is when your service spawns child processes as part of normal operation.

As an example, suppose we have a simple web server that executes CGI scripts written in bash (I know this is kinda contrived but it illustrates my point). I do a request to the server, it runs a bash script. The script runs grep. While grep is executing, the web server decides bash has taken too long and SIGKILLs it. Then grep is adopted by your PID1, in this case your webserver. But the webserver doesn't know its there, and so never calls wait(). You now have a zombie process that will last until you stop the container.

FWIW I have written a very, very small /sbin/init replacement for docker containers. It runs one script on startup and another on shutdown, supplied by the container author: https://github.com/jre/dinit

The terminology used to describe children processes can often seem comical. I recall my systems programming classes and the professor would shout "The parent must kill all the children!" -- was good for a few chuckles.

More seriously, the issues raised in this article are real, and I do believe the author has done a fine job breaking down the UNIX process hierarchy and defined responsibilities.

I'd say it's a shame as this is yet one more important stability/scalability/security/performance item the Docker team has not thought of (or maybe their busy writing more eco-system apps), but I do believe this is a container problem at large seeing that the official App Container Spec has not addresses it.

For those criticizing the need for "yet another pid 1", well, there doesn't seem to be any way around it that can guarantee the same outcome. Yes, it's understandable having reservations about Python being the tool of choice for such an init system, and perhaps C would be a better choice allowing more fine-grained control and guarantees. My guess is the article author was most familiar with Python and wanted to have a POC for the write-up, which is satisfactory.

as I posted on reddit:

A simpler and better solution for this problem is: do not let apps run in background (double-fork/detach) in containers. Most applications that background by default can override this with a command-line flag. And certainly don't run programs that execute other programs that background. Keep it simple, and try to limit yourself to 1 process per container. Sure there might be situations where this cannot be avoided, but in 99% of the situations - this is not a problem. Yes you have to be aware of this potential problem, but I am currently running tons of containers, of which only a few run supervisord (which handles reaping too) to run multiple processes within one container because it was the sane thing to do. Stale zombie processes on my docker hosts? None. These precautions in my opinion are only required for badly behaving software. And yes - there is software like that.

Is there a ticket in the docker project? I would like to follow the progress on this one.

Everything I could find was this ticket: https://github.com/docker/docker/issues/5305

Isn't this effectively a non-issue? It's literally just a tiny entry in a table that is held no? Or does it just cause Docker to behave oddly?

There are two issues. First is Murphey's law. 1 zombie process is probably harmless, but if the amount grows then you'll fill up kernel limits and the container cannot create further processes.

Second, zombie processes that don't go away can also interfere with software that check for the existence of processes. For example, [the Phusion Passenger application server](https://www.phusionpassenger.com/) manages processes. It restarts processes when they crash. Crash detection is implemented by parsing the output of `ps`, and by sending a 0 signal to the process ID. Zombie processes are displayed in `ps` and respond to the 0 signal, so Phusion Passenger thinks the process is still alive even though it has terminated.

> Crash detection is implemented by parsing the output of `ps`

Would there be anything to be gained by actually rooting around in /proc yourselves, rather than relying on ps?

Ps is implemented by parsing /proc, so the results you get are the same.

Having said that, /proc/xxx/status does contain a flag that tells you whether a process is a zombie or not. So on Linux, Phusion Passenger parses that file as an extra check. But not all software does this.

Fair enough!

How does this compare to minit or runit ?

Why write your own mini-init instead of using something like supervisord?

Supervisord is not an init and must not be run as pid1, that's explicitly mentioned in its documentation[0]:

> It shares some of the same goals of programs like launchd, daemontools, and runit. Unlike some of these programs, it is not meant to be run as a substitute for init as “process id 1”.

There are micro-inits you could use instead though, s6 should be runnable as pid1 for instance.

[0] http://supervisord.org/index.html?highlight=init

I'm a little confused here. Especially considering that the docker site lists supervisord explicitly with instructions on how to use it ....

see: https://docs.docker.com/articles/using_supervisord/

There's no problem with running supervisord, inside or outside of docker, there's a problem running supervisord as an init (pid1). You can run openrc or systemd or s6 or runit or minit or whatever and use supervisord on top of that, you just shouldn't use supervisord directly as pid1.

Sysv heavyweight?

It's got a reasonable amount of stuff in it. The Debian version of sysvinit (v2.88) is about 10,000 lines of C code. Probably not a huge deal, but it does more than just reaping zombie processes, if that's really all you want.

The actual init command is only about 3000 lines, the rest is for auxiliary programs that you probably wouldn't need for Docker. And 10k lines is still tiny compared to Python :)

Compared to solutions like runit, it is.

>“As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.”

To the author->

Do you know how many processes a kernel can handle ?

root@lisa2:~# cat /proc/sys/kernel/pid_max 32768


This is actually a flexible limit;

  # echo 1000000 >/proc/sys/kernel/pid_max
  # cat /proc/sys/kernel/pid_max
But why would you run your containers with a ticking timebomb?

Suppose a service has a timed routine that spawns a child every 10 seconds. Unfortunately, this child has a bug and starts leaking zombies to PID1. Congratulations, you now have approx 9 hours until your entire system stops being able to spawn new processes.

You could have a nuclear war near the datacenter where you host that broken software & container. I think it has a better probability to exist in a real world.

Murphey's law.

Do you know how many values a 32-bit integer can represent? Surely no one ever has to worry about overflows?

This is nice and all, but the way the authors wrote this article gives off this feeling of hubris that they've uncovered some critical yet technically obscure problem.

They haven't. This is Unix programming 101 right here. Even if the Docker community has been generally ignorant of this, the problem is not any less glaring or basic.

I'm not sure if my_init is really an "init system", though. I'd call it an "init daemon", since it handles the bare minimum of waiting on its children. A system would imply greater complexity.

Author of the article here. Of course it's a basic Unix programming 101... from our points of view! You and I think it's basic because we're so knowledgeable in this area. But this does not hold for the rest of the world. 99 out of 100 people that I've spoken to have absolutely no idea about this problem. I used to link to the Wikipedia article and thought that people would get it. They didn't. People started getting it once I gave 15 minute talks about the problem and describing things with diagrams. Lots of people gave me the suggestion that I should blog about it, and so I did.

There is no relative point of view here. This is simply an egregious educational failure where application developers have failed to understand the underpinnings of how their platform handles processes at a high level.

Then again, this is how the industry works, I suppose. You rediscover an old paradigm or basic property of your system that has been abstracted out, present it as new, deliver it in a shiny package and reap the fame.

Sorry, I was mostly ticked at how you presented this issue in such an alarmist and "problem-reaction-solution" tone. It's just humorous.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact