yes, I know its much easier to just use docker, but at the point that you have to write a new INIT, doesn't that strike you as time you used something thats not easy anymore?
Things like "make sure we don't loose syslog" make me shudder.
Yes supposedly there is an argument that docker is faster. However as a buildeng i've seen little evidence of that.
We use VMware with throwaway build disks. This means that a Dev can destroy a machine, once rebooted its back to a known default state.
the best part is that we have collectd on the vmware host, and the in the guest to collect metrics on CPU, mem and IO.
Because we have decent disks, and lots of memory, we don't hit any limitations. For heavy build we have NFS root'd servers with non persistant disks. much faster, and requires almost no thinking about.
People who criticize systemd fail to realize that it provides solutoons for many problems under Linux, and even problems with the Unix model itself, and has widespread support from companies, distributions, and other stakeholders.
The horrid shite that is the journald. whoever thought it was a brilliant idea to smoosh grep tail and vi together is someone who's never had to debug anything hung over.
seriously why in living fuck do I have to use your shitty tool to look at system logs? The whole point of linux that you are supposed to be able to use anything to open anything. Sure its a stupid idea to use most tools, but thats not your decision.
Seriously do I have to learn a completely new syntax to find all things that happened at 10:23:04 three days ago? Everywhere else I can use vim/nano/emacs/awk/grep/sed.
but here? no I'm to stupid to understand those tools, here use this one. We've thought of every use case, if you're having trouble its because you're either a hater, idiot, sysadmin, or too old.
you can tell that the people who write this shit are not the poor fuckers who have to tune/support/debug/migrate or use it.
So no, I'm not going to "get behind systemd" until journal is fit for purpose.
Look, my job is to implement new ideas. I run a datacenter that has 15pb on tier 1 storage, and 28,000 CPUs.
The one thing I care about most is this: getting home to see my children.
The only time I look at logs is if something terrible has gone wrong. Everything else is done by sourcing metrics directly from the components directly (collectd + graphite)
We do ship logs, and yes we have elastic search, but thats far to heavy and far to higher latency to be useful.
So the only real time that I look at logs is when something is horrifically broken, or we cant figure out from the graphs what has gone wrong.
At that time, I do not want to have to remember new syntax to get at what should be text files. (yes signed logs, but lets be honest thats why you ship logs.)
If I got more functionality, then yes I'll be onboard.
However I can't use bash to talk to journal, It only has a C API, which smacks of it being designed for phones and super integrated HPC, not the 95% use case.
Because it uses binary blobs, (And yes, I do love binary, just not for logs) Its a pain in the arse to inspect using a scripting language.
if journalctl _COMM=sshd | tail | grep "something" then;
really nice. also what the fuck is with starting a switch with an _? underscore indicates hidden function. seriously, why is it different to every over log inspection tool?
So instead of a 350 line Python script, you think systemd is better? I disagree.
Is having _one_ way to do things that works everywhere so bad? Having a container manager do things I wouldn't expect, like have issues with zombies or dropping syslog messages, makes me never want to touch it.
1) The problem is hugely exaggerated. It only applies to badly written software. And make sure we don't lose syslog? Seriously? just mount /dev/log into your container (it's a unix socket after all) and you're done.
2) Arguably faster? Try booting up 15 instances of app X. With docker, that's achieved in less than a second. On VMWare this has a huge memory and CPU overhead, not to mention boot-up times.
3) Try versioning VM's. Good luck. I have 100% reproducible, versioned images generated and tested in our CI system, that I can easily deploy in a VM on my laptop where I will have 100% the same setup as it will in test/qa/prod. To deploy the new version, start new instance, point load balancer/proxy/whatever to the new instance (happens automagically btw). Everything goes well? Stop old instance. Something goes wrong? Point proxy to old instance, and kill the badly-behaving. If everything went well: zero downtime. If you detect something goes wrong, it's nearly instant to switch back to the previous version. Also, testing this is a lot easier than with VM's. Sure you could do that with VM's too - but the overhead is massive, and takes a lot more time.
My ideal setup would be a hypervisor infrastructure with a bunch of VM's running only Docker applications.
You should take a look at https://github.com/docker/machine and https://github.com/docker/swarm , they were designed with exactly that goal in mind!
For automatically updating the load-balancer, check out docker-gen (https://github.com/jwilder/docker-gen). I hacked together setup that detects the latest version and points to this-one when a container comes up or goes down. It assumes a lot of things specific to my setup though (versioning of the containers, env vars, ...)
You could literally use bash as the init process if you do not care about shutting down cleanly (or are willing to have a more complicated shutdown procedure that 'docker stop').
There are other differences but if all you want from your system is reproducibility then VM images work fine of course.
One slight difficulty is that some applications do want to be pid1. While 99% of the ecosystem wants the problem of being pid1 to be handled natively by Docker, the remaining 1% (people who write init systems, or platforms tied to a particular init system) emphatically wants the opposite. So we need to make sure both modes of operations are available, and deal with a dilemma: do we change the default behavior, potentially breaking images expecting to be run as pid1? Or do we keep a default which breaks 99% of applications, requiring an explicit setting to opt into the correct behavior?
Anybody with opinions on the matter is welcome to voice them on #docker-dev / freenode, or the dev mailing list, we'd love to hear from you.
I think phusion's base images solution is overkill (its own init written in python3 and force users to use runit). It will be great if I can use systemd inside container (I am a CentOS 7 user, use systemd is quite easy for most packages e.g.: yum install httpd; systemctl start httpd), but it requires --privileged. Now I am considering s6 as solution based on this article: http://blog.tutum.co/2014/12/02/docker-and-s6-my-new-favorit...
But I think it will be the best if docker solve this problem itself, then I can freely use my familiar tools like monit.
And to be honest - the problem is exaggerated and not worth fixing on Docker level. Programs launching child processes have to handle SIGCHLD properly anyway - PID 1 or not. This is only a problem with badly written software.
We run syslog and logrotate inside the container (using a volume) and it works fairly well for us. Plus supervisord comes with a pretty kickass tool - supervisorctl - and the conf file is fairly well documented.
Optimizing the RAM/CPU usage of supervisors and rsyslogd is not very critical - it doesn't use too much.
I actually learned a lot from baseimage - but dont know why they went with runit rather than supervisord.
Would really,really,REALLY love to learn to do thus with systemd.
EDIT: ahh.. forgot that their script was based on runit. My mistake - but dont think I will change over from supervisord anytime soon.
It was also much easier to write the tool in Python than in C. Writing it in C would have taken much longer, and at the end of the day we wanted to get work done instead of worrying about a few MBs, which only take 1 second to download and which cost $0.001 of disk space.
In particular you want to avoid memory allocation so it cannot fail even in low memory situations, which you cannot guarantee with Python. That means you might die reaping children with low memory which would kill your container.
Particularly when there are simple init scripts like  which could be easily adapted to fill both the reaping and restart problem.
Writing my_init took much more time than you think, and I do not regret choosing Python. My_init not only handles process reaping, but also environment variable handling and bunch of other stuff. Writing text parsers in C is a huge pain (libc has no regexps that are as easy to use as Python).
I do not oppose rewriting it in C, but given limited manpower and the tiny benefits it's not high on my todo list either. If you are worried about resource usage, then I would very much welcome contributions for rewriting it in C.
Or if you want small have a look here for minimal PID 1 "So how should init be done right?" http://ewontfix.com/14/
The environment variable handling and other stuff could perhaps be handled in a separate bash-script (or Python), such that the scope of the actual reaper is sufficiently small (to use a lower-level language).
There comes a point where you have to take a look at the diminishing returns and say "That's good enough."
(Mine doesn't pass on signals, but that's not important for us.)
If you use Docker containers as a form of lightweight immutable infrastructure you won't care about zombie processes because you won't be restarting stopped containers. Instead you just `docker run` every time you want to run a new container. The example of a corrupt file is moot because your containers shouldn't be writing to the file system with the intention of that file being around for long. This also applies to databases, which I'm not fond of running in containers at all (setting aside highly distributed databases, perhaps).
As an example, suppose we have a simple web server that executes CGI scripts written in bash (I know this is kinda contrived but it illustrates my point). I do a request to the server, it runs a bash script. The script runs grep. While grep is executing, the web server decides bash has taken too long and SIGKILLs it. Then grep is adopted by your PID1, in this case your webserver. But the webserver doesn't know its there, and so never calls wait(). You now have a zombie process that will last until you stop the container.
More seriously, the issues raised in this article are real, and I do believe the author has done a fine job breaking down the UNIX process hierarchy and defined responsibilities.
I'd say it's a shame as this is yet one more important stability/scalability/security/performance item the Docker team has not thought of (or maybe their busy writing more eco-system apps), but I do believe this is a container problem at large seeing that the official App Container Spec has not addresses it.
For those criticizing the need for "yet another pid 1", well, there doesn't seem to be any way around it that can guarantee the same outcome. Yes, it's understandable having reservations about Python being the tool of choice for such an init system, and perhaps C would be a better choice allowing more fine-grained control and guarantees. My guess is the article author was most familiar with Python and wanted to have a POC for the write-up, which is satisfactory.
A simpler and better solution for this problem is: do not let apps run in background (double-fork/detach) in containers. Most applications that background by default can override this with a command-line flag.
And certainly don't run programs that execute other programs that background. Keep it simple, and try to limit yourself to 1 process per container. Sure there might be situations where this cannot be avoided, but in 99% of the situations - this is not a problem.
Yes you have to be aware of this potential problem, but I am currently running tons of containers, of which only a few run supervisord (which handles reaping too) to run multiple processes within one container because it was the sane thing to do. Stale zombie processes on my docker hosts? None. These precautions in my opinion are only required for badly behaving software. And yes - there is software like that.
Everything I could find was this ticket:
Second, zombie processes that don't go away can also interfere with software that check for the existence of processes. For example, [the Phusion Passenger application server](https://www.phusionpassenger.com/) manages processes. It restarts processes when they crash. Crash detection is implemented by parsing the output of `ps`, and by sending a 0 signal to the process ID. Zombie processes are displayed in `ps` and respond to the 0 signal, so Phusion Passenger thinks the process is still alive even though it has terminated.
Would there be anything to be gained by actually rooting around in /proc yourselves, rather than relying on ps?
Having said that, /proc/xxx/status does contain a flag that tells you whether a process is a zombie or not. So on Linux, Phusion Passenger parses that file as an extra check. But not all software does this.
> It shares some of the same goals of programs like launchd, daemontools, and runit. Unlike some of these programs, it is not meant to be run as a substitute for init as “process id 1”.
There are micro-inits you could use instead though, s6 should be runnable as pid1 for instance.
To the author->
Do you know how many processes a kernel can handle ?
root@lisa2:~# cat /proc/sys/kernel/pid_max
# echo 1000000 >/proc/sys/kernel/pid_max
# cat /proc/sys/kernel/pid_max
Do you know how many values a 32-bit integer can represent? Surely no one ever has to worry about overflows?
They haven't. This is Unix programming 101 right here. Even if the Docker community has been generally ignorant of this, the problem is not any less glaring or basic.
I'm not sure if my_init is really an "init system", though. I'd call it an "init daemon", since it handles the bare minimum of waiting on its children. A system would imply greater complexity.
Then again, this is how the industry works, I suppose. You rediscover an old paradigm or basic property of your system that has been abstracted out, present it as new, deliver it in a shiny package and reap the fame.
Sorry, I was mostly ticked at how you presented this issue in such an alarmist and "problem-reaction-solution" tone. It's just humorous.