One of these is killing "zombie" processes that have been abandoned by their calling session.
That's funny terminolgy, isn't it? Killing a process usually means sending it a signal, typically TERM or KILL, that causes it to exit. But a zombie process is one that has already exited, but hasn't been waited for by its parent, where its parent is either the process that spawned it, or if that process has died, the process with PID 1. This is usually referred to as reaping the zombie process, not killing it. AFAIK, a signal sent to a zombie process is simply ignored.
Or do the quotes around zombie imply a different meaning, such as "zombie-like"?
No, it's a zombie in the normal sense, the killing here is not sending it a signal but reaping zombie processes (in the sense of personified death reaping souls) by waiting on it.
Things would probably be clearer if the quotes were around "killing" rather than "zombie", mayhaps the interviewer/writer was unfamiliar with the terminology.
Poettering says that PID 1 has special requirements. One of these is killing "zombie" processes that have been abandoned by their calling session. This is a real problem for Docker since the application runs as PID 1 and does not handle the zombie processes. For example, containers running the Oracle database can end up with thousands of zombie processes.
Why does Poettering keep claiming this when he's the one who submitted the patch that adds the PR_SET_CHILD_SUBREAPER prctl(2) [0] functionality?
That doesn't have anything to do with Poettering's quote.
PR_SET_CHILD_SUBREAPER moves the ownership of an orphaned process to whichever process was selected rather than the default PID1, and that only works for descendant of the subreaper.
The problem pointed by the quote is that normal software doesn't go around checking if it has zombie children and waiting on them, so in a container with random software S set as PID1 and creating subprocesses, zombies may accumulate until resources are exhausted[0].
PR_SET_CHILD_SUBREAPER is a way to cause that problem on a system with a proper init (or to test that your init works properly without needing to boot into it)
Yes it does. He's claiming that systemd should manage the container processes as pid1, because systemd will then clean up the zombies. But anything that reaps zombies can be pid1 -- systemd isn't special in this regard. And even if you did use something that didn't reap zombies as pid1, you could leverage PR_SET_CHILD_SUBREAPER as some other non-pid1 process to grab zombies for descendants it spawns.
If you do use PR_SET_CHILD_SUBREAPER, then you need to reap whatever gets reparented to you; if you don't do this then the process table will eventually fill up with zombies. He is correct that few programs do that, but there's nothing that requires that to be done by pid1 if all the processes within the container are spawned by something that provides that functionality and uses PR_SET_CHILD_SUBREAPER.
> He's claiming that systemd should manage the container processes as pid1, because systemd will then clean up the zombies.
The part that'a quoted only notes that PID1 is responsible for reaping orphaned zombies, that Random P. Application Process most likely doesn't do that, and that it causes problems.
> But anything that reaps zombies can be pid1 -- systemd isn't special in this regard.
The part you've quoted doesn't try to claim otherwise.
> And even if you did use something that didn't reap zombies as pid1, you could leverage PR_SET_CHILD_SUBREAPER as some other non-pid1 process to grab zombies for descendants it spawns.
That's a completely inane claim, the whole point of the article is the issue of people starting their application process as PID1, what are you suggesting, that applications should be modified to spawn an init which would use PR_SET_CHILD_SUBREAPER to which it would delegate spawning subprocesses? That's utter lunacy. Have some decency and regard for basic sanity and the context in which the quote appears.
> If you do use PR_SET_CHILD_SUBREAPER, then you need to reap whatever gets reparented to you; if you don't do this then the process table will eventually fill up with zombies. He is correct that few programs do that, but there's nothing that requires that to be done by pid1 if all the processes within the container are spawned by something that provides that functionality and uses PR_SET_CHILD_SUBREAPER.
Are you just making that hare-brained bullshit on the spot so that you don't have to admit your original comment was wrong?
What's the point of spawning a broken PID1 just so you can spawn a process using PR_SET_CHILD_SUBREAPER and doing the actual reaping correctly? Just spawn that as PID1 in the first place FFS.
What do you mean "the thing that spins up the actual container"? The root process for the container? It's already PID1. The external process creating the container? It's sitting outside the container and "below" PID1, what could that do that'd be of any use?
I guess he's saying, that you can't just take any random binary and run it in a Docker container, because if that binary spawns a lot of children but does not wait for them, then you'll have a lot of zombies.
Docker could run a minimal pid1 in each container to address this. Though if this had been a big issue I guess this would have been already fixed.
Naturally, a proof of concept of the problem would be great. (Let's say a Dockerfile.)
It has been a reasonably big issue. E.g. I kept seeing zombies with Consul for a while until we realised that every single Consul Docker container on Dockerhub just had Consul run as pid 1 in the container (this is a while ago, no idea if that's still the case), without realising that Consul health checks then could end up as zombies if you weren't very careful about how you wrote them (e.g. typical example: Spawning curl from a shell script, with a timeout on the health check that was shorter than any timeouts on the curl requests).
It's usually fairly simple to fix (e.g. for Consul above, I raised it with the Consul guys and they said they'd look at adding waiting on children to it as a precaution - it's just a couple of lines -, but people building containers could also introduce a minimal init, or you can write your health checks to guard against it), but it happens all over the place, and people are often unaware and so not on the lookout for it and it may not be immediately obvious.
The reason I raised it as an issue for Consul, for example, even though it wasn't really their fault, but an issue with the containers, is that people need to be aware of the problem when packaging the containers, need to be aware that a given application may spawn children, and that they may not wait for them. Even a lot of people aware of the zombie issue end up packaging software that they didn't realise where spawning child processes that could end up as zombies (in this case, it took running it in a container without a proper pid 1, using health checks which not everyone will do, and writing the health checks in a particular way in order to notice the effects).
Thankfully there are a number of tiny little inits. E.g. there's suckless sinit [1], Tini[2] , and here's a tiny little proof of concept Go init [3] I wrote (though frankly, suckless or Tini compiled with musl will give you a much smaller binary) as what little you actually need to do is very trivial.
Seeing how even the trivial pid1 "scripts" solve the problem, it's truly baffling why Docker doesn't have a --with-reaper flag.
Also thanks for the Consul example, makes it much-much easier to see the issue and argue for a general solution. (So not every random app/project/service/daemon has to implement pid1 functionality.)
> Seeing how even the trivial pid1 "scripts" solve the problem, it's truly baffling why Docker doesn't have a --with-reaper flag.
That doesn't fix the issue since you need to know about the issue and accept that it exists, at that point you can just as easily use one of the micro-inits available.
The alternative is to enable it by default, but now you've broken BC for the weirdo who actually expects orphan processes to be adopted by the root process they're starting.
Yes, the problem is that we would need to change the default behavior of Docker, which many people and scripts expect to be stable. It's a case of interface stability vs. functionality improvement. So far interface stability has won. I personnally think it would be better to change the default, but anything that breaks an interface, even a subtle implicit one, has the burden of arguing a solution, thinking through migration issues, submitting patches... So far I have seen a lot of drive-by criticisms and dismissal of the need to even discuss the tradeoff (see for example this lovely fellow: https://lwn.net/Articles/677419/). But I have not seen anyone stepping up to do the work.
I'm not worried about that. When operations people have problems they are rather quick to search and try solutions. But baking it into the Dockerfile is much more portable and automatic (from the operations point of view).
Also see https://github.com/Yelp/dumb-init that is 20K statically built executable perfect for resource constrained containers that have to deal with reaping of arbitrary children.
Just to clarify: even with a proper init, if a process spawns children and doesn't wait on them, you still have zombies until the parent either dies (allowing init to inherit the zombie, at which point it waits on it), or the parent waits. This is the reason behind the double-fork trick.
Seems like the way Fedora is packaging systemd for 24 is going to move systemd-nspawn to a level of maturity that will likely surpass some of the clunky issues folks have with running docker.
rkt isn't built around systemd. It does use it internally and integrates well with it.
Inside of rkt there is an internal logical separation between the tool that sets up the container filesystems and the one that executes them. We call those things stages[1].
Now inside of rkt we have a few different "stage1" options today:
- systemd: this means that your container has a real init system
- clear containers: execute the container inside of a virtual machine with lkvm.[2]
- direct execution w/ fly: no init system is involved for special privileged containers.[3]
If someone wanted to contribute a stage1 that used a different init system that would be great. But, today systemd works fine and is generally an implementation detail. We also get some bonuses by using systemd on systemd systems like machinectl integration, and journald integration.
Also, I should note that rkt should work on non-systemd systems as well. Again, because, systemd is an internal detail.
Why the systemd hate? Because it's a big monolithic project that takes over your system? You do realize that Docker is much more monolithic and opinionated than systemd, right?
When you ask questions, especially leading ones, it causes a good deal of confusion around the topic at hand. The reasons behind this are complex, but they have something to do with our tendency to double bind each other.
Someone has the right to say why something is "disqualified" for them, even if it is devoid of context. What is awesome here is that the leading expert for this topic is replying directly to the negative (empty) opinion and actually presents a (rich) alternate opinion.
How does you asking unanswerable questions contribute to resolving the conversation to something we can all learn from?
Regardless of their nature, questions are definitely a burden. However, I think the way some questions are put can cause a disproportionate amount of burden when they contain hidden meanings or agendas.
If someone is having issues being direct and use techniques to "hide" how they feel about something in a question, they effectively load the question with intent. I think sometimes those questions can be viral in nature, causing angry memes like what they mention in "This Video Will Make You Angry": https://www.youtube.com/watch?v=rE3j_RHkqJc
Logic would dictate that we should learn to avoid questions which cause excessive amounts of processing with little return in their answers. A simple way to filter on these is to ask if the question conflicts itself when answered in a given way.
Yes, but I will never have to use docker if I don't want to. For that matter, docker doesn't try to be cron, it doesn't want to handle mounting, didn't subsume udev, and doesn't encourage other project to link against it, and to drop all compatibility with non-linux systems. Systemd does, did, and is doing that right now.
Install deps, configure it, done. Looks straightforward, if a tad undocumented.
The thing is, the inside of a docker container is by definition indistinguishable from a regular linux install. Discourse isn't dependent on Docker, not the same way GNOME is dependent on systemd: It just encourages you to use it.
I'm curious what you mean by debug. If you mean monitor all of our apps send metrics, health checks, and logs over the wire I'm sure that is independent.
What would docker allow you over the unikernel especially given the best practice push for docker images to only run one thing in a container?
IMO with Unikernel Xen aka Hypervisors are the container holders instead of docker.
With a Docker container, I can exec into it and run strace, ltrace, gdb etc.. With a unikernel it all depends on what you have built into the unikernel. That might provide everything I need, or not. The issue is that we will need a lot of toolking to put unikernels on a sufficiently equal footing vs. being able to run decades worth of Linux tools directly in the containers.
The issue I have with that is the tooling you mention while stable and mature is actively being replaced by cloud tools because you really can't just debug a single machine in production when you have a cluster.. not to mention it is production so debug symbols might not even be available.
I understand your point of the maturity w/ tooling but I see it as a serious failure if you have to log into a machine in production and run gdb or IMO any tool. Your app can and should provide healthchecks/monitoring so that you can see if there is a problem (this includes even a thread stack dump).
I'm probably just biased and jaded as I have had some serious technical debt lost to Docker. It just feels like a VM on top of a VM on top of a VM of continuous things to break/learn... I want baremetal :)
> you really can't just debug a single machine in production when you have a cluster
Somehow I ended up debugging, tracing, monitoring and even hotpatching individual machines in the cluster. Yeah the easy problems will show up in the monitoring and logs. The harder ones won't.
That must have been a pain in the butt :) . And for sure your right there are always exceptions.
I guess I haven't ran into those issues probably because I run JVMs but I suppose if you have native code or an interpreter using native code I can see how it would be helpful to just SSH and figure out what the issue is.
Now that I recall I have actually had to SSH a bunch of times because of Rackspace network interfaces randomly failing so I am a big hypocrite :)
It wasn't a huge pain. It was Erlang, so I could do those very easily. But it still had to be done by logging into a few machines and poking around. I can't imagine if that was somehow a bunch of C code combined with a kernel running in the same memory space.
Opinionated, yes. Monolithic, no. Huge mess of everything that deeply integrates in any system — of course not, your containers don't need to know anything about Docker and host system, you are absolutely free in choices. It's even possible to run (gasp!) multiple services with supervision inside Docker.
That's the first time I ever heard anybody argue that Docker isn't monolothic. It does everything inside it's single daemon executable. Compare against Rocket, which doesn't use a daemon, and uses separate executables for different tasks and stages.
Docker is monolithic, probably more so than it should be, but designed quite well, allowing for things like triton to exist. Systemd is monolithic, but does far more than docker, and really more than it should.
Rocket is not limited to Linux containers. It can also run in VMs. Incidentally there are also solutions to run Docker containers in VMs (Rocket itself should able to, but there are others as well, like hyper.sh)
That is true, however jails lose a lot of power without VIMAGE. VIMAGE is not enabled by default yet, and it's pretty unstable (I use it). I really wish VIMAGE would mature, but we're not there yet.
Technology isn't that important here, it's what stems from it is. Adoption, infrastructure, tools, community.
There's also a price for doing it differently. And believe you me, using BSD nowadays is the definition of doing things differently. What for? What I'm getting for losing my time and reinventing the tools that are already available and much more polished?
Dockerfiles can be replicated. Docker Hub can be replicated. Missing things can be compiled from sources, probably. But all that costs time for little to no gain.
You do realise that "different" platforms can still be popular enough to have a lot of the same ecosystems. Hell, Docker itself used to be the "alternative"; not even that long ago in fact.
FreeBSD might only have a fraction of the community that Linux does, but that's still a pretty large number of developers and sysadmins in real world terms.
There are also some interesting systems build on top of jails. I'm starting to use iocage, which uses both ZFS and jails to manage and create systems with great ease. Better or worse than Docker? It's different. It does some things docker does not, and vice versa. I'm currently in the evaluation stage, and it's likely I'll be using it to replace hand built jails running PostgreSQL, builds and other single-purpose tasks.
Well if someone is competent enough to create a Docker image then it's not a great stretch to assume many of them would also be competent enough to create a jail. And FreeBSD is just as easy to use as Linux (actually, I generally find it easier to administrate than Linux since things are more rigorously laid out. But a lot of that is also down to my own personal preference).
At the end of the day, both Jails and Docker are well documented. So even the people only interested in blindly copying and pasting commands should be able do the basics.
The real problem FreeBSD and jails face isn't with support nor accessibility but rather dumb prejudice. Much like why many Windows users think Linux is difficult. If you spend your whole time shouting that your way of doing things is the best then you're never going to give anything else a fair chance.
You're missing the point. Even if jails is the most elegant, easy and powerful technology in the world it's still on BSD. People are not going to switch to BSD just because of jails. Sure, then can use both, but why?
Docker and linux containers in general made things easier and more accessible for many. Switching from that to jails doesn't make sense.
> You're missing the point. Even if jails is the most elegant, easy and powerful technology in the world it's still on BSD. People are not going to switch to BSD just because of jails.
We're not just talking about jails though. The OP was discussing systemd + containers. Switching to FreeBSD to escape systemd isn't that weird of an idea since most of the same Linux software will also run on FreeBSD. In fact I'm seriously considering switching my Debian 7 (Wheezy) servers over to FreeBSD rather than upgrade to Debian 8 (Jessie) and have deal with systemd. Anecdotally I've read other people consider switching away from Linux as well
You might like systemd and Dockers. That's great. But that's your personal preference. You shouldn't be so surprised when other people might prefer to run the same software but on a different platform.
On a tangent note: I also have a bunch of existing systemd systems - RHEL servers, ArchLinux desktops, etc - that I'm very happy with and intend to keep running Linux. I make this point just to emphasise that I'm not anti-Linux nor a FreeBSD fanboy. Just someone who's platform agnostic.
> Sure, then can use both, but why?
Why not? This isn't a sports team where you're expected to only support one product. It's quite possible to use multiple different technologies based on whatever fits a specific purpose better.
Have you used both jails and Docker? Saying docker is a much more polished option doesn't match my experience. Yes, it has more features, but with these features come more bugs. Some of them are minor annoyances. Some are of the "our infrastructure is fucked until we switch our storage driver" kind.
I'm not comparing Docker and jails, I'm talking about things around. Stackoverflow.com coverage, blogs posts and guides, github repos, publicly available images, etc.
> And believe you me, using BSD nowadays is the definition of doing things differently.
... although people like Florian Haas argue that the way that one does things on the BSDs is actually the way that makes sense to do things with Docker, as well, and the way that you think to be "different" is actually the sensible way overall.
Choosing BSD instead of Linux is "doing things differently". I'm talking about popularity and what it means in this whole comment tree, not about technical merits.
That's basically the argument to use Windows and Windows-based technology and not Linux. Everything you can do on any of the UNIX boxes, you can do with Windows. It might be different, but it is still a more popular / supported platform.
Since I'm not a Windows fan, I find value in doing it differently, and so have Linux fans. I think you will find FreeBSD and SmartOS users find the cost in time to bring a large enough gain to satisfy their business requirements.
> That's basically the argument to use Windows and Windows-based technology and not Linux.
Not today it isn't. Years, maybe decades ago it could be.
What Linux containers do is help to remove the barrier that various distributions introduced, it makes things more accessible and it's more lightweight than using virtualization. Centos, Alpine, Ubuntu, whatever. As long as it is in a container I can work with it. I can even run some Windows binaries with Wine inside a container. rkt is largely compatible with Docker infrastructure, so that too is fine.
But what jimktrains2 suggesting is complete opposite of that, it reduces options.
> But what jimktrains2 suggesting is complete opposite of that, it reduces options.
Personally I'd consider having familiarity with more than one platform would increase one's options.
But ultimately having other solutions on the market is a good thing. Not only because no one solution is the best at every metric (be it stability, security, speed, memory usage, nor any specific requirements), nor because different solutions can appeal to different personalities. But mostly because different solutions might solve a problem in a unique way that the competing solutions may not have considered - often in ways that can ported and thus benefit the competitors and wider community.
So I wouldn't be so quick to dismiss anything that's outside your field of expertise.
What you're talking about is a luxury I can't afford. That's what I mean when I talk about the price. There are plenty of far more important technologies that I would rather learn.
I can't afford to learn two technologies that do about the same thing, of which one is significantly less popular, might now have tools that other has and runs on a significantly less popular OS.
That's actually a fair point and one I completely sympathise with. But your original argument very much sounded like you were suggesting that people in general shouldn't bother with FreeBSD, jails, nor any other technologies which weren't dominant in their respective field. Which is why we disagreed.
They don't have to embrace the change they have other options. What bothers me is they chose to shoot down everything and anything that has systemd in it out in the public. Just leave the fight and just embrace your choice and allow others to have their choice and don't pee on their parade.
Exactly that is your choice. I am concerned about the drama of systemd by people that don't like systemd always peeing on every and anything that mentions systemd.
It is in the old rpm vs deb, vim vs emacs, python vs perl dram of the past.
No, it isn't. Nobody is going to try to nuke my emacs install, and install vi on my machine instead. If vi started deleting emacs, or using non-utf8 encodings on all text for some reason, or otherwise made using emacs impossible, the vi developers would fix it, or the community would say, "What the FUCK!?" and probably fork it.
The concept of the community is an abstraction and in this case a bad one. There is no community. There are a million different individuals and within that thousands of communities each composed of some subset of those individuals.
There is no reason each subset or each individual even shouldn't have their own opinion and based their actions upon it.
The very foundation of Open Source / Free Software Movement is 100% community. The very foundation of Closed Source is "There is no community." Community isn't an abstraction but is what has built Linux.
To quote RMS (Who I disagree most of the time but highly respect)
>Tens of millions of people around the world now use free software; the public schools of some regions of India and Spain now teach all students to use the free GNU/Linux operating system. Most of these users, however, have never heard of the ethical reasons for which we developed this system and built the free software community, because nowadays this system and community are more often spoken of as “open source”, attributing them to a different philosophy in which these freedoms are hardly mentioned. http://www.gnu.org/philosophy/open-source-misses-the-point.e...
> There is no reason each subset or each individual even shouldn't have their own opinion and based their actions upon it
100% my point move to your choice and don't pee on systemd every time it is brought up. Your opinion is different then the majority in regards to systemd and you can use those options and not have to discount everyone else's choice.
I don't always run containerized applications, but when I do, I prefer them completely systemd-free, thank you.
Sometimes I wonder if systemd is actually a part of big plan of moving everyone to microservices and containers and maybe even unikernels — anything, just anything without this abomination.
Can you explain your position to me? I can understand somebody who dislikes systemd and dislikes docker. I can understand somebody who likes both systemd and docker. But disliking systemd but liking docker? That I don't understand. Any effective criticism of systemd that I've heard generally can also be applied to docker.
Like yours: "I wonder if systemd is actually a part of big plan of moving everyone to microservices and containers and maybe even unikernels" works even better if you replace systemd with docker.
Docker is just a toolkit for composing and networking layered OS images. It improves isolation of things and adheres to simple principles (immutable containers, restarting instead attempting to recover, etc.) It structures things better. Inter-container communication is deliberately simple (env variables and, recently, networking).
Systemd spits on isolation, it embraces integration of everything. Supervision, logging, communication, IO, configuration, state management — everything goes through systemd. Everything is binary and opaque. Docker is transparent.
Your criticism of systemd still applies to docker. "Supervision, logging, communication, IO, configuration, state management — everything goes through docker"
If I use systemd I have to type 'systemd logs' to get at my logs, or I can use a plugin to move it somewhere else. If I use docker I have to type 'docker logs', or I can use a plugin to move it somewhere else. etc. etc.
P.S. Agree completely with your praise of Docker. I'm firmly in the 'love both systemd and docker and wish they got along' camp.
Wrong. Docker manages containers, and only containers. 'docker logs' shows you the logs from your containers. Docker never tried to make me run my non-container logs through 'docker logs'. You know what else docker never tried to be? cron. Or udev. Or consolekit. Or init. It just tries to manage your containers.
I think this is a little bit off. You're looking at them from two different perspectives.
Docker wants to manage your containers. And in that regard it is one monolithic daemon that manages everything about your containers.
Systemd wants to manage your computer and things related to init. It is a bunch of modular, but strongly integrated, pieces that manage everything about your init and process management.
>>It is a bunch of modular, but strongly integrated, pieces that manage everything about your init and process management.
OH REALLY? well, can I just run systemd-udevd, without systemd? how about journald? No? well than, if everything depends on one massive daemon, it isn't very modular, is it.
Not soon. You can't run journald without systemd now, and you won't be able to run systemd-udevd without systemd as soon as kdbus gets merged, which the systemd devs are pushing for heavily.
Your information is out of date with respect to systemd-udevd, per https://news.ycombinator.com/item?id=10518933 and others; and your assertion about journald is simply wrong unless something has changed very recently.
Isolation and integration are not opposites. You can increase isolation (through e.g. judicious application of cgroups, which systemd encourages and makes use of) while also increasing integration (e.g. sharing APIs).
Docker is to me far more problematic when it comes to integration. It is trying to make everything go through itself, without providing a fraction of the scheduling and management capabilities that systemd does. Of of the motivations for Rocket is exactly that it allows for a far less monolithic experience than Docker - even when integrated with systemd.
Integration is all well and good, I suppose. Until you put all your integration into a single process, which will kill your system if it ever crashes. Then I will have some words to say about you. Especially if you ever, ever, EVER, EVER, EVER, use assert, in ANY situation you could ever potentially recover from.
I disagree with your position on assertions and much prefer Bryan Cantrill's, because it results in bugs actually getting fixed:
Hope is not a strategy, including for your
software. If your state has become corrupt,
it is incumbent upon you to die and donate
your body to science, where it can be
debugged.
Well, yes, in that case you should die, but take steps to make sure that your state doesn't become corrupt in the first place. Like, for instance, doing less, and making your service less complex.
That's funny terminolgy, isn't it? Killing a process usually means sending it a signal, typically TERM or KILL, that causes it to exit. But a zombie process is one that has already exited, but hasn't been waited for by its parent, where its parent is either the process that spawned it, or if that process has died, the process with PID 1. This is usually referred to as reaping the zombie process, not killing it. AFAIK, a signal sent to a zombie process is simply ignored.
Or do the quotes around zombie imply a different meaning, such as "zombie-like"?