Hacker News new | comments | show | ask | jobs | submit login
Your Docker image might be broken without you knowing it (phusion.github.io)
147 points by jballanc on Feb 18, 2014 | hide | past | web | favorite | 90 comments

The comments about the init process are true. It makes sense to run a proper PID1 system here such as runit.

I'd argue with the rest of the post. The problem is that phusion makes the common mistake of thinking of containers as faster VMs. That's fine, this is where almost everyone starts when first looking at Docker paradigm.

A good rule of thumb is: If you feel like your container should have Cron[1] or SSH[2], you are trying to build a VM not a container.

VMs are something that you run a few of on a particular computer. Containers are something that you will run thousands or tens of thousands on a single server. They are a lot more lightweight and loading them up with VM cruft doesn't help there.

[1] Cron: use the cron of the outer machine with docker run [2] SSH: use lxc attach

I am not sure whether it is a common mistake to see containers as faster VMs. The way we use Docker at Phusion is on one side as sandboxed environments to run our integration tests in, and on the other side as isolated containers to run long running services in (like apps, databases) to lessen administrative overhead. Both use cases work best in an environment that's as much as a fully functional OS as possible.

Anyway, what's wrong with using containers as ultra light weight VMs, isn't the whole idea of the recent VM upmarch to use VMs as application containers?

As long as you keep in mind that the security of LXC hasn't been thoroughly battle test yet, I think it's a fine idea to use docker to build light weight VMs.

The reason that you want a VM is because it reduces the attack surface for a process, and sandboxes it effectively. The administrative overhead gets cut down because you can more easily snapshot it, and reduces crosstalk between components. It removes access to data that other processes should not have access to, in a way that is tedious and difficult to do with file system permissions. It removes the ability to communicate with services running on localhost, increasing security that way. And so on.

Containers do the same thing, without the overhead of a VM. The advantage of a VM in administrative overhead isn't because it's a VM, but because it has restricted access. Tossing in all the other crap that a VM has into a container because "it's like a VM" is simply cargo cult programming.

No one is tossing crap into containers because "it's like a VM". Maybe you should pay attention before calling other people cargo cult programmers.

Also, security is not the main reason people use VMs, if it was virtualization would not have been so popular.

>The reason that you want a VM is because it reduces the attack surface for a process, and sandboxes it effectively.

Well, that's why you want a container too.

From an operations standpoint VMs are just heavyweight containers.

The difference is that although containers may have the same function as VMS they also are crazy fast. I don't see anything wrong with using them as VMS.

I disagree with running cron outside the Docker container. One of the reasons for using Docker is to lower deployment pain. The moment you use cron on the host machine you've introduced yet another moving part, and yet another dependency that must be installed on the host.

I also disagree that there is a mistake here involving thinking of containers as faster VMs. Yes, you can think them of applications, but the fact is still that there is a whole operating system running inside the container, and that many apps rely on cron and other stuff. Given that crond is so small and lightweight, and that a lot of people don't know what depends on cron and what not, I think it's better to turn it on by default. If you know for sure that you don't use cron, you can still turn it off.

Remember, the goal of baseimage-docker is to provide a base image that is correct for most people, especially people who are not intimately familiar with the Unix system model.

Lxc-attach, although it works, has several problems:

* You are doing things outside Docker so you won't be able to track it (logs, attach, etc). Also, docker might use LXC right now, but there is no warranty it will do so forever. For example, what if you're using Docker on OS X? No lxc-attach there.

* It does not allow you to limit access. What if you want to give a person only access to a specific container? You can do that with SSH through the use of keys.

* lxc-attach has caveats with --elevated-privileges, documented in the man page.

Just run another container that runs cron.

In regards of lxc-attach: this is a command that docker should expose to allow network operations. Even with baseimage-docker you won't be able to track it because all you can do is attach/log sshd. Also how do all these SSH keys and ports on a single host ?

Well they are both, it is a weaker VM as far as capabilities, but also have more ability than a single isolated OS process.

There is no need to blackbox the concept un-necessarily. It is an LXC container, read about LXC containers, its capabilities and restrictions. If you are building at core of your infrastructure don't trust simplistic mental analogies from HN.

Then Docker works on top of that. Read about what Docker adds or restricts on top of the base LXC containers.

I also think Docker is a little too wishy washy in their marketing and there are enough people who get confused about what exactly is going on underneath.

From the slides "More technical explanations: Run everywhere, regardless of kernel version [+], regardless of host distro, physical or virtual, cloud or not [o]

[+] kernel version must be 2.6.32+ [o] container and host architecture must match. "

No mention of LXC containers, and that is a "technical" description. Why not just put a mention of the underlying core technology? It might change tomorrow sure and just update.

They do mention LXC on their site. It isn't necessary to harp on it everywhere. I don't see that people are confused about what is going on.

I thought it was clear that Docker intended to abstract some of the platform-specific details like using LXC, and focus on making a standard interface for encapsulating and managing services.

I find it misleading to talk like Docker is just LXC, or a shell script for running LXC, with no contribution of its own.

> I don't see that people are confused about what is going on.

I can see it in these discussions and in personal conversations. Yeah one can say "their fault didn't read docs".

> They do mention LXC on their site.

Where? Click on "Learn More". Switch through slides, get to "Technical Details", nothing there. Click on "The whole story". In a 10-20 page "super-detailed" technical explanation it is mentioned twice in passing.

> I thought it was clear that Docker intended to abstract some of the platform-specific details like using LXC

Clear to you and me, not clear to quite a few people.

> and focus on making a standard interface for encapsulating and managing services.

And that is my beef with it, selling complete black boxes and then have people wonder if there are unicorns inside, is it a VM, is it a container... If they just mentioned exactly that their technology is based on it would make it easier.

Are they afraid they'll get sued by LXC creators? I don't understand why go to great lengths beating around the bush about it.

> I find it misleading to talk like Docker is just LXC

I think it is even more misleading to not talk about the core underlying technology and then say what they add on top of it.

> And that is my beef with it, selling complete black boxes and then have people wonder if there are unicorns inside, is it a VM, is it a container... If they just mentioned exactly that their technology is based on it would make it easier.

Docker isn't "selling" a technology, though, they're selling a container standard. Right now, the containers people are making are all Linux-ELF-x64 ABI containers, so they're run using LXC. Future containers, though, could run on Solaris zones, or BSD jails, or OSX sandboxes, or full VMs instead. And they'd still be "Docker containers."

That's the thing Docker is trying to get people interested in--pluggable units of backend-irrelevant computation. LXC is just how that's implemented for the Linux-ABI containers, which are their MVP.

>Containers are something that you will run thousands or tens of thousands on a single server.

Citation needed. Who'd run "thousands of containers" on a single server? Some VPS service? I don't think Docker is meant to be used like those use Xen and the like.

Fascinating. My first inclination, when I started running Docker, was to run /sbin/init and launch a full systemd and all services.

I even asked on ServerFault (ie, StackOverflow for servers) about it and was told, quite aggressively, that running a full OS is wrong:


Addressed individually:

1. Reaping orphans inside the container.

Yup. If your app's parent process crashes, its child processes may now be orphans. However in this case your monitoring should also restart the entire container.

2. Logging.

Assuming you run your docker image in a .service file (which is what CoreOS uses as standard), systemd-journald on the host will log everything as coming from whatever your unit (.service) name is. So if you `systemctl myapp start` output and errors will show up in `journalctl -u myapp` in the parent OS.

3. Scheduled tasks.

For things like logrotate, it really depends whether you're handing logs inside or outside the container. Again, I'd use systemd-journald in CoreOS, rather than individual containers, for logs, so they'd be rotated in CoreOS. For other scheduled tasks it depends.

4. SSHd

It depends. SSH isn't the only way to access a container, you can run `lxc-attach` or similar from the host to go directly to a container.

I do mention CoreOS here because that's what I use, but RHEL 7 beta, recent Fedoras, and upcoming Debian/Ubuntus would all operate similarly.

Regarding reaping orphans: orphans do not necessarily imply that something crashed or that something went wrong. Orphans is a very normal part of system operation. Consider an app that daemonizes by double forking. Double forking a common technique with the intention of making the process become adopted by init. It expects PID 1 to be able to handle that correctly.

Regarding logging: that only holds for output that is written to the terminal. There are lots of software out there that never log to the terminal and log only to syslog.

As for all the other stuff: it is up to debate whether they should be handled inside or outside the container. The right solution probably varies in a case-by-case basis. Baseimage-docker's point is not to tell that everyone must do things the way it does. It's to provide a good and useful default, for the people who have no idea that these issues exist. If you are aware of these issues and you think that it makes sense to do things in another way, go ahead.

> Consider an app that daemonizes by double forking. Double forking a common technique with the intention of making the process become adopted by init.

But that's daemonization - if you don't daemonize (because your container is itself a daemon) you won't need it.

Ack re: syslog. The stuff I'm writing now uses stdout, and I let systemd do the work. If I had some proprietary app that needed a local syslog then I'd run that in the container. But I'd more likely ask the vendor if I could just disable the syslog requirement and either get stdout or journald for things like multiple field support etc.

The reason I'm writing is because the post gives the impression that everyone is likely to be 'doing it wrong'. There are definitely some things to consider, but I don't think that impression is accurate.

Correct, but how do you know your app never spawns anything that may daemonize? If you wrote your app yourself and know all your dependencies and how everything behave, then fine. But what if you're simply packaging someone elses app? What if you're building a database server container? Have you read every single source line to be sure that it will never behave that way and leave behind zombie processes?

That is what baseimage-docker is about. It's about providing a sane, safe default that behaves correctly. There are also other ways to make your system behave correctly of course.

As for "giving the impression that everything else is wrong": that is not the intent of the article. The title is only to catch the reader's attention and to encourage the reader to continue reading and to understand the edge cases. The title says other things MIGHT be wrong, it does not say that they ARE wrong. So let me make it clear: any solution which solves the edge cases and issues described in the article, is correctly. Baseimage-docker is one solution, the one provided by us. It is not the only possible solution.

Good point and accepted re : database example.

I've only been working with Docker for a couple of months, and I find this discussion really interesting. The goal of trying to get containers to behave more like a full system across various lifecycle events is somewhat orthogonal to my own aims, which have been to get my containers as close to stateless as I can.

Like some other posters here I view containers less as a lightweight VM, and more as a process sandbox. In the context of a scalable architecture I would like a container to represent a single abstract component, which can be spun up (perhaps in response to autoscaling events), grabs its config, connects to the appropriate resources, streams its logs/events out to sinks, reads and writes files from external volumes, and runs until it faults or you shut it down.

Ideally there would be nothing inside the container at shutdown that you care about. After shutdown the container, and potentially the instance it was running on, disappear. Spinning up another one is a matter of launching a new container from a reference image.

So far, in cases where I have needed daemons running in the container, I have pointed my CMD at a launch script that starts the appropriate services, and then launches the application components, typically using supervisord. That has worked fine, but I admit to not understanding the PID1 issue well-enough up to this point.

Baseimage-docker does not imply that your container becomes stateful. Using services like cron and SSH do not imply statefulness.

I also think that the container should be as stateless as possible. When state is necessary, it can be saved at a bind mounted directory.

The point of baseimage-docker is to ensure that the system works correctly. See its description about the role of PID 1. It has got nothing to do with the statefulness discussion.

Agreed, and a fair point. Statefulness is not a consequence of using Baseimage-docker, and I didn't mean to suggest it was. A clearer way to put it is perhaps to say that aiming for a container that is as simple and stateless as possible makes the "problems" outlined in the OP seem less compelling to me.

Take the syslog example. If I am starting processes that log to syslog, and I want syslog running because I care about those messages, then I should be taking steps to ship those messages out of the container, otherwise I am creating state that has to be preserved across system lifecycle events to be of any use. If I am pursuing a stateless container then I will not be blindly running things that create state without deciding how to handle it. Along those lines if you are pursuing this kind of design you want to have a good handle on everything that's running, and what state it produces. I don't know everything that's running and producing output in a full Linux installation. I'm sure I could figure it out, but it seems to me that Docker's minimalistic approach makes it easier to draw lines around this stuff.

The OP implied that you could design what looks like a solid container and that it might yet be broken in ways that aren't obvious. I'm very eager to know if that's really the case, as I am considering deploying some production components using the tool.

So far the system services argument doesn't seem very compelling to me. I haven't run into any issues launching services from scripts at container start. Examples would be logstash, redis, supervisord, etc. It could be very convenient to have an image already configured with a proper init system, but I am not sure that it is fixing anything that is broken.

I don't have enough experience to get deeply into the PID1 issue. All I can say is that I haven't run into any problems . I can't say, for example, whether everything is shutting down cleanly in all cases, but the way I build my containers I don't care that much. Unless I go back in for specific diagnostic reasons a container only gets started once.

Correct, fully agreed with what you said about syslog. But that's not the problem that baseimage-docker is trying to solve. Suppose that you're building a Docker container, and something fails. Nothing on stdout and stderr. You decide to look in /var/log/syslog, but nothing there too. You scratch your head. If only you knew that /var/log/syslog only works if the syslog daemon is running. That sort of thing is what baseimage-docker solves. Whether you want to ship logs outside the container, that's up to you.

Right now I am building a web app in a Docker container. The web app is written in Rails, hosted by Nginx and Phusion Passenger. To make setup as easy as possible for users, the container also contains PostgreSQL. I run Nginx+Passenger and PostgreSQL at the same time by hooking them both on runit. The init system in baseimage-docker ensures that a 'docker stop' properly shuts down both Nginx and PostgreSQL.

Cross-distro support notwithstanding, why not just skip Docker, LXC and VMs. Instead, use cgroups on bare-metal to make processes behave. On that note, forget bridging, use SR-IOV virtual functions with VLANs for QoS and _Profit_.

Edit: It seems this comment has been voted down. I think perhaps this is seen as irrelevant, but I would disagree, because Docker uses LXC and masks its function in much the same way as LXC uses cgroups and masks their function. cgroups can be used to achieve similar goals without these many layers of abstraction. In this way, I believe this comment to be relevant to the discussion of full vs. application containers on Linux. There are certainly many reasons for using containers, but one of the leading reasons is process limits (e.g. RAM, network namespace). Limiting process usage of those resources, using only cgroups, is quite easy in comparison to all Phusion has gone through here to something with similar (though admittedly different) aims. Example: http://www.andrewklau.com//controlling-glusterfsd-cpu-outbre...

Edit 2: I would also appreciate constructive criticism. That is, I've been downvoted without useful feedback. Specific feedback as to what is wrong with my comment would enable me to contribute more constructively to this discussion. Without such feedback, I believe the downvote can be seen as a simple and tribal "go away".

Constructive criticism, about how this sounds: it isn't clear that what you propose is actually more valuable than using Docker. It sounds like it's complex and requires a lot of manual intervention. It doesn't sound like your alternative covers Docker's use cases.

Your idea may need to be more fleshed out, but at a minimum it needs to be explained in a way that makes it clear why most users of Docker would see a significant benefit to use your approach instead.

> There are certainly many reasons for using containers, but one of the leading reasons is process limits

On the other hand, the main reason people tend to use Docker, as far as I know, is not anything to do with quotas or limits; it's guaranteed reproducibility of deploys. (The same thing you get on Heroku with "slugs", etc.)

The layers of abstraction in Docker result in something that's a lot simpler to use than managing cgroups manually. Also, I suspect people care more about the cross-distro support and configuration isolation that Docker provides than resource management.

My experience with cgroups is that it's incredibly difficult to get them to do what you want them to do. But systemd seems to be changing that, so maybe their use will get more mainstream soon.

I think those many layers of abstraction are something many see as a feature, not a problem. I certainly appreciate the clean abstraction Docker provides to LXC.

You really should not run ssh in your containers. If you have a ton of containers then key management and security updates of SSH will be a pain. There are two tools that can easily help out:

- nsenter lets you pick and chose what namespaces you enter. Say the host OS has tcpdump but your container doesn't. Then you can use nsenter to enter the network namespace but not the mount namespace: sudo nsenter -t 772 -n tcpdump -i lo

- lxc-attach will let you run a command inside of an existing container. This is lxc specific I believe and probably not a great long term solution. But, most people have it installed.

I disagree with the premise that using Docker to run individual processes is "wrong". Phusion is doing a disservice by suggesting as such. There ARE use-cases where such a base-image is useful, but I believe these should be the uncommon case, not the common one. Even yet, if running multiple processes in a container is needed, it's preferable to use Docker-in-Docker.

I suppose part of the problem is the two benefits of Docker and containerization are frequently confused. Docker provides portability and build bundling, but ALSO provide loose process isolation. You should want to take advantage of that process isolation and by doing so, should want to run SSH or cron in their own containers, not in a single container with your application process. If your application has multiple processes, each should have their own containers. These containers can be linked and share volumes, devices, namespaces, etc. Granted, some of the functionality one might desire for this model is still missing or in development, but much of it is there already and that's the model I aspire Docker to follow.

It might also be to some degree a matter of legacy versus green-field applications. For instance, I've been deploying OpenStack's 'devstack' developer environment (which forks dozens of binaries) inside of a single Docker container. In this case, the Phusion base-image might make sense. However, the proper way of using Docker would be to run dozens of containers, each running a single service.

The reason I don't do this is because the OpenStack development/testing tools provide this forking and enforce this model, using 'screen' as a pseudo-init process. From the Docker perspective, this is a legacy application. I could and probably will change those development tools to create multiple containers, but until then, it's easiest to stick to a single container.

    I disagree with the premise that using Docker to run individual processes is "wrong". Phusion is doing a disservice by suggesting as such.
This is not the premise of the article. The premise is that someone goes 'from ubuntu; apt-get install memcache; cmd ["memcached"]' and thinks everything is going to be alright, when in reality they've just set up a rather buggy system.

If you're absolutely certain your app is going to be fine running as the sole (PID1) process in the container, then this article has no problem with that. It just says that if you're going to run something you've got from apt-get, then chances are, your system is going to have to be a little more like a Debian system.

Yeah, Phusion oversells their case when they say this is the "right way" to do it. It's one way to do it, and this methodology probably addresses customer support issues they are having. Many of their customers likely misunderstand what's actually going in a container by default.

I'd rather see a more balanced approach that shows a range of options, without opining so much about how Docker containers should or should not be used. Better to fit the solution to a particular use case.

It will work but things are addressed on the wrong level in my opinion.

syslog: each container now has it's own logs to handle. If you want them to be persistent/forwarded it might be better if all containers could share the /dev/log device of the host (not sure of the implications though).

ssh: lxc-attach. Docker should expose that.

zombies: it's a bug in the program to not wait(1) on child processes.

cron: make a separate container that runs cron.

init crashes: bug in the program again. it's possible to use the hosts's init system to restart a container if necessary.

lxc-attach: see https://news.ycombinator.com/item?id=7258242 about why I think SSH is more appropriate.

Zombies: this is not about child processes created by the program. It's about child processes created by child processes! For example what if your app spawns another app that daemonizes by double forking? Your PID 1 has to reap all adopted child processes, not just the ones it spawned.

Then it's a bug in the child process. Turtles all the way down. Also, double-forking is a hack that should burn in hell.

EDIT to the reply below: It's still a design issue but I agree that it's not always practical to change existing software. A small PID1 wrapper that reaps zombie processes and execs the target program would be a good middle-ground.

It's not. Most apps rightfully expect that they're not PID 1, and that the real PID 1 takes care of that sort of stuff. Only in a container does it happen often that your totally-not-designed-to-be-a-PID-1 app, actually is PID 1.

What if you're creating a PostgreSQL container, and your init script spawns a daemon, after which it exec()s the PostgreSQL server process as PID 1? The daemon then spawns a few processes that fork a few times. PostgreSQL only waitpid()s on its own postmaster worker processes and so those other processes become zombies. Are you telling me that PostgreSQL is broken and that you have to patch PostgreSQL?

I think using a proper init system, and running PostgreSQL under it, is a much saner view on things. The small wrapper that you mentioned is exactly the /sbin/my_init provided by baseimage-docker.

It may be a matter of opinion but advocating to run cron, sshd, and so on in your containers, let alone in every single one by providing a base image to do that seems plain wrong.

Let's take an example. You have Nginx, a web app, and a database. You can put everything in the same container or not. If you choose to put everything in different containers, you will be able to use tools at the Docker level to manage them (e.g. replace one of those processes).

And the fundamental idea is that we expect to have plenty of Docker images around that you can pick and play with, and those Docker-level tools will be able to manage all those things.

Now if you put everything in the same container, you're back to square one, reinventing the tools to manage those individual process. You can say that you don't need to re-invent anything, because you're used to full-fledged operating systems. Still, if you have a nice story to deploy containers on multiple hosts, to send logs across those hosts, and so on, the road will be more straightfoward when you decide to use multiple hosts.

This is about uniformity. I want processes (and containers around them), and hosts, that's it. I don't want additional levels. I don't want processes, arbitrarily grouped inside some VMs (or containers), and hosts. Two levels instead of three.

Right, cron and sshd are open for debate, but at the very least you have to make your PID 1 behave correctly by reaping adopted child processes. That is a major part of baseimage-docker.

Baseimage-docker is not advocating putting everything in the same container. It's advocating putting all the necessary, low-level services in the same container. What if your app happens to use a library that needs to schedule things to run periodically using cron? To me it doesn't make sense to split that cron job to another container. The app might physically consist of multiple processes and components, but I think it should logically behave as a single unit.

For stuff like Nginx and the database, it's not so clear what is the right thing to do. It depends your use case. I don't think that putting those major services in the same container is always correct (though it might be), but I also don't think that splitting them out to Docker containers is always the right thing to do.

You say that that putting stuff in the same container puts us back to square one. I think splitting them puts us back to square one. Your base OS already runs all your processes as single units. You have to worry about each one of them separately, resulting in lots of moving parts that all increase deployment complexity. The beauty of Docker should be that you can group things. If you don't group things then why would you be using Docker? You might as well apt-get install your app and have it run as a normal daemon.

One use case where it really really makes sense to put everything in the same container: when distributing an app to end users who have little to no system administration knowledge. For example, what if you want to distribute the Discourse forum software? It depends on Rails, Nginx and PostgreSQL. Users are already having a lot of trouble installing Ruby, running 'bundle install', setting up Nginx and setting up PostgreSQL. Imagine if they can just 'docker run discourse' and it immediately listens on port 80, or whatever port they prefer, with the database and everything already taken care of for them.

I guess we both understand things well enough to know that limits to draw are not rigid. That being said, here is my take on what you say.

An app should logically behave as a single unit. I would say that's true, and that unit is a cluster of containers. Docker is not yet ready as a tool to manage clusters of containers, but I believe it will. In the meantime tools like Fig or Gaudi are exploring the design space.

You say that having everything separate is back to square one, because you have to manage things separately. My opinion is to develop tools to manage cluster of containers, not to cram things to fit in a single one (I'm not being harsh, sorry if it sounds like). If you use Docker to group things (at the container level, instead of at the cluster of containers level), what should we do if I want to share something with you (a program) ? I can be nice and provide a Dockerfile, but you would still have to put it in your existing "logical single unit", thus loosing the benefits of, e.g. dependencies isolation.

The distribution case for enduser is a good one, where the limits will depend on what you really want. For instance if you don't expect people to expand your app by adding additional processes, why not. But I think it is still a workaround for the cluster-level tool I keep talking about.

I am using Docker to create a cluster of containers (for https://reesd.com). Since the infamous cluster-level tool of my dream doesn't exist yet, I'm still relying on Bash scripts (because I feel like exploring my possibilities and don't want to start writing a solidified tool). The script is pretty simple: a bunch of `docker run -d`, saving containers IDs and IPs around (this could be replaced by `docker inspect` and such).

Well that script is done so it can be run next to itself multiple times. So I can have multiple instances of the whole Reesd service on my laptop. To deploy it, I run the same script, possibly next to the live one. I have additional scripts to e.g. replace one specific container (say, the web app). So really when talking about uniformity, I want to be able to run Reesd on my laptop, or on multiple machines, and possibly side-by-side, using the same Docker features.

A possibility that I haven't tried regarding your last paragraph is the docker-in-docker feature.

Very interesting viewpoint. Yes, if Docker performs cluster management right then that would change a lot of things. I see that the CoreOS guys released Fleet today, possibly in response to this article. I'll have a look at this later. https://news.ycombinator.com/item?id=7260596

Why not just use

CMD ["/sbin/init"]

And start your app through an init.rd script?

The article says "upstart" is designed to be run on real hardware and not a virtualised system. If that is true, then perhaps there is value in baseimage-docker, but details are lacking.

So why don't you try it and see whether it works?

One of the things /sbin/init does is checking and mounting your filesystems. But you can't do that in an unprivileged Docker container because you don't have direct hardware access. This is only one example of where things go wrong. The entire init process is full of these kinds of code where it is assumed that there is direct hardware access.

Even when your container is started with -privileged, you still can't do that. The host OS is already controlling the hardware.

Also, /sbin/init usually does not like having SIGTERM sent to it, which is what 'docker stop' does. Depending on the implementation, /sbin/init either terminates uncleanly (causing the entire container to be killed uncleanly) or ignores the signal outright (causing the 'docker stop' timeout to kick in, also causing the container to be killed uncleanly).

It depends on the init system, however.

systemd makes an effort to ensure that running /sbin/init inside of a container works and can be detected by the software and services underneath it[1]. In general this means that if you take a copy of Arch or Fedora and try to run it inside of a container it works properly without any hacks.

For your own services you can also start to do the right things by using the virtualization detection code[2] that is built in. The most immediately useful one being: ConditionVirtualization=container and !container. With these directives you can tell your services to run or not run depending on whether you are in a container or on real hardware.

[1]: http://www.freedesktop.org/wiki/Software/systemd/ContainerIn... [2] http://www.freedesktop.org/software/systemd/man/systemd.unit...

ConditionVirtualization=container seems like a "with great power comes great ability to screw up in subtle and horrible ways" sort of feature, wherein when you need it, you really need it, but most of the time, a different approach will be vastly preferable.

Absolutely. This was in the context of the parent talking about doing things without certain privileges or skipping unnecessary steps.

Same with Gentoo's homebrew init system OpenRC, which has had support for running in various kinds of containers for a while now.

Or just read the source code. http://svn.savannah.nongnu.org/viewvc/*checkout*/sysvinit/tr...

Init doesn't care if the OS it's running on is physical or virtual at all. That's why stock init works on both virtual and physical machines. Init doesn't have to mount anything to work. In fact, it's common in embedded environments that init does basically nothing but reap children and start a single bash script.

For sysvinit, it's designed to never fail because it's supposed to keep the entire operating system functioning. But it can and does exit; how do you think your system shuts down? Sysvinit will respond to several different signals, but not SIGTERM. If you want it to stop you can use 'telinit 0' or 'telinit a/b/c' (which probably only works for users with privilege), send it SIGPWR or SIGINT, or use the /dev/initctl control channel.

Maybe that is how the init you linked works, but the docker instructions take you through the basics of setting up an Ubuntu image, where upstart is used (not sysvinit) and if you stray off the path, or try to upgrade the base image to a newer release, you find very quickly that upstart does not tolerate running in a container very well, and that a lot of things expected it to be running, you'd have to work around it. The OP is targeting Ubuntu users (or users of the 'base' image distributed in the docker registry.)

Not to mention, Docker places its own hooks in /sbin/init (still true?) so if you had an init that you wanted to run, you had better put it somewhere else inside of your image, because the file will be overriden when your app container starts.

This is a good discussion that starts in about the same place: https://github.com/dotcloud/docker/issues/223

Oh jesus that's a huge mess. So I guess it turns out Docker was designed to do a bunch of wacky things under the hood because they never expected users to use their tool in different ways. It looks like as a fix some people are running full-blown copies of Ubuntu under Docker (because somehow that's better than OpenVZ??)

I'm not sure how you infer this from what I said?

Even Ubuntu have plans to ditch upstart long-term, following Debian's lead to adopt systemd. Nobody is running "full-blown copies of Ubuntu" under docker, because full-blown would imply "with upstart" and using your distributions' built-in /sbin/init is precluded by the fact that docker overrides it, and also that Upstart doesn't work well in a container.

So people are running parts of Ubuntu without understanding that sometimes those parts had expectations of being child processes of upstart, and that upstart would always be running, when in docker-land that's not a reasonable expectation (and OP addresses this by adding in a more reasonable init, which probably handles most of these concerns.)

I just built the baseimage-docker, trying to get into it with ssh now, I can't imagine giving anyone a container and not letting them ssh into it, and I've added ssh to containers before in a way that I knew was non-supported or broken, so it's good to have this example. It looks like they are indeed doing it right.

I think that some time since the issue I linked you from ~11 months ago, they stopped overriding /sbin/init and started putting their hooks in /.dockerinit instead, since many folks tend to put something they care about in /sbin/init.

Here's the reference: https://github.com/dotcloud/docker/pull/898

So yes, there is some mess associated with changing your way of business over to doing containers, but a lot of these problems have been solved in one way over six months ago.

We're left arguing about the people who didn't catch the solution, don't know the expected way of doing things, haven't RTFM'ed or just aren't interested in making permanent solutions out of their docker containers. That is one of the strengths; docker lowers the cost of deployment (in test) when you can take an image that solves only the problem you care about and deploy it into a disposable container.

You will always get the occasional "I want to put this in docker" from someone who maybe isn't understanding, and of course some times that person is your boss. Then sometimes what they're asking for is perfectly reasonable like "let's use it for an SSH forwarding endpoint". Docker (and CoreOS) certainly put some hurdles in front of generally easy and perfectly normal ideas, for better or for worse.

Your first challenge in this case will be to get your container listening on a port 22 of some public interface somewhere, since containers are designed not to be exposed like that until you RTFM; you know whose fault it will be when something is badly configured after you just went and skipped ahead to the section on forwarding ports without spending time on the rest of the instructions at all.

They are definitely curating the GitHub Issues database, you have to give them that.

And as long as we're linking to the source code of init apps:


It isn't the core init system, the /sbin/init executable, itself that's causing problems. It's all the scripts that are part of the init system as a whole. Scripts that try to run fsck and mount volumes and populate /dev and stuff.

As for signals: ok it's great that it responds to signals, but that's not the point. `docker stop` sends SIGTERM. So your PID must respond to SIGTERM and forward the signal to all other processes, period.

That's an easy fix. Provide an empty inittab and a false mtab (or link it to /proc/mounts). Almost every stock install image for every distro does this.

Sounds like Docker has a design flaw :) You could alternately just write a shell script that sends SIGPWR or uses /dev/initctl and then use `docker wait`. It would be better if Docker included support for running a custom executable to stop your container. Maybe they'll add support for it once enough people run into problems like this.

I take back my comment. I am not familiar with Docker; I was confusing it with a raw lxc container, and when I setup an Ubuntu lxc image, I didn't have to do anything special for the init. That's probably because the Ubuntu image had worked out the kinks already.

No need to take it back, it was a fine question and good feedback.

Old systemv init would work for this purpose, but not upstart.

Docker is a container for running processes, or a process. Containers should be disposable and transient. I have begun to think of it in terms similar to OOP. Images are your Classes. Containers are your class Instances. When you are done with an instance, you discard it and make a new instance. So don't go shoving all kinds of crap into the instance like crons and sshd that don't belong there. Most devs don't expect to have their code be free of memory leaks when it comes to interpreted languages. And docker containers don't need to worry about child processes being stopped - they should just be disposed of and you make a new container from your image. Keeping containers around would be like trying to pickle a python class instance perpetually that has references to who knows what... Just make a new instance when you need it. And just make a new container when you need one. I use named containers and a Makefile that stops and deletes existing containers with the same name before starting a new one.

To me, that does not make any sense. Your program executable is already like a Class. A normal process is already like instances of your classes. If all you want is OOP, then why are you using Docker? Your Unix system has been doing that for 30+ years!

The benefits of me that docker brings is the ability to share containers with my team and to deploy those same containers to the server. The reason that I started thinking of it in terms of programming, like OOP, was to help me get my head around how to correctly use docker on a server. As I start to understand docker, finding the correct place for configuration, data, crons, debugging, and execution are all important and the closest paradigm that I can easily apply to it is OOP, even though it is not a perfect match. OOP can also help visualize how one image can inherit from another image and override both methods and configuration of that image so that instances will behave in a different manner.

I think there's a lot of assumptions made, and there are a lot of assumptions made about your base image.

The Ubuntu base image (or how it's built) can be found https://github.com/tianon/docker-brew-ubuntu

Some excellent examples of how to use them with /sbin/init can be found https://github.com/tianon/dockerfiles/tree/master/sbin-init/...

Not everyone who uses Docker uses CRON, nor considers them long-term containers; rather short term process containers.

Docker is growing and how we use docker will change so be flexible and realize what you considered useful yesterday may not be required tomorrow. We will have to re-learn best practices and keep learning after that.

Note, the Ubuntu image isn't made by Ubuntu. Maybe Phusion should host their own Ubuntu image just for sakes of sakes.

Is there a "explain Docker to me like I'm 5" post?

This seems like the old "I have problems with managing everything I need for my app so I'll just run docker containers. Now I have 2 problems"

The Linux kernel has some features that make it possible to isolate a process from all (most) system resources, without actually running it inside a VM.

Docker is a tool that makes it easy to launch such isolated processes. You just specify what the filesystem environment should be for the new process, and what process to run in a small file and off it goes.

In theory this could make provisioning easier, having each application come in a Docker container that satisfies its own system level dependencies, and does its service level dependencies over connections to other containers/external hosts.

The slideshow on this page gives a decent overview:


To understand Docker [0] in a complete way you first have to understand a lot of other concepts.

You need to understand what "linux containers" [1] are. To understand that you need to understand cgroups [2]. To understand cgroups you have to understand process groups/sessions [3] and namespace isolation [4], as well as how the kernel implements cgroups. [5] [6]

You need to understand what chroot [7] is. (it just prepends a path to all pathname lookups for a process and its children)

You need to understand what aufs [8] [9] is. To help understand aufs, you need to understand how union filesystems [10] work, how copy-on-write [11] works, and how virtual filesystems work [12] [13] [14].

You need to understand what rsync is [15]. To understand that you need to understand delta compression [16], and how rolling checksums [17] are used.

You should also understand things like how bash works, how processes signal each other and return status on exit, how file descriptors (like stdin/stdout/stderr) work, and other basic UNIXy concepts, to understand how other parts of docker works.

Docker is a frankenstein amalgamation of all these things, working together to allow you to basically run an arbitrary command in a way that is as isolated from your operating system as is practical while still remaining "light weight". Other solutions have other benefits or tradeoffs as evidenced here [18].

[0] http://www.activestate.com/blog/2013/06/solomon-hykes-explai... [1] https://en.wikipedia.org/wiki/LXC [2] http://blog.dotcloud.com/kernel-secrets-from-the-paas-garage... [3] https://en.wikipedia.org/wiki/Process_group [4] https://en.wikipedia.org/wiki/Namespace_isolation#NAMESPACE-... [5] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt [6] http://blog.dotcloud.com/under-the-hood-linux-kernels-on-dot... [7] https://lwn.net/Articles/252794/ [8] http://aufs.sourceforge.net/aufs.html [9] http://blog.dotcloud.com/kernel-secrets-from-the-paas-garage... [10] https://en.wikipedia.org/wiki/Union_filesystem [11] https://en.wikipedia.org/wiki/Copy-on-write [12] https://en.wikipedia.org/wiki/Virtual_file_system [13] http://lwn.net/Articles/57369/ [14] http://www.win.tue.nl/~aeb/linux/lk/lk-8.html [15] https://rsync.samba.org/how-rsync-works.html [16] https://en.wikipedia.org/wiki/Delta_encoding [17] https://en.wikipedia.org/wiki/Rolling_hash [18] https://en.wikipedia.org/wiki/Operating_system-level_virtual...

If you happen to be a python dev:

Docker is basically like virtualenv, but for everything instead of just for one aspect of one language

So if you run anything other than Ubuntu inside Docker, this is useless because the steps to build your own aren't outlined.

I find Docker to be horribly counter-intuitive and ass-backwards anyway, so not much harm done there as people are in general better off with something else entirely (plain lxc, libvirt, virtualbox, xen, openvz...). I recommend to steer away from it at least until 1.0 is out.

EDIT: I put it in my .plan to build a better BusyBox image aimed at running statically compiled programs with minimal baggage, but I'm not sure when I'll get a round tuit*

*: http://i.ebayimg.com/00/s/NDgwWDY0MA==/z/z-4AAOxyUrZSr82N/$_...

Why do you think it isn't outlined? The website explains exactly what the modifications are, what they do, and what they are good for. The Dockerfile is on Github for everyone to see. The website makes explicit mention that the init system is /sbin/my_init, for which the full source is available on Github. It's trivial to take the my_init script and integrate it into your non-Ubuntu container. You can even write your own init system based on the website description if you so choose.

I think it is not related to Docker itself, but to the fact the it is using all purpose Linux distributions. I'm pretty sure that very soon we will see explosion on new distros addressing exactly these problem and built explicitly for running inside containers.

I used to think that that was what CoreOS was, but I have since become confused.

CoresOS is designed to host containers, not be a base container image.

How does this play with the CoreOS premise where each docker should be hosting a single process managed intelligently through something like systemd?

Under this model I'd expect that systemd's pgroup support should help with zombie processes and generally take over many of the services that baseimage-docker is suggesting here. As other have mentioned in this thread, there's a fairly large difference of opinion between running containers like fast VMs or like thin layers around single processes—does baseimage-docker make sense only in the latter?

baseimage-docker is meant to make it easier to make a correct environment for the processes you run in it, so perhaps make it be more like a fast VM.

From what we've seen the CoreOS people and perhaps the Docker people as well like to see Docker more as a thin layer around processes, being managed by external services.

Off-topic, but I'd thought I'd screwed up my DNS for a moment and this article redirected to the silly side-project I've been working on: ipaidthemost.com.

I guess we borrowed the same template?

I'm pretty suspicious of using runit instead of Upstart- nobody tests Ubuntu with runit, and you're liable to get in trouble if you depend on some other service running on the machine. Although clearly it works well enough for them.

I also sort of suspect that the closer you are to running a full distribution in your containers, the less benefit you're getting from the containers.

Baseimage-docker uses runit exactly to not run a full distribution in the container. Upstart tries to boot a full Ubuntu. A full Ubuntu is not necessary inside the container. Therefore, baseimage-docker provides a custom init system that boots only the minimal subset of Ubuntu that is necessary for it to run correctly in Docker.

I was super stoked to read this, and went diving to borrow some of their work for my own Docker usage. However, I'm confused by their choice of Python for the my_init script. The site claims they chose runit because it is more lightweight than supervisord, a Python tool of similar merit. Making the init process depend on Python seems to negate that advantage.

It's not only Python that makes supervisord relatively heavy compared to runit. It's also the amount of code in supervisord (and its dependencies). my_init is only a single file, less than 300 lines, with minimal dependencies.

Baseimage-docker is also in a "minimal viable product" phase. We're still trying to tweak things until they're right. For example my_init recently received some features which are important in certain use cases; features which would have been much lower to implement in C.

In the future we may optimize things by rewriting my_init in C. Right now it's laziness on our part.

Based on the points raised in your article, I ended up poking my own Docker images to get runit working on them. It looks like runit is A) pretty snazzy, and B) capable of being the init daemon. Am I missing some trade-off or issue there, that prevents it from being PID1 in your scenario?

Yes. Runit does not correctly reap adopted child processes, which is why we run runit under my_init. My_init (or an alternative that behaves like my_init) is absolutely necessary for correct operation.

They just described implementing an OpenVZ VM.

"Note that the shell script must run the daemon without letting it daemonize/fork it. Usually, daemons provide a command line flag or a config file option for that."

fghack is an anti-backgrounding tool. http://cr.yp.to/daemontools/fghack.html

I've been trying to get some tools to run in a docker for a few days now. So far the problems have been that there isn't a convincing HOME folder and user, and that the locale isn't set (only explodes if there are unicode filenames, but there are plenty of those e.g. for SSL certs).

Does this script sort out those kind of things?

No, you have to take care of environment variables yourself. Unfortunately, Docker does not inherit environment variables from the base image, so in every Dockerfile you have to ensure that the right environment variables are set.

That's not "unfortunate" -- it is by design. You use containers to minimize dependencies on the host.

What does minimizing dependencies on the host have anything to do with not inheriting environment variables from the base image?

Note: I'm not talking about inheriting environment variables from the host!

I don't understand the PID1 case. You are running a single process, why do you have to collect zombies?

In fact, I understand none of these points. This seems all very hard to relate to. These are containers and not VMs. Most of that stuff should run in a separate container.

Your single process might spawn child processes that double fork, resulting in zombies. Unless you've read every source code single line in the app, plus every single source code line in all its dependencies (and all dependencies of all dependencies), you really can't be sure that that won't happen. And when it does happen, your system is not behaving correctly.

And what if your single process spawns a child process that encounters an error, and logs only to syslog? If your syslog daemon is not running, you will never know that there has been an error. Again, if you've read every single line and know that this does not happen, then that's fine. But the point of baseimage-docker is to provide a good and safe default so that these edge cases are already taken care of for you.

A lot of stuff isn't a single process. Like Phusion Passenger.

just use the ubuntu-upstart stackbrew image.. compatible with all the packages etc..

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact