Size is such a tiny concern. I'm surprised people make such a big deal about it. When all of your images use the same base, it's only a one-time cost anyway.
And there are FAR more important concerns:
- Are the packages in your base system well maintained and updated with security fixes?
- Does your base system have longevity? Will it still be maintained a few years from now?
- Does it handle all of the special corner cases that Docker causes?
Sorry - but the phusion images are unnecessarily bloated. The existence of the them has been defended by 'fixing' many so-called problems that are actually no problem at all - or at least shouldn't be a problem if you know what the hell you're doing. No well, written software won't spawn zombie-processes - sorry. Reaping dead child processes is something pretty basic if you're using "fork".
And then - a logger daemon. Guess mounting /dev/log into a container is too complex if you care about this?
Logrotate - sure, useful - but if you care about logs and aren't sending them to your logger daemon or /dev/null, you probably want to store them externally - in a volume or mounted host directory - and have a separate container taking care of that.
The ssh server... Containers are no vm's, if you have to log in on a container running in production - you're doing something wrong - unless that container's only job is running SSH (which can be useful for example for Jenkins build slaves).
Cron - again - same thing: run in a separate container and give access to the exact things your cronjob needs.
That is for me the essential thing about containers: separate everything. But sure, you could treat containers as a special VM only for one service - nobody is going to stop you. I however prefer isolating every single process and explicitly telling it how to communicate with other processes. It's sane from many perspectives: security, maintainability, flexibility and speed.
A container is whatever you want it to be. Single process? Sure. Full OS? Sure. Somewhere in between? Sure.
Containers are not new technology, and they were not invented by Docker or Linux. An artificially-constrained view of what a container is (or should be) that's driven by one tool's marketing (Docker) isn't helpful.
Sorry, but it's not only Docker using 'containers' that way. I'm no fan of systemd for various other reasons - but that is one thing it does correctly: use namespaces aka 'containers' to separate processes.
It simply makes no sense to add additional unnecessary overhead and complexity to something that is essentially very lightweight. If you want a full-blown OS - a VM is much better suited at that, and modern hypervisors come with a ton of bells and whistles to help you manage full-os environments.
LXC is using containers in the same manner as VMs. There are still reasons to use a container over a VM. To name a big one, application density. There's a Canonical page about it I can dig up if you want that claims you can get ~14 times the OS density with LXC containers that you can with KVM VMs. That allows you to provide a high degree of separation while still allowing you to use more traditional tools to manage it.
Not everyone is of the caliber that tends to browse HN. Not everyone adapts to new technology as quickly as people around here tend to, especially if that new technology requires a huge upheaval in the way that things have been done for the last 10 or 15 years. Using containers the same way we do VMs provides a lot of the benefits of containers without requiring a drastic change from other departments.
I've had upto 512 LXC nested containers running quagga for bgp & osp to simulate "the internet". My machine is an i7 laptop and this used less than 8-10 gigs of ram to run.
fyi the github of "The Internet" setup was from the 2014 NSEC conference where they used it so the participants had a large internet routing simulation available to test security.
They're supposedly coming along quite nicely with the security of containers. Can you run docker containers in userspace? It's been a while since I did much with it, I know LXC can with a fair bit of customization. That would do a lot to help with security, and if you're following good containerization principles you should be able to set a really finnicky IDS that shuts down containers on even the slightest hint of a breach.
> Modern KVM has a comparable density to containers (except for memory)
It does, but the memory can make a big difference if you're running microservices. If I'm guesstimating I'm thinking there's probably about a 200MB difference in memory usage between a good container image and a VM. With microservices that can grow quite a bit. Let's say 4 microservices, needing at least 2 of each for redundancy, you're already looking at a difference of 1.6GB of memory. If you need to massively scale those that's .8GB of memory for every host you add, not including any efficiency gains from applications running on containers rather than VMs (which is going to be largely negligible unless we're talking a massive scale).
You can create either privileged or unprivileged LXC containers. Creating Unprivileged containers only requires a very simple configuration that takes 60 seconds to do.
Also, note that with LXD/LXC the "default" container is now unprivileged. Also with LXD/LXC the LXC command syntax is now simplified even more than it was with traditional LXC but with the added power of being able to orchestrate and manage LXC containers either remotely or locally.
Yes, and it increases the attack surface even more in some scenarios. Now, an unprivileged user can create new namespaces and do all sorts of things which were previously limited to root.
With "clear containers" (very minimal KVM VMs), you get the overhead down to <20MB:
LXC (www.linuxcontainers.org) supports Apparmor, SElinux, Seccomp and what’s probably the only way of making a container actually safe LXC has supported user namespaces since the LXC 1.0 release in 2014.
Yeah that's cool, but my main point is that images which make use of the stable debian package system and are actively maintained are a better approach than an image that makes use of more obscure technology that could be abandoned, or worse, maintaining your own container infrastructure.
> No well, written software won't spawn zombie-processes - sorry.
And yet it happens.
> The ssh server... Containers are no vm's, if you have to log in on a container running in production - you're doing something wrong
The SSH server is incredibly useful for diagnosing problems in production, so I for one applaud it (although it's not really necessary anymore with docker exec).
> Cron - again - same thing: run in a separate container and give access to the exact things your cronjob needs.
Or just run it in-container to keep your service clusters together.
> That is for me the essential thing about containers: separate everything.
It's a question of degree. Where you draw the line is almost always a personal, aesthetic choice.
I can understand that argument. It's an edge case, and building a sane Dockerfile on top of Alpine that runs applications through S6 (or runit), which developers use for their applications is the way to go for me. This is what phusion baked in?
>The SSH server is incredibly useful [...] (although it's not really necessary anymore with docker exec).
It's an additional attack vector and, by your own admission, it's useless. docker exec has been baked into docker for over a year.
>Or just run [cron] in-container to keep your service clusters together.
Per-container cron sounds painful. Then you have to deal with keeping every container's system time in sync with the host (yes, they can deviate). Not only that, if you have a periodic cron job that runs an app to update some database value, scaling becomes bottlenecked and race conditions (and data races) can get introduced. You are prevented from running multiple instances of one application to alleviate load because the container has the side-effect of running some scheduled job. Cron should be separate.
One can also choose the degree to which they want to throw out good practices that prevent them from repeating others' mistakes.
Have you ever seen a container's system time deviate from a host? This makes sense with boot2docker since it runs in a VM but I can't think of a reason this would happen in a container.
>> No well, written software won't spawn zombie-processes - sorry.
> And yet it happens.
Strange, I have been running software in docker for almost 2 years in production on 6 docker hosts running a ton of containers these days, and yes - a lot of this software spawns child-processes.
In all this time I have never seen zombie processes with one major execption: Phusion Passenger to run our Redmine instance. If you run this under supervisord as 'init' process - you indeed notice the init process cleaning up "zombie processes" at startup like this:
So that case for me is the exception, and I do use an init process (supervisord) to run only apache with passenger. Note that using Apache with PHP or plain does not leak zombie processes.
Some things you really can't split into one-process-per-container. Like how WAL-E needs to run alongside the Postgres daemon (or at least, I was unable to get it to run otherwise). You might argue you shouldn't run Postgres in a Docker container, but that's just one example of IPC you can't delegate to shared files / TCP ports.
The real problem with splitting things into a bunch of containers is that the story around container orchestration is still poor. Kubernetes is the leader here, but running a production-ready cluster takes some work (besides Google Container Engine, there are some nice turn-key solutions for spinning up a cluster on AWS but they come with short-lived certificates and rigid CloudFormation scripts which create separate VPCs; so you have to setup your own PKI and tweak CloudFormation scripts).
I see no reason why it couldn't run in a separate container. You'd probably have to mount the postgres socket directory and the WAL archive dir into it, but it could be tricky - true. But containers are just a tool. Some things are not suitable to run in containers, don't try to shoe-horn everything into them.
Other than that, there's no problem running postgres itself in a container - as long as your data is stored in a volume ending up being bind-mounted on the local disk, and not on the layered filesystem - otherwise performance will suffer badly.
And yes - orchestration - especially on small-scale - is still a sore point. All the tools like kubernetes seem to focus on large scale and scaling multiple instances of the same containers - which is not what I and many people need. Something like docker-composer, but in daemon form would be nice.
Personally, I've run into weird issues sharing sockets and other files that need to be read+write on both containers. One thing is you have to set up the users very carefully/similarly in both containers, due to file ownership issues with bind mounts (UIDs have to align in both containers).
Agreed about not shoehorning things into containers. Redis, for instance, should be ran with custom kernel parameters (transparent huge pages disabled), so doesn't fit well in the container paradigm since containers share the same kernel.
Agree in general, but you can overdo it with splitting services up. E.g. would you really run a extra container just for a cronjob that runs once a night to e-mail some data from a database? Esp. if you run on a platform where you essentially pay per container that seems like a waste.
Most of the things I described assume you have full control over your host's OS.
For stuff like you mention - you should maybe reconsider not using containers if you're on a pay-per-container platform? They are just a tool, and certainly don't fit every single use-case. Also - paying per container seems like a silly thing to do - since containers can be very short-lived. Resource-based billing would be a better fit - although that could be tricky to measure I guess.
I'm currently toying with IBM bluemix (mostly because they have a relatively big free tier) and they have resource-based billing, but you since can't make containers arbitrarily small and you pay for RAM reserved for a container, it is effectively per container. So even if you only need 1 GB for 30 min every night, you either build something that starts a worker container on schedule or you pay for resources you don't use 98% of the time. I guess other platforms are similar.
But of course, if you can afford to use that in production it probably doesn't matter very much, and you might choose a different platform if it bugs you. Just came to mind because I just was wondering how to split stuff up.
Size of programs, in terms of disk, memory, cpu time, and network usage, is bloated by multiple orders of magnitude by all the confused people who think the only thing that matters is "developer productivity". Maybe 20% is worth sacrificing, maybe 50%, but 100x? 1000x? It all adds up.
One really easy and relevant example, sizes of docker images for running memcached:
(that last one is my own, the other two are the two most popular on docker hub).
As another example, a co-worker recently was working with some (out-of-tree) gstreamer plugins, and the most convenient way to do so was with a docker image in which all the major gstreamer dependencies, the latest version of gstreamer, and the out-of-tree plugins were built from source. The offered image was over 10GB and 30 layers, took quite a while to download, and a surprising number of seconds to run. With just a few tweaks it was reduced to 1.1GB and a handful of layers which runs in less than a second. It was just a total lack of care for efficiency that made it 10x less efficient in every way, enough to actually reduce developer productivity.
> the confused people who think the only thing that matters is "developer productivity".
Developers, especially good developers (or hell, even just competent) are more than worth the effort put into improving their productivity, and the good ones will usually intuitively have a grasp of the XKCD time trade-off graph and reduce or eliminate delays themselves given the chance.
That being said, even in this day and age of extremely cheap cycles, non-volatile and volatile storage, and insane throughput, making something like VM/chroot images smaller can lead to higher productivity in that you can spin them up faster, or spin up tons more in paralell than you would normally think of. Having the option to do such can help shape alternate modes of development and open up possibilities previously undreamt of ("spin up 1000 docker images? Can't do that because they each need 200MB RAM and I only have 32GB of RAM").
Size of cruft aside, there's value in discussing whether such cruft should exist.
It's normal for common tools to be SUID root - it's necessary for operation on a normal machine. Do you really need 30+ SUID binaries inside your Docker container built for one thing?
Docker seems to present an ideal situation for stripping such potential exploit vectors back.
One really easy one: write a shell script to do most of the image building (run by the Dockerfile), instead of adding a bunch of RUN directives in the Dockerfile, especially if you clean up intermediate files with a "make clean" or something. Each directive in the Dockerfile adds a layer, which adds container setup overhead, and also "locks in" all filesystem space usage at that point.
I'd be careful about drawing conclusions from those tests. We know that the number of bits in a container does not directly influence how much RAM it consumes. Therefore, there must be something the images are doing that consumes memory which is not happening in the "smaller" images. The key would be to find the culpable process or daemon(s).
It could well be due to things like shared libraries. A larger distro will have more options enabled, causing more shared libraries to be linked into the same running processes, and thus more shared libraries to be fully loaded into memory.
A smaller distro might even statically compile most things - Alpine does. If you dynamically link shared libraries, the whole library is loaded into memory to serve the process. If you statically link, only the actually used part of the library is included in the binary.
Statically linked binaries can't share the library in memory between each other like dynamically linked binaries can, but if all your processes are running in separate containers, they won't share those libraries anyway (unless they're all in a VM and the rarely used "samepage merging" is enabled for the VM).
Finally ... simplicity has knock-on effects. Making things simpler and smaller (not easier), and reducing the number of moving parts in the implementation, makes cleaning up more stuff easier.
That's not really how it works. Both executables and shared libraries are mapped into the virtual address space of the process, then only stuff that is actually used will be faulted (read) into physical memory. At page granularity, so yes, there is some bloat due to unused functionality, but it's not as bad as requiring the entire thing to be loaded into memory.
That's an awful lot of conjecture. I'd wager that most of what you would actually be running in a container would not have its memory usage significantly affected by the presence or absence of optional shared libs.. I'm with the parent on this; such claims warrant research.
Not really, it was an educated guess, and then a description of how binaries and libraries work on modern unix systems.
Here's a quick demo based on the trivial example of the memcached docker images I mentioned in another thread:
vagrant@dockerdev:/host/scratch/janus-gateway$ sudo docker run --name=mc_big --detach --publish=11212:11211 --user=nobody sylvainlasnier/memcached /usr/bin/memcached -v -m 64 -c 1024
67c0e406245d341450c5da9ef03cbf60a8752433a4ace7471e2a478db9a62e07
vagrant@dockerdev:/host/scratch/janus-gateway$ sudo docker run --name=mc_small --detach --publish=11213:11211 --user=nobody ploxiln/memcached /bin/memcached -v -m 64 -c 1024
11037b69acfbc0de7601831634751cd342a7bafe9a25749285bc2c2803cc1768
vagrant@dockerdev:/host/scratch/janus-gateway$ top c -b -n1 | grep 'COMMAND\|memcached'
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5984 nobody 20 0 316960 1192 768 S 0.0 0.1 0:00.02 /usr/bin/memcached -v -m 64 -c 1024
6091 nobody 20 0 305256 780 412 S 0.0 0.0 0:00.00 /bin/memcached -v -m 64 -c 1024
Notice the significant difference in RES (resident set size) and SHR (shared memory). Less trivial processes will have more shared libraries and bigger differences here. Multiply this kind of result times all the contained processes. It adds up.
Sorry, I was responding to your post in the context of logician's "an important concern" assertion. You and jabl are correct technically of course.
Within the context of "an important concern" though; the difference in RES and SHR between the two is about ~330kb. I suspect most people wouldn't find that significant particularly given memcached's common use cases.
Using Gentoo stable in production right now. I'm in charge of how long a package is supported now. All execs get a brand new gentoo machine built with binaries compiled by myself.
You wouldn't believe how fast you can get a gentoo machine up and running compared to other distros. Build for a minimum common architechture (all intel binaries are based on Sandy Bridge, all ARM based on Rockchip RK3088), and installing for new computer is little more than untarring a bunch of binaries to /. My record is 5 minutes for a full KDE Plasma 5.5 software stack.
I explicitly did not mention Gentoo - I know a bunch of people who run it in production. But, for anyone considering to do this - if you're running Gentoo, you're essentially building your own distro, which has massive advantages but is also a huge effort. You're now in charge of security updates, maintenance and Q&A. What if you're leaving the company? There are many Debian or Redhat admins, but good luck finding a Gentoo expert.
We use Alpine Linux for our applications and I like it, and I too shudder at it being used for the entire production system. As a sysadmin, you can still administer the LTS distro that hosts the docker containers and whatever other pieces of the stack you interact with. Alpine Linux containers, like any other container, should host an instance of an application (maybe not even that, depending on how complex the application is) -- not the entire production server, not SSH keys, not iptables, firewall rules, etc.
Both. Debian is the gold standard of long term support, and Ubuntu is a stable company that builds upon this. And that's why I approve of phusion/baseimage being based off it.
RHEL (and by extension CentOS) 7 provides Ruby 2.0. And a 3.10 kernel even. If you're running docker with CentOS, this is what you're likely to use.
RHEL/CentOS 6 provides Ruby 1.8.7 and a 2.6.32 kernel. It can be made to run with docker, but it's unsupported and it won't be easy.
RHEL/CentOS 5 provides Ruby 1.8.5. The 2.6.18 kernel it comes with won't even run go binaries such as docker, much less lxc. Yes this is ancient. It was released in 2007 and it will be supported until 2017.
And there are FAR more important concerns:
- Are the packages in your base system well maintained and updated with security fixes?
- Does your base system have longevity? Will it still be maintained a few years from now?
- Does it handle all of the special corner cases that Docker causes?
That's why I use https://github.com/phusion/baseimage-docker