Kelsey Hightower actually published something on this topic called "Building Docker Images for Static Go Binaries" .
Nix package manager  offers a potential means to know the complete dependency tree. If you're not familiar, a nix expression to build a package takes a set of inputs (specific binary packages of, e.g., make, gcc, bash, libc, libxml2) and produces a binary output (depending only on the inputs). The run-time dependencies can be a smaller set than the build-time dependencies and are deduced by observing shared library linking for example.
I've been using it (outside Docker) for various Ruby apps, and I can't say it's been easy, but a large part of the pain has been Rubygems' inability to encode dependencies on C-libraries (e.g. libxml-ruby depends on libxml2).
There have been attempts at provisioning Docker containers with Nix 
Of course, if you are using Nix, some part of Docker's isolation becomes redundant (Nix isolates multiple version of things on the filesystem using plain-old-directories, so it's trivial to run ten different versions of Ruby side-by-side, for example).
I've heard whisperings on the wind of research being done with respect to monitoring what files a Docker container uses, and then removing everything that the container doesn't need to run the app. I agree that this is the future- I shouldn't have apt-get, curl, etc. taking up space in my final image if I don't need them - but how do you tell a "good" file from a "bad" one? (Just thinking out loud here - what if my app depends on imagemagick, libffmpeg etc.?) Nix looks pretty cool I suppose.
I've managed to get most images to basically the size of the base image + my app.
This process is sort of the reverse of building a single binary and adding it to a minimal image. I like that approach but it's not always straightforward w/ some applications.
Mostly the problem I've run into is figuring out what to remove without b0rking the containerized app.
For instance here is a gist I whacked together in a few mins that will build you a runnable Ruby intepreter with NOTHING else installed but it's required shared libraries. (Note this would not be fun to get say Nokogiri working in without knowing what you are doing)
I think that would be like packr  for Java, already discussed here .
I wonder if there is something like this for other languages/platforms like python/ruby/node.
Nix is a package management system which knows the full (yes completely full) transitive dependency tree of every package it installs, so you can have an absolutely minimal set of software in your container if you use nix-docker.
Because it's already doing version isolation in the package manager, you can also mount the software from the host into a shim container, which is more the "nix way" of doing things.
I think there's got to be a middle-ground - small (maybe O(tens of MB) max) but full-featured enough to have a simple shell and the ability to get debug tools.
How small of a debian or fedora image could we get if we REALLY tried? 50 MB? 30 MB?
| docker container |
| w/ static bin |
| <----- nsenter + bash/debug bits (on host machine)
So, nope, docker exec just wouldn't work.
- you need a specific version (e.g. redis is pretty old in the Debian repositories).
- you need to compile with specific options.
- you will need to npm install (or equivalent) some modules which compile to binary.
But you can avoid all of these by building your own DEB/RPM packages and installing those into the container.
This might make the container less "whitebox", in that the Dockerfile no longer contains the full steps to reproduce the built image from public sources. But having an internal package repository makes a lot of sense, and not just for building small Docker images. Keeping your own package repository helps make your server builds more reproducible in general, and provides clean mechanisms for performing updates on your own software as well as third-party packages.
(edited for formatting)
To elaborate for people not very familiar with gentoo; it solves what a lot of the discussion in this thread seems to be about - having complete control of the dependency chain based on how you choose to build your software. Using nginx as an example, enabling mod_security would pull and build its own dependencies (which also can be limited based on its compiler options). Strip man pages? Done. Change libc library? No problem (if the packages support it).
The work that needs to be done is expanding the toolset to a point where you say "I want this in docker, plx" and anything else (dependencies disregarded) basically goes out the window. The current attempts builds upon a small set of packages for convenience then removes "safe" stuff. When time allows, I'd like to be much more aggressive in terms of what's considered safe.
I'm personally also very interested in progrium's work with bundling busybox with opkg (https://github.com/progrium/busybox), but still think that docker containers should not be built from within - which why cross compiling from gentoo to create a minimal docker image is the way to go.
And using docker filesytem feature for that. I'm still quiete new to Docker and don't know if it is easy to do.
But I'll have a look to gentoo-bb, I think it is exactly what I need!
To reconcile these issues I propose two-phased building of Docker image. First you setup a regular Docker image based on Debian or whatnot which contains all the tools you need to build/setup the application. Then inside that container you build the final image based on empty image (eg http://docs.docker.com/articles/baseimages/#creating-a-simpl... ), adding only the files that are really needed at runtime.
I'd love to see more research in the area.
Much better concept would be: share layers, not images, based on verified base images with preinstalled saltstack. This effectively boils down to sharing good and up2date provisioning scripts.
There are some more conceptual problems with the whole docker idea that are rooted in a "need-to-productize-quick" infected thinking and do make everything seem immature and not really thought out - very basic problems that pop up with orchestration and networking should have been solved before releasing the product, now millions of half-assed "products" step into that gap and the result is a bizarr level of overcomplication of any infrastructure that was not possible before with virtualization alone, and still there are important things that "will be contributed in the future by somebody, hopefully".
Docker should not be a product itself with it´s own "market", but the basic docker ideas should be added to already existing concepts and inherit already existing infrastructure. The docker execution model should be a standard feature of any linux distribution with a standardized container modell (with some security added!) and the existing packaging infrastructure should be extended to handle what is needed to support it, including userspace updates and provisioning or on-the-fly rebuilds, so people can concentrate on writing provisioning scripts and not fighting another layer of system config BS. Getting rid of the VM is great, but building even more complicated overhead is totally absurd. Meanwhile something like Vagrant is a great thing to learn from.
I would hold that making docker images easy to use, transparent as possible, reliable, versatile and easy to use (did I mention that already? oops) are far more important priorities.
Admittedly, I use docker primarily for development/testing purposes and my use-cases are a bit different than the average production use-case, however, having a large toolbox easily accessible for me to use (yes, including the ability to ssh into the docker container) is invaluable to me.
I may be missing something here, but racing to make docker images "as small as possible" feels like a bit of premature optimisation.
However, I've seen (and am guilty of) a few of the pitfalls in the parent post. When you suddenly start adding 100mb+ unnecessarily to a Docker image, it can have some nasty ramifications in devspeed, and also deployments. A classic way to accidentally dramatically increase the weight of a Docker image is to apt-get install build-essential.
What kinds of ramifications you might ask? Well, I live in Australia. I don't know if Docker Registry has a CDN POP here, but that extra 100mb tends to take a solid minute or two longer for me to pull down.
If you dump an OS in a container you are treating it like a lightweight VM (and that might be fine in some/many cases).
If however you restrict it to exactly what you need and it's runtime dependencies + absolutely nothing more then suddenly it's something else entirely - it's process isolation, better yet it's -portable- process isolation.
If it doesn't hinder runtime performace +/- 100MB of disk space is fairly benign. I understand how smaller images would be useful, but for my use case it doesn't help much.
On the other hand, pinning every package that you install would end up being pretty verbose.
I see a few reasons to build your own from the Dockerfile:
- 1) You don't trust the image and want to build your own.
- 2) You want to build something slightly different
- 3) You want an up-to-date version.
So you'll generally want those fixes, unless they really broke something, which I'd guess would be somewhat rare.
RUN curl -SLO "http://nodejs.org/dist/v$NODE_VERSION/node-v$NODE_VERSION-linux-x64.tar.gz" \
&& tar -xzf "node-v$NODE_VERSION-linux-x64.tar.gz" -C /usr/local --strip-components=1 \
&& rm "node-v$NODE_VERSION-linux-x64.tar.gz"
RUN set -e; \
curl -SLO "http://nodejs.org/dist/v$NODE_VERSION/node-v$NODE_VERSION-linux-x64.tar.gz"; \
tar -xzf "node-v$NODE_VERSION-linux-x64.tar.gz" -C /usr/local --strip-components=1; \
That said, obviously there's going to be different preferences...I was more wondering if there was some semantic difference.
Then if the lines are long (as they are here) they get cut off and it's much harder to notice that they're there at all.
I was trying to use the example given...
Here's another, smaller example:
# install wget without artifacts
RUN set -e; \
apt-get update; \
apt-get install -y wget; \
apt-get clean; \
rm -rf /var/lib/apt/lists/*
Personally I also don't think the meaning is as clear, and you now need to maintain the top line if you cut'n'paste. I suppose it's a matter of taste.
It'll replace a bunch of unnecessary shell scripts that are currently required to get similar functionality working.
This will remove the .debs apt just downloaded and installed that are being cached in /var/cache/apt/archives, saving you a little disk space.
On one hand, I consider this a good default behavior for building docker containers, so I'm glad it's there. On the other hand, I didn't know about it until I investigated. It's strong evidence for the great point @amouat made regarding base images being blackboxes in most (all?) cases.
i wanted to use public docker images and spin on it AWS and quickly use a sanity check/vet the app if it can be something I want to use internally or recommend for customers.
Realized there is nothing of such sort and started xdocker.io -- open source initiative.
Currently we support security monkey and ice (both from netflix).
Just love docker and learning quite a bit of tricks along the way.
This article just helps us to do our job better by following the best practices to build docker images.
I would also appreciate if experts on this can help us screen the docker files we have created and share the feedback with us.
There are lots of proposal sitting in Github, it'd be great to get more feedback.
I would suggest the only truly whitebox images are the ones that can be recreated from github (or similar) repositories.
This is true for any packages you use, not just Docker. For example, rails gems. If you don't pin them your app will break at some point on an update. Always manually update packages and test before deploying.
This must be a bug? Why should it legitimately behave this way?