
Building Good Docker Images - jaswilder
http://jonathan.bergknoff.com/journal/building-good-docker-images
======
WestCoastJustin
Google takes this a step further and creates single binary containers with the
minimal OS bits needed [1, 2]. Personally, I think this is where we need to be
headed vs running a full blown ubuntu/debian/centos OS inside the container.
Three benefits, 1) no OS to manage eg. no _apt-get update_ or configuration
management, 2) container has less of an attack surface (think shellshock --
the container does not have bash, wget, curl, etc), 3) they are lightweight.
The issue is that, how you do we (container creators) know the dependency tree
for the app? Sure this might be easier for Go binaries, but what about complex
apps like rails and mysql? It is a major pain to figure this out, so we just
use an OS, and it takes all the thinking out of it.

Kelsey Hightower actually published something on this topic called "Building
Docker Images for Static Go Binaries" [3].

[1] [https://registry.hub.docker.com/u/google/nodejs-
hello/](https://registry.hub.docker.com/u/google/nodejs-hello/)

[2]
[https://github.com/thockin/serve_hostname](https://github.com/thockin/serve_hostname)

[3] [https://medium.com/@kelseyhightower/optimizing-docker-
images...](https://medium.com/@kelseyhightower/optimizing-docker-images-for-
static-binaries-b5696e26eb07)

~~~
zenlikethat
Perhaps I'm missing something, but I don't see how [1] can be used to create
"single binary containers with the minimal OS bits needed"? It is from
[https://registry.hub.docker.com/u/google/nodejs/dockerfile/](https://registry.hub.docker.com/u/google/nodejs/dockerfile/)
and uses the full Debian stack that you discuss including apt-get etc.

I've heard whisperings on the wind of research being done with respect to
monitoring what files a Docker container uses, and then removing everything
that the container doesn't need to run the app. I agree that this is the
future- I shouldn't have apt-get, curl, etc. taking up space in my final image
if I don't need them - but how do you tell a "good" file from a "bad" one?
(Just thinking out loud here - what if my app depends on imagemagick,
libffmpeg etc.?) Nix looks pretty cool I suppose.

~~~
jaswilder
I wrote [https://github.com/jwilder/docker-
squash](https://github.com/jwilder/docker-squash) to remove things that I know
I don't need in the final image such as curl, wget, temp files, various
packages, etc..

I've managed to get most images to basically the size of the base image + my
app.

This process is sort of the reverse of building a single binary and adding it
to a minimal image. I like that approach but it's not always straightforward
w/ some applications.

~~~
zenlikethat
Ah yes, I've played with docker-squash and like it, I wish there was a built-
in docker solution for squashing layers (perhaps any contiguous string of
instructions starting with ~ would be squashed into one layer?).

Mostly the problem I've run into is figuring out what to remove without
b0rking the containerized app.

------
ajdecon
I really think there are no good reasons to include build tools in a docker
image. The author lists three possible reasons:

\- you need a specific version (e.g. redis is pretty old in the Debian
repositories).

\- you need to compile with specific options.

\- you will need to npm install (or equivalent) some modules which compile to
binary.

But you can avoid all of these by building your own DEB/RPM packages and
installing those into the container.

This might make the container less "whitebox", in that the Dockerfile no
longer contains the full steps to reproduce the built image from public
sources. But having an internal package repository makes a lot of sense, and
not just for building small Docker images. Keeping your own package repository
helps make your server builds more reproducible in general, and provides clean
mechanisms for performing updates on your own software as well as third-party
packages.

(edited for formatting)

~~~
aye
I wish I could upvote this a few times.

------
jbergstroem
I think this is a case where gentoo can really shine. Although tooling might
not be there just yet, the linux meta distribution allows you to build a
strict set of dependencies based on what you need and nothing else. There's
already been pretty successful attempts at this, such as
[https://github.com/edannenberg/gentoo-
bb](https://github.com/edannenberg/gentoo-bb) (63mb custom nginx sound ok to
you?) or
[https://github.com/wking/dockerfile](https://github.com/wking/dockerfile).

edit: To elaborate for people not very familiar with gentoo; it solves what a
lot of the discussion in this thread seems to be about - having complete
control of the dependency chain based on how you choose to build your
software. Using nginx as an example, enabling mod_security would pull and
build its own dependencies (which also can be limited based on its compiler
options). Strip man pages? Done. Change libc library? No problem (if the
packages support it).

The work that needs to be done is expanding the toolset to a point where you
say "I want this in docker, plx" and anything else (dependencies disregarded)
basically goes out the window. The current attempts builds upon a small set of
packages for convenience then removes "safe" stuff. When time allows, I'd like
to be much more aggressive in terms of what's considered safe. :edit

I'm personally also very interested in progrium's work with bundling busybox
with opkg
([https://github.com/progrium/busybox](https://github.com/progrium/busybox)),
but still think that docker containers should not be built from within - which
why cross compiling from gentoo to create a minimal docker image is the way to
go.

~~~
pierreozoux
Thanks for the pointer. I've been thinking about this kind of scheme: \- using
a vanilla gentoo \- install portage \- emerge my package \- at the end, do a
diff of the filesystem to apply it to the vanilla gentoo.

And using docker filesytem feature for that. I'm still quiete new to Docker
and don't know if it is easy to do.

But I'll have a look to gentoo-bb, I think it is exactly what I need!

------
zokier
This gave me an (probably non-novel) idea: "double-layered" Docker image
creation. One thing that rubs me the wrong way is how Docker images contain
stuff like apt (and all the related supporting stuff) when they don't really
need them (at runtime). On the other hand you need to install/compile/setup
the environment somewhere, and relying on the host system would break any
hopes of reproducibility.

To reconcile these issues I propose two-phased building of Docker image. First
you setup a regular Docker image based on Debian or whatnot which contains all
the tools you need to build/setup the application. Then inside that container
you build the final image based on empty image (eg
[http://docs.docker.com/articles/baseimages/#creating-a-
simpl...](http://docs.docker.com/articles/baseimages/#creating-a-simple-base-
image-using-scratch) ), adding only the files that are really needed at
runtime.

~~~
23david
Yep cool idea. This is being addressed by the (not yet merged) docker nested
build feature:

[https://github.com/docker/docker/pull/8021](https://github.com/docker/docker/pull/8021)

------
dockerhubby
The Docker hub is - after such a short time - an even darker place than the
wordpress plugin registry and is already a source of security problems and a
useless waste of bandwith, time and effort. Besides too many amateurs
publishing BS, the real problem is that the company behind this is not taking
responsibility to assure the quality of their containerized-app-store. This
does not have to end in censorship, like our beloved Big Brother Apple does -
some automated checks on each uploaded image could be a way to go plus a team
of reviewers, that approves everything uploaded. Of course, the amount of
information attached to any image currently is a joke, the whole hub is a one-
day-of-work prototype that never should have been published in the first place
in this premature state, but now it´s too late, so there is no other way than
burning it down and restarting it with some more thinking before.

Much better concept would be: share layers, not images, based on verified base
images with preinstalled saltstack. This effectively boils down to sharing
good and up2date provisioning scripts.

There are some more conceptual problems with the whole docker idea that are
rooted in a "need-to-productize-quick" infected thinking and do make
everything seem immature and not really thought out - very basic problems that
pop up with orchestration and networking should have been solved before
releasing the product, now millions of half-assed "products" step into that
gap and the result is a bizarr level of overcomplication of any infrastructure
that was not possible before with virtualization alone, and still there are
important things that "will be contributed in the future by somebody,
hopefully".

Docker should not be a product itself with it´s own "market", but the basic
docker ideas should be added to already existing concepts and inherit already
existing infrastructure. The docker execution model should be a standard
feature of any linux distribution with a standardized container modell (with
some security added!) and the existing packaging infrastructure should be
extended to handle what is needed to support it, including userspace updates
and provisioning or on-the-fly rebuilds, so people can concentrate on writing
provisioning scripts and not fighting another layer of system config BS.
Getting rid of the VM is great, but building even more complicated overhead is
totally absurd. Meanwhile something like Vagrant is a great thing to learn
from.

------
deeviant
Is minimizing the size of a docker image really the top priority?

I would hold that making docker images easy to use, transparent as possible,
reliable, versatile and easy to use (did I mention that already? oops) are far
more important priorities.

Admittedly, I use docker primarily for development/testing purposes and my
use-cases are a bit different than the average production use-case, however,
having a large toolbox easily accessible for me to use (yes, including the
ability to ssh into the docker container) is invaluable to me.

I may be missing something here, but racing to make docker images "as small as
possible" feels like a bit of premature optimisation.

~~~
samcday
If we were talking about shaving a couple of bytes off a layer, I'd agree that
attention might be better focused elsewhere.

However, I've seen (and am guilty of) a few of the pitfalls in the parent
post. When you suddenly start adding 100mb+ unnecessarily to a Docker image,
it can have some nasty ramifications in devspeed, and also deployments. A
classic way to accidentally dramatically increase the weight of a Docker image
is to apt-get install build-essential.

What kinds of ramifications you might ask? Well, I live in Australia. I don't
know if Docker Registry has a CDN POP here, but that extra 100mb tends to take
a solid minute or two longer for me to pull down.

------
pit
"Pin package versions" \-- yes. One of the things that has been bugging me
about Docker is that if you begin every Dockerfile with an `apt-get -y
update`, you never know what you're going to end up with.

On the other hand, pinning _every_ package that you install would end up being
pretty verbose.

~~~
sp332
Why do an update if you prefer your packages to be pinned?

~~~
michaelmior
Depending on what your base image is, the pinned versions you want to install
may not be available.

------
23david
One additional tip for readability is to replace the && with set -e at the top
of any RUN commands that combine more than one command.

before:

    
    
      RUN curl -SLO "http://nodejs.org/dist/v$NODE_VERSION/node-v$NODE_VERSION-linux-x64.tar.gz" \
        && tar -xzf "node-v$NODE_VERSION-linux-x64.tar.gz" -C /usr/local --strip-components=1 \
        && rm "node-v$NODE_VERSION-linux-x64.tar.gz"
    

after:

    
    
      RUN set -e; \
        curl -SLO "http://nodejs.org/dist/v$NODE_VERSION/node-v$NODE_VERSION-linux-x64.tar.gz"; \
        tar -xzf "node-v$NODE_VERSION-linux-x64.tar.gz" -C /usr/local --strip-components=1; \
        rm "node-v$NODE_VERSION-linux-x64.tar.gz

~~~
digisign
Hmm, needs additional semi-colons.

~~~
23david
where?

I was trying to use the example given... Here's another, smaller example:

    
    
      # install wget without artifacts
      RUN set -e; \
        apt-get update; \
        apt-get install -y wget; \
        apt-get clean; \
        rm -rf /var/lib/apt/lists/*

~~~
e12e
I think parent meant you trade two &&s for one ;.

Personally I also don't think the meaning is as clear, and you now need to
maintain the top line if you cut'n'paste. I suppose it's a matter of taste.

------
peterwwillis
Docker images are like if some high school kids decided to sell mass-produced
beer. Instead of writing down a recipe with the ingredients, measurements,
temperatures, elevation, times, and distributor-sourced quality-assured
materials, the kids go to random stores and buy whatever they think they need
to make beer in huuuuge quantities. They make huge batches of beer so that
they won't need to make it again for months or years. Then the next time they
brew a batch, the beer tastes completely different, they say, "Oh, we might
need to write some of this stuff down, and make sure it was the same as last
time."

------
banmeagainplz
This patch is essential for creating minimal, best images with Docker.
[https://github.com/docker/docker/pull/8021](https://github.com/docker/docker/pull/8021)

~~~
23david
I hope nested build functionality get merged soon. From reading the last
proposal minutes, I wasn't sure if it's been shelved for later?

[https://github.com/docker/irc-minutes/blob/master/docker-
dev...](https://github.com/docker/irc-minutes/blob/master/docker-
dev/2014-10-02--17-30.md#nested-build---proppy)

It'll replace a bunch of unnecessary shell scripts that are currently required
to get similar functionality working.

------
robinson-wall
Something the article doesn't touch on in its pursuit of a smaller image -
when you run "apt-get install" or "apt-get upgrade" you should do "&& apt-get
clean" in the same RUN command.

This will remove the .debs apt just downloaded and installed that are being
cached in /var/cache/apt/archives, saving you a little disk space.

~~~
jbergknoff
Thanks for pointing this out. In the "debian:wheezy" docker image, "apt-get
clean" didn't seem to have an effect. I dug around a bit and found that, in
this image, aptitude is configured to not cache the downloaded packages (via
/etc/apt/apt.conf.d/docker-clean).

On one hand, I consider this a good default behavior for building docker
containers, so I'm glad it's there. On the other hand, I didn't know about it
until I investigated. It's strong evidence for the great point @amouat made
regarding base images being blackboxes in most (all?) cases.

------
sudhi_xervmon
This is a great topic. I hope these docker image thingy gets a bit more mature
and quickly.

i wanted to use public docker images and spin on it AWS and quickly use a
sanity check/vet the app if it can be something I want to use internally or
recommend for customers.

Realized there is nothing of such sort and started xdocker.io -- open source
initiative.

Currently we support security monkey and ice (both from netflix).

Just love docker and learning quite a bit of tricks along the way.

This article just helps us to do our job better by following the best
practices to build docker images.

I would also appreciate if experts on this can help us screen the docker files
we have created and share the feedback with us.
[https://github.com/XDocker/Dockerfiles](https://github.com/XDocker/Dockerfiles).

------
nickstinemates
Great post. It shows there's a bit to do in Docker to make transient data a
bit more user friendly.

There are lots of proposal sitting in Github, it'd be great to get more
feedback.

------
thinkingkong
Lately we've been using shell scripts to do a lot of boiler-plate container
preparation. We copy that script in and run it at the beginning of the
container build. The nice thing about doing things this way is that you keep
your Dockerfile a little cleaner, you end up with less layers, and that layer
will only rebuild if you change the source file.

------
amouat
The author mentions a good image is a "whitebox" if it publishes its
Dockerfile on the Hub. Unfortunately this isn't really enough; many (even
most) Dockerfiles depend on scripts and data files which aren't hosted on the
Hub.

I would suggest the only truly whitebox images are the ones that can be
recreated from github (or similar) repositories.

~~~
jbergknoff
You're right that a published Dockerfile isn't enough to really know what went
into the image, but I think it's the closest thing we have at the moment. I
would love to see some tools for building truly minimal docker images from
scratch.

------
bradleyland
Curious if the sizes quoted Debian netinst vs Ubuntu server, or a standard
Debian install vs Ubuntu server? We base all our installs (regular VMs, not
Docker) off of Debian netinst because we get to choose exactly what goes on to
the server.

------
driverdan
> Pin package versions

This is true for any packages you use, not just Docker. For example, rails
gems. If you don't pin them your app _will_ break at some point on an update.
Always manually update packages and test before deploying.

------
ianlevesque
"Thus it seems that if you leave a file on disk between steps in your
Dockerfile, the space will not be reclaimed when you delete the file."

This must be a bug? Why should it legitimately behave this way?

~~~
cschneid
From what I understand, docker caches the state of its world during each step.
So you add a file, it caches, remove the file, it can't free that space from
the layered filesystem. But if you create & remove in the same `RUN` command,
it never gets persisted as a step, so doesn't take up space.

------
zenlikethat
OP makes a good point about buildpack-deps being quite huge. My only complaint
about the official images is that they take a LONG time to pull.

------
tbronchain
Thanks for writing such stuff. Too many images available today are just a pain
in the ass to use!

------
rasur
This is Gold, if only for the info about temp/transitory files and resultant
image size.

