A common mistake that's not covered in this article is the need to perform your add & remove operations in the same RUN command. Doing them separately creates two separate layers which inflates the image size.
This creates two image layers - the first layer has all the added foo, including any intermediate artifacts. Then the second layer removes the intermediate artifacts, but that's saved as a diff against the previous layer:
RUN ./install-foo
RUN ./cleanup-foo
Instead, you need to do them in the same RUN command:
RUN ./insall-foo && ./cleanup-foo
This creates a single layer which has only the foo artifacts you need.
This why the official Dockerfile best practices show[1] the apt cache being cleaned up in the same RUN command:
The downside of trying to jam all of your commands into a gigantic single RUN invocation is that if it isn't correct/you need to troubleshoot it, you can wind up waiting 10-20 minutes between each single line change just waiting for your build to finish.
You lose all the layer caching benefits and it has to re-do the entire build.
Just a heads up for anyone that's not suffered through this before.
I’m confused why they haven’t implemented a COMMIT instruction.
It’s so common to have people chain “command && command && command && command” to group things into a single layer. Surely it would be better to put something like “AUTOCOMMIT off” at the start of the Dockerfile and then “COMMIT” whenever you want to explicitly close the current layer. It seems much simpler than everybody hacking around it with shell side-effects.
Do keep in mind that you might want a set of trusted TLS certificates and the timezone database. Both will be annoying runtime errors when you don't trust https://api.example.com or try to return a time to a user in their preferred time zone. Distroless includes these.
Downside - squash makes for longer pulls from the image repository, which can matter for large images or slow connections (you keep build layers but now have no caching for consumers of the image). There's various tricks to be pulled that don't use squash - I've had the most luck putting multiple commands into a buildkit stage, then mounting the results of that stage and copying the output in (either by manually getting the list of files, or using rysnc to figure it out for me).
But then you end up with just one layer, so you lose out on any caching and sharing you might have gotten. Whether this matters is of course very context dependent, but there are times when it'll cost you space.
Doesn't that squash all your layers though? That defeats the whole purpose of there being layers. Now instead of a huge total size, but pushing only a fraction of it, your total is lower but you're pushing all of it every time. Same goes for disk space if you're running multiple instances or many images with shared lineage.
You don't have to do this anymore, the buildkit frontend for docker has a new feature that supports multiline heredoc strings for commands: https://www.docker.com/blog/introduction-to-heredocs-in-dock... It's a game changer but unfortunately barely mentioned anywhere.
Multistage builds are a better solution for this. Write as many steps as required in the build image and copy only what’s needed into the runtime image in a single COPY command
Is it an option to put all the setup and configuration into a script? So the Dockerfile becomes effectively just:
RUN ./setup.sh
I have seen that in some cases as a way to reduce layer count while avoiding complex hard to read RUN commands. Also seen it as a way to share common setup across multiple Docker images:
RUN ./common_setup_for_all_images.sh
RUN ./custom_setup_for_this_image.sh
However this approach of doing most of the work in scripts does not seem common, so I'm wondering if there is a downside to doing that.
The downside of this is the same as the upside: it stuffs all that logic into one layer. If the result of your setup script changes at all, then the cache for that entire layer and all later layers are busted. This may or may not be what you want.
then any time a static asset is updated, a library is changed, or your application code changes, the digest of the Docker layer for `RUN ./setup.sh` will change. Your team will then have to re-download the result of all three of those sub-scripts next time they `docker pull`.
However, if you found that static assets changed less often than libraries, which changed less often than your application code, then splitting setup.sh into three correspondingly-ordered `RUN` statements would put the result of each sub-script its own layer. Then, if just your application code changed, you and your team wouldn't need to re-download the library and static asset layers.
I do this for all of the CI images I maintain. Additionally, it leaves evidence of the setup in the container itself. Usually I have a couple of these scripts (installing distro-provided deps, building each group of other deps, etc.).
During development or with any image, which you need to update rather often, you usually don't want to lose all of docker's caching by putting everything into one giant RUN directive. This is one, where early optimization strikes hard. Don't merge RUN directives from the start. First build your image in a non-optimized way, saving loads of build time making use of docker build cache.
Personally I would not merge steps, which have nothing to do with each other, unless I am sure, that they are basically set in stone forever.
With public and widely popular base images, which are not changed once they are released, the choices might be weighed differently, as all the people, who make use of your image, will want to have fast download and small resulting images building on top of it.
Simply put: Don't make your development more annoying than necessary, by unnecessarily introducing long wait times for building docker images.
In particular cache mounts (RUN --mount-type=cache) can help the package manager cache size issue, and heredocs are a game-changer for inline scripts. Forget doing all that && nonsense, write clean multiline run commands:
RUN <<EOF
apt-get update
apt-get install -y foo bar baz
etc...
EOF
All of this works right now in plain old desktop docker you have installed right now, you just need to use the buildx command (buildkit engine) and reference the docker labs buildkit frontend image above. Unfortunately it's barely mentioned in docs or anywhere else other than their blog right now.
I guess you don't even need to use `docker buildx`, just `export DOCKER_BUILDKIT=1` and go with it (great to enable globally in a CI system).
Heredocs make these multi-lines so much cleaner, awesome.
There are another base images from google that are smaller than the base images and come handy when deploying applications that runs on single binary.
> Distroless images are very small. The smallest distroless image, gcr.io/distroless/static-debian11, is around 2 MiB. That's about 50% of the size of alpine (~5 MiB), and less than 2% of the size of debian (124 MiB).
Distroless are tiny, but sometimes the fact that don't have anything on them other than the application binary makes them harder to interact with, specially when troubleshooting or profiling. We recently moved a lot of our stuff back to vanilla debian for this reason. We figured that the extra 100MB wouldn't make that big of a difference when pulling for our Kubernetes clusters. YMMV.
I found this to be an issue as well, but there are a few ways around this for when you need to debug something. The most useful approach I found was to launch a new container from a standard image (like Ubuntu) which shares the same process namespace, for example:
docker run --rm -it --pid=container:distroless-app ubuntu:20.04
You can then see processes in the 'distroless-app' container from the new container, and then you can install as many debugging tools as you like without affecting the original container.
Alternatively distroless have debug images you could use as a base instead which are probably still smaller than many other base images:
I've found myself exec-ing into containers a lot less often recently. Kubernetes has ephemeral containers for debugging. This is of limited use to me; the problem is usually lower level (container engine or networking malfunctioning) or higher level (app is broke, and there is no command "fix-app" included in Debian). For the problems that are lower level, it's simplest to resolve by just ssh-ing to the node (great for a targeted tcpdump). For the problems that are higher level, it's easier to just integrate things into your app (I would die without net/http/pprof in Go apps, for example).
I was an early adopter of distroless, though, so I'm probably just used to not having a shell in the container. If you use it everyday I'm sure it must be helpful in some way. My philosophy is as soon as you start having a shell on your cattle, it becomes a pet, though. Easy to leave one-off fixes around that are auto-reverted when you reschedule your deployment or whatever. This has never happened to me but I do worry about it. I'd also say that if you are uncomfortable about how "exec" lets people do anything in a container, you'd probably be even more uncomfortable giving them root on the node itself. And of course it's very easy to break things at that level as well.
Also if you are running k8s, and use the same base image for your app containers, you amortize this cost as you only need to pull the base layers once per node. So in practice you won’t pull that 100mb many times.
(This benefit compounds the more frequently you rebuild your app containers.)
Doesn't that only work if you used the exact same base? If I build 2 images from debian:11 but one of them used debian:11 last month and one uses debian:11 today, I thought they end up not sharing a base layer because they're resolving debian:11 to different hashes and actually using the base image by exact image ID.
Base images like alpine/debian/ubuntu get used by a lot of third party containers too so if you have multiple containers running on the same device they may in practice be very small until the base image gets an upgrade.
I think this something that people miss a lot when trying to optimize their Docker builds. Is the whole optimizing for most of your builds vs optimizing for a specific build. Not easy.
There are some tools that allow you to copy debug tools into a container when needed. I think all that needs to be I'm the container is tar and it runs `kubectl exec ... tar` in the container. This allows you to get in when needed but still keep your production attack surface low.
Either way as long as all your containers share the same base layer it doesn't really matter since they will be deduplicate.
The way I imagine this is best solved is by keeping a compressed set of tools on your host and then mounting those tools into a volume for your container.
So if you have N containers on a host you only end up with one set of tooling across all of them, and it's compressed until you need it.
You can decouple your test tooling from your images/containers, which has a number of benefits. One that's perhaps understated is reducing attacker capabilities in the container.
With log4j some of the payloads were essentially just calling out to various binaries on Linux. If you don't have those they die instantly.
It got removed from the README at some point, but the smallest distroless image, gcr.io/distroless/static is 786KB compressed -- 1/3 the size of this image of shipping containers[0], and small enough to fit on a 3.5" floppy disk.
So the percentage makes it look impressive, but... you're saving no more than 5MB. Don't get me wrong, I like smaller images, but I feel like "smaller than Alpine" is getting into -funroll-loops territory of over-optimizing.
Every now and then I break out dive and take a look at container images. Almost without fail I'll find something we can improve.
The UX is great for the tool, gives me absolutely everything I need to see, in such a clear fashion, and with virtually no learning curve at all for using it.
Since it captures exact dependencies, it becomes easier to put just what you need in the image. Prior to nix, my team (many years ago) built a redis image that was about 15MB in size by tracking the used files ans removing unused files. Nix does that reliably.
We use Nix + Bazel. Nix builds the base image with Python and whatever else we want. Bazel layers our actual Python app on top of it. No dockerfiles at all.
It sounds so cool, but then I don’t get out of the base image before you’re writing your own Python launcher in a heredoc in a shell script in a docker image builder in a nix derivation[0]? Curiosity compels me to ask: how did all that become necessary?
It mostly grew out of using Nix to fetch a python interpreter for builds and tests. By default, Bazel will use whichever python binary is on the host (if any), which can lead to discrepancies between build hosts and various developer machines.
The main difference between Dockerfiles and something like Nix is that the former is run "internally" and the latter "externally".
For example, a Dockerfile containing 'my-package-manager install foo' will create an image with foo and my-package-manager (which usually involves an entire OS, and at least a shell, etc.). An image built with Nix will only contain foo and its dependencies.
Note that it's actually quite easy to make container images "externally", using just `tar`, `jq` and `sha256sum`. The nice thing about using Nix for this (rather than, e.g. Make) is the tracking of dependencies, all the way down to the particular libc, etc.
For my two cents, if you're image requires anything not vanilla, you may be better off stomaching the larger Ubuntu image.
Lots of edge cases around specific libraries come up that you don't expect. I spent hours tearing my hair out trying to get Selenium and python working on an alpine image that worked out-of-the-box on the Ubuntu image.
That’s rolling your own distro. We could do that but it’s not really our job. It also prevents the libraries from being shared between images, unless you build one base layer and use it for everything in your org (third parties won’t).
Once you start adding stuff, I think Alpine gets worse. For example, there’s a libgmp issue that’s in the latest Alpine versions since November. It’s fixed upstream but hasn’t been pulled into Alpine.
musl DNS stub resolver is "broken" unfortunately (it doesn't do TCP, which is a problem usually when you want to deploy something into a highly dynamic DNS-configured environment, eg. k8s)
Do libraries just sat there on disc do any damage?
Also, are you going to update those libraries as soon as a security issue arises? Debian/Ubuntu and friends have teams dedicated to that type of thing.
This is not a valid comparison. You're comparing bare metal virtual machines wherein you are responsible for all of the software running on the VM, with a bundled set of tarballs containing binaries you probably cannot reproduce.
Many, many vendors provide docker images but no Dockerfile. And even if you had the Dockerfile you might not have access to the environment in which it needs to be run.
Docker is successful in part because it punts library versioning and security patches and distro maintenance to a third party. Not only do you not have to worry about these things (but you should!) now you might not be able to even do anything if you wanted to.
> Docker is successful in part because it punts library versioning and security patches and distro maintenance to a third party. Not only do you not have to worry about these things (but you should!) now you might not be able to even do anything if you wanted to.
This is a very restricted view.
Besides this article is about building your own images, not using existing ones.
I found this is not actually an "Alpine" issue but a libmusl issue. Lots of stuff like local support does not work for musl.
I do like the compact size of Alpine but, if you are not developing on with libmusl underneath there seem to be lots of surprises.
True. I had a somewhat similar experience with the official Alpine-based Python images. The are supposedly leaner than the Debian-based ones, but any advantage is cancelled if you need any PyPI packages that use native libraries.
Now you suddenly need to include a compiler toolchain in the image and compile the native interface every time you build the image.
I start all my projects based on Alpine (alpine-node, for example). I'll sometimes need to install a few libraries like ImageMagic, but if that list starts to grow, I'll just use Ubuntu.
A very common mistake I see (though not related to image size perse) when running Node apps is to do CMD ["npm", "run", "start"]. This is first memory wasteful, as npm is running as the parent process and forking node to run the main script. Also, the bigger problem is that the npm process does not send signals down to its child thus SIGINT and SIGTERM are not passed from npm into node which means your server may not be gracefully closing connections.
Node.js has both a Best Practices [0] and a tutorial [1] that instruct to use CMD ["node", "main.js"]. In short: do not run NPM as main process; instead, run Node directly.
This way, the Node process itself will run as PID 1 of the container (instead of just being a child process of NPM).
The same can be found in other collections of best practices such as [2].
What I do is a bit more complex: an entrypoint.sh which ends up running
exec node main.js "$*"
Docs then tell users to use "docker run --init"; this flag will tell Docker to use the Tini minimal init system as PID 1, which handles system SIGnals appropriately.
I'm not a Node/NPM person, but I imagine they had in mind the equivalent of whatever is expected from npm. I expect some nodejs command to invoke the service directly
Edit: Consequently this should make the container logs a bit more useful, beyond better signal handling/respect
I don't avoid large images because of their size, I avoid them because it's an indicator that I'm packaging much more than is necessary. If I package a lot more than is necessary then perhaps I do not understand my dependencies well enough or my container is doing too much.
Starting with: Use the ones that are supposed to be small. Ubuntu does this by default, I think, but debian:stable-slim is 30 MB (down from the non-slim 52MB), node has slim and alpine tags, etc. If you want to do more intensive changes that's fine, but start with the nearly-zero-effort one first.
EDIT: Also, where is the author getting these numbers? They've got a chart that shows Debian at 124MB, but just clicking that link lands you at a page listing it at 52MB.
The article doesn't seem to do much... in the 'why'. I'm inundated with how, though.
I've been on both sides of this argument, and I really think it's a case-by-case thing.
A highly compliant environment? As minimal as possible. A hobbyist/developer that wants to debug? Go as big of an image as you want.
It shouldn't be an expensive operation to update your image base and deploy a new one, regardless of size.
Network/resource constraints (should) be becoming less of an issue. In a lot of cases, a local registry cache is all you need.
I worry partly about how much time is spent on this quest, or secondary effects.
Has the situation with name resolution been dealt with in musl?
For example, something like /etc/hosts overrides not taking proper precedence (or working at all). To be sure, that's not a great thing to use - but it does, and leads to a lot of head scratching
> A highly compliant environment? As minimal as possible. A hobbyist/developer that wants to debug? Go as big of an image as you want.
Hah, I go the other way; at work hardware is cheap and the company wants me to ship yesterday, so sure I'll ship the big image now and hope to optimize later. At home, I'm on a slow internet connection and old hardware and I have no deadlines, so I'm going to carefully cut down what I pull and what I build.
Haha, definitely understandable! The constraints one operates in always differ, so that's why I really try to stay flexible (or forgiving) in this situation.
Our development teams at work have a lot of [vulnerability scanning] trouble from bundling things they don't need. In that light, I suggest keeping things small - but that's the 'later' part you alluded towards :)
You might not need to care about image size at all if your image can be packaged as stargz.
stargz is a gamechanger for startup time.
kubernetes and podman support it, and docker support is likely coming. It lazy loads the filesystem on start-up, making network requests for things as needed and therefore can often start up large images very fast.
I like this article, and there is a ton of nuance in the image and how you should choose the appropriate one. I also like how they cover only copying the files you actually need, particularly with things like vendor or node_modules, you might be better off just doing a volume mount instead of copying it over to the entire image.
The only thing they didn't seem to cover is consider your target. My general policy is dev images are almost always going to be whatever lets me do one of the following:
- Easily install the tool I need
- All things being equal, if multiple image base OS's satisfy the above, I go with alpine, cause its smallest
One thing I've noticed is simple purpose built images are faster, even when there are a lot of them (big docker-compose user myself for this reason) rather than stuffing a lot of services inside of a single container or even "fewer" containers
> I also like how they cover only copying the files you actually need, particularly with things like vendor or node_modules, you might be better off just doing a volume mount instead of copying it over to the entire image.
I'd highly suggest not to do that. If you do this, you directly throw away reproducibility, since you can't simply revert back to an older image if something stops working - you need to also check the node_modules directory. You also can't simply run old images or be sure that you have the same setup on your local machine as in production, since you also need to copy the state. Not to mention problems that might appear when your servers have differing versions of the folder or the headaches when needing to upgrade it together with your image.
Reducing your image size is important, but this way you'll loose a lot of what Docker actually offers. It might make sense in some specific cases, but you should be very aware of the drawbacks.
I like this article, and there is a ton of nuisance in the image and how you should choose the appropriate one.
By chance, did you mean nuance? Because while I can agree it you can quickly get into some messy weeds optimizing an image...hearing someone call it a "nuisance" made me chuckle this afternoon
I always feel helpless with python containers - it seems there isn’t much savings ever eeked out of multi-stage and other strategies that typically are suggested. Docker container size really has made compiled languages more attractive to me
There is some strange allure for spending time crafting Dockerfiles. IMO it's over glorified - for most situations the juice is not worth the squeeze.
As a process for getting stuff done, a standard buildpack will get you a better result than a manual Dockerfile for all but the most extreme end of advanced users. Even for those users, they are typically advanced in a single domain (e.g. image layering, but not security). While buildpacks are not available for all use cases, when available I can't see a reason to use a manual Dockerfile for prod packaging
For our team of 20+ people, we actively discourage Dockerfiles for production usage. There are just too many things to be an expert on; packers get us a pretty decent (not perfect) result. Once we add the packer to the build toolchain it becomes a single command to get an image that has most security considerations factored in, layer and cache optimization done far better than a human, etc. No need for 20+ people to be trained to be a packaging expert, no need to hire additional build engineers that become a global bottleneck, etc. I also love that our ops team could, if they needed, write their own buildpack to participate in the packaging process and we could slot it in without a huge amount of pain
Somewhat tangentially related to the topic of this post: does anyone know any good tech for keeping an image "warm". For instance, I like to spin up separate containers for my tests vs development so they can be "one single process" focused, but it is not always practical (due to system resources on my local dev machine) to just keep my test runner in "watch" mode, so I spin it down and have to spin it back up, and there's always some delay - even when cached. Is there a way to keep this "hot" but not run a process as a result? I generally try to do watch mode for tests, but with webdev I got alot of file watchers running, and this can cause a lot of overhead with my containers (on macOS for what its worth)
You could launch the container itself with sleep. (docker run --entrypoint /bin/sh [image] sleep inf) Then start the dev watch thing with 'docker exec', and when you don't need it anymore you can kill it. (Eg. via htop)
With uwsgi you can control which file to watch. I usually just set it to watch the index.py so when I want to restart it, I just switch to that and save the file.
Do you mean container? So you'd like to have your long running dev container, and a separate test container that keeps running but you only use it every now and then, right? Because you neither want to include the test stuff in your dev container, nor use file watchers for the tests?
Then while I don't know your exact environment and flow, could you start the container with `docker run ... sh -c "while true; do sleep 1; done"` to "keep it warm" and then `docker exec ...` to run the tests?
When I want to run a containerized service I just look for the dockerhub image or github repo that requires the least effort to get running. In these cases is it very common to write dockerfiles and try to optimize them?
Using buildah you have pretty much complete control over layers. You can mount the image in filesystem, manipulate it in as many steps as you want, and then commit it (thus explicitly create layer) when you want.
On the flip side, it's a shell script that calls various buildah sub-commands rather than a nicer declarative DSL. Also you don't have the implicit auto cache reuse behaviour of Dockerfiles, since everything runs anew in next invocation. You would have to implement your own scheme for that, iirc breaking down the script into segments for each commit, writing the id to a file at the end of it, combining that with make worked for me.
This creates two image layers - the first layer has all the added foo, including any intermediate artifacts. Then the second layer removes the intermediate artifacts, but that's saved as a diff against the previous layer:
Instead, you need to do them in the same RUN command: This creates a single layer which has only the foo artifacts you need.This why the official Dockerfile best practices show[1] the apt cache being cleaned up in the same RUN command:
[1] https://docs.docker.com/develop/develop-images/dockerfile_be...