Specifically, they don't really express how Docker packaging is a process integrating the way you build, where you build, and how you build, not just the Dockerfile.
1. Caching is great... but it can also lead to insecure images because you don't get system package updates if you're only ever building off a cached image. Solution: rebuild once a week from scratch. (https://pythonspeed.com/articles/docker-cache-insecure-image...)
2. Multi-stage builds give you smaller images, but if you don't use them right they result in breaking caching completely, destroying all the speed and size benefits you get from layer caching. Solution: you need to tag and push the build-stage images too, and then pull them before the build, if you want caching to work. (Long version, this is a bit tricky to get right: https://pythonspeed.com/articles/faster-multi-stage-builds/)
To be fair the official documentation Dockerfile best practices does mention this (although it also gives contradictory advice).
COPY . .
RUN go get -d -v ./...
RUN CGO_ENABLED=0 GOOS=linux go build -a -mod=vendor -o myprog .
RUN useradd -u 5002 user1
COPY --from=0 /opt/src/github.com/project1/myprog/myprog .
COPY --from=1 /etc/passwd /etc/passwd
The tradeoff in any packaging mechanism is always "pin everything to the byte" for maximum security and minimum updateability vs. "blindly update to latest without thinking". We normally develop with the second and deploy using the first and docker is no different.
The presumption here is that you're running on a stable base image, Debian Stable or Ubuntu LTS or something. Updates to system packages are therefore basically focused on security fixes and severe bug fixes (data corruption in a database, say).
Even if you're pinning everything very specifically (and you probably should), you still want to at some point get these (severe-bugfix-only) updates on a regular basis for security and operational reasons.
Blind reliance on caching prevents this from happening.
Typically security patches trigger new releases with minor/patch version number bumps, which are then installed by getting the latest version of the package. That's pretty much the SOP of any linux distribution.
I’ve noticed the official docker images don’t seem to do this. E.g. the official “java” images seem to be uploaded and then are never changed, the only way to get a newer underlying base system is to upgrade to a newer version tag release. Is this true of all the official images, I wonder?
Best combo is to pin to a specific tag, that you periodically update to the latest stable release, and also allow overriding via a build arg. Anyone who wants the bleeding edge, say for a CI server, can run a build with “latest” as the tag arg.
Here's the official Node.JS image from a couple years ago, for example...
$ sudo docker inspect node:6.1 | grep 'Z"'
Each image has a manifest with all the source repos, tags, what commit to pull from, what arches are supported, etc.
As long as the tag is listed in that manifest, it is automatically rebuilt when the base image changes.
It only is showing the newest version of node 8.16 listed in the manifest file. In other words, if I had an image based off node 8.15, it isn’t going to be updated ever.
So it’s not a matter of just rebuilding regularly, if you aren’t updating your dockerfiles to use newer language versions, you also aren’t going to get system updates.
Edit: I think i do see your point which is that if you are completely up to date on language versions, clearing the build cache every once in a while may still help get a system update if an upstream image is changed in between the release of a new language tag.
With BuildKit cache for all the intermediate stages is tracked. You can push the whole cache with buildx or inline the intermediate cache metadata in buildx or v19.03. `--cache-from` will pull in matched layers automatically on build. Can also export the cache to a shared volume if that suites you better than a registry.
Is there a reason why this would be bad? Clearly you'd have to clean the old cache layers regularly, but I'm more concerned with some layers taking very long times - the caching seems required.
Though I suppose if you set up your CI system to store the caches locally then you get caching, and it's more efficient as you're not downloading layers. So maybe that's just the "right" way to do it, regardless. /shrug
Eventually this will be unnecessary given (currently experimental) secret-sharing features Docker is adding. But for now pushing everything by default would be a security risk for some.
There's a good summary here: https://medium.com/@tonistiigi/build-secrets-and-ssh-forward.... The tl;dr is you can now write lines like "RUN --mount=type=ssh git clone ssh://git@domain/project.git" (or any command that uses SSH) in a Dockerfile to use your host machine's SSH agent during "docker build". You do currently need to specify experimental syntax in the Dockerfile, and set the DOCKER_BUILDKIT environment variable to 1, and pass "--ssh default" to "docker build", but it's a great workflow improvement IMO.
* You can mount build-time secrets in safely with `--mount-type=secret`, instead of passing them in. (Multistage builds do alleviate the problems with passing secrets in, but not completely.)
* Buildkit automatically parallelizes build stages. (Of course!)
* Mount apt-cache dirs in at build time with `--mount-type bind` so that you don't have to apt-get update every single time, and you don't have to clear apt-caches either.
And lots more.
Notice that this mostly involves capabilities that Docker already has to build time.
It was a deep hack into the entirety of the Docker build chain. This way it was probably possible to publish experimental work-in-progress build features to the world at a faster pace than the official Docker release cycle.
And as pointed out above, the online-only requirement has been lifted already.
The situation is perfectly understandable, and to be commended that people offered to do this work for the betterment of all.
Also if you dare to enable user-namespaces for Docker because, well also security - multi-stage builds fail (https://github.com/moby/moby/issues/34645)
Perhaps worth a mention in this blogpost?
The post is focused on Java apps but, for example, there is a distinction on runtime and SDK images in .NET Core. If you want to build in Docker, you have to pull the heavier SDK image. If you copy the built binaries to image, you can use the runtime image. I guess there could be similar situations in other platforms too.
Other than that, it looks like a decent guide. Thanks to the author.
docker run image test
The appeal of docker is completely & reproducibly owning production (what runs on your laptop runs on prod), and that also applies to the build (what builds on your laptop builds on prod). Not to mention the add on benefits that you can now use standard build agents across every tech stack and project, no need to customize them or keep them up to date, etc.
In addition, you can build a separate container from a specific part of your multi-stage build (for example, if you want to build more apps based on the SDK step, or run tests which require debugging). So from one Dockerfile you can have multiple images or tags to use for different parts of your pipeline. The resulting production image is still based on the same origin code, so you have more confidence that what's going to production was what was tested in the pipeline.
Furthermore, devs can iterate on this Dockerfile locally, rather than trying to replicate the CI pipeline in an ad-hoc way. The more of your pipeline you stuff into a Dockerfile, the less you have to focus on "building your pipeline".
The way I read it, they're more talking about when you're developing locally everyone should be building the application inside of a container rather than on their personal machines with differing set ups.
Perhaps I'm coming at this from a wrong angle.
We set system time as well.
I wish Dockerfiles would just fade away into the background, and be replaced by something more similar to an archiver but with better integration with repositories and versioning metadata.
My personal approach for Python applications' Docker packaging (https://pythonspeed.com/products/pythoncontainer/) was similar to yours: wrap the build process in a script. I wrote it in Python rather than bash, so it's more maintainable, but a Ruby shop might want to write theirs in Ruby, etc..
It is a fact that everything that happens in a Dockerfile execution via `docker build` can be done with a `docker run` and executing commands and ending with `docker commit`. Even the caching mechanism can be replicated. (Except with more control in my opinion.) It is also a fact that `docker run` has more capabilities than `docker build`, such as being able to mount things in.
And yes, you can spin up an http server and serve the cached packages and setup up all this before using docker build, but hey I can do it with docker run + docker commit much quicker... (I wrote "you almost can not create" and not "you can not create" - so it is possible with Dockerfile, but is not trivial.)
Why on earth would anyone do that? That's simply wrong on so many levels, and is far from being standard practice.
Why do you believe this point is relevant? If you can mount a drive then you can access its contents on the host, and if you can access files on the host then those contents are also accessible in a Docker build.
No one said that. I stated the fact that you can access files from the host during a docker build, thus it's irrelevant if you can mount drives or not. Just copy the files into your build stage and that's it.
But the nicer packaging and simpler model make that new layer of abstraction useful.
This was surprising to me. I thought I could `docker pull` the layers from the registry and only re-build what had changed on my machine. But no, this doesn't work.
The reason is that the docker client archives the source files, including all the file attributes like uid,gid,mtime, ... Between two computers those are bound to be different.
Its also good to speed up builds with configuring them to cache artifacts in a separate layer before the build happens or you can get them to use the host machines cached .m2 /.npm folders as a volume, however that might not work with pipelines etc. that build the docker containers.
When you use the USER directive to set up a working user in Dockerfile, be explicit about UID and use the numeric UID.
What works 99% of the time is to use alpha username in the "USER" directive. You can get some surprising artifacts if the container runtime hits this rare bug. There are likely several other great ways this can go wrong, as well.
Somewhere deep in the manual, it is suggested that you should only use a numeric UID, even though USER accepts an alphanumeric username most of the time, but if you look at container images built by OpenShift and other pro docker-ers, you will see they always do this with a numeric UID.