In my experience docker-slim[0] is the way to go for creating minimal and secure Docker images.
I wasted a lot of time in the past trying to ship with Alpine base images and statically compiling complicated software. All the performance, compatibility, package availability headaches this brings is not worth it when docker-slim does a better job of removing OS from your images while letting you use any base image you want.
Tradeoff is that you give up image layering to some extent and it might take a while to get dead-file-elimination exactly right if your software loads a lot of files dynamically (you can instruct docker-slim to include certain paths and probe your executable during build).
If docker-slim is not your thing, “distroless” base images [1] are also pretty good. You can do your build with the same distro and then in a multi stage docker image copy the artifacts into distroless base images.
I've been using Alpine religiously for years, until the build problems became too big. Mostly long build times and removed packages on major version updates.
Now I first try with Alpine and if there is the slightest hint of a problem, I move over to debian-slim. So things like Nginx are still in Alpine for me, while anything related to Python not any longer.
At first I thought your mention of docker-slim was an error for debian-slim, but I've followed the link am glad to have learned something useful.
Node.js application images:
from ubuntu:14.04 - 432MB => 14MB (minified by 30.85X)
from debian:jessie - 406MB => 25.1MB (minified by 16.21X)
from node:alpine - 66.7MB => 34.7MB (minified by 1.92X)
from node:distroless - 72.7MB => 39.7MB (minified by 1.83X)
Why are the minified Alpine images bigger than Ubuntu/Debian? Are a bunch of binaries using static linking and inflating the image? Or something else?
You can try using the xray command that will give you clues (docker-slim isn't just minification :-)). The diff capability in the Slim web portal is even more useful (built on top of xray).
I have quit using Alpine. It caused to many issues in production. For some workloads in GO for instance you can use Scratch directly. But slim and distroless are my preferred base images.
Personally I hate scratch images because once you lose busybox, you lose the ability to exec into containers. It's a great escape hatch for troubleshooting.
There are a few options that help here. With host access, I tend to just use nsenter most of the time to do different troubleshooting. It can be a bit of a pain doing network troubleshooting though since the resolv.conf will be different without the fs namespace.
And kubernetes has debug containers and the like now.
The software required for running a Snap Store instance is proprietary [0], and there are no free software implementations as far as I know. Also, the default client code hardcodes [1] Canonical snap store, so you have to patch and maintain your own version of snapd if you want to self-host.
Snapd also hardcodes auto-updates that are also impossible to turn off without patching and maintaining your own version of snapd / blocking the outgoing connections to Canonical servers, so snapd is also horrible for server environments. To top that, the developers have this "I know what's good for you, you don't" attitude [2] that so much reminds me of You Know Who.
Yep. I am trying my best to boycott Canonical and their closed source Snap which is akin to Apple Store or Play Store, but for desktop... for... LINUX. Goes against the philosophy in every way imaginable.
It's really sad especially given how Canonical introduced so many people to Linux through Ubuntu. I understand they need to monetize to survive but I wish it wasn't like this. I miss the "Ubuntu One" service, a simple Dropbox like alternative that you pay for. Completely optional and server side. Integrated into the UI.
That being said, I was wondering how many people actually find the Snap system and ecosystem useful. Reverse engineering snapd (which is licensed under GPLv3) and snap app format in order to create a compatible server would be a fun project.
They don't give you the choice. Applications are only avaliable as either a snap or an deb. With many you might want to containerise (ie Chromium for CI) being snaps.
I dont belive they can work in unpriviledged docker containers.
you cannot use snap inside a docker or any other OCI container, first of all snap is a containerised package as well so it doesnt make much sense but what is more important it requires SystemD and as far as i know if systemd isnt PID 1 snap deamon wont run and its CLI will output it cant run.
If you're using plain Go, you get static binaries for free. If you're linking against so C library you might be out of luck. You can try setting CGO_ENABLED=0 in your environment, but I've had mixed success in practice.
I’ve only had success with CGO_ENABLED. If you depend on a C library, then obviously it won’t work, but mercifully few ago programs depend on C libraries (not having straight C ABI compatibility is a real blessing since virtually none of the ecosystem depends on C and its jerry rigged build tooling).
Well this is not what I'm seeing. I need to add -ldflags "-linkmode external -extldflags -static" to the go build command otherwise the binary doesn't run inside a scratch image.
Layers support in docker-slim is something that's on the todo list (exact time is tbd).
The recent set of the engine enhancements reduce the need for the explicit include flags. Also new application-specific include capabilities are wip (already added a number of new flags for node.js, next.js and nuxt.js applications).
Could anyone ELI5 how exactly docker-slim achieves minifications of up to 30× and more? I've read the README but it still seems like black magic to me.
basicly what docker-slim does it basicly checks what your program is opening/ using(using simular system as strace) and what it does open/use is then copied to a new image. And how can get those kinds of numbers, basicly it removes parts of rootFS that are not required which is basicly your base images standard files like /etc /home /usr /root..., it also removes all development dependancies, source code and other cruft you might have copied in for use during build or simular.
While absolutely genius, it would be more awesome if we could shift this to the left. And have dpkg or apt, or something new, only fetch and place those binaries that are needed.
while in theory possible it would require a whole lot of change.
Also like kyle said apt and alike(all current package managers) cant generaly deal on file per file, basis package is as is all files included, so this would require a new package manager(or old one) with packages being a single file with no dependancies for specific package. This would require new base images if we used dockerfiles and you would need to know every library every binary your program needs. While on other hand leaving in all the cruft some copy in like READMEs LICENSE files and so on.
Benefit of docker-slim is that it is in many ways just a line in your already predefined CI/CD pipeline, therefore its just another step likely just working with already preexisting technologies you might use in your pipeline.
it'll be possible to do something like that in the future where docker-slim will generate a spec that describes what your system and app package managers need to install. Using the standard package managers will be tricky for partial package installs though because it's pretty common that the app doesn't need the whole package. Even now docker-slim gives you a list of files your app needs, but the info is too low level to be usable by the standard package management tools.
While this is the go-to approach, I find really hard later to debug problems for images who don't have even an shell or a minimum set of tools to help debug a problem
I recommend avoiding this kind of thinking, which leads to bundling all sorts of stuff in every single container. My philosophy is that images should contain as little as possible.
To debug a container, a better way is to enter the container's kernel namespaces using a tool such as nsenter [1]. Then you can use all your favourite tools, but still access the container as if you're inside it. Of course, this means accessing it on the same host that it's running.
If you're on Kubernetes, debug containers [2] are currently in beta, and should be much nicer to work with, as you can do just "kubectl debug" to start working with an existing pod.
Related, is there any built-in facility that would log file system accesses that failed on the resulting image? Seems like that would give the answer in a substantial fraction of cases.
you can use --include-shell to inclue a shell. but my recommendation is always keep a non-slimmed image somewhere so you can refer to it and some debugging can be done in either, also that allows you to use something like slim.ai's(company behind docker-slim) web based SaaS features which allow you to see the differences between 2 images and see the file system in nice web base file tree. some bugs may stem from removing what is necessary for app to work(sometimes caused by not loading a library during "slimming" process) for those types of errors you need to know how your app runs more then anything.
it a tradeoff
The problem is that sometimes, using commands like cp don't output the error in case of failed permissions and if there's any tool into it (e.g. ls), you can't find out why it fails
Typically I just use Go in a scratch base image where possible, since it’s super easy to compile a static binary. Drop in some certs and /etc/passwd from another image and your 2.5MB image is good to go.
Using `COPY --chmod` is not the correct solution for this. It works, of course, but it isn't very logical from Dockerfile readability standpoint. The real issue is the incorrect use of multi-stage builds. In multi-stage builds you define additional build stages where you prepare your binaries(eg. compiling them) and copy to the final runtime stage, so your final stage remains clean of temporary files created by your build steps. Based on your comment in your current build stage you run curl, extract etc., but you don't actually finish preparing the binary by correcting the executable bit. Instead, you copy the half-prepared binary to runtime stage and then try to continue your further modifications there. Eg. similarily if you would skip extracting step, copy the zip instead and extract it in runtime stage and then you would have the zip and the final binary in your exported image.
Another red flag is that you run `apt-get` after copying the binary to runtime stage(because you still want to tweak the binary there). That means any time source for binary changes, the `apt` commands need to run again and are not cached. If you just add the executable bit in your build stage you can reorder them, so the `COPY` comes after `RUN`.
You are correct - I should be running chmod in the download stage and that is what I did before realizing `--chmod` existed. However, `--chmod` is still a valid solution.
The reason I did not stop with running chmod in the first stage is because this seemed like a common problem - what if I was ADDing a binary or a shell script directly from a remote source and I did not have a download stage?
I'm sure there are better ways to write that Dockerfile - I'm by no means an expert. It just so happens that I noticed this problem when the Dockerfile (it was from a different project. I was modifying it) was in this state and I had nothing better to do than ~yak shave~ investigate why the image size was a bit larger than I expected :)
So many gotchas like this in dockerfiles. I think the issue stems from it being such a leaky abstraction. To use it correctly you need to know how docker works internally inside and out, as well as Linux inside and out.
The default choices are baffling in docker, it really is a worse-is-better kind of tool.
Has anyone worked on a replacement for dockerfiles? I know buildah is an alternative to docker build, but it just uses the same file format
Sure, there are, but they all have enough of a learning curve that they don't seem to take hold with the masses.
Nix, Guix, Bazel, Habit, and others, all solve this problem more elegantly. There are some big folks out there, quiet quietly using Nix to solve:
* reproducible builds
* shared remote/CI builds
* trivial cross-arch support
* minimal container images
* complete knowledge of all SW dependencies and what-is-live-where
* "image" signing and verification
I know docker and k8s well and it's kind of silly how much simpler the stack could be made if even 1% of the effort spent working around Docker were spent by folks investing in tools that are principally sound instead of just looking easy at first glance.
Miss me with the complaints about syntax. It's just like Rust. Any pain of learning is very quickly forgotten by the unbridled pace at which you can move. And besides, it's nothing compared to (looks at calendar) 5 years of "Top 10 Docker Pitfalls!" as everyone tries to pretend the teetering pile of Go is making their tech debt go away.
I never thought I'd come around to being someone wary of the word "container", as someone who sorta made it betting on them. There is so little care for actually managing and understanding the depth of one's software stack, well, we have this. (Pouring one out for yet another Dockerfile with apt-get commands in it.)
Docker provides a solution for balls of mud. You now have a more reproducible ball of mud!
Bazel and company require you to clean up your ball of mud first. So your payoff is further away (and can sometimes be theoretical)
Ultimately it’s less about Docker and more about tooling supporting reproducibility (apt but with version pinning please), but in the meantime Docker does get you somewhere and solve real problems without having to mess around with stuff too much.
And of course the “now you have a single file that you can run stuff with after building the image ”. I don’t believe stuff like Nix offers that
Can you generate a tar file that you can “just run“ (or something to that effect)? My impression was that Nix works more like a package installer, but deterministic
With the new (nominally experimental) CLI, use `nix bundle --bundler github:NixOS/bundlers#toArx` (or equivalently just `nix bundle`) to build a self-extracting shell script[1], `...#toDockerImage` to build a Docker image, etc.[2,3], though there’s no direct AppImage support that I can see (would be helpful to eliminate the startup overhead caused by self-extraction).
If you want a QEMU VM for a complete system rather than a set of files for a single application, use `nixos-rebuild build-vm`, though that is intended more for testing than for deployment.
The Docker bundler seems to be using more general Docker-compatible infrastructure in Nixpkgs[4].
There might be quicker ways to do this, but with one extra line a derivation exports a docker image which can in turn be turned to a tar with one more line.
Nix's image building is pretty neat. You can control how many layers you want, which I currently maximize so that docker pulls from AWS ECR are a lot faster
Uhm, can't get Nix to build a crossSystem on MacBook M1, it fails compiling cross GCC. I wouldn't say it's trivial. Maybe the Nix expressions look trivial, but getting them to actually evaluate is not.
> it's kind of silly how much simpler the stack could be made if even 1% of the effort spent working around Docker were spent by folks investing in tools that are principally sound instead of just looking easy at first glance.
This is the phrasing I was groping around for. Thank you
> Miss me with the complaints about syntax. It's just like Rust.
Yeah and it's competing against Dockerfiles, which I suppose in this analogy is like Python or bash with fewer footguns; syntax and parts of the functional paradigm are absolutely putting nix at a usability/onboarding disadvantage to docker.
You can also use buildah commands without the whole dockerfile abstraction. As a structured alternative there's also an option to build container images from nix expressions.
I've been using Packer with the Docker post-processor. I’ve had to give up multi-stage builds but being able to ditch Dockerfiles and simply write a shell script without a thousand &&\’s is more than enough reason to keep me using it.
> I’ve had to give up multi-stage builds but being able to ditch Dockerfiles and simply write a shell script without a thousand &&\’s is more than enough reason to keep me using it.
I don't understand your point. If all you want to do is set a container image by running a shell script, why don't you just run the shell script in your Dockerfile?
Or better yet, prepare your artifacts before, and then build the Docker image by just copying your files.
It sounds like you decided to take the scenic route of Docker instead of just taking the happy path.
I’m glad you were able to infer so much from so little, but what it actually sounds like is that you don’t know how helpful it is to build with a system like Packer. As others have pointed out, Dockerfiles are full of gotchas, the incomprehensible mess that they become due to the limited format and the need for workarounds is only half the reason I use packer now. If you think Dockerfiles produce a “happy” path, then good for you, but you might first fix the COPY command and make sure it works for multiple files, with args, and ONBUILD, or any of the other warts sitting around in the issue list. We’re all waiting.
Meanwhile, I write packer files in HCL - a saner language and a saner format - without worrying about the way files are copied. Of course, it’s not perfect but I’d choose any of the other suggestions here before going back to Dockerfiles based on your optimism and the knowledge - that I already had but virtually every author of a Dockerfile ignores - that I can RUN a script. Thanks, but no thanks.
I run Linux as I always have. Building and running are super simple.
I feel like Docker was created more or less to let Mac devs do Linux things. Wastefully. And without a lot of reason, tbh. And of course, they don't generally even understand Linux.
Why would a technology built on top of cgroups, a feature only available in the linux kernel, be created to "let mac devs do linux things"? In fact, running docker on Mac was painful in the early days with boot2docker.
Just my experience on my team. The Linux guys were already building and running things locally. So the sales pitch so to speak from our team were the Mac guys saying 'hey, now we can build and test locally!', whereas the Linux guys just kinda found it a slight annoyance.
Things have certainly changed with the rise of kube, ecr, and the such. But in the time of doing standard deploys into static vms, it didn't make a ton of sense.
I encourage you to investigate where docker came from, and the rise of containerization in general. The notion that you have is rather misinformed and anachronistic. Competing against standard deploys onto VMS, especially using proprietary software is exactly why containerization gained a foothold.
Whatever this anecdote your team told you about Mac guys, this just has nothing to do with docker's, and containers in general, rise to fame. It wouldn't be until much later when Mac users were starting to rely on tools like Vagrant for development environments where docker was seen as an alternative to that. If your team were real linux guys, they probably would have already known about lxc, as well as all the other technologies that lead up to it: jails, solaris containers, and vserver, so seeing this as "some annoying mac thing" is especially puzzling to me.
You know, I try to be reasonable so, you're right - my initial comment was way too broad and dismissive.
I told a personal tale about adoption(not creation), which isn't exactly fair to the creators.
It's a slightly different and perhaps jaded view when a perfectly solid workflow is upended, and when asking why get responses like 'consistent OS and dependencies', which our vms already had, and 'we can run it locally', which half of us already did.
Admittedly, there is a lot of value in a consistent and repeatable environment specification(vs bespoke everywhere), being able to do so without needing to spin up vms, and yes - running linuxy things on Mac and Win, among other things.
Slight nitpick, but `apt-get update && apt-get install -y openssl dumb-init iproute2 ca-certificates` in the dockerfile is not the recommended approach.
That command itself means that a docker container is no longer reproducible. You cannot build it (with any code changes for your service) and guaranteed to be the same since that might be in production due to changes in the packages.
Always better to go with the base image, add your packages to the base and then use that new image as the base image for your application.
> It's a tradeoff between making container images reproducible, and not shipping security vulnerabilities.
You can regenerate your base images every day or more often and have consistent containers created from an image. Freshly generated image can be tested in a pipeline to avoid issues and you won't hit issues like inability to scale due to misbehaving new containers.
> You can regenerate your base images every day or more often and have consistent containers created from an image.
That solves nothing, as it just moves the unreproducibility to a base image at the cost of extra complexity. Arguably that can even make the problem worse as you just add a delta between updates where there is none if you just run apt get upgrade.
> Freshly generated image can be tested in a pipeline to avoid issues and you won't hit issues like inability to scale due to misbehaving new containers.
You already get that from container images you build after running apt get upgrade.
`apt` runs during the creation of 1-3 VM images per architecture and not during creation of dozens of container images based on each VM image.
When we have VM images upon which all our usual Docker images were successfully built, we trust it more than `FROM busybox/alpine/ubuntu` with following Docker builds. I've detailed the process in a neighboring comment[1] but you're right that it doesn't suit all workflows.
For AMIs (and other VM images) it might make more sense. With containers? Not so much. And with a distributed socket image caching layer it makes even less sense.
We have a maximum image age of 60 days at work. You gotta rebase at a minimum of 60 days or when something blows up. Keeps everyone honest and honestly not that bad. New sprint new image then promotion. And with a container repository and it being internal does reproducibility really matter? Just pull an older version if push comes to shove.
I don't know (I know) why people aren't moving to platforms like lambda to avoid NIH-ing system security patching operations. We can still run mini monoliths without massive architectural change if we don't get too distracted by FaaS microservice hype
When your workloads are unpredictable and spike suddenly such that you can't scale quickly enough to avoid having a bunch of spare capacity waiting around and have HA requirements. In this scenario more is spent on avoiding variable spend to achieve a "flat" rate
In 20 years of writing software, I have never seen an amount of legitimate influx of traffic that can swamp a whole pool of servers faster than it can scale. I’m not saying it can’t happen, I’ve just not worked on any code or infrastructure that couldn’t keep up with the demands of scale. Is there an industry this regularly happens in where this is a recurring issue?
I write software that a billion users see every day, so maybe I’m jaded by the sheer scale and challenges of writing code at scale that I just can’t imagine these types of problems.
You are looking at your own experiences I guess. In edtech it is common for large classrooms to suddenly come online and do things in tight coordination and no predictive scaling isn’t predictable enough for this problem. You can also look at ecommerce, Black Friday type events to see how capacity planning can easily require runway on spare capacity before scaling can react several minutes in.
Do you think EC2 capacity on AWS is on average kept in high utilization? Everyone runs (non truly elastic resources) with headroom to varying degrees
Ah, yeah, I’m only familiar with the industries I’ve worked in and never worked in edtech. That’s a pretty good example of any industry that gets sudden, unpredictable load.
> I don't know (I know) why people aren't moving to platforms like lambda to avoid NIH-ing system security patching operations.
Perhaps because people do their homework and just by reading the sales brochure they understand that lambdas are only cost-effective as handlers of low-frequency events, and they drag in extra costs by requiring support services to handle basic features like logging, tracing, and even handling basic http requests.
Predictability has nothing to do with it. Volume is the key factor, specially its impact on cost.
> arrogant of you to say adopters haven’t done their homework
Those who mindlessly advocate lambdas as a blanket solution quite clearly didn't even read the marketing brochure. Otherwise they would be quite aware of how absurd their suggestion is.
Basically you recreate your personal base image (with the apt-get commands) every X days, so you have the latest security patches. And then you use the latest of those base images for your application. That way you have a completely reproducible docker image (since you know which base image was used) without skipping on the security aspect.
> Basically you recreate your personal base image (with the apt-get commands) every X days, so you have the latest security patches.
How exactly does that a) assure reproducibility if you use a custom unreproducible base image, b) improve your security over daily builds with container images built by running apt get upgrade?
In the end that just needlessly adds complexity for the sake of it, to arrive at a system that's neither reproducible nor equally secure.
If I build an image using the Dockerfile in the blog post 10 days later, there is no guarantee that my application would work. The packages in Ubuntu's repositories might be updated to new versions that are buggy/no longer compatible with my application.
OP's suggestion is to build a separate image with required packages, tag it with something like "mybaseimage:25032022" and use it as my base image in the Dockerfile. This way, no matter when I rebuild the Dockerfile, my application will always work. You can rebuild the base image and application's image every X days to apply security patches and such. This also means I now have to maintain two images instead of one.
Another option is to use an image tag like "ubuntu:impish-20220316" (instead of "ubuntu:21.10") as base image and pin the versions of the packages you are installing via apt.
I personally don't do this since core packages in Ubuntu's repositories rarely introduce breaking changes in the same version. Of course, this depends on package maintainers, so YYMV.
Whether you have a separate base or not, it relies on you keeping an old image.
The advantage a separate base has is allowing you to continue to update your code on top of it, even while the new bases are broken.
You could still do that without it though, just by forking out of the single image at the appropriate layer. Not as easy, but how often does it happen?
> If I build an image using the Dockerfile in the blog post 10 days later (...)
To start off, if you intend to run the same container image for 10 days straight, you have far more pressing problems than reproducibility.
Personally I know of zero professional projects whose production CICD pipeline don't deploy multiple times per day, or in the very worst case weekly in very rare cases where there is zero commit.
> OP's suggestion is to build a separate image with required packages, tag it with something like "mybaseimage:25032022" and use it as my base image in the Dockerfile.
Again, that adds absolutely nothing to just pulling the latest base image, running apt-get upgrade, and tagging/adding metadata.
Eh, that’s a heavy handed and not great way of ensuring reproducibility.
The smart way of doing it would be to:
1. Use the direct SHA reference to the upstream “Ubuntu” image you want.
2. Have a system (Dependabot, renovate) to update that periodically
3. When building, use “cache from” and “cache to” to push the image cache somewhere you can access
And… that’s it. You’ll be able to rebuild any image that is still cached in your cache registry. Just re-use a older upstream Ubuntu SHA reference and change some code, and the apt commands will be cached.
I'm applying security patches, necessary updates and similar during system image creation (VM image - for example AWS AMI - the one later referred in Dockerfile's FROM). Hashicorp's Packer[1] comes in handy. System images are built and later tested in an automated fashion with no human involvement.
Testing phase involves building Docker image from fresh system image, creating container(s) from new Docker image and testing resulting systems, applications and services. If everything goes well, the system image (not Docker image) replaces previously used system image (one without current security patches).
We have somewhat dynamic and frequent Docker images creation. Subsequent builds based on the same system image are consistent and don't cause problems like inability to scale. Docker does not mess with the system prepared by Packer - doesn't run apt, download from 3rd party remote hosts but only issues commands resulting in consistent results.
This way we no longer have issues like inability to scale using new Docker images and humans are rarely bothered outside testing phase issues. No problems with containers though, as no untested stuff is pushed to registries.
I mean, HN is the land of "offload this to a SaaS" and when we can actually offload something to a distro, like "guarantee that an upgrade in the same distro version is just security patches and won't break anything", it is recommended to avoid doing it?
Security assfarts will yell at you for either approach. It'll just be different breeds yelling at you depending which route you go, and which one most recently bit people on the ass.
That's a bold claim. Do you have any references to support it? The examples in Docker's documentation use apt-get directly and I don't see any recommendation to use a base image as you describe.[1][2]
With Debian, there are snapshot images[3] which seem like a better approach for making apt-get reproducible. You'd simply have to change the "FROM" line in the Dockerfile to something like "FROM debian/snapshot:stable-20220316" (where 20220316 is the date of the image you are trying to reproduce, helpfully given in /etc/apt/sources.list).
With the approach you describe, you would have to carefully manage the base images: tag them, record which one was used to create each application image, and keep them around in order to reproduce older application images.
I'm sure there are situations where the approach you describe is useful (e.g. with other package managers, especially ones that don't have a notion of lockfiles), but it adds complexity and I don't think it's necessarily justified in the case of apt-get (at least on Debian).
But the base images seem to not be stable themselves. The article's example of ubuntu:21.10 was released on Mar 18 2022 as of today (Mar 26) [0]. So if the base image is not fixed, the reproducibility is already gone.
I noticed you start to find every operating system quirk ever when you start writing Dockerfiles. I've run into so many strange things just converting a simple predictable shell script into a Dockerfile.
> I noticed you start to find every operating system quirk ever when you start writing Dockerfiles.
I've been using Dockerfiles extensively for years and I'm yet to find anything that fits the definition of a OS quirk.
The quirkiest thing I've noticed in Dockerfiles is the ADD vs COPY thing.
> I've run into so many strange things just converting a simple predictable shell script into a Dockerfile.
What exactly are you trying to do setting up a Dockerfile that requires a full blown shell script?
A Dockerfile should have little more beyond updating/installing system packages with a package manager, and copying files into the container image. First you run a build to get your artifacts ready for packaging, and afterwards you package those artifacts by running your Dockerfile.
I don't know what qualifues a s quirk, but I sometimes had to think about stuff like "who's PID 0", "do I have the CAP_WHATEVER capability", "I need to 'reap' subprocesses?" and other stuff that 'just works' when you have a decent system with a decent init process and all the other things.
Why not run your existing predictable script in a single RUN command? What need to be converted?
The mistake many do is seeing dockerfiles as a 1:1 mapping of a shell script with RUN prefixed on every line. It’s not, you should only split a run if you have a good reason to add a new COPY in between for layer caching reasons or switching user.
With the --bind options from buildkit you can ensure the apt cache does not get layered and you can mount any big temporary files needed from host instead of copying them first.
Also each RUN, COPY, ADD, etc. creates a new layer in the image so you should put commands which are related in the same command. The example in the blog post violates this principle, which is the main reason for the unexpected size.
I initially just wantetd to replicate my production env locally, then after that experience I wanted to replicate my Docker env in production so that would also be predictable.
A very common mistake I see (though not related to image size perse) when running Node apps is to do CMD ["npm", "run", "start"]. This is first memory wasteful, as npm is running as the parent process and forking node to run the main script. Also, the bigger problem is that the npm process does not send signals down to its child thus SIGINT and SIGTERM are not passed from npm into node which means your server may not be gracefully closing connections.
To avoid any potential issue with signal propagation a good practice is to always use a lightweight init system such as dumb-init [1]. One could assume that the node process would register signal handlers for all possible signals, but I prefer to not have to make this assumption and use an init system instead.
If I exec into a container that runs npm and run top you’ll see npm (parent) using res memory and the node process (child) itself using memory. I’m pretty sure the npm memory is just wasted.
Can echo this. A colleague’s node container was maxing out CPU and just removing PM2 and running node directly solved the problem. That was easier than debugging why PM2 was having such a hard time.
To be fair, it was a straight up conversion of an old VM in vagrant and Docker was looked at as a one to one replacement before learning otherwise.
pm2 is great when running on servers, but using pm2 in containers feels wrong and again wasteful. Just invoke your script. If it crashes, fine Kubernetes or Docker handles that. Logs, handled by k8s. Monitoring I use DataDog.
What's the point of pm2? Every time I've seen it it's just been part of a messy misconfigured system and whatever it's actually doing could've been accomplished entirely with a tiny systemd unit running node directly.
Funny, I had a very similar experience at a client of mine last month. They were using Apache Spark images and installing all kind of python libraries on top of them. The biggest contributors to image size were:
- miniconda (~2GB)
- a final RUN chown -R statement (~750GB)
We reduced the image size and relative Spark cluster considerably by playing around with dependencies in order to stick with plain pip and using COPY --chown.
That only works for Java. Nearly all of the stuff I work on involves native executables as well as the need to setup the OS environment in the container (libraries, user accounts, etc)
Works for lots of examples where you are packaging a statically-ish linked thing into a container, could be node/golang/python/java etc. Def lots of other scenarios where it doesn’t work, but sometimes you can push that to a base image built differently. Keep the majority of things simple, less footguns.
(Edit: I realize most of those aren’t statically linked, better description might be “things copied straight into container, not installed”)
I think the OP is confusing the runtime and image format a bit here. At runtime OverlayFS can use metadata-only copy up to describe changed files, but the container image is still defined as a sequence of layers where each layer is a tar file. There's no special handling for metadata-only changes of a file from a parent layer. As the OCI image spec puts it [1]:
> Additions and Modifications are represented the same in the changeset tar archive.
Really thought I missed something with COPY --chmod so I'm glad you mentioned it's new. COPY's preservation of flags simplified one of projects because I started setting flags on shell scripts in the repo instead of at build time.
Classic deploy with system-provided "insulation" (GNU/Linux cgroups (firejail/bublewrap) or FreeBSD capsicum etc) reduce far more the size and the overhead...
What is the point of these technology inventiones? The desire of hoarding more bananas than the monkey can eat will make this planet unhabitable one day.
My understanding (based on other comments in the thread - I'm no Docker internals expert) is that it's about the size of Docker image files, which contain a tarball (or similar) of files that each layer adds or modifies on top of its base layer. There's no way for them to say "same as this other file, just with permissions changed". Which has always seemed to me like a bad design decision on Docker's part, because there's lots of room for deduplication within the images that just cannot be done due to the format they chose. Why not have the layers reference individual file by content hash + metadata? If there's a lot of small, unusual files, you could just bundle them with the image, sort of like how Git packs objects together for efficiency, but still retains the identity of each.
You folks do all realize that almost all of this is trying to work around the fact that Ulrich Drepper is cramming dynamically-linked glibc up our uh, software stack, right?
Linus doesn’t break userland. A tarball is a deployment strategy if someone isn’t dicking with /usr/lib under you.
Everyone is talking about workarounds when this should be fixed in the file system. This is just dumb. Changing metadata shouldn't require the entire file to be copied lol.
> If you are wondering why a metadata update would make OverlayFS duplicate the entire file, it is for security reasons. You can enable “metadata only copy up”[0] feature which will only copy the metadata instead of the whole file.
I wasted a lot of time in the past trying to ship with Alpine base images and statically compiling complicated software. All the performance, compatibility, package availability headaches this brings is not worth it when docker-slim does a better job of removing OS from your images while letting you use any base image you want.
Tradeoff is that you give up image layering to some extent and it might take a while to get dead-file-elimination exactly right if your software loads a lot of files dynamically (you can instruct docker-slim to include certain paths and probe your executable during build).
If docker-slim is not your thing, “distroless” base images [1] are also pretty good. You can do your build with the same distro and then in a multi stage docker image copy the artifacts into distroless base images.
[0] https://github.com/docker-slim/docker-slim
[1] https://github.com/GoogleContainerTools/distroless