FWIW, if you wanted to get the static from-scratch go image a lot smaller, try running the binary through UPX. It typically improves compiled go binary sizes by absurd amounts: https://upx.github.io/
Also, https://busybox.net is a 1.0MB static binary that provides all of the standard posix-compliant tools (sh, grep, tail, etc) and actually comes with an httpd implementation. So if you wanted a more useful container at about 1MB, you could use it.
Edit: here's a previous thread on UPX that has 80 comments talking about the tradeoffs it makes. You should definitely consider these if you intend to use UPX: https://news.ycombinator.com/item?id=15456980
Keep in mind that UPX needs to decompress the whole executable to memory and thus cannot make use of OS-level memory sharing. Normally, when you load the same executable twice, its base image will only consume memory once. But it with UPX, you cannot make use of that. So in the end, it may use more resources than you thought!
I was initially a bit skeptical if the page cache would work with all layering docker does. But OverlayFS, perhaps unsurprisingly, does indeed support that: [1] "OverlayFS supports page cache sharing. Multiple containers accessing the same file share a single page cache entry for that file. This makes the overlay and overlay2 drivers efficient with memory and a good option for high-density use cases such as PaaS." TIL
UPX creates anonymous pages that contain the decompressed payload, right? Those aren’t file-backed pages so how would the filesystem even know those pages exist?
That’s correct and I guess only page deduplication might help in that case. What I meant in my post was just to state that docker can use the page cache despite the massive indirection caused by overlays. I was wondering if that was even possible and it indeed is.
The base would consume “up to once”. You will only load pages touched, and they are all reclaimable under memory pressure. Memory used by UPX for decompression is guaranteed to be used, and is not reclaimable!
So in fact, it’s not might use more resources, it’s guaranteed to use more resources.
The only benefit of UPX is saving disk space, which is unlikely to be super useful to most people these days.
The other use case is to distribute images more quickly over the network and improve image pull times, at the expense of slightly more memory use.
Similar to how saving disk is less useful to most people these days, RAM on servers in big docker/kubernetes clusters is also likely not much of a concern for binaries that are well under 100MB if you're not running a ton of copies on each server. In particular, it seems like this tradeoff is most useful for static go binaries, which don't benefit from shared system libraries and aren't likely to have many copies per server, but of course that definitely depends on the use case and I definitely do not recommend blindly using UPX without considering the tradeoffs.
It would be an amazing integration though if it were possible to integrate UPX with docker's layer compression such that the binary is extracted to the FS before running. But given how it all works, that's pretty challenging to implement and the network is only getting faster...
Are you sure UPX actually beats the standard layer compression which is already provided?
After applying UPX, the result should be nearly uncompressable, so if UPX performs worse than layer compression then it might be better for network usage to not use it at all.
In this scenario where you are unpacking it in advance, that eliminates the main benefit of UPX which is that it is self-extracting.
Definitely important to note, I wonder if upx could solve this via execing an memfd_create in some mode. Also wish upx came with a trivial extraction tool so you could use it for distribution but not runtime.
UPX does significantly more compression on binaries than gzip can on the layers. It's not just simple compression, it makes a lot of modifications to the binary to make it more compressible.
Perhaps this is what you were asking for in the GP, but it would be interesting then if UPX could apply those transformations without actually doing the final compression step itself. Perhaps the win isn't as great without the ability to smuggle specific hints to the compressor, but it would be interesting to see.
On the other hand, if it could be demonstrated that modest increases to binary size correlated to significantly improved compressability, then maybe that's something that should just be considered by the compilers themselves.
Is UPX actually that much slower than extracting a gzipped docker layer? I didn't realize UPX decompression was much slower than other compression algorithms
Beware that at least historically some antivirus software would wrongly classify any UPX-packed binary as malware. Don’t know if this is still the case but something to keep in mind.
It added 18.5 kB on top of busybox, so, not as good as the 6kB ASM server code. With busybox included, the whole image is 1.18 MiB, or 751 kB compressed.
I figured there were a bunch of ways to shrink that one, and I had a C based image that was in between my GO and ASM images, but decided to cut them for the sake of brevity!
It definitely is, although a cool thing about busybox is that you can rebuild it with any applets disabled. I'm not sure just how small it can get though.
But yeah, the tradeoff of that 1MB is that you get a very functional shell environment vs just an http server. It's impressive just how small ut is, given everything that it is capable of
> It definitely is, although a cool thing about busybox is that you can rebuild it with any applets disabled. I'm not sure just how small it can get though.
They do distribute statically linked individual binaries [0]. For just HTTPD and nothing else is about 85kb. But it is a lot more than just a static file server - it comes with CGI support.
> But yeah, the tradeoff of that 1MB is that you get a very functional shell environment vs just an http server. It's impressive just how small ut is, given everything that it is capable of
Maybe.. I suspect a well-designed Forth image could pack similar capability in less than 100kB. So it might seem low, but not a miracle.
Certainly, I think this is more about getting over the developer pain threshold. In terms of computer speed vs human perception I don't see either one mattering too much (compared to a 900meg base for node-debian on Docker, although I guess there is some incremental stuff for that?).
If you're going for size, why not just sidecar all of the "standard posix-compliant tools" but ones that you would want normally. For example, bash instead of ash/dash. You don't need those tools all the time right?
I mostly meant to sidecar the shell and treat it as a debugging utility. While I get the point of ash being smaller as a point for busybox, if size stops being a concern because the shell is in an inactive-sidecar, then you can use a shell that has less surprising edge cases for users, like bash.
One thing I'm wondering -- why are people obsessed with container size? A lot of enterprise is Java shops which deal with huge dependencies, huge distributions, and generally pretty heavy software -- with internet speeds what they are today, it's not hard to throw 100-500MB around internal networks (building the code closer to the deployment area is also a solution). Even gigabyte size containers are not impossible to deploy these days with the right caching (lord help you if you invalidate a bunch of layers though).
In a world of xTB thumbdrives, why are we so worried about containers being <1MB? Security is one thing, minimalism is another, but I do sometimes wonder about this.
Anyway, if you want to build minimal containers and want to do it with a bit more backing than this one (pretty well written) blog post, probably also check out distroless:
I personally find most of the time alpine-based containers are more than small enough and good enough with nice ergonomics (sometimes you just want to shell in to the container), but maybe it makes sense to build containers with and without debugging/inspection tools (a `project:vx.x.x` and a `project:vx.x.x-debug`) and switch them out when problems arise.
I'm working at a shop where we have to use extremely large images (10 - 40 GiB) on a platform of 60+ medium sized blades (4 - 12 cores, 32 - 256 GiB RAM, 200 - 1000 GiB SSDs). Startup time for running such huge images often is 5 - 10 minutes. They contain prepackaged test data so that they can be run in parallel against newly built versions of our software - there are about 30ish of these images.
You not only store these images once in the registry, but on all container hosts that they might run on. And you need to transmit them, if they aren't cached there, yet. And you may want to redeploy new images several times a day, as they get updated, new ones added, etc.
To be fair, the above is not the use case containers were originally intended for. And if your 10 GiB container image only runs on 3 or 5 nodes and only gets updated once a week, you obviously don't have to worry about the overhead too much.
But at scale (beyond 10 nodes) and/or on very fast development cycles (=more then one deployment per day), size IMHO starts to matter.
And the CI with many-GB Docker images is very painful... Turning on layer caching usually makes the process even slower as it needs to pull and unpack the previous image before starting, and if you turn it off you're downloading a ton of deps on every build.
If you separate the heavy stuff into a base-image you still have to load it on the beginning of CI, which without beefy machines with local SSD caching can take a loooong time.
I can totally see this starting to matter, but I do want to point out that it hasn't ground your business to a halt just yet and is very very unlikely to. From what I've seen most ops pain comes from a lack of automation when deploying -- as in if the process is 5-10 or even 15 minutes, no one really cares (outside of an emergency) if it's completely automated. Do you find that is true with your work as well?
I can certainly see that it can be a painful problem (10-40GiB is huge), but I expect most smaller shops that aren't running on their own servers/colo as such these image sizes were never even an option (I can't imagine trying to upload 40GB across the public internet to launch one instance!).
Because every manual step means things can go wrong. You might also need to look it up if you don't do it often. All these increase the overhead of manual steps compared to clock time.
For this particular use case: We previously used VMs and snapshots for that workload. The problems we encountered were:
- snapshots aren't really intended to be portable (we used ESXi and also KVM on LVM backed volumes), hence we had to write and maintain tooling to have a "snapshot repository", versioning for these and distributing them to the target nodes - that did all work, but was even slower (startup times, which can include shipping and setting up the snapshot, was 15 - 60 minutes)
- a VM will duplicate all of the services and the kernel, so all of that has to be started as well, where as the containers only start the services under test.
- Using VMs makes it much more difficult for our developers and QA to retreive a particular image and replicate a failing test on a given version of the tested software locally, especially when considering a wide range of client OSs we see in our not that large group (MacOS, various Linux distros, even Windows)
A general observation on trade-offs: With containers layering and COW you get fast development, but have to pay the performance bill when you happen to download and apply an image for the first time. Similarly, taking a snapshot of a LV under LVM or on a ESXi VM is fast, but applying the snapshot is slow.
We had therefore at one point considered using Ceph RDB and their COW clone snapshots[0]. It would let us do "cheap" restores of snapshots. Our initial tests showed that the network bandwidth requirements[1] would have needed some serious infrastructure re-engineering in order to keep up with our I/O expectations. And again, the containers slot in nicely with commonly available local resources and allow working offline, to a degree.
What if you had a way to use Docker to create your VM images? Then you'd have no registry, be able to mount the vm storage over network(ondemand fetching), boot quickly.
I don't know about you but I work on a laptop, and I work on multiple projects, and each project might use dozens of containers for builds, testing, releases, etcetra. If every container were a gigabyte in size, majority of the laptop's internal storage would be wasted on them, nevermind the time wasted pushing and pulling updates.
I do too (my laptop is pretty beefy though), and yes you're right this would be a problem, if containers weren't built to combat this with Copy on Write filesystems.
That said, dependencies do take up a lot of space, but machines these days also come with a lot of space. People also run prune and develop caching strategies (ex. NPM) as a result. It's not like every container will be a GB in size, as it's not reasonable to have every container have that much content, but how many people are out there using 950MB dependency-filled base containers with 50MB of actual relevant application code? Most people learned the run-it-on-alpine lesson like 5+ years ago at this point.
My new laptop has 3T of disk. I have the old feeling of what can I possibly do with it all. I have two copies of all my photos and text files etc. and dual boot Arch and Ubuntu but still it is so Empty. I don’t think even a very aggressive go get campaign can fill it.
Do you run `docker system prune`/`docker image prune` and related commands from time to time? I assume you already know about it but just in case.
Just recently I found that I had like 100s of GBs used up on my hard drive because I'd installed and used podman for a while and it's image store was separate and just... sitting there, long after I'd moved back to docker for reasons.
A lot of companies that sell products (non open source ones anyway) actually try to bloat the binaries and deployments to make them look more significant. Many gigabyte installers are not that weird for some products, where I found (for fun as corp wise we cannot really care) the relevant software we are paying for to be a few mb and it doesn't need or even touch the rest the rest in the package for 99% of cases. But it seems/feels like you are paying those $100000s for some serious big software.
When software was still delivered in physical boxes, I noticed the same thing and I even found products in the past where the .exe had megabytes of random data attached to it to make it look bigger.
With Docker images it's all a lot easier to bloat things up and 'hide things'; I used to pack up our main product in a chroot tgz (this is early 2000s) with an installer, so I did not have to actually think about different linux distro's or installers; I know for a fact that some people bought our product over competitors because the installer was so large in size so it looks like better value. People told me when I asked them afterwards why they did not go for the competition. And this was not intentional; it just had a full Debian install in a chroot.
This was something I never considered, but reminds me of the Larry Ellison tactic of never naming anything v1, or the concept that because something costs the most, it therefore must be the best.
Well, Oracle was (not sure if they still are) one of these companies. Some of their products we used in the past where artificially bloated and packed over multi CDs etc to make it all look bigger.
Particularly flinging $100 bills at peasants and how RDBMS set pricing for their products. If you do have an implicit awareness of Oracle and their tactics, you will find this an amusing piece of nostalgia when people were starting to understand just how to build efficient and nice web sites in the 1997 time frame.
I loved Greenspun. So funny but so clear on how to make stuff work efficiently. My first threaded program I wrote professionally was a n extension to AOLserver to let the TCL layer speak the AOL flap protocol back to the backend. Got it working in few days but couldn’t get three simultaneous threads working for like a month. My boss at the time was like (once I finished it) don’t worry, software is late. My first project in the no threads server framework took six hours start to finish.
At scale every tiny bit helps. Especially now that Edge computing is becoming a reality, it's much better to pull and run a 1MB container image than a 100MB one on limited hardware and bandwidth
Agreed -- edge computing is a different world, and the difference between 100kb-1mb-10mb-100mb is huge. What I was discussing is definitely not a fit at that scale/usecase -- I'd argue that most infrastructure using containers (and mos t people who see this blog post) are not built for this scale/use-case, and I'm not sure containers even make sense in the edge computing paradigm just yet.
Plus, figuring out optimal sizing for Docker images is in general complicated and often counterintuitive. A company I worked with mandated that every image must be written from scratch to reduce size on disk and save on transfer bandwidth/time. The end result though was that there were no sharable layers, and so every server had to redownload every image wherever there was the tiniest update. Moving to a set of common 1GB+ base images with all dependencies included made the entire system smaller and faster.
oh god I fell in love with everthing related to Alpine Linux. Found out about it when I started playing around with docker but then ported all my home servers and Raspberry Pis to Alpine. It's so damn small and has little overhead. Need docker? `apk add docker`, done!
I've written guides on how to boot Alpine (even with GUI) over PXE [1] and how to set up a perfect file server that runs from a RAMdisk USB Thumb drive with full disk encryption [2]
You seem to have jumped the gun a bit. The question was "why are people obsessed with container size?" while you're answering some question like "What's a good and small container OS?"
Yes kind of but also not really -- this answer was really delightful to read because Alpine linux literally was the first answer to the first concerns with container size. Before docker multi stage builds existed (people we doing the builder pattern adhoc before that anyway), the usual way to get drastic size reductions in container size was to run your container on alpine linux.
Base Ubuntu/Fedora used to be absolutely huge, and no one was smart/patient enough to pick out all the dynamic libs that you needed with your JAR or script to go into production so you usually just shipped the whole fat image.
Yes yes, but you also seem to be missing the point. Here you are answering the question "How did Alpine appear and what came before it?" but again missing the original question which was "why are people obsessed with container size?"
https://blog.haschek.at/2020/the-perfect-file-server.html is made obsolete by SmartOS, which does the exact same thing, in addition to offering Triton (enterprise web GUI for virtual systems), OpenZFS (end-to-end data protection, ease of administration), zones (full-blown, running-at-the-speed-of-metal virtual UNIX servers), DTrace (for production-safe, real-time, deep machine state instrumentation and inspection), Bardiche (for building firewalls), FireEngine / Crossbow (high-performance TCP/IP stack for building virtual switches and routers) and imgadm(1M) / vmadm(1M) (for virtual server provisioning and software management), and last but not least, a reference implementation of the NFS V4 protocol.
or by proxmox or by any other OS. That's not what it's about. If it comes with a web gui, it's already bloated for me. This project was about the minimal perfect setup for my needs and I only need SSH access
SmartOS does not come with a web GUI; Triton is completely optional and isn't required at all, but if one is running 100,000 servers or more, it's there as a gratis option if one needs one.
As far as I am aware, and please correct me if I am wrong, but "proxmox" does not have nearly the same list of capabilities as SmartOS, all the while running from a ~650 MB, read-only RAMDisk.
It's too bad that in the path to minimal disk size they decided to simply disregard legal obligations and they ended up producing something that violates the open source licenses of the contents... so unusable for any product.
Alpine is one of the unsung heroes of the container world. It's insane how much value Natanael Copa has created & shepherded over the years (decades?). Recently came upon an interview with him and it was the first time I saw the creator behind Alpine Linux[0]. Similarly, the musl libc project, just chugging along, giving most projects that build on top of it an easy out for portable static binaries.
Thank you for those links, I am about to gobble up those blog posts -- I recently went on a benchmarking kick[1] and diskless alpine instantly struck me as the perfect server setup. ECC memory + running from ram would give me full use (to put in RAID/whatever else) of the NVMe drives, it's something I'm going to try out as soon as I get a chance to.
I am so interested in the infrastructure space, I know exactly two hosting providers that will give me PXE level access (so I could use something like tinkerbell[2]):
- OVH [3][4]
- Vultr[5]
- LeaseWeb[6]
Unfortunately my personal favorite hosting provider, Hetzner[7] (I fell in love the moment I came across the robot marketplace) does not offer it yet, though I've automated going through their rescue system at this point so it's OK.
I am working in Manufacturing Execution Systems area and the internal sites are spread around the globe on all continents; speed does matter, there are dozens of places hundreds of kilometers away from a place with Gigabit Internet (for WAN). Taking backups off site is faster by physically shipping hard drives than copying over the WAN, I don't like any large software.
In one place in Africa driving to another site 3 hours away gets network speed 10x better, so we do this from time to time.
Container size generally impacts startup/deploy times. 500MB container still takes almost a second to copy over gigabit connection, and typically you need to copy it at least twice: from build server to repository and from repository to whereever its going to run.
A Gigabit Ethernet connection means 1Gbps minus some overhead, effective rate is around 117-120 MB/sec, so the 500 MB container takes at least 4 seconds to copy. 10 Gbps gets you to less than a second, but that works usually between servers, not from the regular laptops.
I wonder the same thing. I can see the value in shrinking a 1GB container by 90%, but I suspect there are diminishing returns by the time you're shrinking a 5MB container below 1MB. In most organisations, there are probably bigger opportunities for improving security, performance, etc.
It's worth remembering the MBs add up in multiple places. Sure, it may not be worth for 5MB. But once you're in tens/hundreds, they add up in storage (how many old versions do you keep), transfer times (how quickly can a fresh instance pull all tasks and how quickly can a developer (at home rather than in us-east-1) start with a project), transfer costs (how many times does a CI node need to download/upload the last layer).
Saving 10% in size may actually translate to a non-trivial cost saving - both obvious and cost-as-in-paid-waiting-time.
Also all those dependencies are security risks. A lot quicker and easier to scan a container with just one go executable than even the most
benign distribution of core utils and so on. And if your containers are just one binary and you can read the logs then you don’t need to go into
the container any more than you need to go into your binary. Then your k8s node or other container runner host is just like your old host running a bunch of processes but with much better process isolation and file system isolation.
There are some teams where this will be the best use of their time, and I'd be well out of my depth suggesting to them otherwise.
But the vast majority of teams in my experience have spent ages farting around with these kind of micro-optmisations, whilst there were far better opportunities to improve the infrastructure. And in the process, they actually made it more fragile and harder to debug.
Simply put, this is a niche skill with niche value. Most people trying to do this probably can't and shouldn't do it.
> building the code closer to the deployment area is also a solution
This is exactly what we do. Once the first build of our application is installed on our customer's servers, it rebuilds itself from source each time. This is also really good at firewalling CI/CD compromises away from your customers, as long as you don't merge cryptolockers/miners into your private github repos.
I found that most people focus on size and forget about reuse.
Sure, each layer adds size to your final image, but you can reuse them, so that you don't need to transfer the complete image but just the layers on top of the one that changed.
Flatten everything and you need to ship the whole stuff.
I love blog posts like these: they don't go after real-world problems directly but rather explore what's even possible.
paulfurtado in this comment section mentioned UPX[0] as a way to get binaries even smaller. Funnily enough UPX even managed to decrease the size of the asmttpd web server for me:
### build stage ###
FROM ubuntu:18.04 as builder
RUN apt update
RUN apt install -y make yasm as31 nasm binutils git
RUN git clone https://github.com/nemasu/asmttpd.git asmttpd
RUN mv asmttpd/* .
RUN rm -rd asmttpd
RUN make release
# i have the upx binary on my machine in this case
COPY ./upx-3.96-amd64_linux/upx upx
RUN ./upx --brute asmttpd
### run stage ###
FROM scratch
COPY --from=builder /asmttpd /asmttpd
COPY ./index.html /web_root/index.html
CMD ["/asmttpd", "/web_root", "8080"]
With a 4 byte index.html my total image size is now 5.3kB!
Might be possible to extend asmhttpd to serve gzipped files directly (possibly breaking compliance with rfcs by forcing compression)? Ie send index.html.gz?
To provide a real-life use-case for such a tiny container image: With Kubernetes one may want to run the service container as a random or at least non-root user id and store data in an attached volume. That volume will usually be attached as owned by root. So often an init-container is used, which runs once, before the service container is started, to change the permissions.
Here is an image with a statically linked chown binary from the busybox tools cowering this use case: https://hub.docker.com/r/privatebin/chown (I am the author of that image)
Benefits of using a small image are less attack surface and faster operation (less data to download, lower startup time, less memory, etc.).
My (somewhat silly) use case for the tiny container was trying to cram 10,000 pods onto a k8s cluster without breaking the bank: https://www.youtube.com/watch?v=1y2nRNexRVk
And don't forget to include making sure to create the right resource and namespace constraints so that the one binary doesn't gobble up memory endlessly or use up the file system or put your infrastructure completely at risk if/when it's compromised -- whether you do that with systemd units, raw cgroup/namespace finagling, or regular tried-and-true linux user-based resource segregation.
Containers are not a security measure. They're acceptable for some resource constraint considerations, but you should not ever use them because of security concerns. If that's something to worry about, use VMs at the very least.
I'm going to have to disagree with that statement. Containers (as implemented by Docker/containerD/CRI-O) do add security controls. Specifically, separate namespaces, restrictions on capabilities, seccomp filters and apparmor profiles.
If you can easily break out of any containerized enviornment, I'd suggest you could register for bug bounty programmes and make quite a lot of money also you can get a reward from Jessie Frazelle by escaping from https://contained.af/
Obviously containers have a larger attack surface than say a VM hypervisor, but security is not an absolute and both containers and VMs have suffered from breakout issues in the past. No security measure in isolation provides perfect protection.
> Containers are not a security measure. They're acceptable for some resource constraint considerations, but you should not ever use them because of security concerns. If that's something to worry about, use VMs at the very least.
This is repeatedly really commonly, but is bordering on wrong more and more every year. At the lowest level it's not even completely correct. Containers are resource + namespace isolation -- cgroups and namespaces -- isolation and kernel-assisted resource hiding/access control is absolutely a facet of good defence-in-depth. Enhanced security is not a core feature of containerization as a whole but it can absolutely help with your security posture.
For what % that satement is wrong, it's even more so because it fails to consider the container runtime. Container runtimes these days (containerd is the best one IMO) include ways to run containers at different levels of isolation (VMs and as micro kernels), so a move to containers can absolutely help security posture, because now you don't have to pull out packer/ansible/etc to build your VM, you can just take the same container you were running and change the runtime. Time to ship/launch improved security features is important, and containers can reduce that burden for teams.
Containers are a process isolation and resource control tool -- they can absolutely aid in security posture. Would you say the same about BSD Jails?
Just so there's some more value here, here are some projects that are somewhat close to cutting edge (though it's been a while) in this space that I like:
I’ll double down on what the GP said: containers provide less security than people imagine, and contrary to your comments Linux control groups provide no meaningful resource isolation.
> containers provide less security than people imagine
Well sure, this heavily depends on the people you're talking to and what their imaginations are like. If it's the "containers are lightweight VMs" crowd then sure -- that view is fundamentally wrong.
> Linux control groups provide no meaningful resource isolation.
So this is a pretty bold claim -- as far as I can see processes certainly get OOMkilled when they use more resources than allowed for by their cgroup.
Do you have a link you could share? Is some fundamental part of cgroups (v1? v2?) broken in some way I haven't heard of up until now such that everyone has patched around it and it does what it says on the tin but despite the code in the kernel?
The incompressible resources, including memory, have the best isolation story, but it is by no means perfect. A process in a given control group can easily cause external memory consumption that either escapes accounting altogether, or is charged to another control group. An example is the way kernel slabs are assigned to control groups. There were some patches in late 2020 from Facebook to help fix this, but I am not sure if they are merged and released yet. Those patches are also just better, not perfect.
Another example is a control group that is chronically out of memory. It may page vigorously, which will have external effects on other control groups through unaccounted resources like nvme controller time, memory fragmentation, global vmscan, etc.
A third popular way to abuse the resources of other containers is through networking. The way Linux handles network traffic is frankly hostile to proportional resource sharing. A control group with zero CPU quota can still easily cause external CPU time consumption.
Thanks for explaining, while I still think what you said was a bit hyperbolic, I definitely didn't know about these... externalities, I guess you could call them? Learned something today.
I think of control groups as being adequate for what their inventor intended: approximate sharing of resources between non-antagonistic, first-party processes. Outside of that zone they can't be relied on for much of anything.
GP is noting the security limitations of Linux containers, the mechanism, not some management daemon like containerd or some API or workflow. firecracker-containerd is not a container, per-se, as stated in their README.
> Like traditional containers, Firecracker microVMs offer fast start-up and shut-down and minimal overhead. Unlike traditional containers, however, they can provide an additional layer of isolation via the KVM hypervisor.
I think the security mechanism of note there is the hypervisor -- a VM technology. The fact that it is manageable via a daemon/API that was created for containers, does not mean it provides security via Linux containers, the mechanism.
A similar story can be told regarding capabilities, a mechanism that is orthogonal to containers -- all combinations of with and without capabilities and containers have their uses.
I do think GP is overly glib and assumes a particular threat model in order to dismiss Linux containers so thoroughly, and you are right to note "they can absolutely aid in security posture", etc. But GP is correct to note that VMs (though also imperfect) provide certain kinds of security/isolation that Linux containers, the mechanism, cannot yet match.
I don't understand why this isn't being done more. systemd has pretty much every convenience Docker has these days, without any of the ridiculous overhead and cognitive load that comes with managing a Docker daemon.
Yes, you'd need to learn how to make a systemd unit file but honestly, that shouldn't be a problem at all. I've had much worse headaches fighting Docker's networks and firewall-overruling network configuration in the past. Every time I forget to specify a restart option, I have to dig through my shell history again to see how I launched the image so I can kill and recreate it, or I have to look up that oneliner that shows you the docker run command for a running container.
I run stuff in Docker for one simple reason: I have it installed and I'm too lazy to think about software sometimes. If you're packaging software, that laziness isn't a reason to pick a distribution tool.
Static binaries have other problems, perhaps most importantly package management. How do you distribute your app? Traditional deb/rpm packages? curl-to-bash installers? Snap? Flatpak? Binaries that people flog into /usr/bin? What about automated updates, do you set up a repository, do you include a self-updater in your code? The list goes on. Docker solves that by having one general source of packages with the option of adding a repo of your own (take note, Canonical, your shitty Snap Store is practically useless without that last bit).
For services I run on servers, I much prefer traditional packages with dynamically linked executables, for a simple reason: automatically fixing security issues and bugs across applications with a single update command, instead of having to wait for every maintainer of their statically-linked tool to update their dependencies.
It's one of the big problems I have with Rust; many dependencies and applications embed their own versions of dependencies with sometimes very specific versions, and when there will eventually be a massive security problem in one of the TLS packages I'm dependent on recompiles and code modifications from random open source maintainers.
Here, we use docker-compose to automate container creation.
And we have just one Makefile to automate docker-compose (to fetch credentials from secrets storage).
Our images are made from Go binaries, so thanks to Go modules, we can manage dependencies centrally from the main module and rebuild.
Yep, and you'd generally expect to spend time understanding the complexity that systemd brings, the subsystems that power it, and how you can separate resources there! There is no free lunch.
Restated, my point here is that while containers can be more complex and have pitfalls (a lot of which have been worked out somewhat at this point), there is no complexity free lunch -- `[docker|podman|crictl] run --rm --cpus 2 --memory 500mb ...` is pretty darn easy, and more so than writing properly portable and well-considered systemd unit files (and putting them in the right place, with the right permissions, under the right slice, etc). It's easier than most of the options out there (including the old methods of per-user resource segregation).
I'm not about to argue against systemd, it's great software, and it's in every distro right now for a reason. Understanding apt (+/- how to package for it), systemctl, and the options you've laid out in your unit file are not trivial, and I would argue that they are less trivial (or harder) than understanding what's happening with containers, especially if you're running rootless containers, and/or using a container tool like podman which does without the daemon.
--cpus 2 --memory 500M
is easier, and gets you the same results though they may not be as permanent or as well managed -- the management and external stuff is an orthogonal concern, and that's not the situation I was addressing. The original point was insinuating that throwing up a binary and getting it to run. your filesystem is also not available to the container by default, and in this way docker sort of fails closed. If you're running a rootless container, the story is even better.
One thing you have not covered is filesystem isolation, which docker also does very easily. There is a lot to configure on the systemd side[0] and the parts that are overlapping are just easier to configure and run with docker. Systemd is the better tool to build repeatable installs for pet processes, but again, there is a lot of knowledge underneath that is related. People to this day still complain that systemd does too much (I personally like it a lot, and it's great to have everything in one place).
[EDIT] Just to make myself clear, systemd is an amazing tool -- I like it, I run it, I'm not smart enough to administer a more complicated setup -- but docker is easier, for a large part of the small subset of systemd's capabilities that docker covers.
But there is a lot of work to be done before you can do the simple apt install. I (gladly) don't know how it is nowadays but before Dockerfiles/Docker creating your own packages according to the various standards was a pita. Most companies needed a 'packaging specialist/release engineer' role as most developers where not up to the task. Solutions like FPM[0] did help somewhat, but it was still hard when dealing with non-homogeneous environments. Containers solved that problem universally for all distributions.
to the argument vector liberates these options' definer from "understanding [...] the subsystems that power it", and reflect on their potential impact. And I'd argue that THIS is the real complexity incurred by these kinds of resource constraints that seem so simple on the surface - not the specific syntax or location you have to use to introduce them. All of which makes systemd unit files and their settings' implications (which are amazingly well-documented btw) as good as any other option, imho.
Systemd is great software -- it is useful, and powerful. A systemd unit file, the idea of what a unit is, when they run, how they run, what shells they use, what permissions they run under, and lots of other complexities are expressed as options in the unit file, configuration files on the system, and in other places.
If I want to run a useful piece of software like redis let's say, but I want to run it with a resource constraint to make sure that it doesn't take more than 2CPUs and 500MB of memory, it is far easier to do that with the following command line:
docker run --rm redis --cpus 2 --memory 500mb -p 6379:6379
Than to write the equivalent systemd unit file, set up the isolated filesystems that docker would let you easily bind mount in, etc. This is like comparing systemd-nspawn to systemd -- if systemd-nspawn isn't simpler than systemd then what are we even doing.
Docker won because of it's developer ergonomics (containers weren't new), systemd won because of it's feature set, convenience and sturdiness. They're different tools with different primary use-cases.
Then you will want to manage the configuration file and to browse logs, and you ll have to mount files to you container. And your deployment start to reimplement (a subset to your needs) what distros did System integration complexity didn't disappear thanks to containers. It was just moved to another place.
I dd add that, being convenient for developers doesn't mean it's convenient for hosting management. At the end, when one side convenience is taken in consideration, it often translate to complexity and pain at the other side.
K8S become in some companies more a mandatory runtime requirement than a hosting commodity/facility.
As a (former) developer I d rather ran redis in a container during development, but for production I d rather rely on boring VMs unless some scaling is required. (Managed k8s case set apart)
> Then you will want to manage the configuration file and to browse logs, and you ll have to mount files to you container. And your deployment start to reimplement (a subset to your needs) what distros did System integration complexity didn't disappear thanks to containers.
Agree, but my view on this is that the implicit answer of how you do all that (i.e. the file system, syslog) is now gone. There will be pain (complexity) in the short term, but at the end of the day, we're going to be able to build much better orchestration and systems. To kind of restate that, before you had to worry where a process wrote out it's output (stdout? /var/log/<program>? /etc/<program>/logs? /home/<user>/<program>/logs? syslog?), now you know want to get the non-stdout/stderr logs of the thing you're running, you'd better give it a volume to write to (which may be fake, and actually write everything to some remote storage or something), and I think that's a step forward.
Of course, I'm not saying containers should go everywhere -- relying on boring VMs over containers is fine too -- but I think rich world of functionality available to container-driven workflows is popular for good and bad reasons, and the good reasons are worth exploring/beneficial to me.
can anyone explain why you might need to limit memory consumption of your service?
I thought the kernel has virtual memory and if you consume more it will swap some memory and thats it.
Couldnt you just manage memory from inside your app? if its redis, then check the db size and shutdown/cleanup gracefully, and not crash redis with OOM?
> can anyone explain why you might need to limit memory consumption of your service?
> I thought the kernel has virtual memory and if you consume more it will swap some memory and thats it.
Well just to make sure the right memory goes to the right places -- if someone uploads a large file and you've made a mistake in your code that tries to hold it all in memory instead of buffering it straight to disk for example, you'd want that process to crash, and not your machine.
Also you generally don't want to swap, so much so that Kubernetes disables it immediately[0]. Not that Google is the only group with the right answer but they seem to think nothing can come of a machine having to swap. Maybe they're right. Even if they're not, A world where one service swaps[1] (I've never done this with docker to try it though) is probably better than one where it uses all the memory and everything swaps.
> Couldnt you just manage memory from inside your app? if its redis, then check the db size and shutdown/cleanup gracefully, and not crash redis with OOM?
You'd be surprised -- some languages just don't have a way to very easily get feedback from GC[2]. It's also something that I don't think most people think about, messing with the -XmXx<setting>s in Java is definitely year 2/3/4 java development for most people.
Wow, given how expensive RAM is it is no surprise they will never allow kubernetes work with swap, becausr it directly translates to $$$ for GCP and other cloud providers.
Also since everyone loves using Java/Spring/Node and other memory hungry frameworks - it print enormous $ for cloud providers to require users overallocate RAM and disable swap.
or am I just spitballing conspiracy theory here and there is no conflict btw decisions like these and vendors' revenue streams?
For years, nay decades, the standard for data centers is no swapping. An incident caused by machine went slow due to swapping is almost as embarrassing as one caused by disk filling. Better to orchestrate a restart every night on your leaky code.
I think it is just different thinking: DEV where software is a pet, where I want full control of my memory and take care of it carefully, and dont want anyone shutting down my service violently.
contrast it to Ops mentality of software is a cattle (here is your memory quota and if its OOM, just kill/restart the service and hope next run it wont run OOM)
Didn't doubt your experience (I read and enjoy your comments all the time on HN), just pointed out that the feature set offered by single static binary (which arguably is not easy to build depending on what language you're using, etc) is not the same as what's offered by linux containerization in the current day.
Half the "static" binaries that used to float around weren't even fully static because of glibc, until building with musl (and getting bit by getaddrinfo) became widespread -- the situation is not as simple as you made it seem.
Agreed -- containers are certainly being overused these days, I find a big indicator of when people know what they're talking about is when they mention/realize that docker is containerd these days (docker is simply a shim over containerd which uses runc underneath), but it is absolutely still very buzzword-y.
I think we've basically stumbled into a very effective and widespread packaging paradigm though -- now you don't even have to pick the right language/toolset to get static binaries easily -- just throw a container over the wall and your filesystem requirements (mounts), network requirements, etc will be made pretty obvious to the person doing the deployment
Assuming that source and target hardware are the same, otherwise VMs enter into the picture, and stuff like Kata Containers add even more layers.
Then I remember Linus point of view about monolitic kernels and how they are supposed to beat micro-kernels, what for, when people put hundreds of virtualization layers on top.
Well as Docker even abstracts VMs (you can build and run a Linux/amd64 container from a MacOS/amd64 (soon MacOS/arm64 also)) the target OS and hardware become more and more just a deployment detail.
yeah but that bit of complexity wasn't mentioned :). If every process you need to run in it's own VMs/per-instance then you've got best-in-class isolation already, no need to bother with containers, and you can get your resource and namespace isolation. Not quite the same as just throwing a static binary over the wall though, a little bit more complexity there.
Which is probably unnecessary if you've got the deployment platform to yourself and can decide that just-run-this-binary is suitable, safe, and manageable. Personally I prefer "just install this .deb" because it wraps up dependency management along with everything else, but that's not been a popular opinion for a long time.
Maybe because creating custom deb packages can be a hair-pulling experience?
But this is our flow too. We have a jenkins job that builds our master branch and uploads deb packages to jfrog. And then another to create a GCP image with these psckages, and a third to actually deploy these images to GCE. Containers are used for the various jenkins jobs, but not on the actual GCE instances.
It depends which tooling layer you're using, really. If you're just saying "bundle up these files and put them on that part of the filesystem" it can be extremely simple. If you're trying to provide the full build-from-source chain... less so. Being able to ignore the Debian Packaging Guidelines is the path to sanity.
In a previous incarnation I wrote https://github.com/regularfry/au to take the grunt-work out of packaging up ruby apps, and pretty much all it does is generate the minimum set of files that `dpkg-deb` needs to spit out an archive.
Reminder: the Debian Packaging Guidelines are meant for official Debian packages.
Making a package with dpkg-deb -b <dir> <packagename> is very easy and gives all the features of APT (dependency tracking, config management, atomic deployment).
Yep. If you're packaging for Debian, the packaging guidelines make sense. If you're packaging something nobody outside your org is ever going to see, you can be a lot more flexible.
Or just do the work like a professional, link dynamically and build OS packages for the operating systems which you have qualified and can guarantee that your software will work on. And take pride in the level of quality you are capable of delivering. Few in this day and age are capable of writing software at such a high level of quality as to be able to make guarantees that it will JustWork(SM); creating operating system packages out of one's work makes it possible to deliver software like a professional, perform integration testing and provide guarantees to the consumers, while minimizing the load on the users of said software.
I tried using Nix to build the Node version ("pkgs.dockerTools.buildImage"). That brought the image size down from 943M to 205M. Using a musl64 "crossSystem" (which required rebuilding everything) brought it down to 188M.
Interesting discussion. Just a comment about the size o Golang binary. In the post example it uses standard fmt library. fmt by itself, adds several megabytes to a binary when compiled. Removing it would reduce binary size.
If you imagine a container as a tarball, then scratch is an empty tarball. Each time you copy a file or set of files into the container during the build, you get a new tarball or layer. Some base containers provided by OS vendors, e.g. CentOS, Debian, are a single layer with the entire OS filesystem inside including tooling, libraries etc.
Scratch has no tooling or libraries or anything. It's empty. You typically add a binary into it and then just run that binary, but due to the lack of libraries, it just be statically compiled. So in answer to your question, you can run statically compiled C. You can't run Java due to the lack of JVM which I believe has its on set of filesystem dependencies, but I'm not a Java dev, so I'm not entirely sure.
Most distros force you to dynamically link every dependency if you want them to package it. So the default build for most projects is dynamic
You've stumbled on a holy war between distros and guys like you, me, and Linus Torvalds[1] that want to deploy a binary and just have it work everywhere.
Sure this is understandable when we are building an OS.
But here we are using Docker so we have full control over the application we are building. Why this craziness of having these huge images containing who knows what?
Is it just because we can? And the cloud providers like us when we do it?
It mostly happened because the we carried over the previous assumptions, practices and limitations when moving into containers.
I agree with you and the parent commenter, this should be the default, but some people are against static-linking, even in cases dynamic-linking provides no advantages.
There's downsides to statically linking, particularly around what to do when vulnerabilities are found in your dependencies. If you use Debian for example, you can scan packages to detect versions with known vulnerabilities and rebuild the container to upgrade them if it's a static binary, that moves the detection into scanning your dependencies. In modern GitHub (other VCS hosts are available with similar features) this is relatively easy assuming you depend exclusively on things in your language of choice. Outside of that it gets harder and more awkward. That said, using a distro's image also means you may have a bunch of false positives in your scans, who it's a matter of taste imo.
Go binaries that are statically linked work great in scratch images. If you or your dependencies start dynamically linking against libraries then they don't so much, but that's relatively unusual in my experience.
Using asm is a bit extreme and can lead to many problems. The toy project used is quite limited anyway; you could use one of the tiny C httpds and still get the size under 15k for similar functionality.
Would anyone use this in production? I doubt so. You want a reliable battle-tested solution. Nginx can be compiled to 500kb without OpenSSL; if you need https then it's just a few megabytes, I find it acceptable for non-embedded solutions like containers.
Also, https://busybox.net is a 1.0MB static binary that provides all of the standard posix-compliant tools (sh, grep, tail, etc) and actually comes with an httpd implementation. So if you wanted a more useful container at about 1MB, you could use it.
Edit: here's a previous thread on UPX that has 80 comments talking about the tradeoffs it makes. You should definitely consider these if you intend to use UPX: https://news.ycombinator.com/item?id=15456980