Computing is an endless cycle of inventing ways to isolate code in a private machine, followed by inventing ways to make it easier for those machines to interoperate.
Absolute. I feel like society goes through changes in a similar cyclic way. We, as humans, basically have a finite span of understanding and attention, and so, basically create cycles that are longer than that.
I keep wondering where we would be now if we had not spent a decade or two eagerly running down that lane, before reassessing the be all end all solution nature OOP promised encapsulation to be. Perhaps not as far as we are after doing the meandering that we did, I think OOP has contributed a lot to the post-OOP world we live in now. It's not gone, it has just been demoted from ideology to tool.
We could argue that we've down the agents road since the microservices craziness. But this one backfired visibly and very soon and most people noped out of it in a blink.
The OOP humanity is so heavily invested on has very little relation to that vision on the GP.
This is a really interesting way to think about the progression.
As a timeline I like to plot the ratio of users to isolated compute. We've moved along points like users per building, users per room, user per computer, computers per user, kernels per user, processes per user.
If you use chroot to run something, it's interesting how the dynamic libs you need to get in place grows until you are mirroring a whole linux in a subtree. It gives you a sense for how you end up with containers.
One thing that is wild to me is how nix solves this problem, of things needing to be linked together. It doesn't solve it with containers, but by rewriting the location of the links in the executable to be in the nix store. You can run LDD and see it in action.
To me, all that points at containers being in some way a solution to Dynamic linking. And maybe an over the top solution.
Should we be doing more static linking? Not even depending on libc? What are the challenges with that?
Containers are a solution to dependency management for sure. In the world of the FHS, the dynamic linker is meant to solve a number of problems, like space saving and security updates, at the OS level by discovering dynamic deps installed in special library paths. One thing I've never liked about FHS is how everything is organized by kind.
The interesting thing about how Nix approaches the problem is to replace the concept of FHS almost entirely (only a couple binaries are linked to /) by hijacking PATH, and the linker configs like you mentioned. The biggest difference being that the whole version-pinned dep tree is encoded in a nix package(and in the linker config of the binaries it produces) rather than just the package itself.
At some level you could say there is no "dynamic" runtime linking in nix, i.e the linker uses partially specified deps in a discovery phase, all of the link bindings happen at build time.
The FHS did attempt to solve the issue of multi-version dependencies with an interesting name and symlink setup, but they are usually still bound by fairly loose version constraints (like major version). Containers are a lot more like nix in this way, where deps are "resolved" at build time by the distro's package manager by virtue of controlling the process' filesystem.
This is one major issue with the reproducibility of container builds, the distro package managers are not deterministic, you could run a build back-to-back and get different deps depending on your timing(yes, even between test and build CI steps).
Disclaimer: I'm not a strong containerization proponent.
The good part of containers is you isolating the thing you're running. I'm very against resource waste, but if I can spend 90MB on a container image instead of installing a complete software stack to run a task which is executed weekly and runs for 10 minutes, I'd prefer that. Plus, I can create a virtual network and storage stack around the container(s) if I need to.
Case in point: I use imap-backup to backup my e-mail accounts, but it's a ruby application and I need to install the whole stack of things, plus the gems. Instead I containerize it, and keep my system clean.
Nix is something different and doesn't solve "many foreign processes not seeing each-other on the same OS" problem. Heck, even Docker doesn't solve all problems, so we have "user level containers" which do not require root access and designed to be run in multitenant systems.
> many foreign processes not seeing each-other on the same OS
For sure, I was thinking of the packaging nature of containers, not the 'security' nature of containers. The pivotroot part. Though I guess being able to have namespaces does make packaging clearer in certain cases.
For a horrible analogy: With actual shipping containers, we don't have each shipping container be a stripped down model of a ship, so that the things in it aren't confused.
The packaging is generally a side-effect of isolation in my experience. I never chrooted a software because I needed a different library stack for it to run, but to isolate it from rest of the system in one way or another (for security/access reasons).
Docker just made the interface more practical, and built the ecosystem around it. lxcontainers, apptainer and podman has improved upon the idea in different ways to cater different use cases.
So for me containers were never simplified ships to begin with. This different perspectives happen probably because people look from different perspectives at initial contact, which is normal.
Agreed, it isn't great. Shipping containers are more like tarballs: a bunch of stuff (files) bundled into one "stuff" (file) so they're easy to ship around and pieces don't get lost in the process.
I don't think there is a good physical-world analogy for what containers are doing, though. Maybe takeout including utensils and napkins to "simulate a kitchen" in case you eat it on the go, so you don't have to rely on the "system" forks and napkins? Still kinda rubbish honestly.
>>> To me, all that points at containers being in some way a solution to Dynamic linking. And maybe an over the top solution. ... Should we be doing more static linking? Not even depending on libc? What are the challenges with that?
>> but it's a ruby application and I need to install the whole stack of things, plus the gems. Instead I containerize it, and keep my system clean.
There was a bit of drama recently with the Bottles project and some Distro maintainers. With the Bottles devs saying "we only want the appimage as a distribution method".
I see containers as a means of distributing software. If that ruby app was a binary, then the container would become baggage.
>>> Good for them having wishes, yet free software doesn't work that way in most cases. :)
Historically I would agree with you.
But bottles, and the attitude of some upstream projects is one of "modern", "faster", "support the latest version only" ... Software devs who live in user space want a more "App Store" like delivery mechanism where they have control. The kernel is inclined to never break backwards compatibility. Distros are sort of stuck in the middle...
>> I see containers ...
> That's OK. Docker is a tool,
One of these is a subset of the other... (depending on your perspective). Dockerdesktop running a linux image on a windows box is most certainly a tool... that same container being ported to lcx in production makes it a package manager. Who is a subset of whom is something we could debate but these things can be both.
Honestly I don't understand that obsession of developers with building and distributing binaries themselves. That is the less fun part.
When I write something, I just release the source code and I let other do the non fun part for me. You just have to mention the dependencies and how to build and that's it.
Because it's easier to support, easier to answer bug reports and/or crash reports etc. And distros can now not mess up the build too much, which happens a lot with any sufficiently complicated piece of software.
Besides your average user does not want to deal with building something like OBS or Ardour or any of these programs that have a ton of dependencies and no real unified package manager to install them.
> Software devs who live in user space want a more "App Store" like delivery mechanism where they have control.
They can have control. They can say that we have flatpaks and app images we publish and support, and they only support the latest version (or the same minor version, whatever), and can say that packages got from distros may not be the latest.
There's no need for a yelling match, IMHO.
Distros also can do whatever they want. Like rclone. You can get the packages from rclone.org or from your distros. There's no yelling match, but trade-offs.
> ...that same container being ported to lcx in production makes it a package manager.
I don't think so. I have containers which work like binaries (in the form of "./binary infile outfile" fashion) and exit after processing what I give them. For me that container is a utility program as a whole. Same for the imap-backup example.
When you think services which are always on, docker might be a package manager, but I pack my own containers for example, so it's more like compiling for me.
So, what docker or containers is depends on your perspective, or like a chameleon which changes its color according to the landscape it's in.
So, it's a tool in the end. Package managers are tools, too.
The thing with Bottles is that it also relies on a very specific setup of all underlying software. So in that case it's more about complexity that leads to the need of containerization than keeping systems clean or containerizing for sandboxing.
Linux doesn't solve that problem at all. Qubes tries to do that with a hack, that is probably sufficient, but quite complex.
I'd say that we don't have good solutions to that problem. But nobody even tried our best candidates, so I'm not completely sure. (For a while, it looked like Android would finally try some. But then Google turned it into a user-hostile anti-privacy OS.)
One issue with static linking is that your dependencies will likely have critical CVEs over time. If you keep all your libraries separate on the filesystem, you can just do a "apt update; apt upgrade", and you will have all the latest patches. This will patch security issues in e.g. libssl or libc for all your applications that are dynamically linked against this shared libraries, which can be quite a few. In static binaries, the version of the libraries is not obvious from the outside. If you have, for example, 100 fully static binaries, these can come in 100 different major/minor/patch level versions of their dependencies. You now have to patch each binary separately by upgrading and recompilation 100 times to patch all your static binaries, that requires much more time and energy.
That all makes sense. But when those 100 binaries end up as 100 OCI images, and then to patch them you need to update those 100 OCI images to have the new version, it does seem like we've gone in a circle a bit.
I mean, there are some advantages, if they all share the same base layer, maybe they share those libs at least on disk via a shared layer. But practically, though you are maybe not back where you started, you are at a place that seems to share some similarities.
This is one of the biggest issues with containers IMO. This and the layering system, which I think is poorly designed both to configure and to actually do the tasks it's meant to(build and delivery caching).
The solutions to this problem in the space have basically been to provide scanners to crack open the containers and detect things with known vulnerabilities. But I have not seen (m)any solutions around these scanners to facilitate the lifecycle of landing fixes.
Even if you provided a tree of a-proved base containers for each deployment lang in your org, you can't just update the base and deploy the world, there's not even tooling to automate working over the "FROM hierarchy" of images where you could detect which need to use new bases.
Because of the difficulty in managing large container hierarchies, in some orgs the later drives a common methodology of making image tags mutable, i.e. `ruby:myorg-v2`, which makes the FROM more like a dynamic link reference that gets updated automatically on the next build. I view this workaround as a regression brought on by the _still_ incredibly poor and complex SDLC tooling around managing images.
It's like for a few years the whole computing world forgot about security updates and went on a container bender, when woke up one morning realizing what they've done, and then started adding various clunky solutions on top: mutable tags, various tooling to take apart container images and inspect the junk in there, notify and scan for issues etc.
You mean like proposing JSON as XML replacement, rebuilding from scratch all the XML tooling including validation, and when almost done with it, replacing JSON with YAML?
You are right. If you put 100 dynamically linked binaries into 100 OCI images, then you have the same security issues all over again. As best practice, I would recommend using a container vulnerability scanner that can identify containers requiring updates (list CVEs). I think all major cloud providers have such a service available, and there are some free and open-source tools available, such as Trivy and Clair. It is also beneficial to use official container images that have frequent patches available for their base images. If you use a base image like 3.9-slim instead of 3.9.19-slim, you can, for example, pin your Python version to 3.9, but you get patches. But this again only works if you do not have a "FROM scratch" image with just a single fully static binary.
Containers are powerful because they solve many computing issues, one of which being able to act as a (lower case c) static container for dynamically linked apps as well as cross-language, multi-executable meta-apps.
Containers also provide many forms of isolation (network, file system etc.), they provide a modern versioning and distribution scheme, composibility (use another container as a base image).
All of these things can, and perhaps should be, done at a the language level as well but containers also work across languages, across linking paradigms, and with existing binaries.
I haven't seen any other response mention it yet, but containers are also heavily used for web-exposed services in part because of address space and port contention. Network namespaces allow you to graft an overlay network onto your physical network in a relatively simple and easy way (not that it's actually easy, but networking never is).
Otherwise, sure, nix can rewrite the RPATH in your ELF file to make it pull dynamic libs from the nix store, but what does it do when two processes both want to listen on ports 80 and 443?
Possibly, if the Internet ever actually goes pure IPv6, one LAN will have enough addresses to assign one to each process instead of each host.
There are, of course, other ways to handle it. People used vhosts predominantly defined in a dedicated web server that was really only a reverse proxy, but now you need nix and nginx. Then you discover you also want resource isolation. Is there a userspace alternative to cGroups? I don't see how there could even in principle be an alternative to PID/UID namespaces and UID/GID submapping. Some things have to happen in the kernel and that means containers of some sort. It doesn't have to be the exact OCI standard that eventually grew out of Docker and eventually Kubernetes, but some kind of container.
I think what you are saying is true, and my knowledge of networking is pretty slim.
But, to play along with my static linking thought-exercise: if you take a process and put it in a network namespace then is it a container? I wouldn't say it is. Container runtimes might have a nice interface for namespacing, but namespacing something doesn't make it a container.
I guess my thought experiment is if things are statically linked binaries and you had a way to run them with the control group and namespace settings you wanted, would the packaging aspect of containers add anything?
The elites don't want you to know it, but namespaces are just there for the taking. You can grab as many as you want. You can set the memory limit on any process with cgroups, no docker desktop required. :)
Anyways, just a thought experiment about how the industry sometimes seems to be going in a circle, in the fashion of the lady who swallowed a fly.
They are complementary. If you want to reuse the shared parts of binaries, then you need a way to separate the binary image into parts that are specific to the application and libraries that can be reused across binary images, plus some metadata how to reconstruct the image from its parts. That's what exactly what dynamic libraries provide.
In general, it's much easier to link a binary with its libraries than go in to the opposite direction (extracting common code from static linked binaries), because once you statically link a binary the library code will vary slightly due to differences in memory addresses, compiler optimizations, unused code that has been omitted, different input library versions, etc.
Even if you were, in theory, able to write a complex filesystem driver that is able to extract the common parts of statically linked libraries so they can be deduplicated, then to reconstruct the original binary in memory, you'd have to perform something similar to dynamic linking, except now in the kernel, which really isn't an improvement.
> If you want to reuse the shared parts of binaries
But aren't we, when we use OCI images as a packaging mechanism, using containers to essentially throw away that sharing and arrive at a complicated version of static linking, where everything dynamically linked is shipped with the program?
Same goes for arguments about ease of patching things. When the software's package is actually an image, you are patching each image individually that is running on the system.
No, because you can share common libraries across containers by putting them in a separate layer.
For example, if you have a complex service that consists of multiple binaries all written in C++ using boost, then for each binary you can create a container that contains a layer of a base OS (shared), C++ libraries (shared), boost libraries (shared), application binary (unique).
All the services can now share their common libraries, both on disk and in memory, which reduces I/O and memory use. That's one of the main advantages of containers over virtual machines (VMs): each VM instance has a distinct region of memory that is not shared with others even if they happen to load bit-for-bit identical binaries into memory.
(I know, VM memory deduplication exists to ameliorate this problem, but here my previous comment applies: it's much easier to start from shared components and link them together than extract the shared data after the fact. And typically VMs have lots of nonsharable state that containers do share, like pretty much all writable kernel pages.)
A docker image consists of several layers, each of which contains only the modifications to the layers below it. Each layer and the final image is immutable. Docker uses OverlayFS to provide a unified view of the various layers.
A running container is based on an immutable image and a single writable layer. That writable layer is unique to the container which contains all modifications made to the immutable image by processes running in the container.
Docker relies on the immutability of layers to share them between containers. This is not much different from how regular Linux processes all share the readonly contents of binaries and libraries that they load, while each process has its own private heap space that is not shared with other processes.
That means that deleting a file from a base layer, either when building an image or at runtime from the container, doesn't actually modify the contents of that layer. It only adds a tombstone marker to the writable layer, that indicates the file was deleted, and OverlayFS creates the illusion that the file no longer exists inside that container.
(The flipside is that deleting files from immutable layers doesn't actually free up space because the actual file contents don't go anywhere, but that's rarely a problem.)
In the same way a zip file is static linking, I guess, since Docker images are just tarred gzips. But .deb packages can include files and scripts that create directories, too.
I think a better analogy is, with containers you maintain more "servers" in the end, and patch and reboot them all.
So there's no free lunch and everything is a trade off. People thinking that containers are no more work than managing servers or services are in the wrong.
They do more than that. For instance, they can be swapped easily. Think for instance a security library being updated by the distro security team. It also makes it easier to depend on an LGPL library, nicely allowing the users to modify the LGPL library without having to recompile your program.
Then there's probably a reason, like the changes haven't been tested yet or verified to make sure they don't FUBAR your machine. It's okay to go a little slower to ensure reliability. Plus, do you really want the latest bugs in HEAD anyway?
That's an interesting way to look at it. If all dynamic libs lived on a read-only location, then the file system could actually only store the libs in one place and the other "copies" would be just symlinks to that... and when the OS loaded such lib, it would automatically know that despite being in different locations, the libs were the same (they're all symlinks to the same place). Is this something that has been attempted before?
You can store the files as if you name the actual location of the file as a hash of its contents and symlink the file to that location, you naturally get deduplication. Fuchsia does this [1]. You still end up wanting to try and coordinate your packages to share as many deps as possible for resource optimization reasons, but you no longer depend on it.
Slight tangent, but in the nodejs / npm ecosystem, that's one of the things that makes pnpm unique (and IMHO far superior to npm or yarn) -- its node_modules are deduped using symlinks.
This is exactly how nix works, except instead of symlinks it actually modifies the binaries at buld-time to point its dynamic library paths to absolute paths in the store which includes a source-derived hash in place of a version.
This is basically dynamic linking: you have foo.so sitting somewhere on disk, and multiple processes can just load it whenever. It can't be read-only, though; you need to add new libraries to that directory eventually.
Deduplication, in the sense of physical space reduction, seems like the least strong argument one could make for dynamic linking these days.
The strongest argument I can think of, is enabling system managers (distro maintainers, even end users) to update dependencies. This might be to apply a security patch, enable some kind of tracing for profiling an application, and so on.
Funny - it looks like you’re being downvoted for asking what I think is a very natural question. It’s one I’ve asked before; have we just created a more elaborate statically-linked executable via containerization? In the end, Docker/OCI seems like the universal Linux package manager.
I’m sure I don’t have the full picture since I’m far more ops than dev though.
I don't think so. Instead we created an artifact which can live inside an OS, but cannot see and touch to the rest of the OS.
CGroups is a deceptively powerful mechanism. You can isolate a process resource wise (X cores, Y amount of memory, Z amount of swap), network wise (a different virtual network adapter with its own IP, bandwidth limits, etc.) and FS wise (running in its own filesystem with devices it can see).
It can keep your system tidy by encapsulating elaborate stacks which makes system management painful, allows deterministic operation and image generation if you tag everything with version.
Downside is you can do bad things with it like terminating HTTPS with a gateway container and talk HTTP among your backend instead of configuring tools, or writing shoddy software, and getting away with it because it works, or gets automatically restarted when it crashes 6 times a day.
I don't run every service as a container, because some services suffocate when they are in a container, but for short running things which needs system-wide changes to function, or test-driving small services before fully committing, it's a good tool to have.
However, it's abused with no end, and their popular use leaves a bad taste in your mouth.
If you have a go/rust/zig binary as your application do you need a container to run it?
Maybe, but it makes less sense at that point. If your doing Node/Ruby/Python/PHP then yes the container makes sense to drag your runtime to the server...
Do containers (docker) make sense for dev. Sure, to a point. Because our dev (win/Mac) might not look like our deploy (linux).... If we move to a standardized remote dev model then docker makes less and less sense.
>> It’s one I’ve asked before; have we just created a more elaborate statically-linked executable via containerization?
The bottles project only supports their app image, as they no longer want to be responsible for supporting the disttro maintained packaged version of their product.
Yes containers are becoming a way of dealing with linking, and dependency management. Its a blunt instrument for dealing with software packaging and distribution.
The huge advantage of containers is that can use the same mechanism to run anything. Don't care if it is a single Go binary, dynamically linked binary, or Ruby interpreter and thousand files.
You might start off writing Go programs, but then need to run Postgres database for development. Or discover that need special library in some other language and easier to make its own service. Or need to run third-party service. With Docker, you run the image and don't care what's inside, and the isolation gives some assurance that won't escape.
For me as a mere user, wanting to run some homelab services, the main advantages to containers are that they make updates easier (don't need to wait for distro), and it makes it much clearer where configuration and data lives, easing backup and rollback by orders of magnitude.
Static vs dynamic linking is an implementation detail as far as I'm concerned. If all the dynamic libs needed were in a well-defined location it wouldn't matter that much.
The benefit of waiting for maintainers to update your software is that you have a stronger guarantee that it won't break your system, or otherwise fubar something. Maintainers are the adults in the room saying "no, fix your shit" when sloppy developers release crap, which seems to be happening more and more frequently lately.
As for where configuration and data live, that's always available in the docs, and Linux convention puts stuff in /etc, so I'm not sure how containers help. And dynamic libraries are in a well-defined location, with environment variables and other tools that allow you to specify where they live. It's not like dynamic linking is an unsolved problem.
Maintainers are also the ones breaking software so, realistically the difference is basically moot. And for a containers to fubar the system you have to really mess up. At worst that specific container fubars itself and you rollback a tag.
There are just fewer things that can go wrong when you get to a sufficient number of services. And lastly moving to a new host is infinitely easier too, export the volume, import of new host and off you go. And stuff like Kubernetes will just handle this for you (and more).
And as for those linux conventions, they vary a lot from distro to distro, you can never be quite sure where that specific version of that specific distro puts its files. So having them just not be able to touch the host ever is a good thing.
Right, but a piece of software having a distribution maintainer doesn't mean it will never have bugs, and if it's a container there's already much less risk of it breaking my system.
As for your second paragraph, that's very idealistic. Config can live in /var, /usr, /home, /usr/local, literally anywhere. I find it much nicer when all data / configuration for a piece of software is all self contained.
I think of myself as a mere user as well, though I manage the container system/orchestration for a small SaaS company as well (we're weirdos who use Swarm instead of Kubernetes) and agree with you regarding the management benefits.
I wish I had read this article a decade ago. For many years I have been wondering "why the heck would I use containers when I have chroot, cgroups and namespaces?"
Turns out that's exactly what containers are a packaging of! And I only found out about two years ago.
Although this article doesn't go into it, the benefits I've found of using containers rather than rolling isolation by hand is that a lot of semi-standardised monitoring, deployment, and workload management tooling expects things to come packaged as containers.
> Turns out that's exactly what containers are a packaging of!
Well, no. When people say "containers", they always mean "Docker".
And Docker also comes with a daemon with full root permissions and ridiculous security policies. (Like, for example, forcefully turning off your machine's firewall, #yolo. WTF!)
P.S. I actually run systemd-nspawn in production, but I am probably the only person on earth to do so.
Those in the know are familiar with OCI, etc. but (without hard data to back me up) I think it's still fair to say that the majority of people (lay people, if you will) consider them the same thing by virtue of ignorance.
>By default, all external source IPs are allowed to connect to the Docker host. To allow only a specific IP or network to access the containers, insert a negated rule at the top of the DOCKER-USER filter chain.
Yikes. Should people read the docs? Yes. Should Docker not do this? Also yes.
Perhaps I belong to the minority, but I really don't think about containers as Docker. Actually, I don't remember the last time I used Docker for anything. For the past several years, I've been using either Podman or systemd-nspawn, as yourself.
> P.S. I actually run systemd-nspawn in production, but I am probably the only person on earth to do so.
Mind sharing some good practical introduction article or set of articles for using VEs (virtual environments) with it? I'm tied to LXD at this moment which manages to provide both ease of operational and ease of configuration fine tunings be needed. I.e. I understand and tested for the projects I do taking care about on how to have network bridges, resources limiting, snapshot/rollback/create new image for VEs, storage profiles (say some I want to put on BTRFS some on ZFS some ...), simple `lxc ls` and `lxc shell <VE-name>` interfaces - may be systemd has all this kind of stuff as well. Or may be it shines in different area?
> P.S. I actually run systemd-nspawn in production, but I am probably the only person on earth to do so.
You're not alone, systemd-nspawn is very much underrated. I have used it a lot for machine containers, though I'm using podman+quadlet+systemd more right now.
systemd-nspawn with mkosi for generating workload images is still a nice & powerful ecosystem.
I can't speak for anyone else, but I definitely feel like there's no time to actually learn about all of these tools before being thrown into them by management/other well-meaning ICs. The end result is everyone is using a tool they know nothing about, with predictable results.
Totally agree. There are many, many pulls for attention - I don't really fault these people I mention. It's just notable that with all of this noise, the smallest bit of specialization can go a long way.
Honestly, habits/being adaptable are most of it. For example, don't waste time searching if you know 'ansible-doc' or 'man 5 something.conf' has it
It's a nice blog post but it still misses a few important building blocks without which it would be trivial to escape a container running as root.
Apart from chroot, cgroups and namespaces, the containers are also build upon:
1) linux capabilities - that split the privileges of a root user into "capabilities" which allows limiting the actions a root user can do (see `man 7 capabilities`, `cat /proc/self/status | grep Cap` or `capsh --decode=a80425fb`)
2) seccomp - which is used to filter syscalls and their arguments that a process can execute. (fwiw Docker renders its seccomp policy based on the capabilities requested by the container)
3) AppArmor (or SELinux, though AppArmor is the default) - a LSM (Linux Security Module) used to limit access to certain paths on the system and syscalls
4) masked paths - container engines bind mounts certain sensitive paths so they can't be read or written to (like /proc/sysrq-trigger, /proc/irq, /proc/kcore etc.)
5) NoNewPrivs flag - while not enabled by default (e.g., in Docker) this prevents the user from gaining more privileges (e.g., suid binaries won't change the uid)
If anyone is interested in reading more about those topics and security of containers, you may want to read a blog post [0] where I dissected a privileged docker escape technique (note: with --privileged, you could just mount the disk device and read/write to it) and slides from a talk [1] I have given which details the Docker container building blocks and shows how we can investigate them etc.
Excellent info! I started head-deving a project similar to nix-snapshotter[0] and I was thinking "ok, I can probably just build CRI impl that builds a rootfs dir with nix and just shell out to bubblewrap to make a "container".
But once I went through that mental exercise I started reading code in containerd and cri-o. Wow, these are _not_ simple projects; containerd itself having a full GRPC-based service registry for driving dynamic logic via config.
One thing I was pretty disappointed about is how deeply ingrained OSI images are in the whole ecosystem. While you can replace almost all functional parts of runtime, but not really the concept of images. I think images are a poor solution to the problem they solve, and a big downside of this is a bunch of complexity in the runtimes trying to work around how images work (like remote snapshotters).
Containers are a bad take on a solved problem. The problem was encountered, studied[0] and solved, decades ago.
During the Viet Nam conflict, the Air Force needed to plan missions with multiple levels of classified data. This couldn't be done with the systems of that era. This resulted in research and development of multi-level security, the Bell-LaPadula model[2], and capability based security[1].
Conceptually, it's elegant, and requires almost no changes in user behavior while solving entire classes of problems with minimal code changes. It's a matter of changing the default from all access to no access, all the way down to the kernel.
Conceptually, I've come to think of containers as a kind of "known-good starting point", the origin of a coordinate system where "movement" is adding things. A set of Dockerfiles form a trie where each line of the Dockerfile is a node in that trie's branch. The great benefit of containers is that they allow you to reach any possible point in the space for a single process, without affecting any other. The other features of containers are, to me, secondary, things like container images, or even access or resource control. The main draw of the tool is giving the user a declarative way to move reliably and repeatedly through system-space, and to do so for any number of processes. (The main cost is the ~20% overhead such a system incurs).
The point of containers is to run a process in an isolated environment. Microkernels by design allow isolating any process with very fine grain control, by allowing or disallowing certain IPC connections for a given process. Those connections can be enabled or disabled for a running process as well, which would essentially be like moving a process in and out of a container while it is running. Individual processes can also run entirely isolated stacks for things like networking, storage, etc. in an unprivileged way. The former can be particularly painful to deal with in Linux containers.
Containers are basically monolithic kernels playing catching to the features designed into microkernel-based operating systems.
If I'm remembering correctly from when I ran through the instructions at home, it was written for the original cgroup sysfs interface rather than the more modern cgroup2 [0]. You can figure out which you're running with
> mount | grep cgroup
> cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
which turns the examples is a nice "check your understanding"
Note: if you look into the details of how setting up namespaces and cgroups works you'll run away in horror. The APIs are very iteratively evolved piecework, not really a coherent(ly designed) abstraction.
I'm still not sold on the "why" wrt kubernetes. I hate that my resource hog map reduce jobs run on the same kernel and contend for the same resources as my user facing live site service.
That is one of the reasons why. Containers share the kernel and system resources. When you want to start running a bunch of containers in a particular configuration, that's when you'd use a container orchestration tool like kubernetes to define how and where you want those containers to run across multiple systems.
While you could schedule containers manually, or just run your application on VMs or hardware manually, something like kubernetes will let you define rules which it will dynamically evaluate against your infrastructure. You can instruct kubernetes to run your map reduce jobs on different nodes than your user-facing site... and you can give kubernetes an arbitrary number of nodes to work with, and it can scale your workloads for you automatically while also following your rules.
I guess I would prefer "kubernetes but with VMs instead of containers". The overhead of running in a VM is not very high, and a hypervisor can restrict resource usage much more effectively - so that we could still bin pack map reduce jobs on the same machines as live site services
Kubernetes has pluggable container runtimes. There are ones for running VMs, including the new lightweight VMs. They use standard OCI images.
Using VMs with Kubernetes only makes sense when you need the strict isolation. If you are running own code, then containers are faster. Containers also perform better because they can share resources on host. In Kubernetes, containers can have minimum and maximum limits, which means they can dynamically use space not used by other containers. VMs need to be allocated memory when they start.
But kubernetes has very good support for segmenting applications and long running processes, you don't even have to segment the nodes, you can just "let it happen" (although you should probably segment the nodes somewhat). You can set (anti) affinity for example to make applications not tolerate each other when scheduled etc. And there are quite a few more knobs the scheduler has that you can tune.
This might be what you're looking for? IIRC it was written for the older cgroup (v1) sysfs interface, so you may need to cross reference it with the cgroup2 documentation
something that is nice in the container world -- better than docker are lxc containers - but the steward of the project Canonical seem to have done a bad job with it. last time I played with lxc the ux was clunky.
if you could have the automation / configuration of docker / podman for lxc that would have been nice.
I've used Proxmox to manage my LXC workloads for years and it's been great, although I'm unaware to what extent it meets your criteria of offering automation. I find its interface to do roughly what a VM host (VirtualBox, VMWare, etc.) can do, but for LXC containers (and QEMU VMs) instead of VMs.