I am fairly comfortable with Linux as a user for things like understanding processes, ports, key files and utilities, etc. The way I understand how to model abstractions like containers is to know the various OS primitives like cgroups, changing root, network isolation. Once one sees how those pieces come together to create the container abstraction, they can be mapped to the system calls provided by the OS. Usually they also have utilities bundled (like `chroot`) to interface with those primitives as an operator.
Solaris Zones by Price and Tucker:
Jails: Confining the omnipotent root by Kamp and Watson:
Many of those pieces can be useful on their own. For example you don't need a full container if all you want to do is to ensure that some applications use a VPN and others use your public network address. A network namespace is all you need and those are accessible through simple cli tools such as `unshare` and `ip netns` and don't require behemoths like dockerd.
The tricky part is using them all together correctly, initializing them in the right order, not getting the control daemons confused by running in the wrong context and so on. That's where many of the security vulnerabilities come from.
Unionfs: A Stackable Unification File System:
> This project builds a stackable unification file system, which can appear to merge the contents of several directories (branches), while keeping their physical content separate.
> Unionfs allows any mix of read-only and read-write branches, as well as insertion and deletion of branches anywhere in the fan-out.
> To maintain unix semantics, Unionfs handles elimination of duplicates, partial-error conditions, and more.
If it is the same thing (but maybe more maintained or has more features...) , can we implement something like trip package manager on top of Overlayfs?
(Porg is a package manager where all files that are installed by make install is tracked and mounted on a Unionfs layer.)
1. Does /foo/bar get removed from the topmost layer, exposing the one below?
2. Does /foo/bar get removed from all three layers?
3. Does /foo/bar get replaced with a "tombstone" record to pretend that it was deleted, while still appearing in some of A, B, or C on its own?
These semantics are tricky to get right, and during the process of upstreaming unionfs to the kernel, they made some incompatible changes to the model and chose different answers for these questions, and as a result, renamed it overlayfs.
Unionfs is significantly more complicated.
>Union mounts have also been available in BSD since at least 1995.
btrfs subvols/snapshosts have their own costs, they can get slow once you accumulate thousands of them (it's fine if you just use a few at a time). But you can create a single btrfs subvolume, store your overlays in there and then delete the subvolume when you're done.
If you're going to try btrfs, check if your system/tools handle the space overcommit correctly. Some ways to check the available space don't really play well with snapshots. (As in, they report less space available)
Proceed by making another r/w snapshot for the/each subprocess.
Just work inside that folder (which behaves ~like a CoW bind-mount) and copy the artifacts into the "common" snapshot.
btrfs sub del -c path/to/temp/build/snapshot
will clean it up and ensure it won't reappear after a crash. You can skip the -c if you don't care about it reappearing.
If you have enough space to handle up to maybe a couple minutes delay in garbage collecting, you don't need to force any special syncing or such, as it just happens in the background.
As for the phases, I don't think this is a problem. btrfs snapshots are like git branches that don't allow merging, leaving explicit copy as the only recourse.
Just to make sure: This is per point-in-time, and not just because of history? You can usually make things somewhat cheaper just by doing a shallow clone. But of course that doesn't help if at any given point-in-time the repo is still too big.
And he said even deleting the diffs in an overlay is too expensive, which normally should be cheaper than deleting a worktree.
Very interesting. Thanks for sharing!
Maybe it's out there and I've just missed it, but I really wish there were richer ways to build up a container FS than just the usual layers approach with sets of invocations to get from one layer to the next, especially when it's common to see invocations that mean totally different things depending on when they're run (eg "apt update") and then special cache-busting steps like printing the current time.
I know part of this is just using a better package manager like nix, but I feel like on the docker side you could do interesting stuff like independently run steps A, B, and C against some base image X, and then create a container that's the FS of X+A+B+C, even though the original source containers were X+A, X+B, and X+C.
It looks like the layer limitation comes from the underlying union filesystem stuff and not from anything inherent in Docker itself. I wonder if it would be possible to build a new filesystem driver for Docker that could serve up an infinite ecosystem of immutable nix packages without having to actually unpack them. Whether such a thing would actually be of value would probably depend a lot on the use case, but I could imagine a CI type scenario where you wanted to have a shared binary cache and individual nodes able to come up very fast without having to unpack the pkgs and prepare a full rootfs each time.
As a side note: I’m somewhat bewildered by the docker-controlled-by-kubelet-on-systemd architectures as there is huge overlap in what these layers do and how they are implemented.
> make install (or ./wierd-application or 'npm install' or whatever)
and all changes made to the filesystem in that shell is written to the upperdir instead. So for example in the above example, the upperdir would contain files such as upperdir/usr/bin/app and upperdir/usr/include/header.h.
It's useful when
* You want to create a deb/rpm/tgz package, but a makefile does not have support for DESTDIR
* An application writes configuration files somewhere, but you don't know where
* An application would pollute your filesystem by writing files in random places, and you want to keep this application local to "this directory"
* In general when you want to know "which files does this application write to" without resorting to strace
* or when you want to isolate writes, but not reads, that an application does
I got this idea many years ago, when first personal cloud storages appeared and offered some limited space for free. I thought it would be nice if I could pool them and fill them taking their different capacities into account. And if I could also stripe and EC them for higher availability...
I still wonder if there's something that can do this and if there isn't I would like to know why, since it looks like a useful and obvious idea.
Using mirror raid functionality is fine and is able to do n-copies redundancy with >=n disks and efficiently uses unequal disk sizes.
I used this to build custom iso images based on an existing iso file. The old method mounted the ISO image, rsync'd the entire contents, copied and updated some files, and then created a new image. This took quite a while and (temporarily) wasted a lot of disk space, and was initially sped up by copying the temp ISO to RAM disk, which also presented some challenges, and wasn't as fast as the eventual solution, using aufs on top of the ISO mount to patch the image. Worked like a charm and sped up the ISO building considerably :)
edit: oops, I've read the article now, I guess aufs acts the same and creates a tombstone file.
Is it me or just the title is a little bit inaccurate in the sense that there's more to "How containers work?" than overlays, e.g. it made me think that it covers more than it actually does, e.g. cgroups, namespaces, etc...
Anyone knows of a more in depth coverage of containers building block type of article that allows one to build a rudimentary container from scratch to appreciate what goes into building one ?
This was posted a few months back on here and it a cool little tools for seeing how Docker fits layers together:
I'm currently using overlay because I leave my docker mount on my root partition since ZFS has always been a pain for root. I hear Ubuntu 19.10 is smoothing that over a bit though.
It's too bad because I've ran into issues on lower memory boxes where the Linux block cache and the arc compete for memory and cause performance slowdowns or errors. If all the filesystems were ZFS this wouldn't be an issue.
Also all the other ZFS benefits like resisting bitrot etc.
So all you need to do is manually specify the volume driver to overlay2 when your FS is ZFS, and this just works?
on block based CoW FS, shared executables and libs will not be shared among containers, because file cache is file based not block based.
Using Overlayfs - or other file level CoW storage, when one starts several containers from the same image, shared libraries and executables will be loaded only once(so will be actually shared) across containers.
Userspace implementation of union directories with control of which half of the union made gets precedence for certain operations such as file creation, etc.