Hacker News new | past | comments | ask | show | jobs | submit login
How Containers Work: Overlayfs (jvns.ca)
582 points by saranshk 19 days ago | hide | past | web | favorite | 72 comments



I enjoyed this blog post. Julia does a great job of distilling an idea down with examples.

I am fairly comfortable with Linux as a user for things like understanding processes, ports, key files and utilities, etc. The way I understand how to model abstractions like containers is to know the various OS primitives like cgroups, changing root, network isolation. Once one sees how those pieces come together to create the container abstraction, they can be mapped to the system calls provided by the OS. Usually they also have utilities bundled (like `chroot`) to interface with those primitives as an operator.


I have been confused about containers for so long but having read your comment and looking up the terms you mentioned allowed me to finally find the right articles that explained containers to me. Thanks!


Can YOU post links to the articles so we can learn?

Thanks!



On linux containers usually involve some more primitives than cgroups and namespaces. Bind mounts, overlayfs (TFA), veth network interfaces (to connect the network namespaces), network bridges, memfd, seccomp, procfs etc. are all bits and pieces that are used by most containers/sandboxes.

Many of those pieces can be useful on their own. For example you don't need a full container if all you want to do is to ensure that some applications use a VPN and others use your public network address. A network namespace is all you need and those are accessible through simple cli tools such as `unshare` and `ip netns` and don't require behemoths like dockerd.

The tricky part is using them all together correctly, initializing them in the right order, not getting the control daemons confused by running in the wrong context and so on. That's where many of the security vulnerabilities come from.


Hmm... how is Overlayfs and Unionfs different? From the explanation I can't find any differences...

Unionfs: A Stackable Unification File System[0]:

> This project builds a stackable unification file system, which can appear to merge the contents of several directories (branches), while keeping their physical content separate.

> Unionfs allows any mix of read-only and read-write branches, as well as insertion and deletion of branches anywhere in the fan-out.

> To maintain unix semantics, Unionfs handles elimination of duplicates, partial-error conditions, and more.

If it is the same thing (but maybe more maintained or has more features...) , can we implement something like trip[1][2] package manager on top of Overlayfs?

(Porg is a package manager where all files that are installed by make install is tracked and mounted on a Unionfs layer.)

[0] http://unionfs.filesystems.org

[1] https://github.com/grencez/trip

[2] http://www.linuxfromscratch.org/hints/downloads/files/packag...


They're mostly the same, with different answers to tricky questions. e.g. if I stack filesystems A, B, C, and have the same file /foo/bar in all of the layers, and then do rm /foo/bar, what happens:

1. Does /foo/bar get removed from the topmost layer, exposing the one below?

2. Does /foo/bar get removed from all three layers?

3. Does /foo/bar get replaced with a "tombstone" record to pretend that it was deleted, while still appearing in some of A, B, or C on its own?

These semantics are tricky to get right, and during the process of upstreaming unionfs to the kernel, they made some incompatible changes to the model and chose different answers for these questions, and as a result, renamed it overlayfs.


It looks like overlayfs removes most of the tricky questions by only allowing the topmost filesystem to be read-write. In that case rm is pretty simple: remove it from the topmost filesystem if it exists there, and make a tombstone if it exists in a lower layer.

Unionfs is significantly more complicated.


Note that there is also a BSD UnionFS filesystem doing much the same thing. I don't know what relation it has to the Linux OverlayFS, only that they (superficially) do very similar things.


Interesting. It has apparently been around for decades.

>Union mounts have also been available in BSD since at least 1995.

See: https://en.wikipedia.org/wiki/UnionFS


They are very similar. Overlayfs is in the default kernel tree, which would be the biggest difference.


I recently made a system that uses overlays to provide work spaces for a complex build process. However, there is significant overhead paid for unmounting an deleting the files in the overlay after all the work is done. I was thinking about changing the system such that I allocate a partition ahead of time, write all the overlays there, and on success just blow away the partition and with it the overlays. This is kind of a pain in the ass. Can anyone suggest a method for rapidly deleting all the data generated by using 100's of overlays? Maybe BtrFS snapshots would be better? What are the pros and cons? Thank you so much and I apologize for "anything" up front :)


With recent kernels you can combine overlayfs and btrfs.

btrfs subvols/snapshosts have their own costs, they can get slow once you accumulate thousands of them (it's fine if you just use a few at a time). But you can create a single btrfs subvolume, store your overlays in there and then delete the subvolume when you're done.


Good idea.


I'm not sure I understand your use case well, but have you tried lvm's snapshots? That could be the simplest solution.

If you're going to try btrfs, check if your system/tools handle the space overcommit correctly. Some ways to check the available space don't really play well with snapshots. (As in, they report less space available)


The use case: A build system runs 100's of compile/test jobs that use the same underlying git repo. The git repo is massive so giving each process its own copy is not practical. Instead each subprocess gets a unique overlay to do it's work. After a subprocess finishes if it has any artifacts that will be used by a later stage of the build they are copied to a "common" overlay which will get mounted on top of the original lower file system used in the previous phase. So the system needs to be able to use multiple lower overlays (RO) as it moves through each phase. After its all over all the data needs to be quickly deleted such that the system is free to perform the next job.


You can use a btrfs subvolume for the build location, and take snapshots at all rollback points. This is fast. Basically, you create an r/w snapshot at first that will evolve the current batch's "common" overlay.

Proceed by making another r/w snapshot for the/each subprocess.

Just work inside that folder (which behaves ~like a CoW bind-mount) and copy the artifacts into the "common" snapshot.

btrfs sub del -c path/to/temp/build/snapshot

will clean it up and ensure it won't reappear after a crash. You can skip the -c if you don't care about it reappearing.

If you have enough space to handle up to maybe a couple minutes delay in garbage collecting, you don't need to force any special syncing or such, as it just happens in the background.

As for the phases, I don't think this is a problem. btrfs snapshots are like git branches that don't allow merging, leaving explicit copy as the only recourse.


I'd definitely try lvm snapshots in that case. They free you from the specific choice of the filesystem. But I don't know your exact workload, so... testing and profiling some options. Make sure you try larger lvm snapshot blocks to lower the overhead - 4KB may still be the default and it kills performance.


> The git repo is massive so giving each process its own copy is not practical.

Just to make sure: This is per point-in-time, and not just because of history? You can usually make things somewhat cheaper just by doing a shallow clone. But of course that doesn't help if at any given point-in-time the repo is still too big.


Even then, you can actually use multiple external worktrees, saving on the .git folder (or at least any significant size contributions of it).


I don't think git uses copy_file_range yet (at least I can't find it looking through its source), so it'll make expensive copies for each worktree.

And he said even deleting the diffs in an overlay is too expensive, which normally should be cheaper than deleting a worktree.


correct its per point in time. Either way the overlays are a TON faster. :)


It is likely a lot of work, but sounds like working towards out-of-tree builds will pay great dividends going forward in your codebase.


Another way to do this with just userspace tools could be git clone -l, combined with committing artifacts to a temporary branch to be used by later stages. But maybe the space savings would still be suboptimal because there would be copies of the artifacts both in the workspace and the git db?


I have done something similar using a ramfs for the work directories. That way when you are finished you simply save what you want and delete it. Building is much faster than to disk. If your data does not fit in RAM, just use an image file in the same way.


I think there is a limit on how many layers Docker can handle in an image and if I recall correctly the number is like 63 layers. I don’t have any idea why it is like this but it could be chosen for performance reasons.


> The use case: A build system runs 100's of compile/test jobs that use the same underlying git repo.

Very interesting. Thanks for sharing!


Can you use a Linux tmpfs as the overlay? It's much faster to create/delete files, and you can simply unmount it at the end and its memory is immediately reclaimed.


unfortunately, the overlays are too big and would eat too much RAM. I have not figured out why they get so big as they are only modifying a small amount of files. Sometimes I think the copy on write functionality is not working properly.


Try adding initial /build-tmpfs then copying or mounting /lib{n} ... then making new /specific-build-tmps with subdirectories or readonly for mounts /lib{n}. This way writes are discarded after a build (you destroy /specific-build on completion). If you can't easily split files between disparate sources, use a COW filesystem only for /specific/build/tmpfs. Likely to exhibit substantially different performance characteristics, with easily limited memory requirements. Did a lot of this with ZFS in the past, worked very well.


make sure you have redirect_dir=on,metacopy=on


Tangentially related: are there any good solutions for shipping data with containers (for which the CoW mechanism is not suited particularly well)? Is there a "hub" for volumes?


I somehow feel compelled to point out that this idea of union/overlay FS layers has nothing to do with containers per se. But on the other hand is somehow critical for why containers got popular as that is the way to make the whole thing somehow efficient both in terms of hardware resources and developer time.


Yeah the title should be "unionfs: a kernel feature having nothing whatever to do with containers, and some ways to use it" but I guess that's too long :-) . Problem is there is not some central marketing department for Linux that can even tell us what "containers" means. There are lots of people who think they are "using containers" who do not use this style of mount, and there are lots of people using this style of mount who do not consider themselves container users.


They really don't, and it was funny that period where you'd see Dockerfiles with all the commands in a single invocation to avoid "bloating" the resulting image with unnecessary intermediate products that ended up deleted.

Maybe it's out there and I've just missed it, but I really wish there were richer ways to build up a container FS than just the usual layers approach with sets of invocations to get from one layer to the next, especially when it's common to see invocations that mean totally different things depending on when they're run (eg "apt update") and then special cache-busting steps like printing the current time.

I know part of this is just using a better package manager like nix, but I feel like on the docker side you could do interesting stuff like independently run steps A, B, and C against some base image X, and then create a container that's the FS of X+A+B+C, even though the original source containers were X+A, X+B, and X+C.


Not sure if you're aware since you already mentioned Nix, but Nixpkgs has a nice function to export "graph-layered" Docker images[0]. There is some overhead in the conversion, but the rest of the build can be parallelized by Nix as usual.

[0]: https://grahamc.com/blog/nix-and-layered-docker-images


Fantastic, this is exactly what I was looking for, thanks for the pointer!

It looks like the layer limitation comes from the underlying union filesystem stuff and not from anything inherent in Docker itself. I wonder if it would be possible to build a new filesystem driver for Docker that could serve up an infinite ecosystem of immutable nix packages without having to actually unpack them. Whether such a thing would actually be of value would probably depend a lot on the use case, but I could imagine a CI type scenario where you wanted to have a shared binary cache and individual nodes able to come up very fast without having to unpack the pkgs and prepare a full rootfs each time.


That's already possible to some extent by comining staged builds and the experimental buildkit engine which can run independent steps concurrently.


I’m still researching, but I got the impression that buildah from Redhat can do this.


As I understand it, containers are just a set of concepts and kernel features put together to provide an abstraction that's not that different from virtual machines for common use cases.


It is a way to create reasonably isolated environment without the VM overhead. On Linux it is usually called container and implemented in terms of clone() and cgroups, BSDs have jails, Solaris has zones and for Plan 9 that is trivial concept. What is special about the Linux and Plan 9 cases is that the implementation is based around stuff that is generic and not strictly tied to this use case.

As a side note: I’m somewhat bewildered by the docker-controlled-by-kubelet-on-systemd architectures as there is huge overlap in what these layers do and how they are implemented.


I'm not an avid container user, but I think at least part of the popularity for certain dev/ops/CI use cases is how easy it is to get stuff in and out of them using bind mounts and the like. For example, having a run environment that's totally different from a build environment. The only way to do this with conventional VMs is to access them over some in-band protocol like SSH (or to have to re-build the run rootfs and reboot that machine, which is typically extremely slow).


In common usage it seems containers is synonymous with "what Docker does". Because meanwhile Docker has blurred in meaning since various other things were called Docker. Such as calling the non-native product "Native Docker" or whatever "Docker Enterprise" is (impossible to tell by the landing page description).


True, you can use other stacking filesystems with Docker (I believe it had/has a ZFS driver at one time?) The example she shows in the comic are just about the filesystem and leave out the Docker pieces, so I'm wondering if this is just one part of a series.


I wrote this script[1] a while ago which creates an overlay and chroots into the overlays workdir. It's pretty useful, with it, you can do something like

> overlay-here

> make install (or ./wierd-application or 'npm install' or whatever)

> exit

and all changes made to the filesystem in that shell is written to the upperdir instead. So for example in the above example, the upperdir would contain files such as upperdir/usr/bin/app and upperdir/usr/include/header.h.

It's useful when

* You want to create a deb/rpm/tgz package, but a makefile does not have support for DESTDIR

* An application writes configuration files somewhere, but you don't know where

* An application would pollute your filesystem by writing files in random places, and you want to keep this application local to "this directory"

* In general when you want to know "which files does this application write to" without resorting to strace

* or when you want to isolate writes, but not reads, that an application does

[1]: https://gist.github.com/dbeecham/183c122059f7ba288397e8c3320...


I'd be wary of that last point depending on what you mean by "isolate". Chroot is not a security feature so the isolation is not perfect. This shouldn't matter if you trust the application but if it could be malicious (or manipulated by something malicious) then you'd want a harder boundary. `pivot_root` perhaps?


Debian's schroot was made to do pretty much this, though largely obviated by modern container runtimes.


There was a practice of using MergerFS/OverlayFS for pooling multiple drives (often by SnapRAID users), but what's still missing (to my knowledge) is some sort of a balancing layer, that could distribute writes.

I got this idea many years ago, when first personal cloud storages appeared and offered some limited space for free. I thought it would be nice if I could pool them and fill them taking their different capacities into account. And if I could also stripe and EC them for higher availability...

I still wonder if there's something that can do this and if there isn't I would like to know why, since it looks like a useful and obvious idea.


What do you mean by "distribute writes"? One of mergerfs' primary features has always been it's policies which provide different algorithms for choosing a branch to apply a particular function to.

https://github.com/trapexit/mergerfs#policy-descriptions


aufs sort of could do that at the file layer. The issue is you run into a bunch of incompatibilities with how applications can expect it to work. As soon as you want to start looking at striping and EC then you really need to just go with something like ZFS or btrfs.


Please don't try erasure coding within btrfs until they declare it stable...

Using mirror raid functionality is fine and is able to do n-copies redundancy with >=n disks and efficiently uses unequal disk sizes.


This is crucial tech to understand docker-style containers...

I used this to build custom iso images based on an existing iso file. The old method mounted the ISO image, rsync'd the entire contents, copied and updated some files, and then created a new image. This took quite a while and (temporarily) wasted a lot of disk space, and was initially sped up by copying the temp ISO to RAM disk, which also presented some challenges, and wasn't as fast as the eventual solution, using aufs on top of the ISO mount to patch the image. Worked like a charm and sped up the ISO building considerably :)


What happens if you want to delete a file from the ISO?

edit: oops, I've read the article now, I guess aufs acts the same and creates a tombstone file.


Great article and nice style distilling all this into a bite size chunks.

Is it me or just the title is a little bit inaccurate in the sense that there's more to "How containers work?" than overlays, e.g. it made me think that it covers more than it actually does, e.g. cgroups, namespaces, etc...

Anyone knows of a more in depth coverage of containers building block type of article that allows one to build a rudimentary container from scratch to appreciate what goes into building one ?


The title just means it's one piece of the puzzle. She's thinking about making a comic about how containers work, and one important piece of that is overlays. So this is that piece.


fair enough, thank you :)


Yea, it did a great job of covering overlays, but didn't get into how Docker uses a hash value for each overlay piece. Maybe this will be part of a series where she does more of that?

This was posted a few months back on here and it a cool little tools for seeing how Docker fits layers together:

https://github.com/wagoodman/dive


That's interesting, thanks for (re)sharing


OverlayFS is pretty useful for day to day things like union folders for your media collection spanning across different hard drives. It does have a few quirks like inotify not working properly, so changes need to be watched for on the underlying fs.


A lot of utility in Docker comes from incremental (cached) builds based on overlay but in fact you can get it from any CoW system such as LVM/ZFS/BtrFS snapshots.


Docker defaults to overlayfs but you don't have to use it. If you use ZFS or another storage driver it will leverage their capabilities to provide the same functionality: https://docs.docker.com/storage/storagedriver/select-storage...


Docker on ZFS is a pain in the ass and slow as hell. And I say this as both a happy docker/container and ZFS user. But combining them is a bad idea.


I definitely noticed slowness as well when I tried it.

I'm currently using overlay because I leave my docker mount on my root partition since ZFS has always been a pain for root. I hear Ubuntu 19.10 is smoothing that over a bit though.

It's too bad because I've ran into issues on lower memory boxes where the Linux block cache and the arc compete for memory and cause performance slowdowns or errors. If all the filesystems were ZFS this wouldn't be an issue.

Also all the other ZFS benefits like resisting bitrot etc.


I just use overlay on top of ZFS. It's just the native ZFS docker backend that is very slow.


Why didn't I think of that?

So all you need to do is manually specify the volume driver to overlay2 when your FS is ZFS, and this just works?


Facebook, for one, uses Btrfs for its containers: https://facebookmicrosites.github.io/btrfs/docs/btrfs-facebo...


fyi, btrfs as /var/lib/docker/btrfs has a disadvantage under certain circumstances: running commands (du,rsync,..) on the subvolume folders altering the access time will incur a storage cost if the filesystem is mounted without noatime option: https://github.com/moby/moby/issues/39815 - not sure if it applies to ZFS as well (see https://lwn.net/Articles/499648/)


Another difference that may matter for some:

on block based CoW FS, shared executables and libs will not be shared among containers, because file cache is file based not block based.

Using Overlayfs - or other file level CoW storage, when one starts several containers from the same image, shared libraries and executables will be loaded only once(so will be actually shared) across containers.


In fact, this is explained in the article which also has an interesting anecdote about btrfs.


When working with containers also consider hardware abstractions like virtual machines. Startup time can be optimez from minutes down to milliseconds. And also consider statically linked binaries if all you want is to solve the dll hell.


s/o to unionfs for plan9 by kvik: http://code.a-b.xyz/unionfs

Userspace implementation of union directories with control of which half of the union made gets precedence for certain operations such as file creation, etc.


used ovelayfs for an embedded linux project! great stuff!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: