
How Containers Work: Overlayfs - saranshk
https://jvns.ca/blog/2019/11/18/how-containers-work--overlayfs/
======
Cedricgc
I enjoyed this blog post. Julia does a great job of distilling an idea down
with examples.

I am fairly comfortable with Linux as a user for things like understanding
processes, ports, key files and utilities, etc. The way I understand how to
model abstractions like containers is to know the various OS primitives like
cgroups, changing root, network isolation. Once one sees how those pieces come
together to create the container abstraction, they can be mapped to the system
calls provided by the OS. Usually they also have utilities bundled (like
`chroot`) to interface with those primitives as an operator.

~~~
whytaka
I have been confused about containers for so long but having read your comment
and looking up the terms you mentioned allowed me to finally find the right
articles that explained containers to me. Thanks!

~~~
opendomain
Can YOU post links to the articles so we can learn?

Thanks!

~~~
phi12ip
Cgroups and Namespaces also by Julia Evans:
[https://jvns.ca/blog/2016/10/10/what-even-is-a-
container/](https://jvns.ca/blog/2016/10/10/what-even-is-a-container/)

Solaris Zones by Price and Tucker:
[https://www.usenix.org/legacy/event/lisa04/tech/full_papers/...](https://www.usenix.org/legacy/event/lisa04/tech/full_papers/price/price.pdf)

Jails: Confining the omnipotent root by Kamp and Watson:
[http://www.watson.org/~robert/freebsd/sane2000-jail.pdf](http://www.watson.org/~robert/freebsd/sane2000-jail.pdf)

chroot(2) [http://man7.org/linux/man-
pages/man2/chroot.2.html](http://man7.org/linux/man-pages/man2/chroot.2.html)

~~~
the8472
On linux containers usually involve some more primitives than cgroups and
namespaces. Bind mounts, overlayfs (TFA), veth network interfaces (to connect
the network namespaces), network bridges, memfd, seccomp, procfs etc. are all
bits and pieces that are used by most containers/sandboxes.

Many of those pieces can be useful on their own. For example you don't need a
full container if all you want to do is to ensure that some applications use a
VPN and others use your public network address. A network namespace is all you
need and those are accessible through simple cli tools such as `unshare` and
`ip netns` and don't require behemoths like dockerd.

The tricky part is using them all together correctly, initializing them in the
right order, not getting the control daemons confused by running in the wrong
context and so on. That's where many of the security vulnerabilities come
from.

------
pcr910303
Hmm... how is Overlayfs and Unionfs different? From the explanation I can't
find any differences...

Unionfs: A Stackable Unification File System[0]:

> This project builds a stackable unification file system, which can appear to
> merge the contents of several directories (branches), while keeping their
> physical content separate.

> Unionfs allows any mix of read-only and read-write branches, as well as
> insertion and deletion of branches anywhere in the fan-out.

> To maintain unix semantics, Unionfs handles elimination of duplicates,
> partial-error conditions, and more.

If it is the same thing (but maybe more maintained or has more features...) ,
can we implement something like trip[1][2] package manager on top of
Overlayfs?

(Porg is a package manager where all files that are installed by make install
is tracked and mounted on a Unionfs layer.)

[0] [http://unionfs.filesystems.org](http://unionfs.filesystems.org)

[1] [https://github.com/grencez/trip](https://github.com/grencez/trip)

[2]
[http://www.linuxfromscratch.org/hints/downloads/files/packag...](http://www.linuxfromscratch.org/hints/downloads/files/package_management_using_trip.txt)

~~~
Jasper_
They're mostly the same, with different answers to tricky questions. e.g. if I
stack filesystems A, B, C, and have the same file /foo/bar in all of the
layers, and then do rm /foo/bar, what happens:

1\. Does /foo/bar get removed from the topmost layer, exposing the one below?

2\. Does /foo/bar get removed from all three layers?

3\. Does /foo/bar get replaced with a "tombstone" record to pretend that it
was deleted, while still appearing in some of A, B, or C on its own?

These semantics are tricky to get right, and during the process of upstreaming
unionfs to the kernel, they made some incompatible changes to the model and
chose different answers for these questions, and as a result, renamed it
overlayfs.

~~~
Dylan16807
It looks like overlayfs removes most of the tricky questions by only allowing
the topmost filesystem to be read-write. In that case rm is pretty simple:
remove it from the topmost filesystem if it exists there, and make a tombstone
if it exists in a lower layer.

Unionfs is significantly more complicated.

------
nwellinghoff
I recently made a system that uses overlays to provide work spaces for a
complex build process. However, there is significant overhead paid for
unmounting an deleting the files in the overlay after all the work is done. I
was thinking about changing the system such that I allocate a partition ahead
of time, write all the overlays there, and on success just blow away the
partition and with it the overlays. This is kind of a pain in the ass. Can
anyone suggest a method for rapidly deleting all the data generated by using
100's of overlays? Maybe BtrFS snapshots would be better? What are the pros
and cons? Thank you so much and I apologize for "anything" up front :)

~~~
the8472
With recent kernels you can combine overlayfs and btrfs.

btrfs subvols/snapshosts have their own costs, they can get slow once you
accumulate thousands of them (it's fine if you just use a few at a time). But
you can create a single btrfs subvolume, store your overlays in there and then
delete the subvolume when you're done.

~~~
nwellinghoff
Good idea.

------
dfox
I somehow feel compelled to point out that this idea of union/overlay FS
layers has nothing to do with containers per se. But on the other hand is
somehow critical for why containers got popular as that is the way to make the
whole thing somehow efficient both in terms of hardware resources and
developer time.

~~~
mikepurvis
They really don't, and it was funny that period where you'd see Dockerfiles
with all the commands in a single invocation to avoid "bloating" the resulting
image with unnecessary intermediate products that ended up deleted.

Maybe it's out there and I've just missed it, but I really wish there were
richer ways to build up a container FS than just the usual layers approach
with sets of invocations to get from one layer to the next, especially when
it's common to see invocations that mean totally different things depending on
when they're run (eg "apt update") and then special cache-busting steps like
printing the current time.

I know part of this is just using a better package manager like nix, but I
feel like on the docker side you could do interesting stuff like independently
run steps A, B, and C against some base image X, and then create a container
that's the FS of X+A+B+C, even though the original source containers were X+A,
X+B, and X+C.

~~~
Nullabillity
Not sure if you're aware since you already mentioned Nix, but Nixpkgs has a
nice function to export "graph-layered" Docker images[0]. There is some
overhead in the conversion, but the rest of the build can be parallelized by
Nix as usual.

[0]: [https://grahamc.com/blog/nix-and-layered-docker-
images](https://grahamc.com/blog/nix-and-layered-docker-images)

~~~
mikepurvis
Fantastic, this is exactly what I was looking for, thanks for the pointer!

It looks like the layer limitation comes from the underlying union filesystem
stuff and not from anything inherent in Docker itself. I wonder if it would be
possible to build a new filesystem driver for Docker that could serve up an
infinite ecosystem of immutable nix packages without having to actually unpack
them. Whether such a thing would actually be of value would probably depend a
lot on the use case, but I could imagine a CI type scenario where you wanted
to have a shared binary cache and individual nodes able to come up very fast
without having to unpack the pkgs and prepare a full rootfs each time.

------
2019119
I wrote this script[1] a while ago which creates an overlay and chroots into
the overlays workdir. It's pretty useful, with it, you can do something like

> overlay-here

> make install (or ./wierd-application or 'npm install' or whatever)

> exit

and all changes made to the filesystem in that shell is written to the
upperdir instead. So for example in the above example, the upperdir would
contain files such as upperdir/usr/bin/app and upperdir/usr/include/header.h.

It's useful when

* You want to create a deb/rpm/tgz package, but a makefile does not have support for DESTDIR

* An application writes configuration files somewhere, but you don't know where

* An application would pollute your filesystem by writing files in random places, and you want to keep this application local to "this directory"

* In general when you want to know "which files does this application write to" without resorting to strace

* or when you want to isolate writes, but not reads, that an application does

[1]:
[https://gist.github.com/dbeecham/183c122059f7ba288397e8c3320...](https://gist.github.com/dbeecham/183c122059f7ba288397e8c3320234d3)

~~~
ChrisSD
I'd be wary of that last point depending on what you mean by "isolate". Chroot
is not a security feature so the isolation is not perfect. This shouldn't
matter if you trust the application but if it could be malicious (or
manipulated by something malicious) then you'd want a harder boundary.
`pivot_root` perhaps?

------
ZoomZoomZoom
There was a practice of using MergerFS/OverlayFS for pooling multiple drives
(often by SnapRAID users), but what's still missing (to my knowledge) is some
sort of a balancing layer, that could distribute writes.

I got this idea many years ago, when first personal cloud storages appeared
and offered some limited space for free. I thought it would be nice if I could
pool them and fill them taking their different capacities into account. And if
I could also stripe and EC them for higher availability...

I still wonder if there's something that can do this and if there isn't I
would like to know why, since it looks like a useful and obvious idea.

~~~
asdfaoeu
aufs sort of could do that at the file layer. The issue is you run into a
bunch of incompatibilities with how applications can expect it to work. As
soon as you want to start looking at striping and EC then you really need to
just go with something like ZFS or btrfs.

~~~
namibj
Please don't try erasure coding within btrfs until they declare it stable...

Using mirror raid functionality is fine and is able to do n-copies redundancy
with >=n disks and efficiently uses unequal disk sizes.

------
koffiezet
This is crucial tech to understand docker-style containers...

I used this to build custom iso images based on an existing iso file. The old
method mounted the ISO image, rsync'd the entire contents, copied and updated
some files, and then created a new image. This took quite a while and
(temporarily) wasted a lot of disk space, and was initially sped up by copying
the temp ISO to RAM disk, which also presented some challenges, and wasn't as
fast as the eventual solution, using aufs on top of the ISO mount to patch the
image. Worked like a charm and sped up the ISO building considerably :)

~~~
brokenmachine
What happens if you want to delete a file from the ISO?

edit: oops, I've read the article now, I guess aufs acts the same and creates
a tombstone file.

------
hackerm0nkey
Great article and nice style distilling all this into a bite size chunks.

Is it me or just the title is a little bit inaccurate in the sense that
there's more to "How containers work?" than overlays, e.g. it made me think
that it covers more than it actually does, e.g. cgroups, namespaces, etc...

Anyone knows of a more in depth coverage of containers building block type of
article that allows one to build a rudimentary container from scratch to
appreciate what goes into building one ?

~~~
skywhopper
The title just means it's one piece of the puzzle. She's thinking about making
a comic about how containers work, and one important piece of that is
overlays. So this is that piece.

~~~
hackerm0nkey
fair enough, thank you :)

------
wooptoo
OverlayFS is pretty useful for day to day things like union folders for your
media collection spanning across different hard drives. It does have a few
quirks like inotify not working properly, so changes need to be watched for on
the underlying fs.

------
marmaduke
A lot of utility in Docker comes from incremental (cached) builds based on
overlay but in fact you can get it from any CoW system such as LVM/ZFS/BtrFS
snapshots.

~~~
seabrookmx
Docker defaults to overlayfs but you don't have to use it. If you use ZFS or
another storage driver it will leverage their capabilities to provide the same
functionality: [https://docs.docker.com/storage/storagedriver/select-
storage...](https://docs.docker.com/storage/storagedriver/select-storage-
driver/)

~~~
koffiezet
Docker on ZFS is a pain in the ass and slow as hell. And I say this as both a
happy docker/container and ZFS user. But combining them is a bad idea.

~~~
seabrookmx
I definitely noticed slowness as well when I tried it.

I'm currently using overlay because I leave my docker mount on my root
partition since ZFS has always been a pain for root. I hear Ubuntu 19.10 is
smoothing that over a bit though.

It's too bad because I've ran into issues on lower memory boxes where the
Linux block cache and the arc compete for memory and cause performance
slowdowns or errors. If all the filesystems were ZFS this wouldn't be an
issue.

Also all the other ZFS benefits like resisting bitrot etc.

~~~
koffiezet
I just use overlay on top of ZFS. It's just the native ZFS docker backend that
is very slow.

~~~
seabrookmx
Why didn't I think of that?

So all you need to do is manually specify the volume driver to overlay2 when
your FS is ZFS, and this just works?

------
z3t4
When working with containers also consider hardware abstractions like virtual
machines. Startup time can be optimez from minutes down to milliseconds. And
also consider statically linked binaries if all you want is to solve the dll
hell.

------
henesy
s/o to unionfs for plan9 by kvik:
[http://code.a-b.xyz/unionfs](http://code.a-b.xyz/unionfs)

Userspace implementation of union directories with control of which half of
the union made gets precedence for certain operations such as file creation,
etc.

------
crtlaltdel
used ovelayfs for an embedded linux project! great stuff!

