
Linux containers in a few lines of code - benjaminjosephw
https://zserge.com/posts/containers/
======
dmayle
A little bit of education about container systems in linux[1]. A container
system is typically made up a number of components:

 _isolation layer_ : the piece that limits privileges and resource usage. (On
linux, this is usually handled by cgroups and the kernel, but could also be
handled by something like kvm for vm-based containers)

 _raw container configuration_ : Given an image and some metadata (like cpu
limits), launch an isolated process. (On linux, this is usually handled by
runc when working with cgroups)

 _container api daemon_ : Manage the list of container processes and available
images. Provide a unix socket based API for manipulating isolated processes,
launching, deleting, connecting, etc. (In the case of docker, they provide a
daemon which abstracts the containerd daemon, or you can use containerd alone
without docker)

 _container command line tool_ : Provide a user/developer interface to the
three things above. This is the docker command. When you install containerd
without docker this is the ctr command.

Docker, which is probably the most famous container distribution, pairs the
docker command with the docker daemon to abstract away the containerd daemon,
runc, and cgroups.

If you use containerd alone, you get ctr/containerd/runc/cgroups.

There's a standalone command line tool (crictl) which replaces both ctr and
docker and can be used on top of either the docker daemon or containerd.

[1] Container systems seem to have a relatively complex abstraction over what
is a relatively simple architecture.

~~~
Spivak
I feel like podman is proving that you don't really need the api daemon and a
porcelain over runc with a one-off process supervisor is sufficient for a good
number of workloads.

Being able to run containers like any other process and leave the lifecycle
management to systemd is actually really nice.

~~~
freedomben
Could not agree more. As a Fedora user I was mildly intrigued when Podman
showed up, I played with it briefly but stopped because most of my projects
used docker-compose, which doesn't work with Podman.

When I went to work at Red Hat I decided to really try Podman, and I love it
now. Once I discovered that Podman supports Kubernetes Pods (same YAML and
all) I realized how clunky docker-compose actually is. Since most of my
projects now run in K8s anyway, it's awesome to have the app run as a Pod
locally so it can easily be tested with any sidecars in the same
configuration.

There are still uses for docker-compose, and I don't expect it to disappear
from my life completely any time soon, but Podman does have a great place in
my toolkit.

~~~
throwaway8941
podman-compose works fine for me, despite being advertised as "still under
development".

[https://github.com/containers/podman-
compose](https://github.com/containers/podman-compose)

~~~
terrywang
podman-compose works better than expected, however, it is NoT a "drop-in"
replacement for docker-compose, at least now.

------
benjaminjosephw
I love minimal code like this as a way of really understanding how something
works. The author is pretty consistent too - he's got some great projects like
an ultra-minimal electron alternative:
[https://github.com/zserge/webview](https://github.com/zserge/webview)

~~~
faizshah
Same, I really like the book 500 Lines or Less for this as well:
[https://github.com/aosabook/500lines](https://github.com/aosabook/500lines)

I’ve personally been getting in to writing minimal code because I’ve grown
frustrated with how simple tasks can result in complex code that’s difficult
to maintain. Minimal code is easier to come back to. A 100 line script can be
easier to understand than a 500 line script.

------
mmastrac
Big caution here. Do NOT use this style of code to invoke ip tools. This was
the cause of a huge number of security vulnerabilities on Android in the first
few years. Even if you're hardcoding interfaces to start, it's likely someone
else will drive by later on and replace one of the args with %s.

> system("ip link add veth0 type veth peer name veth1");

Always, always, always use exec*() APIs.

~~~
alexchamberlain
Are there not better C APIs to call that `ip` et al are wrapping?

~~~
vocram
Yes, rtnerlink is used to configure networking via a netlink socket, and
libnetlink provides an abstraction over sending raw datagrams.

------
lidHanteyk
Nice work. I'm reminded of bocker [0], which also implements this sort of
functionality in only a few dozen lines of code. The function which
corresponds to this post [1] is relatively short and readable.

[0] [https://github.com/p8952/bocker](https://github.com/p8952/bocker)

[1]
[https://github.com/p8952/bocker/blob/master/bocker#L61-L90](https://github.com/p8952/bocker/blob/master/bocker#L61-L90)

------
fwip
Julia Evans has an excellent zine on how containers work, including a 15-line
bash implementation: [https://jvns.ca/blog/2020/04/27/new-zine-how-containers-
work...](https://jvns.ca/blog/2020/04/27/new-zine-how-containers-work/)

Definitely worth the $12.

~~~
silentguy
I loved it. It helped me understand the nitty gritty of containers.

You can download it for free too: [https://gumroad.com/l/containers-
zine/buyonegiveone](https://gumroad.com/l/containers-zine/buyonegiveone)

There is an official option to either pay or get it for free. Do support her
if you can.

------
ShorsHammer
DIY Containers on Linux is probably a better term here given that Linux
Containers is already heavily in use around the world and included in ubuntu
by Canonical?

For me this is enough to get a container running:

    
    
        lxd init
        lxc launch ubuntu mycontainer
        lxc exec mycontainer bash
    
    

[https://linuxcontainers.org/](https://linuxcontainers.org/)

~~~
throwaway55554
I don't know, I find "Linux Containers" to be sufficiently generic as Linux
has _native_ support for them. LXD/LXC, Docker, etc, are simply tools built
upon that.

~~~
sp332
I agree. Most people who read "Linux Containers" are going to think of Docker,
not LXD.

~~~
josteink
People are different.

When I think about containers, I think about isolated system images,
lightweight VMs, which I can use and adapt to solve a problem at hand.

When I hear Docker, I think about static, locked-down _application-images_
made by others, to deploy in the cloud, which I'm not given to adapt to my
needs, and I also think about things which is not natively integrated into my
Linux-distro and for which I will have to provide glue manually. I also think
about a startling amount of complexity in a new stack which I would have to
learn, just to manage what is actually really just basic Linux-systems.

While all that may or may not be true, that's what I _think_ about when I hear
"Docker".

Hearing "containers" on the other hand, makes me happy. And yeah, I'll got
with LXC/LXD any day.

------
mehrdadn
Could someone comment on how secure such a container is, at least nominally?
Should I be able to theoretically run untrusted code on such a container if
the system is bug-free and I add proper error-checking to the code? Or are
there things that you'd need to worry about the code being able to access? Any
considerations regarding sudo permissions?

~~~
uvuv
Definitely not secure. The author did a great job explaining container
runtimes in basic terms, but there's a lot of security features missing.
Mainly: * Reducing the container's capabilities * Restricting access to
resources through cgroups * Applying seccomp filters to prevent certain
syscalls.

As another comment suggested, user namespaces are another hardening feature,
but not all container runtimes enable it by default. Podman does, Docker
doesn't. In fact user namespaces are so powerful that I believe they pretty
much cover most of the hardening provided by the three features I listed
above. If you're wondering why they're not enabled by default in Docker, take
a look at this [1].

Exploiting the missing isolation mechanisms, the following bash commands will
allow you to escape from the author's containers:

$ ls -al /sys/dev/block # find the root fs device (e.g. /dev/sda1) major and
minor device numbers (e.g. {maj=8, min=1}, {maj=259, min=1})

$ mknod --mode 0600 /dev/host_fs_dev b $major $minor

$ mkdir /host_fs && mount /dev/host_fs_dev /host_fs

(warning: shameless plug to my posts follows:)

If you want more details, I wrote a post on this exact same problem in the
context of three vulnerabilities I found in rkt (another container runtime)
[2].

Beside the issues above, the author's runtime also exposes host file
descriptors like /proc/self/exe that can be used to escape the container. This
is a post I wrote on runC CVE-2019-5736 that explains this kind of issues.

[1] [https://docs.docker.com/engine/security/userns-
remap/#user-n...](https://docs.docker.com/engine/security/userns-remap/#user-
namespace-known-limitations) [2]
[https://unit42.paloaltonetworks.com/breaking-out-of-
coresos-...](https://unit42.paloaltonetworks.com/breaking-out-of-coresos-
rkt-3-new-cves/) [3] [https://unit42.paloaltonetworks.com/breaking-docker-via-
runc...](https://unit42.paloaltonetworks.com/breaking-docker-via-runc-
explaining-cve-2019-5736/)

~~~
rdslw
Thank you for detailed answer and interesting links!

Could you please explain/point me to some information/source, why docker can't
use -net=host namespace if userns is enabled, while on the other hand
rootlesskit[1] which uses userns by default, dont have problem with using host
netns (--net=host) ?

[1] [https://github.com/rootless-
containers/rootlesskit](https://github.com/rootless-containers/rootlesskit)

------
notRobot
See also:

Linux containers in 500 lines of code (2016)
[https://news.ycombinator.com/item?id=22232705](https://news.ycombinator.com/item?id=22232705)

------
bswamina
A few weeks back had published on how to build a container in go programming
... In case interested here are the links:

[https://www.polarsparc.com/xhtml/Containers-1.html](https://www.polarsparc.com/xhtml/Containers-1.html)
[https://www.polarsparc.com/xhtml/Containers-2.html](https://www.polarsparc.com/xhtml/Containers-2.html)

------
davexunit
Small container implementation in Scheme:
[http://git.savannah.gnu.org/cgit/guix.git/tree/gnu/build/lin...](http://git.savannah.gnu.org/cgit/guix.git/tree/gnu/build/linux-
container.scm)

------
accelbred
One thing to note, is that using a PID namespace in that way is incorrect.
PID1 in a PID namespace has to perform the duties normally performed by a
PID1, so you will normally want PID1 in the namespace to be a minimal init. If
not, there may be issues, like unreaped zombie processes.

------
devxpy
While this is interesting, it doesn't really show how containers _actually
work_ , only lists the specific syscall flags to _tell Linux create one_.

A similar snippet[1] exists for go, and it doesn't do anything particularly
special either.

I don't know, maybe David beazley has altered my sense of what "from scratch"
means.

[1]
[https://gist.github.com/lizrice/a5ef4d175fd0cd3491c7e8d71682...](https://gist.github.com/lizrice/a5ef4d175fd0cd3491c7e8d716826d27)

~~~
sp332
This is how containers work though. Or did you want more detail about the
internals of pivot_root or something?

~~~
devxpy
Yes, something like that.

Like here's David's "Build Your Own Async" [1], which I prefer over Philip's
(still extremely good) "What the heck is the event loop anyway?" [2].

It's one thing to _tell_ how something works, but to successfully show what
the hell its actually doing under the covers, just conveys much more
information.

[1]
[https://www.youtube.com/watch?v=Y4Gt3Xjd7G8](https://www.youtube.com/watch?v=Y4Gt3Xjd7G8)
[2]
[https://www.youtube.com/watch?v=8aGhZQkoFbQ](https://www.youtube.com/watch?v=8aGhZQkoFbQ)

------
p4bl0
I love this kind of posts, it's never something you will effectively use
instead of the actual product (here, Docker), but it's really a great way to
learn new little things.

------
jeffbee
Linux containers in one shell statement

    
    
      $ echo $$ > tasks

~~~
mmastrac
I'm not sure if this is serious or a joke, but can you explain it further?

~~~
jeffbee
Given certain initial conditions, this statement moves the current process
(and any process it subsequently creates) into a control group, which meets
minimal definitions of containerization.

~~~
cyphar
It really doesn't. I would say that "unshare -mpf ; pivot_root" matches the
most minimal definition of a container more accurately than joining a cgroup
(it's an isolated system which can't directly interact with the host).

Otherwise you'd have to argue that configuring rlimits actually makes your
shell a container, which is too much of a stretch (for me at least).

~~~
jeffbee
I think that just underscores my point, which is that containment means
different things to different people. To me, it means only the resource
limiting features from cgroups. I have no use for namespaces, bind mounts,
virtual network interfaces, or any of that stuff. In my application all that
stuff is either pointless or harmful. But to you, container means at least PID
and mount namespaces.

rlimit is sort of a thing but it's not actually effective so to me it's not
part of the picture. If unix limits worked, Google would not have needed to
contribute cgroups before deploying Borg. Indeed, in this LWN article which is
actually about control groups, they call control groups "containers". Just
shows there is not a universal meaning of the term.

The earliest control groups patch I can find says "We use the term container
to indicate a structure against which we track and charge utilization of
system resources like memory, tasks etc for a workload." It doesn't say
anything about isolation, namespaces, or security, but it uses the term
container to describe resource control.

[https://lwn.net/Articles/236038/](https://lwn.net/Articles/236038/)

~~~
kelnos
Regardless of what was said when cgroups was first implemented, the current
industry term "container" does actually mean isolation. I suspect you'd find
yourself in a very small minority of people who use it to mean simply "running
in a cgroup".

> _Just shows there is not a universal meaning of the term._

Yes, there is. That meaning has just evolved since 2007.

------
streb-lo
Anyone want to chime in on why pivot_root is preferable to a chroot jail? It's
kind of hand-waved in the article.

~~~
cyphar
This is mostly to do with the implementation of chroot(). Because it only
applies to a single process (and mount tables are per mount namespace), it was
implemented in such a way that directories above the root of the chroot are
still technically accessible (the mounts above the root directory are still
present in the mount hierarchy). This results in all sorts of fun bugs where
if you chroot() inside a chroot() you can get out of the chroot() entirely.
Container runtimes generally block CAP_SYS_CHROOT by default for this reason,
but there are all sorts of other subtle security bugs which pop up because of
this fundamental design choice.

pivot_root() doesn't suffer from this problem because it applies to the entire
mount namespace, and thus its implementation could be made much safer. Instead
of just changing the current process's filesystem root, the actual / mount of
the mount namespace is swapped with another mountpoint on the system (and the
old root is mounted elsewhere). Thus once the old root is unmounted there
isn't a way to get back to the old mountpoints. This isn't perfect protection
(magic-links and other mounts could expose the host filesystem) but it is a
damn sight better than chroot(). Oh, and nesting pivot_root()s doesn't cause a
breakout.

Note that this different behaviour in relation to mounts has resulted in
completely unrelated security bugs with containers (such as being able to
bypass procfs masks because chroot() doesn't hide the unmasked procfs in the
host mount namespace). This is why us container runtime authors always tell
people they should never use chroot() and always use pivot_root() -- though
sadly sometimes chroot() is needed because pivot_root() doesn't work on
initramfs.

(I'm one of the maintainers of runc, the runtime which underlies
Docker/podman/containerd/cri-o/...)

~~~
kelnos
> _though sadly sometimes chroot() is needed because pivot_root() doesn 't
> work on initramfs._

Are people actually attempting to boot a super-minimalist system that just has
a kernel and an initramfs with something like docker into it where they don't
bother with a rootfs at all and just start running containers directly from
the initramfs? That's kinda cool, if that's the case.

~~~
cyphar
You could do that (though one could argue that there's no real benefit to
using containers in that case), but the issue is sadly more general than that.
You cannot use pivot_root() if the _current root_ is on initramfs. The reason
is fairly historic, and boils down to "you cannot unmount initramfs" in the
same way that "you cannot kill pid1".

This means that setups where you have the entire OS image in initramfs, and
you try to run a container (even if it has a different filesystem as its
rootfs) it will fail with pivot_root(). There are solutions for this but they
require changing how the system is started (which can be a bit complicated
depending on what system you're using to build your initramfs). From memory,
minikube has used --no-pivot-root for a while precisely for this reason,
though I believe they have switched away from it sometime recently.

------
josteink
The title is about _Docker_ containers in a few lines of code.

Could the mods consider renaming the submission?

------
brauzi
thanks for sharing!

------
jart
Software history should be written that Docker gave Phil Katz the Oliver
Cromwell treatment.

