Hacker News new | past | comments | ask | show | jobs | submit login
Linux containers in a few lines of code (zserge.com)
458 points by benjaminjosephw 21 days ago | hide | past | web | favorite | 85 comments



A little bit of education about container systems in linux[1]. A container system is typically made up a number of components:

isolation layer: the piece that limits privileges and resource usage. (On linux, this is usually handled by cgroups and the kernel, but could also be handled by something like kvm for vm-based containers)

raw container configuration: Given an image and some metadata (like cpu limits), launch an isolated process. (On linux, this is usually handled by runc when working with cgroups)

container api daemon: Manage the list of container processes and available images. Provide a unix socket based API for manipulating isolated processes, launching, deleting, connecting, etc. (In the case of docker, they provide a daemon which abstracts the containerd daemon, or you can use containerd alone without docker)

container command line tool: Provide a user/developer interface to the three things above. This is the docker command. When you install containerd without docker this is the ctr command.

Docker, which is probably the most famous container distribution, pairs the docker command with the docker daemon to abstract away the containerd daemon, runc, and cgroups.

If you use containerd alone, you get ctr/containerd/runc/cgroups.

There's a standalone command line tool (crictl) which replaces both ctr and docker and can be used on top of either the docker daemon or containerd.

[1] Container systems seem to have a relatively complex abstraction over what is a relatively simple architecture.


I feel like podman is proving that you don't really need the api daemon and a porcelain over runc with a one-off process supervisor is sufficient for a good number of workloads.

Being able to run containers like any other process and leave the lifecycle management to systemd is actually really nice.


Could not agree more. As a Fedora user I was mildly intrigued when Podman showed up, I played with it briefly but stopped because most of my projects used docker-compose, which doesn't work with Podman.

When I went to work at Red Hat I decided to really try Podman, and I love it now. Once I discovered that Podman supports Kubernetes Pods (same YAML and all) I realized how clunky docker-compose actually is. Since most of my projects now run in K8s anyway, it's awesome to have the app run as a Pod locally so it can easily be tested with any sidecars in the same configuration.

There are still uses for docker-compose, and I don't expect it to disappear from my life completely any time soon, but Podman does have a great place in my toolkit.


podman-compose works fine for me, despite being advertised as "still under development".

https://github.com/containers/podman-compose


podman-compose works better than expected, however, it is NoT a "drop-in" replacement for docker-compose, at least now.


Thanks, it might be time for me to give it another try.


There are imho 3 key features what made Docker great:

* the combination of all technologies touched here in the example, but it's missing a very important-one: the layered filesystem

* A defined packaging format and way of distributing images

* Having an API to talk to these things easily.

That is what kickstarted and revolutionised the container space, but there were still many technology gaps and questions. They tried to put everything in one single tool, which was a double-edged sword I think: it made usage easy which accelerated adoption, but also caused it being a big 'monolith' project, which is a bit ironic given that it ended up being extremely suitable for micro-service architectures.

For me as an early adopter, it was very clear from the start that the main problem was how do you actually deploy this in production. Docker saw a very large part of the conceptual bigger picture, but they had any clue how to address this, but to be fair, very few did.

Right now, a lot of lessons have been learned, and standards have been created in the container space.

Nowadays, for production loads, k8s has become the de-facto standard container api, and Docker itself doesn't really carry that much weight anymore, the runtime itself isn't important anymore. It being podman, runc, or still a full-blown docker? I couldn't really care less. Docker's stronghold will remain the development environment for a while though, just due to the massive amount of resources and easy to use solutions out-there...


Yep. Because the use case that seems to be most common is not running untrusted containers, but rather something more like static compilation, but with ruby or javascript code. So on a single machine just considering it another process, but with better isolation of dependencies is (including encapsulation of distinct processes and in-container network activity) is actually the use case. And once we scale up to multiple machines for resource needs, something like k8s starts to make sense, because it just moves the abstraction to the cluster level rather than the machine. The more you get k8s to be a "cluster systems" the saner the management.


Systemd has some pretty nice sandboxing settings built in if you (like me) prefer to not use docker-like containers.

With stuff like RootImage and the various isolation settings you can have a configurably sandboxed container right in systemd.

Or just use systemd-nspawn if you want it more preconfigured.


do you know if any of these non-daemon containers give you network isolation, and if so, how they do it? Not that most workloads need it.


> isolation layer: the piece that limits privileges and resource usage. (On linux, this is usually handled by cgroups and the kernel

Clarification: cgroups only control access to compute resources, like CPU time and memory. To control privileges, e.g. access to the filesystem or network, you need namespaces, which are a completely separate kernel feature.


> Container systems seem to have a relatively complex abstraction over what is a relatively simple architecture

Yep, it's really not rocket science.

I find that very useful and usable resource isolation (i.e. most of what I actually use containers for) can be achieved with processes + cgroups + chroot.

Unfortunately networking complicates things since DNS doesn't return ports, only IP addresses. A local DNS service that returned IP:port (and appropriate connection APIs at the application level) could eliminate much of the need for network namespaces. Putting services behind front-end gateways also works (indeed it's usually the model for web services.)

(Disclaimer: I actually do use network namespaces frequently and have fun doing so, but I tend to think that networking APIs are too low-level.)


> DNS doesn't return ports, only IP addresses

DNS can return IP:Port pairs, using SRV records instead of A/AAAA records, but for some reason it never took off :( I guess because these days everything uses HTTP as the transport layer, and HTTP doesn’t support it?


I would further mention something people don't immediately grasp: Docker is sort of a version control system too.


I love minimal code like this as a way of really understanding how something works. The author is pretty consistent too - he's got some great projects like an ultra-minimal electron alternative: https://github.com/zserge/webview


Same, I really like the book 500 Lines or Less for this as well: https://github.com/aosabook/500lines

I’ve personally been getting in to writing minimal code because I’ve grown frustrated with how simple tasks can result in complex code that’s difficult to maintain. Minimal code is easier to come back to. A 100 line script can be easier to understand than a 500 line script.


Big caution here. Do NOT use this style of code to invoke ip tools. This was the cause of a huge number of security vulnerabilities on Android in the first few years. Even if you're hardcoding interfaces to start, it's likely someone else will drive by later on and replace one of the args with %s.

> system("ip link add veth0 type veth peer name veth1");

Always, always, always use exec*() APIs.


Are there not better C APIs to call that `ip` et al are wrapping?


Yes, rtnerlink is used to configure networking via a netlink socket, and libnetlink provides an abstraction over sending raw datagrams.


They are low-level APIs, so strictly worse for the purpouse.


Yeah, netlink


Nice work. I'm reminded of bocker [0], which also implements this sort of functionality in only a few dozen lines of code. The function which corresponds to this post [1] is relatively short and readable.

[0] https://github.com/p8952/bocker

[1] https://github.com/p8952/bocker/blob/master/bocker#L61-L90


Julia Evans has an excellent zine on how containers work, including a 15-line bash implementation: https://jvns.ca/blog/2020/04/27/new-zine-how-containers-work...

Definitely worth the $12.


I loved it. It helped me understand the nitty gritty of containers.

You can download it for free too: https://gumroad.com/l/containers-zine/buyonegiveone

There is an official option to either pay or get it for free. Do support her if you can.


DIY Containers on Linux is probably a better term here given that Linux Containers is already heavily in use around the world and included in ubuntu by Canonical?

For me this is enough to get a container running:

    lxd init
    lxc launch ubuntu mycontainer
    lxc exec mycontainer bash

https://linuxcontainers.org/


I don't know, I find "Linux Containers" to be sufficiently generic as Linux has native support for them. LXD/LXC, Docker, etc, are simply tools built upon that.


I agree. Most people who read "Linux Containers" are going to think of Docker, not LXD.


People are different.

When I think about containers, I think about isolated system images, lightweight VMs, which I can use and adapt to solve a problem at hand.

When I hear Docker, I think about static, locked-down application-images made by others, to deploy in the cloud, which I'm not given to adapt to my needs, and I also think about things which is not natively integrated into my Linux-distro and for which I will have to provide glue manually. I also think about a startling amount of complexity in a new stack which I would have to learn, just to manage what is actually really just basic Linux-systems.

While all that may or may not be true, that's what I think about when I hear "Docker".

Hearing "containers" on the other hand, makes me happy. And yeah, I'll got with LXC/LXD any day.


Which is ironic given that LXC long predates Docker (having been involved in the original discussions of the kernel APIs being used by containers today), and Docker used LXC for a long portion of its history.


I'll add systemd-nspawn to the list.


Could someone comment on how secure such a container is, at least nominally? Should I be able to theoretically run untrusted code on such a container if the system is bug-free and I add proper error-checking to the code? Or are there things that you'd need to worry about the code being able to access? Any considerations regarding sudo permissions?


Definitely not secure. The author did a great job explaining container runtimes in basic terms, but there's a lot of security features missing. Mainly: * Reducing the container's capabilities * Restricting access to resources through cgroups * Applying seccomp filters to prevent certain syscalls.

As another comment suggested, user namespaces are another hardening feature, but not all container runtimes enable it by default. Podman does, Docker doesn't. In fact user namespaces are so powerful that I believe they pretty much cover most of the hardening provided by the three features I listed above. If you're wondering why they're not enabled by default in Docker, take a look at this [1].

Exploiting the missing isolation mechanisms, the following bash commands will allow you to escape from the author's containers:

$ ls -al /sys/dev/block # find the root fs device (e.g. /dev/sda1) major and minor device numbers (e.g. {maj=8, min=1}, {maj=259, min=1})

$ mknod --mode 0600 /dev/host_fs_dev b $major $minor

$ mkdir /host_fs && mount /dev/host_fs_dev /host_fs

(warning: shameless plug to my posts follows:)

If you want more details, I wrote a post on this exact same problem in the context of three vulnerabilities I found in rkt (another container runtime) [2].

Beside the issues above, the author's runtime also exposes host file descriptors like /proc/self/exe that can be used to escape the container. This is a post I wrote on runC CVE-2019-5736 that explains this kind of issues.

[1] https://docs.docker.com/engine/security/userns-remap/#user-n... [2] https://unit42.paloaltonetworks.com/breaking-out-of-coresos-... [3] https://unit42.paloaltonetworks.com/breaking-docker-via-runc...


Thank you for detailed answer and interesting links!

Could you please explain/point me to some information/source, why docker can't use -net=host namespace if userns is enabled, while on the other hand rootlesskit[1] which uses userns by default, dont have problem with using host netns (--net=host) ?

[1] https://github.com/rootless-containers/rootlesskit


Wow thank you!!


The code from the article misses an important thing w.r.t. security - user namespaces, which can be used to map container UIDs to a subset of host UIDs and container root to non-root user. But it is a namespace that is more complicated to configure than others.


Ah thank you! That's what I was looking for.


Docker has a concept of layered images where only the top-layer is writable. The layer above the "scratch"[1] image usually contain all the files of the base images OS and that's what you set your root directory to. The writable layer disappears when the container is stopped. If you mount your DIY container into /tmp for example, the process running inside your container won't be able to access any OS functionality. You couldn't run a web server in such a container for instance. On the other hand, whatever your containerized process writes into the mounted part of your hard disk won't disappear when the container stops. Because of that, I wouldn't run untrusted code in it.

[1] https://hub.docker.com/_/scratch


this is the file system not docker itself. you can get the overlay behavior without any docker


Can you point me to some online resources? I'd like to learn more about this.



nice. thank you


start with: https://windsock.io/the-overlay-filesystem/

after that read more about overlay, overlay fs, for historical reasons aufs.


I'm not sure if this is what the parent is referring to but there are overlayfs and unionfs in Ubuntu for example.


I see, thanks. It's just a matter of preparing the directory before and cleaning it up afterward though, right? Not a security hole exactly?


Correct.


Big “if”. There has never, in thirty years, been a Linux that lacked a user-to-root privilege escalation path. Running untrusted code in containers is the same as it’s ever been: totally unsafe. VMs are safer, or, minimally, ptrace sandboxes intercepting all syscalls.


One can write the same sentence about VMs. The most recent Xen escalation bug I can find in 5 minutes of Googling has a publication date of Jan 2020.

Emulating an entire machine, all of its diverse hardware, their bespoke protocols, and all the weirdness of the x86/amd64 ISA is fraught with peril. It is a large attack space. So too is the Linux kernel.

And frankly, inside the VM, half of us are running Linux anyways. I feel like for a lot of use cases, compromising the VM's guest OS (Linux) is enough to have a really bad day. Compromising the hypervisor? Bad, yes, but now it's AWS's problem.

There's more to containers, too, than just the security thing, and I think there are enough other advantages (I can more easily bin-pack services together; I can separate the FS and thus dependencies of unrelated components; I can more easily test them; etc.) that containers are worth it. Often and even on top of a VM. (My current work is with containers, and we run them on VMs.)


Are you saying because of bugs or are you saying it's by design? I explicitly said ignore OS bugs.


Yes you did, but that’s as useless as a discussion based on ignoring the laws of thermodynamics.


If you find it useless that doesn't mean I do. There's no obligation to contribute if you don't have anything to.


That may be hard. Some bugs are elevated to features and then become part of the design.


This is too vague to be useful.


here is a secret: there is no such thing as a container. it’s an abstraction we made up and containers rely on kernel features. if you use those features correctly it’s as secure as it gets - chances are that if you’re going to roll your own you’ll miss some things.


That doesn't answer my question at all. I'm well aware containers aren't a "real thing". My entire question was about the "if" part that you didn't address. Is anything missed here? is my question.


in theory, it’s a secure as the kernel if your code does the right thing.

that being said, the attack surface is wider than say if you would run it in a VM or its own physical machine


Well obviously "if your code does the right thing" then it's going to be secure rather than insecure... by definition. That's again a pretty unhelpful tautology.

I'm asking about the code in the webpage, not code I'm writing personally. I'm saying let's assume it has error-handling added to it. That's it. I am not writing any code otherwise. Is that code doing "the right thing"? Or are there more things it needs to be doing?


The container process has still full access to /dev, /sys, all capabilities of root, and the ability to insmod.


I didn't notice that, thanks! Although I imagine they can't do much with /dev etc. unless they get sudo.


Unless you're using user namespaces (which this doesn't) then root inside a container is equal to root outside the container. You don't even need access to /dev, because the container process could just mknod(2) any device and access it with full permissions.

This is only possible in this example because the container has the full capability set (including CAP_MKNOD) and the devices cgroup hasn't been configured to restrict device access. Real container runtimes always restrict device creation by default, and usually don't allow CAP_MKNOD by default.


Thanks!


See also:

Linux containers in 500 lines of code (2016) https://news.ycombinator.com/item?id=22232705


A few weeks back had published on how to build a container in go programming ... In case interested here are the links:

https://www.polarsparc.com/xhtml/Containers-1.html https://www.polarsparc.com/xhtml/Containers-2.html


Small container implementation in Scheme: http://git.savannah.gnu.org/cgit/guix.git/tree/gnu/build/lin...


One thing to note, is that using a PID namespace in that way is incorrect. PID1 in a PID namespace has to perform the duties normally performed by a PID1, so you will normally want PID1 in the namespace to be a minimal init. If not, there may be issues, like unreaped zombie processes.


While this is interesting, it doesn't really show how containers actually work, only lists the specific syscall flags to tell Linux create one.

A similar snippet[1] exists for go, and it doesn't do anything particularly special either.

I don't know, maybe David beazley has altered my sense of what "from scratch" means.

[1] https://gist.github.com/lizrice/a5ef4d175fd0cd3491c7e8d71682...


This is how containers work though. Or did you want more detail about the internals of pivot_root or something?


Yes, something like that.

Like here's David's "Build Your Own Async" [1], which I prefer over Philip's (still extremely good) "What the heck is the event loop anyway?" [2].

It's one thing to tell how something works, but to successfully show what the hell its actually doing under the covers, just conveys much more information.

[1] https://www.youtube.com/watch?v=Y4Gt3Xjd7G8 [2] https://www.youtube.com/watch?v=8aGhZQkoFbQ


Agree, you can't make your own code to isolate from the OS. You need the OS to do that.


I love this kind of posts, it's never something you will effectively use instead of the actual product (here, Docker), but it's really a great way to learn new little things.


Linux containers in one shell statement

  $ echo $$ > tasks


I'm not sure if this is serious or a joke, but can you explain it further?


Given certain initial conditions, this statement moves the current process (and any process it subsequently creates) into a control group, which meets minimal definitions of containerization.


Ahh. [1]

           $ mkdir /dev/cpuset
           $ mount -t cpuset cpuset /dev/cpuset
           $ cd /dev/cpuset
           $ mkdir Charlie
           $ cd Charlie
           $ /bin/echo 2-3 > cpuset.cpus
           $ /bin/echo 1 > cpuset.mems
           $ /bin/echo $$ > tasks
           # The current shell is now running in cpuset Charlie
           # The next line should display '/Charlie'
           $ cat /proc/self/cpuset
[1] http://man7.org/linux/man-pages/man7/cpuset.7.html


It really doesn't. I would say that "unshare -mpf ; pivot_root" matches the most minimal definition of a container more accurately than joining a cgroup (it's an isolated system which can't directly interact with the host).

Otherwise you'd have to argue that configuring rlimits actually makes your shell a container, which is too much of a stretch (for me at least).


I think that just underscores my point, which is that containment means different things to different people. To me, it means only the resource limiting features from cgroups. I have no use for namespaces, bind mounts, virtual network interfaces, or any of that stuff. In my application all that stuff is either pointless or harmful. But to you, container means at least PID and mount namespaces.

rlimit is sort of a thing but it's not actually effective so to me it's not part of the picture. If unix limits worked, Google would not have needed to contribute cgroups before deploying Borg. Indeed, in this LWN article which is actually about control groups, they call control groups "containers". Just shows there is not a universal meaning of the term.

The earliest control groups patch I can find says "We use the term container to indicate a structure against which we track and charge utilization of system resources like memory, tasks etc for a workload." It doesn't say anything about isolation, namespaces, or security, but it uses the term container to describe resource control.

https://lwn.net/Articles/236038/


Regardless of what was said when cgroups was first implemented, the current industry term "container" does actually mean isolation. I suspect you'd find yourself in a very small minority of people who use it to mean simply "running in a cgroup".

> Just shows there is not a universal meaning of the term.

Yes, there is. That meaning has just evolved since 2007.


Anyone want to chime in on why pivot_root is preferable to a chroot jail? It's kind of hand-waved in the article.


This is mostly to do with the implementation of chroot(). Because it only applies to a single process (and mount tables are per mount namespace), it was implemented in such a way that directories above the root of the chroot are still technically accessible (the mounts above the root directory are still present in the mount hierarchy). This results in all sorts of fun bugs where if you chroot() inside a chroot() you can get out of the chroot() entirely. Container runtimes generally block CAP_SYS_CHROOT by default for this reason, but there are all sorts of other subtle security bugs which pop up because of this fundamental design choice.

pivot_root() doesn't suffer from this problem because it applies to the entire mount namespace, and thus its implementation could be made much safer. Instead of just changing the current process's filesystem root, the actual / mount of the mount namespace is swapped with another mountpoint on the system (and the old root is mounted elsewhere). Thus once the old root is unmounted there isn't a way to get back to the old mountpoints. This isn't perfect protection (magic-links and other mounts could expose the host filesystem) but it is a damn sight better than chroot(). Oh, and nesting pivot_root()s doesn't cause a breakout.

Note that this different behaviour in relation to mounts has resulted in completely unrelated security bugs with containers (such as being able to bypass procfs masks because chroot() doesn't hide the unmasked procfs in the host mount namespace). This is why us container runtime authors always tell people they should never use chroot() and always use pivot_root() -- though sadly sometimes chroot() is needed because pivot_root() doesn't work on initramfs.

(I'm one of the maintainers of runc, the runtime which underlies Docker/podman/containerd/cri-o/...)


> though sadly sometimes chroot() is needed because pivot_root() doesn't work on initramfs.

Are people actually attempting to boot a super-minimalist system that just has a kernel and an initramfs with something like docker into it where they don't bother with a rootfs at all and just start running containers directly from the initramfs? That's kinda cool, if that's the case.


You could do that (though one could argue that there's no real benefit to using containers in that case), but the issue is sadly more general than that. You cannot use pivot_root() if the current root is on initramfs. The reason is fairly historic, and boils down to "you cannot unmount initramfs" in the same way that "you cannot kill pid1".

This means that setups where you have the entire OS image in initramfs, and you try to run a container (even if it has a different filesystem as its rootfs) it will fail with pivot_root(). There are solutions for this but they require changing how the system is started (which can be a bit complicated depending on what system you're using to build your initramfs). From memory, minikube has used --no-pivot-root for a while precisely for this reason, though I believe they have switched away from it sometime recently.


pivot_root is supposed to switch the whole system to a new root. chroot applies to a process, but the underlying system keeps going with what it had.


This is true (though it's scoped per mount namespace), but it isn't the primary security reason container runtimes use pivot_root().


The title is about Docker containers in a few lines of code.

Could the mods consider renaming the submission?


thanks for sharing!


Software history should be written that Docker gave Phil Katz the Oliver Cromwell treatment.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: