Barco: Linux Containers from Scratch in C

lucavallin · on Aug 6, 2023

barco is a project I worked on to learn more about Linux containers and the Linux kernel, based on other guides on the internet.

teleforce · on Aug 7, 2023

Looks like a good project to learn container from scratch.

Just wondering the main reason you're C since most of the container project now seems to be using Go or Rust?

cyphar · on Aug 7, 2023

As the maintainer of a Go container runtime (runc), and having worked with Rust in various other projects, while they can be better languages for building large projects, they make it harder to understand what exactly your program is doing when writing software like this.

One example that immediately comes to mind from Rust is a bug with O_PATH file descriptors I found a while ago[1], which would've made certain code we use in runc not work. And from Go, here is a bug I just found in their code for handling file descriptors for ForkExec[2] which is causing issues in a runc patch I'm working on. Neither of these issues exist in C programs. Though of course, C programs have their own issues. For better or worse, the Linux kernel APIs are easiest to use from C.

In runc we actually implement the core container setup code in C because Go doesn't allow you to do everything we need for setting up a container (it has gotten better though, in the past it was completely impossible to set up a container properly in pure Go -- now you can set one up but there are still certain configurations that are not possible to implement in pure Go, such as "docker exec"). You also cannot run Go in single-threaded mode, which means that certain kernel APIs (unshare(CLONE_NEWUSER) for instance) simply cannot be used from regular Go code.

[1]: https://github.com/rust-lang/rust/issues/62314 [2]: https://github.com/golang/go/issues/61751

kouteiheika · on Aug 7, 2023

> One example that immediately comes to mind from Rust is a bug with O_PATH file descriptors I found a while ago[1], which would've made certain code we use in runc not work. [...] Neither of these issues exist in C programs.

This issue doesn't intrinsically affect Rust as a language (when compared to C), because you can just do exactly the same thing as you'd have done in C:

    let fd = libc::open(b"/path\0".as_ptr().cast(), libc::libc::O_CLOEXEC | libc::O_PATH);

Or just make the syscall directly-ish:

    let fd = libc::syscall(libc::SYS_open, b"/path\0".as_ptr(), libc::O_CLOEXEC | libc::O_PATH);

Or use rustix if you want more convenient idiomatic wrappers.

And for setting up containers you'll have to do this anyway because Rust's standard library doesn't expose all of the necessary functionality anyway.

cyphar · on Aug 7, 2023

I'm aware you can work around it, there are workarounds for issues in Go as well.

In general, C programs do not require workarounds for dealing with kernel APIs for the simple reason that the vast majority of kernel APIs are developed with test programs written in C, so kernel developers will usually not design an API that is awful to use in C.

Another thing that surprised me when I first started programming in Rust is that:

    let fd = File::open("foo")?.as_raw_fd();

and

    let f = File::open("foo")?;
    let fd = f.as_raw_fd();

have different behaviour, with the former being incorrect and a possible security bug if you use the file descriptor directly afterwards. But I guess this behaviour is obvious to seasoned Rust developer (at least, it seems obvious to me now).

kouteiheika · on Aug 8, 2023

It's not a workaround - the `File` in Rust wasn't meant nor designed to support full `open` semantics. If you want to use `open` you should use `open` (or an idiomatic wrapper which is meant to model that) instead of forcing it through `File`.

And `open` is not a kernel API either. It's a libc API. If you want to directly access the API provided by the kernel you're supposed to make a syscall, which essentially is exactly the same in Rust and in C.

And to make the point of `open` in C *not* being a kernel API more clear, in glibc the `open` function *doesn't* actually call the `open` syscall, but `openat` with `AT_FDCWD`. Glibc doesn't guarantee that a given function will actually call a given syscall, and new versions of glibc often change which syscalls are called by a given function. This is important if you're also doing e.g. seccomp sandboxing, because suddenly your program might stop working if glibc is updated. For example, glibc 2.34 started using the `clone3` syscall under the hood, which broke Chromium Embedded Framework's sandbox.

So, again, your argument that a language like Rust "makes it harder to understand what exactly your program is doing" compared to C in this particular case isn't really valid, because C has exactly the same problem if you use libc functions, and the only way to guarantee that the program is doing exactly what you want is to use syscalls, which is the same both in C and Rust.

> Another thing that surprised me when I first started programming in Rust

Yep. That's one of the Rust's few badly designed APIs.

LegionMammal978 · on Aug 8, 2023

As I understand it, that as_raw_fd() issue is a big reason that they added the BorrowedFd<'_> type [0] and corresponding AsFd trait in 1.63.0, to prevent the raw file descriptor from outliving its logical owner. Still, I agree that there is lots of potential for issues on the boundary between Rust's implicit lifetime management vs. C APIs' explicit lifetime management, since there won't always be a convenient preexisting mechanism to bridge the gap.

[0] https://doc.rust-lang.org/std/os/fd/struct.BorrowedFd.html

rirze · on Aug 7, 2023

Where can I learn more about this? This seems very uninutive to me, as I'm learning Rust myself.

cyphar · on Aug 7, 2023

I'm not sure if there is a document that mentions this in particular, but it is a consequence of how lifetimes work. The core issue is that .as_raw_fd() takes &File and returns an integer (which doesn't have lifetime information). As a result, the File is dropped at the end of the statement and thus the number you got from .as_raw_fd() is invalidated.

This does happen elsewhere in Rust, but often when you have methods on &self that return something you use later, the method returns something with the same lifetime (fn foo(&'a self) -> Foo<&'a>) and thus the original object will be kept alive until the end of the scope. It just so happens that file descriptors are tied to the lifetime of the File in a way that Rust cannot express nor detect.

I don't know if clippy has a warning for this particular case. It might be useful to add it.

pjmlp · on Aug 7, 2023

Those Rust and Go bugs aren't much different from C gotchas when writing portable UNIX code.

lelanthran · on Aug 7, 2023

> Those Rust and Go bugs aren't much different from C gotchas when writing portable UNIX code.

Maybe, but:

1. It's irrelevant to this product (no one is writing portable UNIX code when they are writing some Linux-specific software, like container implementations).

and

2. It's irrelevant to the author's goals (learning Linux kernel stuff using the language that the interface to the kernel uses is a better idea than using a different language and hacking shims for all the stuff you want to do).

and

3. The cost to switch to a new language is substantial, and only makes sense if you're either joining a team and project that uses that new language, or if the goal is to learn that new language.

pjmlp · on Aug 7, 2023

1. Containers predate Linux, appeared in other UNIXes before GNU/Linux, and Windows also has them.

2. Any languge able to call into Linux API surface is usable. And if we go down the UNIX native languages route born at Bell Labs, C++ also counts.

3. That depends on how much someone knows C (properly), versus other alternatives

aragilar · on Aug 7, 2023

It depends what you mean by "container". As far as I know, Windows containers aren't using namespaces, cgroups and seccomp. BSD Jails are definitely a different thing. So if you wanted to know how exactly linux containers worked, it's probably easiest to use what the linux docs provide (which is C).

pjmlp · on Aug 7, 2023

Of course Windows containers aren't using Linux APIs, they are using Win32 mechanisms for process sandboxing.

Just like HP-UX vaults, and Solaris Zones aren't using namespaces, cgroups and seccomp, rather their own UNIX flavours.

bheadmaster · on Aug 7, 2023

Wouldn't "runtime.LockOSThead()" help with the single-threaded API?

cyphar · on Aug 7, 2023

No, that only pins the current goroutine to a single OS thread (which is needed for some APIs -- namely, all of the other namespace APIs and some thread-related APIs).

There is no way to make an entire Go program run as a single threaded program without using CGo the way we do in runc. Even GOMAXPROCS=1 doesn't work. CLONE_NEWUSER will always fail in a multi-threaded program.

serf · on Aug 7, 2023

I can't answer for the developer, but the answer to that with most small one-person-show projects is familiarity/comfort/ability.

the head-space that adopting a new language for a specific project takes is immense compared to tackling it in a familiar language that you know you're already able in; there is rarely a benefit to doing so outside of team environments where a certain level of on-boarding is expected, or because you have a really niche language requirement/feature that your project is begging for.

nazgulsenpai · on Aug 7, 2023

I came across this last week when reading about different container runtimes -- crun is implemented in C[0].

Their explanation:

  "While most of the tools used in the Linux containers ecosystem are written in Go, I believe C is a better fit for a lower level tool like a container runtime. runc, the most used implementation of the OCI runtime specs written in Go, re-execs itself and use a module written in C for setting up the environment before the container process starts.

  crun aims to be also usable as a library that can be easily included in programs without requiring an external process for managing OCI containers."

[0]https://github.com/containers/crun

lucavallin · on Aug 7, 2023

I haven't written much C since college and I felt nostalgic, so I went for it.

murphyslaw · on Aug 7, 2023

This could be quite useful in a CS course on containers.

metadat · on Aug 7, 2023

How did you come up with the name "Barco"?

lucavallin · on Aug 7, 2023

It's Venetian (my native language) for "hay barrack": http://vec.wikipedia.org/wiki/Barco

voidmain0001 · on Aug 7, 2023

It’s Spanish for boat.

mlashcorp · on Aug 7, 2023

Also, Portuguese for boat.

29athrowaway · on Aug 7, 2023

docker has to do with ships and barco means ship.

ofrzeta · on Aug 7, 2023

It's funny that three people feel the urge to reply for the author and all of them are wrong :-)

Also I find it interesting that in Portuguese as well as in Spanish the word exists in both genders: el barco, la barca, o barco, a barca.

quickthrower2 · on Aug 7, 2023

It is wrong but coincidently in a way that aligns with the sea-fairing theme

darkwater · on Aug 7, 2023

At least in Spanish, my rule of thumb is that "barco" is for bigger boats and "barca" is for smaller ones (and then you have Barça which is the football/soccer team).

loloquwowndueo · on Aug 7, 2023

Barça looks similar but is pronounced entirely differently (say “barsa”).

intelVISA · on Aug 6, 2023

awesome, thx for sharing this :)

zamalek · on Aug 6, 2023

> barco enforces a minimal set of restrictions to run untrusted code, which is not recommended for production use, where a more robust solution should be used.

Aren't containers never suitable for running untrusted code? You need AppArmor, bwrap, or similar AFAIK.

cyphar · on Aug 6, 2023

bwrap is a container and AppArmor is used by basically every container runtime if the system is using AppArmor (otherwise they use SELinux). Seccomp is also enabled by default, and I would argue it is a more significant protection against container breakouts because it protects against kernel 0-days as well and doesn't rely on LSM hooks to block operations. The real question is whether you are using user namespaces.

Jessica Frazelle ran a public bug bounty to break out of a container image that is properly secured, and as far as I know nobody collected the bounty. The website isn't up at the moment, maybe she took it down. https://contained.af/

jppittma · on Aug 6, 2023

Sounds like free money to me. You just press Ctrl+D, and you're out.

cyphar · on Aug 6, 2023

Sadly that doesn't help you get access to the flag file you need to collect the bounty. ;)

creatonez · on Aug 7, 2023

If built to spec, then the various container technologies in the kernel used together are theoretically secure. It closes all of the holes that we know about, aside from a few trivial things like the container spying on process id numbers on the host, and of course the vast potential to accidentally misconfigure it.

However, all this code is quite complex, and the kernel and the software ecosystem are lacking in having a layered approach to security that goes all the way down to the low-level nitty gritty stuff. For example, kernel memory structures are not robustly protected against the usual memory exploits, and there isn't as strong W^X protection as desired. Windows, in contrast, is able to provide layered security through a variety of approaches, including running the entire operating system in a virtual machine, with the host ensuring integrity of kernel memory. These sorts of layered approaches to security are desirable because there will always be defects in any complex software.

Side note: AppArmor and bwrap are distinct. Bubblewrap is a relatively simple userspace program that makes use of existing kernel containerization features (the same ones that Docker/Podman use), whereas AppArmor and SELinux are security features that are patched into the kernel itself. AppArmor and SELinux have made some progress in adding layered low-level security to the kernel, but it's not particularly impressive. Bubblewrap has done great work in exposing the kernel's existing tech to users, but they are not fundamental improvements to the kernel itself.

viraptor · on Aug 7, 2023

> aside from a few trivial things like the container spying on process id numbers on the host

Containers with own PID namespace can't spy on process IDs on the host though? Not sure what you mean here.

> and there isn't as strong W^X protection as desired

What level is desired? Bootup warnings for W^X got merged a while ago. Changes that try to include anything violating it are rejected (see bcachefs).

> Windows, in contrast, is able to provide layered security through a variety of approaches, including running the entire operating system in a virtual machine, with the host ensuring integrity of kernel memory.

What? Xen existed for years, that's not "in contrast". Secureboot and lockdown exists on Linux too. There's also per-service firecracker microvm.

> whereas AppArmor and SELinux are new security features that are patched into the kernel itself

That's very misleading. They're not new - selinux is over 2 decades old. They're also not "patched in" - LSMs have been integrated into Linux for a very long time with multiple implementations available. Selinux had multilabel security created for gov use. It's quite impressive actually.

creatonez · on Aug 7, 2023

> That's very misleading. They're not new - selinux is over 2 decades old.

I misspoke on SELinux and AppArmor being "new". What I was getting at is that they are distinct kernel features, compared to bwrap which is just a user of kernel features already familiar in this discussion. So "new", as in, "additional", e.g. "We turned up some new evidence from the old files".

And yes, SELinux is included as a first-class kernel feature. AppArmor is a bit different because it still has a lot of hurdles before all its features can be upstreamed. However, upstreaming is not the end-all-be-all so it's not necessarily a bad thing that parts of AppArmor are patched in, so I'm not emphasizing this point at all.

> What? Xen existed for years, that's not "in contrast".

Does the Xen or KVM ecosystems provide anything comparable to Windows hypervisor-enforced code integrity? That is, the host is aware of what kernel memory needs to be set to read-only or checked regularly for corruption or irregularities, in a system that is impossible to interfere with without a VM break. (https://learn.microsoft.com/en-us/windows-hardware/design/de... )

Secure VMs are great, VMs that are actually monitoring and enhancing the security of the code running inside are even better.

viraptor · on Aug 7, 2023

> Does the Xen or KVM ecosystems provide anything comparable to Windows hypervisor-enforced code integrity?

Xen can do it, KVM had some attempts at memory enforcement patches, but not sure where it ended up. I'm not sure how well utilised it is though from the guest side ootb. (I think poorly)

A lot of the signing/module loading issues are prevented on the guest side though with https://man7.org/linux/man-pages/man7/kernel_lockdown.7.html

So it's... not the same, but the facilities are available. (Unless someone wants to correct me and it's already used)

loeg · on Aug 6, 2023

I would probably point at a virtual machine for a convenient place to run untrusted code. It's not perfect -- there are VM escapes -- but it's more convenient than a dedicated, air-gapped machine.

NewJazz · on Aug 6, 2023

GKE runs every kubelet in its own gvisor-like userspace hypervisor.

https://cloud.google.com/blog/products/containers-kubernetes...

viraptor · on Aug 7, 2023

Depends what you mean by suitable. If you run the service as a new user, it's more secure than running without a new namespace (you're isolated from other apps) and potentially less secure than running on host (one more layer of indirection for system resource access).

Since in reality most attacks will be against your app itself before the attacker has direct access to syscalls, I see namespaces/containers as extra protection.

charcircuit · on Aug 6, 2023

>Aren't containers never suitable for running untrusted code?

They are suitable provided the kernel is secure.

cyphar · on Aug 6, 2023

This is tautologically true -- "Is X secure? Yes, assuming the technology X uses is secure."

The more nuanced answer is that containers have several layers of protections (seccomp, LSMs, user namespaces, namespaces, cgroups, capabilities, and standard process permissions by running as an unprivileged user) which all act together to help protect against container attacks. It's not perfect, but most container breakout attacks we've had so far are related to when container runtimes have to operate on a container during process setup (IMHO because the process for creating a container process is far from atomic) -- some of these attacks were enabled by kernel bugs which we went and fixed as well. It is very difficult to break out of a container once it has been configured and left alone.

raesene9 · on Aug 7, 2023

Possibly, but I'd say that Google's experience with kCTF was that allowing io_uring on hosts running containers has allowed for multiple breakouts. They paid out over $1m on io_uring related bug bounties https://security.googleblog.com/2023/06/learnings-from-kctf-...

Also while user namespaces help in theory, in practice expanding the attack surface of the kernel exposed to unprivileged users has consequences in allowing container breakout (e.g. CVE-2022-0185) also more recently CVE-2023-3390

cyphar · on Aug 7, 2023

io_uring was blocked by the default Docker seccomp profile until somewhat recently. The primary issue with io_uring is that there is no mechanism to apply seccomp-like rules.

> Also while user namespaces help in theory

You should use user namespaces to contain untrusted code, you absolutely should not enable CLONE_NEWUSER inside a container. I was referring to the former, you're talking about the latter.

raesene9 · on Aug 8, 2023

for io_uring, I feel like it's a combination of bypassing seccomp and also the complexity of the code seems to be a fertile ground for Privesc vulnerabilities.

I'd agree that unprivileged user namespaces should not be available inside containers, however the default stances of some Linux distros (enable unprivileged user namespaces) and Kubernetes (disable the CRIs seccomp filter by default) mean that an awful lot of envirionments will end up in a situation where this is possible.

nedt · on Aug 7, 2023

When I did a talk about docker I also wanted to show a bit of what it does under the hood without going through all the layers and without too much details. This ~120 lines of shell script is really good in providing just an intro into what's needed for containers: https://github.com/p8952/bocker/blob/master/bocker (not mine)

favflam · on Aug 7, 2023

Question: for sandboxing untrusted code, should I invest time in learning more linux container stuff or switch to learning WASI? I am inclined towards WASI myself.

lmm · on Aug 7, 2023

I have more faith in WASI. Linux containers is inherently in a whack-a-mole position where they're trying to retrofit security onto something that was never built for it, which very rarely works.

jeroenhd · on Aug 7, 2023

There's a huge performance hit for many programming languages when you run them inside a WASM runtime. Memory also behaves very different from normal applications. Autovectorizasation also isn't universally supported by WASM compilers yet, which can be costly for performance.

Properly configured WASI runtime are great for security but they're worse than containers on most other fronts. I don't think the downsides make sense unless you're building a business that lets random customers upload WASM files you execute.

elcapitan · on Aug 7, 2023

Just came to say that this is a very nicely set up C project, understandable structure and simple Makefile, very beginner friendly, congrats :)

hippospark · on Aug 7, 2023

There is a similar project _Linux containers in 500 lines of code_ [1], the code is a bit old, but the procedure is quite simple. [1]: https://blog.lizzie.io/linux-containers-in-500-loc.html

shortrounddev2 · on Aug 6, 2023

Very cool, I was thinking of doing something similar in windows

semiquaver · on Aug 6, 2023

What is the underlying isolation technology that would be used in windows?

AkshitGarg · on Aug 7, 2023

Windows also supports containers: https://learn.microsoft.com/en-us/virtualization/windowscont...

shortrounddev2 · on Aug 7, 2023

It also has sandboxing with app containers

yukIttEft · on Aug 7, 2023

Could this also be implemented as a bashscript?

lucavallin · on Aug 7, 2023

Probably: https://github.com/p8952/bocker

gjkood · on Aug 6, 2023

[flagged]

NewJazz · on Aug 6, 2023

Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

Philpax · on Aug 6, 2023

GP was asking if it could cause an issue for the OP, not complaining about an annoyance. It's something that the OP may want to address.

NewJazz · on Aug 6, 2023

I still think it is tangential. The author stated that they wrote this project to learn. The readme says that it is not intended for production use and that there is no networking set up in the containers.

With that context, I doubt that name collisions outside of the containers space are top of mind.

LeFantome · on Aug 6, 2023

If it was my project, I would want to know about a possible name collision. Saying it once seems fine.

prmoustache · on Aug 7, 2023

It can be communicated directly to author outside of hn though.

lucavallin · on Aug 6, 2023

barco really just means "hay barrack" in my native language ¯\_(ツ)_/¯

brezelgoring · on Aug 6, 2023

It means 'watercraft' or 'sea worthy ship' in Spanish, too. :)

In what language does it mean 'hay barrack'?

NewJazz · on Aug 6, 2023

Indeed the wiki yields no matches

https://en.m.wiktionary.org/wiki/barco

lucavallin · on Aug 7, 2023

Venetian: https://vec.m.wikipedia.org/wiki/Barco

gabrielhidasy · on Aug 6, 2023

'Boat' in mine

memefrog · on Aug 7, 2023

Why would there be any conflict? They're in completely different domains so there is no potential for confusion. Why would copyright be relevant in any way? Do you even know what copyright is?

ivoc · on Aug 7, 2023

Legally it's probably fine, but as the Belgian American Radio Company has been around since the 1930s, the title of the article did confuse me: my first guess was something about how Barco was using Linux containers in their products.

godlover · on Aug 6, 2023