As the maintainer of a Go container runtime (runc), and having worked with Rust in various other projects, while they can be better languages for building large projects, they make it harder to understand what exactly your program is doing when writing software like this.
One example that immediately comes to mind from Rust is a bug with O_PATH file descriptors I found a while ago[1], which would've made certain code we use in runc not work. And from Go, here is a bug I just found in their code for handling file descriptors for ForkExec[2] which is causing issues in a runc patch I'm working on. Neither of these issues exist in C programs. Though of course, C programs have their own issues. For better or worse, the Linux kernel APIs are easiest to use from C.
In runc we actually implement the core container setup code in C because Go doesn't allow you to do everything we need for setting up a container (it has gotten better though, in the past it was completely impossible to set up a container properly in pure Go -- now you can set one up but there are still certain configurations that are not possible to implement in pure Go, such as "docker exec"). You also cannot run Go in single-threaded mode, which means that certain kernel APIs (unshare(CLONE_NEWUSER) for instance) simply cannot be used from regular Go code.
> One example that immediately comes to mind from Rust is a bug with O_PATH file descriptors I found a while ago[1], which would've made certain code we use in runc not work. [...] Neither of these issues exist in C programs.
This issue doesn't intrinsically affect Rust as a language (when compared to C), because you can just do exactly the same thing as you'd have done in C:
let fd = libc::open(b"/path\0".as_ptr().cast(), libc::libc::O_CLOEXEC | libc::O_PATH);
Or just make the syscall directly-ish:
let fd = libc::syscall(libc::SYS_open, b"/path\0".as_ptr(), libc::O_CLOEXEC | libc::O_PATH);
Or use rustix if you want more convenient idiomatic wrappers.
And for setting up containers you'll have to do this anyway because Rust's standard library doesn't expose all of the necessary functionality anyway.
I'm aware you can work around it, there are workarounds for issues in Go as well.
In general, C programs do not require workarounds for dealing with kernel APIs for the simple reason that the vast majority of kernel APIs are developed with test programs written in C, so kernel developers will usually not design an API that is awful to use in C.
Another thing that surprised me when I first started programming in Rust is that:
let fd = File::open("foo")?.as_raw_fd();
and
let f = File::open("foo")?;
let fd = f.as_raw_fd();
have different behaviour, with the former being incorrect and a possible security bug if you use the file descriptor directly afterwards. But I guess this behaviour is obvious to seasoned Rust developer (at least, it seems obvious to me now).
It's not a workaround - the `File` in Rust wasn't meant nor designed to support full `open` semantics. If you want to use `open` you should use `open` (or an idiomatic wrapper which is meant to model that) instead of forcing it through `File`.
And `open` is not a kernel API either. It's a libc API. If you want to directly access the API provided by the kernel you're supposed to make a syscall, which essentially is exactly the same in Rust and in C.
And to make the point of `open` in C *not* being a kernel API more clear, in glibc the `open` function *doesn't* actually call the `open` syscall, but `openat` with `AT_FDCWD`. Glibc doesn't guarantee that a given function will actually call a given syscall, and new versions of glibc often change which syscalls are called by a given function. This is important if you're also doing e.g. seccomp sandboxing, because suddenly your program might stop working if glibc is updated. For example, glibc 2.34 started using the `clone3` syscall under the hood, which broke Chromium Embedded Framework's sandbox.
So, again, your argument that a language like Rust "makes it harder to understand what exactly your program is doing" compared to C in this particular case isn't really valid, because C has exactly the same problem if you use libc functions, and the only way to guarantee that the program is doing exactly what you want is to use syscalls, which is the same both in C and Rust.
> Another thing that surprised me when I first started programming in Rust
Yep. That's one of the Rust's few badly designed APIs.
As I understand it, that as_raw_fd() issue is a big reason that they added the BorrowedFd<'_> type [0] and corresponding AsFd trait in 1.63.0, to prevent the raw file descriptor from outliving its logical owner. Still, I agree that there is lots of potential for issues on the boundary between Rust's implicit lifetime management vs. C APIs' explicit lifetime management, since there won't always be a convenient preexisting mechanism to bridge the gap.
I'm not sure if there is a document that mentions this in particular, but it is a consequence of how lifetimes work. The core issue is that .as_raw_fd() takes &File and returns an integer (which doesn't have lifetime information). As a result, the File is dropped at the end of the statement and thus the number you got from .as_raw_fd() is invalidated.
This does happen elsewhere in Rust, but often when you have methods on &self that return something you use later, the method returns something with the same lifetime (fn foo(&'a self) -> Foo<&'a>) and thus the original object will be kept alive until the end of the scope. It just so happens that file descriptors are tied to the lifetime of the File in a way that Rust cannot express nor detect.
I don't know if clippy has a warning for this particular case. It might be useful to add it.
> Those Rust and Go bugs aren't much different from C gotchas when writing portable UNIX code.
Maybe, but:
1. It's irrelevant to this product (no one is writing portable UNIX code when they are writing some Linux-specific software, like container implementations).
and
2. It's irrelevant to the author's goals (learning Linux kernel stuff using the language that the interface to the kernel uses is a better idea than using a different language and hacking shims for all the stuff you want to do).
and
3. The cost to switch to a new language is substantial, and only makes sense if you're either joining a team and project that uses that new language, or if the goal is to learn that new language.
It depends what you mean by "container". As far as I know, Windows containers aren't using namespaces, cgroups and seccomp. BSD Jails are definitely a different thing. So if you wanted to know how exactly linux containers worked, it's probably easiest to use what the linux docs provide (which is C).
No, that only pins the current goroutine to a single OS thread (which is needed for some APIs -- namely, all of the other namespace APIs and some thread-related APIs).
There is no way to make an entire Go program run as a single threaded program without using CGo the way we do in runc. Even GOMAXPROCS=1 doesn't work. CLONE_NEWUSER will always fail in a multi-threaded program.
I can't answer for the developer, but the answer to that with most small one-person-show projects is familiarity/comfort/ability.
the head-space that adopting a new language for a specific project takes is immense compared to tackling it in a familiar language that you know you're already able in; there is rarely a benefit to doing so outside of team environments where a certain level of on-boarding is expected, or because you have a really niche language requirement/feature that your project is begging for.
I came across this last week when reading about different container runtimes -- crun is implemented in C[0].
Their explanation:
"While most of the tools used in the Linux containers ecosystem are written in Go, I believe C is a better fit for a lower level tool like a container runtime. runc, the most used implementation of the OCI runtime specs written in Go, re-execs itself and use a module written in C for setting up the environment before the container process starts.
crun aims to be also usable as a library that can be easily included in programs without requiring an external process for managing OCI containers."
At least in Spanish, my rule of thumb is that "barco" is for bigger boats and "barca" is for smaller ones (and then you have Barça which is the football/soccer team).
> barco enforces a minimal set of restrictions to run untrusted code, which is not recommended for production use, where a more robust solution should be used.
Aren't containers never suitable for running untrusted code? You need AppArmor, bwrap, or similar AFAIK.
bwrap is a container and AppArmor is used by basically every container runtime if the system is using AppArmor (otherwise they use SELinux). Seccomp is also enabled by default, and I would argue it is a more significant protection against container breakouts because it protects against kernel 0-days as well and doesn't rely on LSM hooks to block operations. The real question is whether you are using user namespaces.
Jessica Frazelle ran a public bug bounty to break out of a container image that is properly secured, and as far as I know nobody collected the bounty. The website isn't up at the
moment, maybe she took it down. https://contained.af/
If built to spec, then the various container technologies in the kernel used together are theoretically secure. It closes all of the holes that we know about, aside from a few trivial things like the container spying on process id numbers on the host, and of course the vast potential to accidentally misconfigure it.
However, all this code is quite complex, and the kernel and the software ecosystem are lacking in having a layered approach to security that goes all the way down to the low-level nitty gritty stuff. For example, kernel memory structures are not robustly protected against the usual memory exploits, and there isn't as strong W^X protection as desired. Windows, in contrast, is able to provide layered security through a variety of approaches, including running the entire operating system in a virtual machine, with the host ensuring integrity of kernel memory. These sorts of layered approaches to security are desirable because there will always be defects in any complex software.
Side note: AppArmor and bwrap are distinct. Bubblewrap is a relatively simple userspace program that makes use of existing kernel containerization features (the same ones that Docker/Podman use), whereas AppArmor and SELinux are security features that are patched into the kernel itself. AppArmor and SELinux have made some progress in adding layered low-level security to the kernel, but it's not particularly impressive. Bubblewrap has done great work in exposing the kernel's existing tech to users, but they are not fundamental improvements to the kernel itself.
> aside from a few trivial things like the container spying on process id numbers on the host
Containers with own PID namespace can't spy on process IDs on the host though? Not sure what you mean here.
> and there isn't as strong W^X protection as desired
What level is desired? Bootup warnings for W^X got merged a while ago. Changes that try to include anything violating it are rejected (see bcachefs).
> Windows, in contrast, is able to provide layered security through a variety of approaches, including running the entire operating system in a virtual machine, with the host ensuring integrity of kernel memory.
What? Xen existed for years, that's not "in contrast". Secureboot and lockdown exists on Linux too. There's also per-service firecracker microvm.
> whereas AppArmor and SELinux are new security features that are patched into the kernel itself
That's very misleading. They're not new - selinux is over 2 decades old. They're also not "patched in" - LSMs have been integrated into Linux for a very long time with multiple implementations available. Selinux had multilabel security created for gov use. It's quite impressive actually.
> That's very misleading. They're not new - selinux is over 2 decades old.
I misspoke on SELinux and AppArmor being "new". What I was getting at is that they are distinct kernel features, compared to bwrap which is just a user of kernel features already familiar in this discussion. So "new", as in, "additional", e.g. "We turned up some new evidence from the old files".
And yes, SELinux is included as a first-class kernel feature. AppArmor is a bit different because it still has a lot of hurdles before all its features can be upstreamed. However, upstreaming is not the end-all-be-all so it's not necessarily a bad thing that parts of AppArmor are patched in, so I'm not emphasizing this point at all.
> What? Xen existed for years, that's not "in contrast".
Does the Xen or KVM ecosystems provide anything comparable to Windows hypervisor-enforced code integrity? That is, the host is aware of what kernel memory needs to be set to read-only or checked regularly for corruption or irregularities, in a system that is impossible to interfere with without a VM break. (https://learn.microsoft.com/en-us/windows-hardware/design/de... )
Secure VMs are great, VMs that are actually monitoring and enhancing the security of the code running inside are even better.
> Does the Xen or KVM ecosystems provide anything comparable to Windows hypervisor-enforced code integrity?
Xen can do it, KVM had some attempts at memory enforcement patches, but not sure where it ended up. I'm not sure how well utilised it is though from the guest side ootb. (I think poorly)
I would probably point at a virtual machine for a convenient place to run untrusted code. It's not perfect -- there are VM escapes -- but it's more convenient than a dedicated, air-gapped machine.
Depends what you mean by suitable. If you run the service as a new user, it's more secure than running without a new namespace (you're isolated from other apps) and potentially less secure than running on host (one more layer of indirection for system resource access).
Since in reality most attacks will be against your app itself before the attacker has direct access to syscalls, I see namespaces/containers as extra protection.
This is tautologically true -- "Is X secure? Yes, assuming the technology X uses is secure."
The more nuanced answer is that containers have several layers of protections (seccomp, LSMs, user namespaces, namespaces, cgroups, capabilities, and standard process permissions by running as an unprivileged user) which all act together to help protect against container attacks. It's not perfect, but most container breakout attacks we've had so far are related to when container runtimes have to operate on a container during process setup (IMHO because the process for creating a container process is far from atomic) -- some of these attacks were enabled by kernel bugs which we went and fixed as well. It is very difficult to break out of a container once it has been configured and left alone.
Possibly, but I'd say that Google's experience with kCTF was that allowing io_uring on hosts running containers has allowed for multiple breakouts. They paid out over $1m on io_uring related bug bounties https://security.googleblog.com/2023/06/learnings-from-kctf-...
Also while user namespaces help in theory, in practice expanding the attack surface of the kernel exposed to unprivileged users has consequences in allowing container breakout (e.g. CVE-2022-0185) also more recently CVE-2023-3390
io_uring was blocked by the default Docker seccomp profile until somewhat recently. The primary issue with io_uring is that there is no mechanism to apply seccomp-like rules.
> Also while user namespaces help in theory
You should use user namespaces to contain untrusted code, you absolutely should not enable CLONE_NEWUSER inside a container. I was referring to the former, you're talking about the latter.
for io_uring, I feel like it's a combination of bypassing seccomp and also the complexity of the code seems to be a fertile ground for Privesc vulnerabilities.
I'd agree that unprivileged user namespaces should not be available inside containers, however the default stances of some Linux distros (enable unprivileged user namespaces) and Kubernetes (disable the CRIs seccomp filter by default) mean that an awful lot of envirionments will end up in a situation where this is possible.
When I did a talk about docker I also wanted to show a bit of what it does under the hood without going through all the layers and without too much details. This ~120 lines of shell script is really good in providing just an intro into what's needed for containers: https://github.com/p8952/bocker/blob/master/bocker
(not mine)
Question: for sandboxing untrusted code, should I invest time in learning more linux container stuff or switch to learning WASI? I am inclined towards WASI myself.
I have more faith in WASI. Linux containers is inherently in a whack-a-mole position where they're trying to retrofit security onto something that was never built for it, which very rarely works.
There's a huge performance hit for many programming languages when you run them inside a WASM runtime. Memory also behaves very different from normal applications. Autovectorizasation also isn't universally supported by WASM compilers yet, which can be costly for performance.
Properly configured WASI runtime are great for security but they're worse than containers on most other fronts. I don't think the downsides make sense unless you're building a business that lets random customers upload WASM files you execute.
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
I still think it is tangential. The author stated that they wrote this project to learn. The readme says that it is not intended for production use and that there is no networking set up in the containers.
With that context, I doubt that name collisions outside of the containers space are top of mind.
Why would there be any conflict? They're in completely different domains so there is no potential for confusion. Why would copyright be relevant in any way? Do you even know what copyright is?
Legally it's probably fine, but as the Belgian American Radio Company has been around since the 1930s, the title of the article did confuse me: my first guess was something about how Barco was using Linux containers in their products.