You can call cap_enter(), which disables open(), unlink(), mkdir(), etc. entirely. You can, however, still use openat(), unlinkat(), mkdirat() with relative paths that expand to a location underneath a directory file descriptor. This achieves the same thing, except that you can now have as many chroots as you want. Not just one.
Unfortunately, the idea never caught on, because virtually no software on UNIX uses the *at() functions. Also: the non-*at() functions are still available as symbols, meaning that you can't perform simple compile-time checks to ensure that you application works properly when this form of sandboxing is enabled. Turns out that off-the-shelf software (e.g., libraries) end up misbehaving in unpredictable ways if you disable ~50% of the POSIX API.
It's a shame, because this feature effectively requires you to treat the file system in an object oriented/dependency injected way. Pretty good from a reusability/testability perspective.
It's hard for me to blame programmers for not using these functions more when hardly any language properly exposes them. But since nobody exposes them, nobody's aware they should use them.... chicken & egg strike again.
It's the difference between
dirHandle = open("some path");
fileInDir = openat(dirHandle, "some file");
dir = open("some path")
// examine the directory, then
fileInDir = open("some path/some file");
In the first case, you have atomically-safe operations; you either get the directory or don't, then either get the file handle or don't, etc, and once you have the handle nobody else can take it from you, even if they rename the file under you, etc. It means that if you are writing logic like "if the file is setuid, do this", there's no way for an external process to wedge in between the two things.
In other words, you ought to be able to not just read from a file handle, but also open relative to the handle directly, and do all those other things. Any API that operates in terms of paths is pretty much intrinsically open to TOCTOU, because any time you "check" a path vs. "use" the path, which is fairly common, you have a window of opportunity for lossage. I'm not sure I've yet seen a non-C way of doing this built into a standard library.
Also... before you jump in with some "what ifs", no, these functions don't magically make your code more secure. You still have to use them correctly and it's still pretty easy to mistakenly let path-based logic slip in accidentally even so. It doesn't make insecure code secure; it makes guaranteed insecure (in security-sensitive contexts, obviously a lot of time this isn't a security issue) code possible to write securely.
> In the second case, between those two lines, you can have something else jump in and modify or remove or repermission or whatever the "some file".
Modifying/removing/repermissioning "some file" is still possible even with openat() if you do it between the time you open("some path") and openat("some file"). There is still a race condition there in either case if you are examining the contents of the directory (e.g. "stat"ing the file and then calling openat). You can also modify/repermission "some path" as well. The only thing openat() protects you from is removing/replacing "some path" (not "some file") and I agree that that is valuable for security purposes.
This is part of what I was trying to head off with my parenthetical. You still have to use it correctly to do secure things. But at least it's possible. This kind of security is basically impossible with pure path-based APIs. Plus, as mentioned elsewhere, there are some additional flags you can use for even more security that you can't get out of an API that is "open(filename)", simply because that API is mathematically incapable of carrying such flags (assuming you don't start trying to encode them in the filename itself, but that way lies madness).
filefd, err := syscall.Openat(int(dir.Fd()), filename, os.O_RDONLY, 0)
file := os.NewFile(uintptr(filefd), filename) // for use with library functions
For example, RESOLVE_IN_ROOT "is as though the calling process had used chroot(2) to (temporarily) modify its root directory (to the directory referred to by dirfd)".
If you need to make network connections, you have to do that before entering capabilities mode, because there is no capability to allow it later. You can work through a proxy program, but adding that complexity doesn't seem worthwhile to me unless your program to be sandboxed is very complex.
I haven't worked with OpenBSD's pledge, but the idea of being able to end use of specific dangerous things seems more widely applicable.
I would love it if all network connections of all programs were created through a proxy. It would allow me to do load balancing, firewalling, tunneling, packet capturing, etc. etc. etc. entirely in userspace, without needing to rely on administrative features like pf/iptables, tun/tap, bpf, etc..
You see that in Kubernetes land folks are trying to achieve the same thing by using so-called service meshes (e.g., https://istio.io ). Right now those systems launch a proxy next to every container. For projects like these, it would have been so much easier if UNIX-like systems already had a standard for making the network stack used by a program injectable.
So if you start with a system that has some form of persistent objects, then very quickly a root namespace object is created to solve those library issues.
And then you are mostly back to a Unix root directory.
A single jailed root is where you end up when you take the route of putting software into sandboxes for which they weren't designed, because now you need to emulate a traditional environment.
pledge and unveil are a middle ground, albeit closer to Capsicum, in that they're much more accommodating of existing software patterns. But they do still require application refactoring. OpenBSD has refactored their entire userland codebase this way. That typically involves identifying the necessary resources a program needs and either shifting their acquisition to before privilege dropping (i.e. early in main), or arranging so that they're subsequently accessible (e.g. using unveil).
It's a shame Linux never merged the Capsicum patches. While pledge and unveil are more convenient from a developer perspective, they can't easily be adopted in a standardized way by other operating systems, like Linux. Capsicum was the closest thing we could have gotten to a standardized sandboxing model in the POSIX universe. If it became widely available (cough Linux), I believe a large chunk of software, especially critical network-facing software, would slowly migrate; and an ecosystem of idioms, patterns, and libraries would evolve to increasingly smooth the transition.
What's doubly shameful is that Capsicum is architecturally extremely simple. In principle it would be easy for any POSIX system to adopt. The APIs are trivial, and Linux is already nearly there now that it has process descriptors and an openat that can prevent parent directory traversal. Most of the leg work is in blocking access, after cap_enter has been invoked, to non-standard interfaces and syscalls that expose resources.
Not sure if anything changes for symlinks.
$ mkdir rootfs
$ docker export $(docker create ubuntu:20.04) | tar -C rootfs -xf -
$ unshare -r chroot rootfs bash
bin dev home ...
$ mkdir -p root/bin
$ cp /bin/busybox root/bin/
$ bwrap --bind root / /bin/busybox sh
BusyBox v1.27.2 (Ubuntu 1:1.27.2-2ubuntu3.3) built-in shell (ash)
Enter 'help' for a list of built-in commands.
/ $ ls -l /
drwxrwxr-x 2 1000 1000 60 Jul 22 11:07 bin
It can do a few things userns can't, like load kernel modules. I've had to use this to deal with bugs in BtrFS before.
(another good site: https://wiki.archlinux.org/title/User-mode_Linux )
I don't think containers should be needed for that.
unshare --user --mount --map-root-user chroot /path/to/whatever
$ unshare --user --mount --map-root-user
# mount --bind /proc /path/to/whatver/proc
# mount --bind /proc /path/to/whatver/sys
# chroot /path/to/whatever
If you want to do this at scale, a handy tool is bwrap(1) from https://github.com/containers/bubblewrap . (The README talks about how bwrap is a setuid program to prevent the need for that sysctl, but it also works great as a non-setuid program when that sysctl is enabled, and its value is it has a bunch of handy command-line flags for this sort of thing. We use it extensively at my workplace in non-setuid mode for things that don't quite need containers but need to see alternative root directories etc.)
(And I say "vanishing fraction" relative to the pool of developers as a whole; even if a particular subcommunity uses it extensively that doesn't make it a pervasive request. I can name subcommunities with all sorts of exotic interests that have not penetrated the mainstream yet, like the capabilities-based security community. Someday, when that emerges, we'll all point back to E as a pioneer, but in the meantime, effectively nobody wants it right now.)
Yes, you can do this on Linux with a user namespace, but a user namespace changes the view of user accounts. You have to map every usable UID inside the namespace to a UID you control outside the namespace. At best, you can map a range of UIDs you control to "real" users (root, 1000, etc.) inside the namespace, but they won't be real users outside the namespace. If you're on a multi-user system, seeing other people's files as owned by "nobody" is confusing.
It should be enough to use NO_NEW_PRIVS mode, meaning setuid transitions are not allowed. Then it doesn't matter what user IDs you see inside the chroot.
In fact, back when Linux introduced the NO_NEW_PRIVS flag (almost a decade ago!), this was one of the motivating use cases.
The problem with traditional chroot is that you can typically import setuid applications in this new space which can get confused, for example by a new /etc/passwd file. For this reason, chroot can be used only by root.
The advantage of such a NO_NEW_PRIVS flag is that this kind of abuse of setuid applications is not possible.
This should make it safe to allow ordinary users to use chroot.
Some people like to run for example FTP servers in a chroot so that users have access only to a specific directory and its subdirectories, rather than being able to browse other files on the system.
FreeBSD also has a technology called jails which is what you’d rather use for containerization.
Anyway, previously you had to be root (the Unix admin user) in order to use chroot. FreeBSD now implementing unprivileged chroot means that regular users are able to run processes in chroot as well.
So for example if you were a regular user on a system, you can now create a sub directory in your home directory and run an FTP demon chrooted to that directory and bound to an unprivileged port, and then you can give someone else FTP access to that directory without them being able to see the other files in your home directory, keeping your private data private from them.
$ mkdir /tmp/etc
$ echo root::0:0::/:/bin/sh > /tmp/etc/passwd
$ mkdir /tmp/bin
$ cp /bin/sh /tmp/bin/sh
$ cp /bin/chmod /tmp/bin/chmod
$ chroot /tmp /bin/login
# chmod 4700 /bin/sh
now, log out of the chroot and use your newly minted setuid shell
Can someone enlighten me if this will be part of FreeBSD 14, or if there is a chance it will become available earlier, perhaps with FreeBSD 13.1?
EDIT: The commit message does NOT indicate etc. Silly me.
: Also, you may look for the commit hash (a40cf4175c90142442d0c6515f6c83956336699) at https://mfc.kernelnomicon.org/ to see the back-porting status.