Hacker News new | past | comments | ask | show | jobs | submit login
Implement unprivileged chroot (freebsd.org)
231 points by 0mp 6 days ago | hide | past | favorite | 57 comments

FreeBSD already supported something like this effectively, but in my opinion better way.

You can call cap_enter(), which disables open(), unlink(), mkdir(), etc. entirely. You can, however, still use openat(), unlinkat(), mkdirat() with relative paths that expand to a location underneath a directory file descriptor. This achieves the same thing, except that you can now have as many chroots as you want. Not just one.

Unfortunately, the idea never caught on, because virtually no software on UNIX uses the *at() functions. Also: the non-*at() functions are still available as symbols, meaning that you can't perform simple compile-time checks to ensure that you application works properly when this form of sandboxing is enabled. Turns out that off-the-shelf software (e.g., libraries) end up misbehaving in unpredictable ways if you disable ~50% of the POSIX API.

It's a shame, because this feature effectively requires you to treat the file system in an object oriented/dependency injected way. Pretty good from a reusability/testability perspective.

One of my minor disappointments with Go, considering the time it came out and the UNIX heritage that it descended from, was that it didn't prioritize the *at() functions. It's difficult, if not virtually impossible, to write secure code with the "traditional" path-based system because every time you do one thing, then some other thing to a path that has some sort of security implication, you've written a TOCTOU problem if somebody can wedge between those two things to change some critical aspect of the file.

It's hard for me to blame programmers for not using these functions more when hardly any language properly exposes them. But since nobody exposes them, nobody's aware they should use them.... chicken & egg strike again.

But openat, for example, is still path-based; it just changes the directory that the path is relative to. If you give it an absolute path, it will open it, and I didn't see any reason in the man page why you couldn't just pass in a bunch of ../../ as the usual exploits do. Maybe you're referring to another category of bugs?

Sorry, I was unclear. Too much context in my head from the times I've jousted with this and I forgot to contextualize properly. (Which is ironic since part of my complaint is precisely that too few people know this stuff.) That family of functions allows you to open things based on handles more easily. So you can open a directory, and while holding on to the handle for that directory, know that you are still in that directory, even potentially open files in that directory and then, once you do that, know that you have a file in that directory (or, atomically, don't).

It's the difference between

     dirHandle = open("some path");
     fileInDir = openat(dirHandle, "some file");

     dir = open("some path")
     // examine the directory, then
     fileInDir = open("some path/some file");
In the second case, between those two lines, you can have something else jump in and modify or remove or repermission or whatever the "some file". It has never been the largest security issue, but it's been a running undercurrent of securit issues for decades.

In the first case, you have atomically-safe operations; you either get the directory or don't, then either get the file handle or don't, etc, and once you have the handle nobody else can take it from you, even if they rename the file under you, etc. It means that if you are writing logic like "if the file is setuid, do this", there's no way for an external process to wedge in between the two things.

In other words, you ought to be able to not just read from a file handle, but also open relative to the handle directly, and do all those other things. Any API that operates in terms of paths is pretty much intrinsically open to TOCTOU, because any time you "check" a path vs. "use" the path, which is fairly common, you have a window of opportunity for lossage. I'm not sure I've yet seen a non-C way of doing this built into a standard library.

Also... before you jump in with some "what ifs", no, these functions don't magically make your code more secure. You still have to use them correctly and it's still pretty easy to mistakenly let path-based logic slip in accidentally even so. It doesn't make insecure code secure; it makes guaranteed insecure (in security-sensitive contexts, obviously a lot of time this isn't a security issue) code possible to write securely.

Makes sense, but I think that you only gain safety when you are checking attributes of the directories leading to the file, but not when you are checking the file itself. For example, you said

> In the second case, between those two lines, you can have something else jump in and modify or remove or repermission or whatever the "some file".

Modifying/removing/repermissioning "some file" is still possible even with openat() if you do it between the time you open("some path") and openat("some file"). There is still a race condition there in either case if you are examining the contents of the directory (e.g. "stat"ing the file and then calling openat). You can also modify/repermission "some path" as well. The only thing openat() protects you from is removing/replacing "some path" (not "some file") and I agree that that is valuable for security purposes.

'Modifying/removing/repermissioning "some file" is still possible even with openat() if you do it between the time you open("some path") and openat("some file").'

This is part of what I was trying to head off with my parenthetical. You still have to use it correctly to do secure things. But at least it's possible. This kind of security is basically impossible with pure path-based APIs. Plus, as mentioned elsewhere, there are some additional flags you can use for even more security that you can't get out of an API that is "open(filename)", simply because that API is mathematically incapable of carrying such flags (assuming you don't start trying to encode them in the filename itself, but that way lies madness).

It's doable when you need it, something like:

    filefd, err := syscall.Openat(int(dir.Fd()), filename, os.O_RDONLY, 0)
    file := os.NewFile(uintptr(filefd), filename) // for use with library functions

For what it's worth, Linux 5.6 introduced openat2 [1] which accepts some additional flags controlling path resolution.

For example, RESOLVE_IN_ROOT "is as though the calling process had used chroot(2) to (temporarily) modify its root directory (to the directory referred to by dirfd)".

[1] https://man7.org/linux/man-pages/man2/openat2.2.html

He was - TOCTOU has its own wiki page [1]. These can be nastier, because they don't require the attacker to be able to submit strings or file names.

[1] https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use

I guess I'm not sure how you would use open() that would expose a TOCTOU bug that openat () wouldn't. Can you give an example?

I was unclear. See my other cousin reply; you can't use it yourself to have a directory handle and securely open files in that directory. You can only open things by path.

I’m confused. How would using *at() APIs prevent race conditions?

FWIW (and iirc), with programs using recent-ish glibc, you will never see a call to open() in the wild unless the program takes special care to bypass the implicit libc wrapper. glibc will transparently convert these calls to openat() under its own hood. I do notice that this probably doesn't do you any good on FreeBSD, though :)

This is mostly true on FreeBSD as well. The real problem is that capability mode also disallows openat(AT_FDCWD) - there has to be an explicit directory descriptor.

Capabilities mode is useful, but it's very difficult to apply to programs that don't fit the model.

If you need to make network connections, you have to do that before entering capabilities mode, because there is no capability to allow it later. You can work through a proxy program, but adding that complexity doesn't seem worthwhile to me unless your program to be sandboxed is very complex.

I haven't worked with OpenBSD's pledge, but the idea of being able to end use of specific dangerous things seems more widely applicable.

> You can work through a proxy program, but adding that complexity doesn't seem worthwhile to me unless your program to be sandboxed is very complex.

I would love it if all network connections of all programs were created through a proxy. It would allow me to do load balancing, firewalling, tunneling, packet capturing, etc. etc. etc. entirely in userspace, without needing to rely on administrative features like pf/iptables, tun/tap, bpf, etc..

You see that in Kubernetes land folks are trying to achieve the same thing by using so-called service meshes (e.g., https://istio.io ). Right now those systems launch a proxy next to every container. For projects like these, it would have been so much easier if UNIX-like systems already had a standard for making the network stack used by a program injectable.

That's an interesting thought, but you'd probably end up with many different (captive) proxy programs that enabled the different types of sockets their clients needed, so it likely wouldn't be any easier than say LD_PRELOADing all the libc socket calls, or one of the tap/tun things and/or some sort of network namespace.

Mildly off-topic note, the parent is the author of CloudABI (https://github.com/NuxiNL/cloudlibc), which was (in my opinion) a truly brilliant approach to running untrusted code in a FreeBSD system.

The problem is that many libraries need access to configuration files or other stuff that comes with the library.

So if you start with a system that has some form of persistent objects, then very quickly a root namespace object is created to solve those library issues.

And then you are mostly back to a Unix root directory.

cap_enter can be invoked after library initialization. Libraries can open the files and directories they need during initialization.

A single jailed root is where you end up when you take the route of putting software into sandboxes for which they weren't designed, because now you need to emulate a traditional environment.

pledge and unveil are a middle ground, albeit closer to Capsicum, in that they're much more accommodating of existing software patterns. But they do still require application refactoring. OpenBSD has refactored their entire userland codebase this way. That typically involves identifying the necessary resources a program needs and either shifting their acquisition to before privilege dropping (i.e. early in main), or arranging so that they're subsequently accessible (e.g. using unveil).

It's a shame Linux never merged the Capsicum patches. While pledge and unveil are more convenient from a developer perspective, they can't easily be adopted in a standardized way by other operating systems, like Linux. Capsicum was the closest thing we could have gotten to a standardized sandboxing model in the POSIX universe. If it became widely available (cough Linux), I believe a large chunk of software, especially critical network-facing software, would slowly migrate; and an ecosystem of idioms, patterns, and libraries would evolve to increasingly smooth the transition.

What's doubly shameful is that Capsicum is architecturally extremely simple. In principle it would be easy for any POSIX system to adopt. The APIs are trivial, and Linux is already nearly there now that it has process descriptors and an openat that can prevent parent directory traversal. Most of the leg work is in blocking access, after cap_enter has been invoked, to non-standard interfaces and syscalls that expose resources.

You would need to standardize passing of current root as a file handle, I think? Probably will break some software...

Why not treat open(path) as openat(AT_FDCWD,path)?

Because cap_enter() blocks that too.

Specifically, it blocks going higher than the handle, so using either absolute paths or paths with a ".." component.

Not sure if anything changes for symlinks.

On many linux distro's you can already do this with user namespaces:

    $ mkdir rootfs
    $ docker export $(docker create ubuntu:20.04) | tar -C rootfs -xf -
    $ unshare -r chroot rootfs bash
    # ls
    bin   dev  home ...
Very often when you use chroot you also want unprivileged mounts, in particular overlay mounts if you don't want to mutate the underlying rootfs. You can do that with mount namespaces: `unshare -rm`, but you need Linux kernel 5.13 (or a distro with a patched kernel like Ubuntu) to allow unpriviliged overlayfs.

An alternative to unshare is also bubblewrap (https://github.com/containers/bubblewrap) which also sets up a new namespace. You can build up your own new filesystem by binding existing paths into the new root and then run a process within it:

    $ mkdir -p root/bin
    $ cp /bin/busybox root/bin/
    $ bwrap --bind root / /bin/busybox sh

    BusyBox v1.27.2 (Ubuntu 1:1.27.2-2ubuntu3.3) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    / $ ls -l /
    total 0
    drwxrwxr-x    2 1000     1000            60 Jul 22 11:07 bin

I used bubblewrap to do a lightweight containers on top of arch + pacman. Basically you could install packages on overlays of the host and do whatever there without affecting the host fs. It was pretty nice.

So how does this work? Can you mount / as the lower layer of the overlayfs? Doesn't that create a weird recursion because the mountpoint is a path inside /?

I used unionfs first to combine the sandbox and / where host / is read-only. Then simply bubblewrap into it. I also mounted / to /host if for some reason you wanted to access host fs from inside the sandbox.

Interesting. Going to check out Bubblewrap

As an alternative, one can also use User Mode Linux (UML) to implement a pretty fancy chroot (and fakeroot).

It can do a few things userns can't, like load kernel modules. I've had to use this to deal with bugs in BtrFS before.

I love UML. I used to use it all the time. Is it still developed? It was really a pretty slick system and very easy to work with.


(another good site: https://wiki.archlinux.org/title/User-mode_Linux )

It's in mainline kernel (can be compiled as an architecture), but I don't think it's being actively developed.

*BSD have been quite innovative recently. The pledge and unveil syscalls, although achievable by other means on linux, are very simple and effective for what they do. I don't know a way on linux to use a system on a directory without being root; even if possible I'd still need root to mount --bind some dirs, but definitely something I'd like to do.

I don't think containers should be needed for that.

On Linux, you can do

    unshare --user --mount --map-root-user chroot /path/to/whatever
and if you need to bind-mount some directories, you can do that before the chroot, e.g.,

    $ unshare --user --mount --map-root-user
    # mount --bind /proc /path/to/whatver/proc
    # mount --bind /proc /path/to/whatver/sys
    # chroot /path/to/whatever
without being root. (This requires a sysctl to be enabled for unprivileged user namespaces, which is on by default in the kernel.org tree and I think all major distro kernels have it on now. The feature has been in the upstream kernel since 2013.)

If you want to do this at scale, a handy tool is bwrap(1) from https://github.com/containers/bubblewrap . (The README talks about how bwrap is a setuid program to prevent the need for that sysctl, but it also works great as a non-setuid program when that sysctl is enabled, and its value is it has a bunch of handy command-line flags for this sort of thing. We use it extensively at my workplace in non-setuid mode for things that don't quite need containers but need to see alternative root directories etc.)

"containers" are just a combination of multiple kernel features, one of which does precisely that (user namespaces).

And were known as vaults on HP-UX 11, back in 2000.

Arguably, the issue with these features isn't their existence, since it's not even that hard to add them to a kernel, relative to the generalized difficulty of adding things to a kernel in general. The problem has been the need for mass awareness and desire for the feature, and that's what's taken multiple decades to emerge. It does no good for a kernel to have a security feature that only a vanishing fraction of developers care about and use.

(And I say "vanishing fraction" relative to the pool of developers as a whole; even if a particular subcommunity uses it extensively that doesn't make it a pervasive request. I can name subcommunities with all sorts of exotic interests that have not penetrated the mainstream yet, like the capabilities-based security community. Someday, when that emerges, we'll all point back to E as a pioneer, but in the meantime, effectively nobody wants it right now.)

Sounds like Jails in FreeBSD. Wikipedia says they were added in 1999.

And Zones on Solaris :) phk was the original author of Jails; he wrote an excellent paper called “Defying the omnipotent root”, which I can highly recommend.

*Confining the omnipotent root

Yes indeed! Brain fart, apologies.

And LPARs on System z :)

Both LPARS and z/VM look more like hypervisors to me. Things like containers and chroot probably don't make much sense in the mainframe world since they already had granular facilities to limit access to networks, data sets, etc.

And VMs on IBM VM/370.

Aren’t lpars quite a lot different in nature than zones and jails though?

I wish Linux would do this. Patches are available: https://lwn.net/Articles/849125/

Yes, you can do this on Linux with a user namespace, but a user namespace changes the view of user accounts. You have to map every usable UID inside the namespace to a UID you control outside the namespace. At best, you can map a range of UIDs you control to "real" users (root, 1000, etc.) inside the namespace, but they won't be real users outside the namespace. If you're on a multi-user system, seeing other people's files as owned by "nobody" is confusing.

It should be enough to use NO_NEW_PRIVS mode, meaning setuid transitions are not allowed. Then it doesn't matter what user IDs you see inside the chroot.

In fact, back when Linux introduced the NO_NEW_PRIVS flag (almost a decade ago!), this was one of the motivating use cases.

For those, like me, lacking context, what are the implications of this?

The key feature of chroot is that you can provide a process with a completely different filesystem view. You can leave stuff out that exist in the standard view, or change things. Change the contents of system directories.

The problem with traditional chroot is that you can typically import setuid applications in this new space which can get confused, for example by a new /etc/passwd file. For this reason, chroot can be used only by root.

The advantage of such a NO_NEW_PRIVS flag is that this kind of abuse of setuid applications is not possible.

This should make it safe to allow ordinary users to use chroot.

chroot is a system call that assigns a limited view of the file system to a process. In particular it makes it so that the specific directory will appear as the top level directory to the process.

Some people like to run for example FTP servers in a chroot so that users have access only to a specific directory and its subdirectories, rather than being able to browse other files on the system.

FreeBSD also has a technology called jails which is what you’d rather use for containerization.

Anyway, previously you had to be root (the Unix admin user) in order to use chroot. FreeBSD now implementing unprivileged chroot means that regular users are able to run processes in chroot as well.

So for example if you were a regular user on a system, you can now create a sub directory in your home directory and run an FTP demon chrooted to that directory and bound to an unprivileged port, and then you can give someone else FTP access to that directory without them being able to see the other files in your home directory, keeping your private data private from them.

chroot existed, but could only be run as the root user. It was that way to prevent things like this (old actual exploit for Ultrix):

  $ mkdir /tmp/etc
  $ echo root::0:0::/:/bin/sh > /tmp/etc/passwd
  $ mkdir /tmp/bin
  $ cp /bin/sh /tmp/bin/sh
  $ cp /bin/chmod /tmp/bin/chmod
  $ chroot /tmp /bin/login
  # whoami
  # chmod 4700 /bin/sh
  now, log out of the chroot and use your newly minted setuid shell
Since they now have the "NO_NEW_PRIVS" protection, they can let regular users safely use chroot.

You can for example run a build in a chroot as a unprivileged user.

The commit message does NOT indicate when this will be available to mere mortals like myself.

Can someone enlighten me if this will be part of FreeBSD 14, or if there is a chance it will become available earlier, perhaps with FreeBSD 13.1?

EDIT: The commit message does NOT indicate etc. Silly me.

The commit message does not mention any MFC timeline [1] so this feature is not planned to be merged back into existing stable branches. In other words, the first release with this feature is going to be FreeBSD 14.0-RELEASE.

[1]: Also, you may look for the commit hash (a40cf4175c90142442d0c6515f6c83956336699) at https://mfc.kernelnomicon.org/ to see the back-porting status.

This feature should be in the weekly snapshot pretty soon:


In Linux there's "PRoot" - used by Termux on Android to provide userspace chroot-like functionality (can run Debian, for instance).


Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact