Using seccomp with a default-open filter is a terrible idea to begin with; it wasn't really designed for any of this. Seccomp in its most basic form didn't even have a filter list, it just allowed read() and write(). (And close() or something, don't quote me on the details, the point is it was a fixed list.) You're supposed to use it with a default-closed filter and fully enumerate what you need. (Yes, that's hard in a lot of cases, but still.)
There have been other cases where syscalls got cloned, mostly to add new parameters, but either way seccomp with an "open" filter can only ever be defense-in-depth, not a critical line in itself.
(Don't misunderstand, defense-in-depth is good, and keep using seccomp for it. But an open seccomp filter MUST be considered bypassable.)
>it just allowed read() and write().
A fun consequence of this is that even though there was a function to check if seccomp is enabled or not, it could only ever do one of two things: return "not enabled" or crash the process.
I agree with everything you wrote. I'll add that having a whitelist is not easy too, I've witnessed many situations where seccomp sandbox broke because glibc/python interpreter started using a different syscall (for example openat with AT_FDCWD instead of open)
> I've witnessed many situations where seccomp sandbox broke because glibc/python interpreter started using a different syscall (for example openat with AT_FDCWD instead of open)
ACK, that's what I meant with "hard in a lot of cases"… to be honest I think this is a failure of the ecosystem at-large. It's a bit of a half feature without some kind of higher-level userspace mechanism to collect who needs what, especially when a bunch libraries are involved. It's admittedly a very hard problem, e.g. just because something is linking libcurl as a 2nd or 3rd level dependency doesn't mean you intend your process to ever make network connections… I don't think it's unsolveable though.
This seems like an instance of an anti-pattern I've seen, which is inflating "permission" and "API call" to the same thing.
IIRC, AWS does this, where permission is by API call. As an example, you can have permission to call ssm:GetParameter n times, but if you try to combine those n API calls into a batch with GetParameters, that's a different IAM perm, even though exactly the same thing is occurring.
I find that so frustrating. Another example is uploading an image to ECR (elastic container registry). You need like four different permissions to do it, which I think correspond to individual http requests, but it is usually just a single docker/podman/skopeo command, and I can't think of a situation where you would want to grant permission to initiate an upload but not complete it.
And Google, in ChromeOS, Android, and purportedly, Google production servers, for around a year and a half, as well. For this reason it's also disabled in several of the kernelCTF configurations and in the ones where it remains (GKE), it only pays out at half-rate in bug bounty.
As far as I know, io_uring is quite secure: a user cannot perform a syscall through it unless it has the privileges required to perform this syscall directly
I would gladly get more details about the exact purpose of seccomp in a container environment. Reading a bit of internet, I find that docker "uses seccomp to block mount(2), which could be used to escape the container", which makes no sense to me because mount(2) requires CAP_SYS_ADMIN
seccomp is used for defense in depth.
If someone managed to escalate privileges through some means the seccomp policy will still prevent them from doing nasty things or escalating further.
That would be impossible to know.
The main thing with io_uring is it makes it so you don't need to context switch (ie make system calls) to perform a number of operations.
I was thinking about how one would change io_uring design to be compatible with seccomp, and came up with a very simple one:
A new io_uring fd comes with all operations disabled by default. User has to call "io_uring_register(fd, ENABLE_OP, op)" before operation is used for the first time. Then seccomp filter can easily filter enable_op calls to prohibit certain operations.
It could even be added now in backward-compatible way - add a new feature to io_uring_setup that enables it. Then one could set seccomp filter to only accept setup requests with this feature set, and deny all others. Together, this should allow cooperating programs to pass seccomp filter, while programs that won't register ops could not use seccomp at all.
I agree and think your approach would work, but I need to point out that seccomp BPF filters can also match on syscall arguments. For example, you can allow fcntl(F_DUPFD, …) but deny fcntl(F_SETLEASE, …). For some syscalls (fcntl, ioctl, setsockopt, …), this is rather important.
Surely this is a seccomp shortcoming, or kernel auth shortcoming, rather than an io_uring problem?
That is, seccomp is (apparently? I’ve never used it myself) capable of intercepting direct calls. Obviously, that design isn’t going to be able to handle “indirect” calls in its default implementation.
Either seccomp needs a way to act on the buffer or intercept io_uring calls, or there’s a need for a new auth mechanism that’s capable of handling io_uring style API’s.
Torpedoing the whole api (a la gcp) feels like throwing the baby out with the bath water.
That framing doesn't make sense. System calls and their arguments are an obvious security boundary and have been a sandboxing component for decades. io_uring blows that boundary apart. The "problem" is io_uring, not seccomp.
If you want to make a case for io_uring being benign for security, the right argument is probably against all unmediated shared-kernel multitenancy (ie: multitenancy either through virtualization, or WASM/V8-type language runtimes, and nothing else). It doesn't make sense to say system call filters are flawed because someone came up with an omni-syscall that breaks those filters.
The syscall implementations themselves do checks and return EPERM/EACCES when appropriate. The mechanism for doing the syscall can change. I mean, in the 90s it happened via int 0x80, then we got sysenter, then the vdso. io_uring just moved part of it to user mode.
It seems like a totally reasonable design to me to "just" put the right hooks into the filter mechanism and make it get called the same way regardless of the syscall mechanism.
The obvious solution is to block operations over io_uring if the equivalent syscall would have been blocked by seccomp. But I'm not sure if there is some reason that wouldn't work.
Another possibility would be to allow setting restrictions on all io_uring operations for the current and all child processes, although that would be less convenient than using the existing seccomp system.
Well the article brings up containers as an example. If the sysadmin controls “your” parent or root process (e.g. the login shell), they can just perform seccomp filtering there and it applies to everything within it (like any other sandbox).
(author here) I'm one of the maintainers of HashiCorp's Nomad, so that example was likely inspired by the separation of duties that's part of our security model. In that environment, there's a subset of task (ex. container) configuration that's controlled by the cluster admin and a subset that's controlled by the job author deploying onto the cluster.
Author here! The motivating example of this post is frankly pretty lousy in retrospect (and was even so soon after writing, given the friendly reminder from Giovanni Campagna that `socket` wasn't one of the io_uring opcodes). At best this is an interesting limitation of seccomp. Maybe relevant if you were using gVisor?
There have been other cases where syscalls got cloned, mostly to add new parameters, but either way seccomp with an "open" filter can only ever be defense-in-depth, not a critical line in itself.
(Don't misunderstand, defense-in-depth is good, and keep using seccomp for it. But an open seccomp filter MUST be considered bypassable.)