It also ignores that seccomp-bpf allows for far more fine-grained rules for syscalls (like specifying that certain bits be cleared or certain arguments be equal to a value). And they're adding more and more features over time to it. I don't get why you would use ptrace (and if you don't use ptrace then you don't need another layer -- just play with the exising OCI support for seccomp and use runc directly).
seccomp-bpf doesn't let you follow pointers, so you can't even implement most pledge() restrictions in it. For instance, pledge() always permits open("/etc/localtime"), but at the point seccomp
is run, all you know is open(some pointer to userspace).
You could imagine combining seccomp-bpf with some other system that reads the arguments after they've been copied to kernelspace, which is basically Landlock's approach https://lwn.net/Articles/698226/. But I've been personally waiting for something like this since 2011 or so when people were saying seccomp mode 2 should use ftrace, and Landlock itself has been in review (slash argument) for two years. An approach like gVisor works today.
> seccomp-bpf doesn't let you follow pointers, so you can't even implement most pledge() restrictions in it. For instance, pledge() always permits open("/etc/localtime"), but at the point seccomp is run, all you know is open(some pointer to userspace).
This is something that is being worked on (separately but similar to Landlock) in the form of seccomp syscall emulation (I don't remember the actual name of the patchset at the moment but it was proposed a month ago I think). However after talking to some seccomp folks I was told that in theory eBPF maps could be used for this purpose (though I'm not really convinced to be honest).
The real downside of ptrace is that you cannot filter which syscalls you're interested in -- so you pay the price of tracing for every syscall. seccomp doesn't have this problem.
You can use SECCOMP_RET_TRACE to kick complicated cases back to the ptracer but handle the easy cases without the slowdown. So you can write a seccomp policy that does something like this pseudocode:
if syscall == SYS_open:
if flags == O_RDONLY:
return (SECCOMP_RET_TRACE, 0)
else:
return (SECCOMP_RET_ERRNO, EPERM)
else if syscall in (SYS_read, SYS_write, ...):
return (SECCOMP_RET_ALLOW, 0)
else:
return (SECCOMP_RET_ERRNO, ENOSYS)
and it would be much much faster than tracing every system call, since most programs call open() rarely and read() and write() very often.
That said, the ptracer's job here is kind of hard, because the kernel still gets an untrusted userspace pointer, and another thread, another process, etc. can modify that memory in between the ptracer okaying it and the kernel getting to it. (See "Argument races" in Tal Garfinkel's 2003 "Traps and Pitfalls" paper.) So you either want the filtering to happen in the kernel after it's been copied to kernelspace (which is Landlock's approach), or do the open from a trusted process and send the fd over (which is I think gVisor's approach).
> seccomp-bpf doesn't let you follow pointers, so you can't even implement most pledge() restrictions in it.
That's not even the least of it, it's impossible to implement the ratcheting down semantics of pledge() using seccomp-bpf. For example, an initial pledge("stdio rpath") may be later reduced to pledge("stdio").
"Subsequent calls to pledge can reduce the abilities further, but abilities can never be regained."
Personally I don't think pledge()'s semantics are the best idea in the world, especially since all of the restrictions are cleared on exec() if it is permitted IIRC -- so it's useless for sandboxing.
Of course. If you promise "exec", then you're allowed exec. This allows pledge to be in software that otherwise would be impossible, for example text editors, and other shells. It still reduces attack surface, as the shell itself can no longer open sockets, or random device ioctls. The alternative is no protection at all.
> Package ptrace provides a ptrace-based implementation of the platform interface. This is useful for development and testing purposes primarily, and runs on stock kernels without special permissions.