Hacker News new | past | comments | ask | show | jobs | submit login
VDSO, 32-bit time, and seccomp (lwn.net)
53 points by Tomte 80 days ago | hide | past | web | favorite | 34 comments

This class of problem for seccomp has been known for ages. This makes it quasi-unusable IMO (or reserved for full-contained binaries, maybe in some JIT scenario but probably not a lot else), and even quasi in self-contradiction with the kernel non-regression rules (because by design seccomp can not achieve that in the general case -- at least when you consider the bigger picture that is a classic userspace design and ecosystem, like in a mainstream standard distro). To detail more: under this "standard" model the applications do not often do syscalls themselves, but use intermediate libraries (ex: glibc) that provide (quasi)-Posix and reserve some of the syscalls or even provide syscall like functions that actually use others -- or just start to use new syscalls in new versions to provide higher levels of abstraction. But they do not provide another abstraction to obtain a seccomp-like services suitable for this model. So seccomp is basically unusable in this context, which is one of the most important.

Now the kernel itself made the mistake, proving that the whole idea was not practical; it breaks too easily, and it breaks even when used in most restricted ways or when the whole (non-kernel provided) userspace has been designed for it (which was not a practical condition at all, to begin with).

IMO seccomp should be phased out entirely and eventually replaced by something else. Trying to "fix" it will yield nowhere: it is broken by design, since forever.

Or you could argue that seccomp lacks a libc-like wrapper library with easier to use abstractions around the syscall.

Then you would only have to update the library to work with new kernel versions.

Yes but like I said: this layer does not exist, and to my knowledge nobody intend to create it. It would be a big amount of work too, compared to the pledge approach...

The other issues: Parent cannot seccomp itself without being highly coupled to the child (filters persist across exec), Different syscall numbers, socketcall, etc, etc make it almost mandatory to use something to build the filter.

But there isn't really anything that Linux can replace it with, pledge works because Openbsd controls both libc and the kernel.

> Parent cannot seccomp itself without being highly coupled to the child

You can workaround this sometimes though. Unless you actually care about fork with current memory copy, (i.e. you care about spawning new processes only) you can fork a "spawner" process early which is only a thin proxy for pipe->exec commands. You apply seccomp after spawner is ready and you're all good.

There are precedents for linux shipping associated libraries, e.g. libbf and liburing. But yeah, it's not likely to happen anytime soon.

seccomp is such a sad story. In theory seccomp can do a lot more than OpenBSD's pledge. In practice, the OpenBSD devs added pledge support with comparatively little effort to 100s of programs, while seccomp is a constant headache for the few programs which use it.

Worse is better is a simple minded and wrong interpretation. In reality the outward simplicity of pledge(2) masks a great deal of high quality engineering and research. The categories for pledge were not just pulled out of someone’s ass.

Seccomp like so many Linux interfaces is the “fuck it” here’s an exhaustive yet half baked set of tools, you can do anything! This barely works out in gp programming, but is always an unmitigated disaster in anything security related.

Afaik it's not even possible to implement a pledge-like wrapper around seccomp because seccomp is inherited while pledge leaves the responsibility of securing itself to each process.

> Allows a process to call execve(2). Coupled with the proc promise, this allows a process to fork and execute another program. If execpromises has been previously set the new program begins with those promises, unless setuid/setgid bits are set in which case execution is blocked with EACCES. Otherwise the new program starts running without pledge active, and hopefully makes a new pledge soon.


Pledge can be inherited by child processes too.

I keep getting raged at over e-mail/github because end-packagers and some users don't comprehend why it's broken. I completely regret adding seccomp support in the first place, since it breaks randomly, packagers enable it by default then don't run tests and ship broken binaries, then hurl abuse "because security" when you try to report it.

By comparison I've never gotten a complaint about pledge support (or the FreeBSD equivalent), which we also support.

You could make a quiz based on "what syscalls does this familiar libc API end up calling". selinux and seccomp are dead ends.

Yet somehow we run SELinux successfully on every Android device. I've never understood the antipathy towards SELinux (which by the way, doesn't operate at syscall granularity).

There were also enterprise and military customers using it for damage reduction. Some used products from Tresys that made configuration easier. Unfortunately, the tremendous number of 0-days and leaks in Linux platform made high-assurance community push for putting apps in VM's on separation kernels. The rise in hardware attacks meant even those aren't enough: physical separation with EMSEC sheilding is required.

Most aren't going to do that, though. So, their boxes are still nice targets.

One reason to dislike SELinux is the arbitrary assumptions baked in by default that barely anyone knows how to or cares to change, which makes carefully designed software look broken when the framework itself is broken. One example I've faced is UNIX pipes, which for all purposes were obsoleted by UNIX domain sockets in the mid 80s as part of the original intent of the BSD socket API, but to SELinux they are profoundly different things. You can pass pipes across a user->root boundary but not a socket.

To the end user, they only see your code broken by SELinux, and assume you haven't done your job. On the other hand, SELinux is codifying rules about UNIX that never existed and amount to emotional heuristics about the risk of Internet domain sockets being inherited around the system by the wrong process. The effect isn't to prevent Internet domain sockets being inherited by privileged processes, but breaking all sockets. That's garbage design, hidden behind marketing suggesting because the NSA contributed some code that the problem couldn't possibly be SELinux

But unix domain sockets are not pipes, not least in the important way that you can reliably determine the identity of the other end. That can in some cases be an infoleak, and therefore needs independent access control.

If SCM_CRED passing were a real problem in _any_ scenario, SELinux should target that instead, not an entire subsystem on which half the system is built

The point is you need the ability to express both "these are the same thing" and "these are different". You can do both with SELinux. What's your alternative?

> I've never understood the antipathy towards SELinux

Most people first encounter it when it breaks their program in opaque and surprising ways. They then dig in... to find that the solution is non-obvious. So they turn it off and remember it as that thing that breaks stuff.

I don't understand. Why are they dead ends?

The problem with seccomp is that it operates at the level of system calls. An application could use seccomp to, for example, restrict what system calls it can invoke. However libraries usually consider what system calls they use an implementation detail that can be changed at any time. This forces the application programmer to pay attention to library internals.

An effect system would solve this problem, as new system calls imply new effects, causing a breaking API change.

libc functions may need unexpected syscalls. For example, some of the date and time functions will very likely access the filesystem, just to parse /etc/timezone and the corresponding files in /usr/share/zoneinfo/. This is unexpected from the point of the programmer, who did not add any open(2) syscall to their application.

Now depending on which libraries are in use, finding out which syscalls are actually required has often to be determined with strace or trial&error through all code paths. And with another version of the library, this set of system calls could even change without any changes to API or ABI, leaving the responsibility to track this to the application programmer.

This is why seccomp (and SELinux) policy should be the province of the platform developer, not the application developer. That's how it works on Android (disclaimer: I worked on that) and it works very well.

It works very well until even kernel hackers hit the very problem we are talking about -- which is precisely the same class of problem that classic GNU/Linux userspace is having when trying to use it -- and then it does not work very well anymore.

So maybe in they are some contexts where it kind of work (until it doesn't, like here), but another model which works in more cases would be more useful...

If you have such a model I would welcome it. But realistically, the complexity in SELinux (mostly) doesn't come from SELinux. It comes from the irreducible complexity of doing general purpose fine grained access control on Linux.

You could avoid some of that complexity by building something more opinionated, but it turns out that 99.9% of users do something that only 0.1% of users do. The odds that you break enough to generate the same negative sentiment SELinux has, but without the tools to dog those users out, are quite high.

Something like pledge seems to have better successes...

Maybe a middle ground could be achieved, but honestly I prefer simple proven approaches to grand overcomplicated designs. And it has to be in the kernel (or at least shipped and installed by it) and not just one more userspace layer on top of seccomp or the like, because otherwise I can't update the kernel without the risk of breaking everything, which in the traditional GNU/Linux distro world is an important workflow.

I mean, SELinux is deployed successfully on north of a billion devices with essentially no false positives and a huge ecosystem to support. It would seem that it has been proven, and at a considerably larger scale than pledges. Not sure where else to go with that.

I'll believe "essentially no false positives" when people don't set SELinux enforcement to permissive and forget about it.

Ok. It isn't in permissive on Android, and there's a CTS test for it.

Android is typically the system that does not permit the workflow I was talking about. It does not make it necessarily the worst thing ever, simply it is an extremely different system than what GNU/Linux is. Now obviously it also uses Linux so it is useful to have some features even if they are (really) usable in one kind of system and not the other, but my point is that it would be better to have features usable by both, and we kind of know how to do it thanks to the pledge example: group the syscalls by theme, maintain the groups definition in (or at least shipped by) the kernel, done -- that virtually solves both the intermediate 3rd party library problem, and would have prevented that VDSO fail.

Really the Linux kernel being so decoupled from userspace is one of the point that let it be so successful. Having a technology which drops or reduces that characteristic so much is by definition not going to make that technology used everywhere... Which is a big opportunity loss compared to practical solutions.

I'm confused.

First, Android is a Linux system, and underneath the app framework is really a pretty normal one. So I'm not sure what distinction you're drawing here.

Second, the grouping of related syscalls is something SELinux already does-- open, for example, doesn't care if you open() or openat(). That grouping lives in the kernel as well. So that would appear to do what you wanted?

Third, pledge does not do what you described. Although it restricts access to syscalls, the more important thing is how it restricts the behavior of those syscalls. That policy is something you could write in SELinux, but not in seccomp. And of course, pledge isn't capable of doing the useful things SELinux and seccomp can, like forbidding certain ioctls but allowing others.

Fourth, SELinux was not bitten by the VDSO issue.

Fifth, both pledge and SELinux get hit by the third party library issue in exactly the same way: you have a policy which was sufficient under the old version of the library, but something has changed and now they need something else. The only difference is that with SELinux the person who might know that the third party library had changed is also responsible for the policy, where with pledges you don't have that visibility.

Finally, I don't know what you mean with your last paragraph at all. The kernel has a contract with userspace and kernel developers broke it, is that what you mean by decoupling here? I'm not sure it was a good thing, and the kernel folks seem to agree...

I don't want to segregate Android from other Linux systems or anything like that, but I was just making the remark that this is a (very) different system compared to classic GNU/Linux distros. With different actors at different points of the lifecycle and so over. Like in practice: as an end-user I can't update the kernel on my Android phone; which is a huge shame IMO (it could as well use the NT kernel as far as I'm concerned, it would change nothing about my practical freedoms)

Also I was never really talking about SELinux because I don't know much about it -- the article was initially talking about seccomp and so was I. If SELinux groups syscalls, then that's good and way better than seccomp on that topic in my opinion. BUT: I'm not a fan about having "security policies" potentially separated from applications and potentially written by yet another party -- at least not on that level -- in most case that makes no sense IMO; only the applications (1) authors know or are the most efficient to specify what is needed; and if furthers restrictions are wanted they ought to be tunable by the end user with a nice GUI and a few on/off controls.

About libraries evolving under your feet: if the library are not insane, and the groups of syscalls (or "subsyscalls" if needed, like the ioctl case) are good enough, then you shall avoid virtually all problems.

At least, for sure, you will avoid trivial problems like the VDSO one.

EDIT (1): "application" in the GNU/Linux sense. Android applications are completely different beasts.

MySQL also still has Y-2038 issues, even on 64 bit machines.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact