Hacker News new | comments | ask | show | jobs | submit login
A subsystem to restrict programs into a “reduced feature operating model” (marc.info)
230 points by cnst on July 18, 2015 | hide | past | web | favorite | 96 comments

This doesn't seem particularly better than Linux's seccomp-bpf or OS X's seatbelt. In particular, I don't really understand the complaint about not wanting to write a program, then turning around and writing a system call, running with full privilege on the system, that hardcodes all sorts of things about userspace.

I wish he'd acknowledge and discuss prior, effective work in this space instead of saying things "showed up" and they're "insane". For instance, a direct comparison to either seatbelt or seccomp-bpf would make it clear that the distinction between initialization vs. steady state is well-explored in production systems using this (like sandboxed Chrome renderers) and not novel.

The implementation is better than OS X, because OS X has a ton of kernel API surface available to userland and much of it can't even be sandboxed, while tame starts with one syscall and works its way up.

I agree it's not better than seccomp-bpf plus a library to supply these kinds of common policies in userland, with the caveat that I don't know of any such library in common use on Linux.

Yeah, seatbelt is not that compelling in practice, but that's true of the OS X kernel/userspace boundary in general. :/ But the design of it seems to match what's wanted here, exactly: you can enable restrictions during the running of a process, and OS X ships with a handful of standard seatbelt profiles.

pcwalton's gaol, which has both seccomp-bpf and seatbelt backends, has a concept of "profiles," which I think matches the general idea here:


To be fair I wouldn't use it in production yet, but it's quite a bit more reviewed than tame(2) is right now.

I don't think it would be particularly useful. The most common usage of seccomp is hardening an existing sandbox by reducing kernel attack surface. Fine-grained control is crucial for that, since it's the entire point. Every unnecessary system call or permitted system call flag is added attack surface.

When seccomp is actually being used to implement the sandbox semantics, the application usually needs to be designed around it. It's very difficult to apply it this way to an existing application. I just don't think there's a strong use case for coarse control when it's so far from being the real hard problem.

I agree, this isn't particular new or inspired. It's similar to existing mechanisms which the author pointed out.

Seccomp is really hard to integrate into existing programs, especially if they are monolithic and don't have privilege separation already. You can't narrow down the syscall footprint in complex programs enough to get decent protection unless you already have something along the lines of privilege separation. Third party libraries cause headaches too because their syscall footprint might conflict (now or in the future!) with the main program's seccomp configuration.

Capsicum looks interesting. The challenge with fancy security mechanisms is that many people contributing to the codebase may be unaware or care only about use-cases where fine-grained capabilities aren't needed.

At the end of the day, all these mechanisms can work well but it's best to use them from the start and make them one of the key concerns that all developers know about.

The tame call isn't far from being an extremely coarse version of seccomp. In general, both require splitting up programs into components to implement a sandbox. Otherwise, they're only able to reduce the kernel attack surface for an existing sandbox and not nearly as much as they could if it was split up well. They're both vulnerable to changes in third party libraries and don't have control over filesystem access once those system calls are allowed (it's all or nothing, beyond tame's hard-wired paths).

The main missing feature on Linux is the ability for programs to apply their own path-based MAC policy as seccomp can't be used to do string comparisons and couldn't realistically be extended to offer it. It's one of the main reasons that robust seccomp sandboxes require so much work.

You might well be able to implement tame() entirely in userspace using seccomp-bpf.

And that wouldn't be a bad thing; seccomp could do with a simpler user interface (in addition to the current, more powerfull one).

You can almost do it, but you can't whitelist paths. Luckily, none of the hard-wired paths is particularly compelling. With either feature, you end up needing a multi-process architecture as soon as you want to allow access to certain paths in a sandbox. It's just that tame hard-wired a few used by the base system, but they aren't generally useful outside of it.

So, do I have it right that this is effectively a way of a program being able to declare to the operating system "I shouldn't ever do <x>"?

Because, if so, that makes a whole lot of sense. (Adding security "for free" generally does).

This could conflict with on-the-fly upgrades, though. If it turns out that some later version of your program does in fact require <x>, then you'll have to kill and restart the process as opposed to upgrading on-the-fly. Perhaps not the end of the world, but worth noting.

It can also be used by a parent process to limit the functionality of a child process, by calling tame() after fork() and before exec(). I can imagine this being used with some a "tame" command in the shell to run untrusted programs. Of course I don't know how thorough the sandboxing is, and I wouldn't trust it to make unsafe programs completely safe.

> I can imagine this being used with some a "tame" command in the shell to run untrusted programs.

This might not be useful in practice. The inspiration behind tame() is the observation that a program usually needs many more rights at initialisation-time than in its main-loop. Thus, tame()ing a program before it even starts is unlikely to be practical - if you give it enough permissions to successfully start (including whatever is needed for dynamic linking), you may not have reduced its capabilities much.

> Of course I don't know how thorough the sandboxing is, and I wouldn't trust it to make unsafe programs completely safe.

That's the best part about OpenBSD -- the APIs may not be binary or even source-code compatible between the releases, but the source code is usually as readable and as clear as it gets.

If you read through the examples, it's even better. The default case when you call tame() is that you don't get any privileges, so you explicitly have to call and declare to the operating system, "I need to be able to do <x> - don't let me do anything else."

This sounds like Tcl's "safe interpreter" [1], but for syscalls.

[1] http://www.tcl.tk/man/tcl8.4/TclCmd/safe.htm

That's my interpretation of it. Basically your program does its initialization, then says to the system "Ok, here's all I need from this point forward." Making this level of security this easy is a huge win.

This is pretty brilliant/obvious in hindsight. In addition to the sandboxing protection, you also have a really good inventory of what privileges the application requires. Looking over the diffs in the applications - most of them are two or three lines - a #include <sys/tame.h> followed by something simple like tame(TAME_STDIO | TAME_DNS | TAME_INET);

What I really like about a lot of the OpenBSD initiatives, is they don't overthink their solutions - they make them as simple as possible, but no simpler. Signify, which avoided the entire web-of-trust/PKI complication is another example.

Yeah, it’s a basic dynamic role/coeffect system, which goes a long way toward enforcing correctness and safety, much as types do (whether static or dynamic).

Having fine-grained capabilities and the ability to turn them off is always useful. The usual problem is that some component needs to, say, open a file, so all code gets "open file" privileges.

There's a tool like this for Android phones. It not only can turn privileges off for an application, but also offers the option to provide apps fake info for things they don't need. You can, for example, deny address book access; if the app tries to access the address book, it gets a fake empty one. You can deny camera access; the app gets some canned image. This allows you to run overreaching apps while keeping them from overreaching.

Yeah, for 'open file' kind of stuff it would be better to have a real sandbox (I think Windows began doing something like this, not sure if in Vista or 7, that if programs wanted to write to certain restricted places they can - but this is written to their sandbox, so if they read it later they can get the files but with no effect on the system files)

Vista did that as a compatibility workaround. Older Windows apps from the 9x and XP era were used to being run as Administrator and being able to write directly to Program Files and the like. In order to make some of these work on Vista without running them as Administrator, the solution was to lie to the applications and make their writes go through to sandboxes instead.

The sandboxing was flawed, though. It causes problems for some applications. For example, Gang Garrison 2, a game I have worked on, fails to update itself if stuck in Program Files and not run as admin.

This exists today, with selinux sandbox.


Whats the name of this app? Can it run without root permissions?

xprivacy/xposed framework, no

I wish Android allowed power users to create "root accounts" on Android, similar to how you can create Admin accounts on Windows, but be completely isolated from the default safe account.

For another example of a similarly beautiful interface that echoes "difficult solution made stupidly simple to use", checkpointing under DragonFly BSD: http://leaf.dragonflybsd.org/cgi/web-man?command=sys_checkpo...

On an unrelated note, I've always had respect for how Theo de Raadt is both the project leader of a complete BSD system, yet also an active hacker. Contrast to Linus Torvalds, who's mostly a manager nowadays.

On an unrelated note, but tangential to your link, why can I not save and restore processes? It seems like something that would be relatively easy to do - suspend all threads, save the thread control blocks, mark all pages as fault-on-write, resume threads, and start saving pages, unmarking them once they are saved. If/when a thread faults, copy (or save immediately) that page and then resume that thread. (Or potentially don't resume the threads until after you save everything, although that may cause problems...) You need to deal with file handles / etc, but that can be done too.

Not really different than hibernation.

You can, there's a process freezing utility for Linux; I'll see if I can remember its name.

EDIT: Here is it: CryoPID (https://github.com/maaziz/cryopid)

CryoPID allows you to capture the state of a running process in Linux and save it to a file. This file can then be used to resume the process later on, either after a reboot or even on another machine.

cryopid and cryopid2 are long abandoned, possibly not working on 3.x and 4.x kernels anymore.

Modern and more sophisticated equivalents are CRIU and DMTCP.

You've tickled on the problem but not quite nailed it.

> You need to deal with file handles / etc, but that can be done too.

That's actually the hard part. To get a real image of that process in time, you need to snapshot the full filesystem state, too. Or it could change out from beneath your program. Even more complicated: network state.

Why is network more complicated? I would think network doesn't have any atomic/uninterruptible states filesystem might?

It's easy to re-open a file (assuming it's still there), but with sockets your IP may have changed, the remote IP may have changed (which you may have stored in your working memory that got checkpointed), DNS may point you to a different service entirely, you could have had to do some kind of port knocking or something to get that connection open in the first place.

I know this kind of stuff is being worked on so VMs/containers/namespaces can be moved around but it seems to be one of those things that gets really complicated when you try to do it transparently for userspace.

IPs, DNS settings change on running programs all the time, that doesn't seem as unusual as re-opening a file that's actually not there. A unix socket is an interesting mixed case :)

If a process has a stream socket open to another process, or to another system over the network, what happens to that socket when the process is "thawed"?

How about if it's listening on a TCP port -- what happens if that port is in use by another process when the original one is thawed?

I understand this can't go 'right', but are those things more difficult than filehandles to files that have been deleted?

Handles to deleted files are relatively uncommon in practice. Network sockets aren't.

"Handles to deleted files are relatively uncommon in practice."

Could you please expand on your reasoning here? We're talking about restoring processes at arbitrary points in the future. That means we're not just talking about handles to files that were deliberately deleted while the process was running, but also anything that the process had open that was frozen that may have been subsequently deleted. That would seem to include any log file that gets rotated, which is not exactly rare, plus a ton more things.

I also think that treating network sockets as if they were disconnected is likely to go better than treating files that way - existing programs probably make more assumptions about disk state not changing unexpectedly than about network state not changing unexpectedly (even if both are technically not well founded).

IIRC the Criu developers went into some detail about this on FLOSS weekly some time back:


I can't remember exactly where in the podcast they discussed it, but I believe it was just before the part where you could hear brains exploding in the background

Elaborate? I don't think there's much of anything that could change out from under a suspended process that couldn't change out from under a running process.

(Case in point: you can have a system hibernate, have a supposedly locked file change, and have the system resume.)

This is a very good point. I definitely see the value in it, and making it simple to use means that programs are more likely to actually use it.

See also, Solaris' Role-Based Access Control and Privileges models.

Privileges (seems to fit the post): https://blogs.oracle.com/casper/entry/solaris_privileges


Programming with Privileges Example: http://docs.oracle.com/cd/E23824_01/html/819-2145/ch3priv-25...

Overview: http://www.c0t0d0s0.org/archives/4077-Less-known-Solaris-fea...

In particular, the Solaris privileges model allows a program to gracefully degrade functionality and drop and reinstate privileges at different points of execution.

This is an interesting idea, but something about having various "behaviours" baked into the logic concerns me. I'm certain that Theo has thought about this and understands the implications better than I do though.

For example, if your process has TAME_GETPW opening /var/run/ypbind.lock enables TAME_INET. The reasoning behind this makes sense, but now it means that yp always has to open that file before it can do its thing. The behaviour of yp always opening that file before accessing the network is now required by the kernel.

The saving grace is that OpenBSD (and the other BSDs) are developed as a unified system, so if yp ever changes to no longer use that file, that change will only come as part of a version upgrade that includes the kernel, etc.

Windows 8 has an equivalent of this, using a "mitigation policy" called ProcessSystemCallDisablePolicy, which is set using SetProcessMitigationPolicy().

Chrome uses this for their sandbox of rendering processes.

It looks like you can only disable GUI calls with it (the very sparse documentation seems to contradict itself a bit though). Also, a model in which you specify which calls you want to enable (and everything else gets disabled) is stronger than one in which you specify which ones you want to disable.

Indeed, it's the whole whitelist vs blacklist [1] debate. Whitelists are radically safer.

[1] - https://farm9.staticflickr.com/8669/16418068728_b8dd8aa200_c...

I recently asked for exactly this on StackOverflow, but for Linux[0]. Is anyone aware of an interface to seccomp-bpf on Linux that is as easy to use as this tame() syscall?

If not, does anyone want to join forces to create one? An ultra-simple library that provides tame()-like functionality on all capable platforms should make writing secure software a lot easier.

[0] https://stackoverflow.com/questions/31373203/drop-privileges... if anyone's curious

I would look into the Chrome/Chromium sandbox code, as they seem to have at least some facility for parsing simple profiles.

This page has some details:


The kernel pieces of tame(2) have just been committed:


I can't see where in this patch forked children inherit the `ps_tame` from the parent process. I don't know this kernel at all but it seems like something like this should be in sys/kern/kern_fork.c

    pr->ps_tame = parent->ps_tame
Otherwise tamed processes could just fork a process to do things they've declared that they won't do.

A tamed process cannot even call fork(2) unless the TAME_PROC flag is passed, in which case I'd assume the child process would then inherit a copy from the parent as part of a normal fork(2) operation. It can be later revoked by the child, but it may want to keep it for kill(2).

I'm `cvs get`ting as fast as I can to read the rest of sys_fork.c to sate my curiosity, but CVS incredibly slow. I'm spoiled by how git packs the repo.

turned out to be way faster just to grab http://mirrors.sonic.net/pub/OpenBSD/5.7/sys.tar.gz , if anybody else wants to poke around and doesn't want to wait for cvs.

You can also view the source on the cvs web mirror: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/

Sounds like you're doing something unnecessarily more complicated than this:

$ time cvs -d anoncvs@anoncvs1.ca.openbsd.org:/cvs export -rHEAD src/sys/kern/kern_fork.c

U src/sys/kern/kern_fork.c

    0m2.73s real     0m0.06s user     0m0.04s system

Yep, I was grabbing `src` instead of `src/sys/kern/kern_fork.c`. I've completely forgotten how to CVS, and I'm okay with that.

Perhaps you'll like this, then: https://github.com/ustuehler/git-cvs

What are the reasons why CVS is still in use?

  1. it works
  2. they're used to it
  3. there isn't enough reason to change
  4. lots of infrastructure would need rebuilt if they changed

Because we're already too busy with porting our code base to Ruby.

I wonder if this is related to the "capsicum needs you to rewrite programs" bit. I.e. the capsicum man page explicitly says

> Once set, the flag is inherited by future children processes, and may not be cleared.

And mentions that specific new APIs should be used in order to manage processes through capabilities.

Note that about half of the parent proc struct is simply copied into the child.

Ah yeah, just found that, the patch has a couple lines added to sys/proc.h,

    +	u_int	ps_tame;
, right before the end of what's defined as `ps_endcopy`, which is copied from parent to child in `process_initialize`.

Makes sense now.

This won't work for programs that allow for plugins, which are arguably those that need the most protection. Programs don't generally know what permissions plugins when they are compiled.

Could you provide examples?

The stuff I'm thinking of that would plugin to Firefox or Photoshop probably do things that would already be allowed (read/write files, allocate memory, access network).

Either way this seems like an extremely simple way to lock down all the little command line utilities and small programs that make up a working unix system, so that if someone does get arbitrary command execution by other programs it gets much more difficult to chain exploits.

think more like emacs or vim. plugins can add any number of things that isnt just file read/write. they can add new syscalls for a built in debugger (ptrace, strace, etc.) or even add an opengl layer for coding in 3d.

That's basically what I meant, thanks for explaining it clearly.

Hrm. I don't see why not? The plugin declares its behaviors in a manifest. During program initialization, the program reads all the manifests, sums the permissions, and declares those.

Of course, if you download an infected plugin, you're going to have a bad time. But that is likewise a problem if you download an infected program and run it. It's not an attack vector this syscall is meant to prevent.

I don't think that the plugin should declare it's behavior. Consider the case were you have a shared webserver and a php-plugin for one user (say a wordpress installation). Then the user ( or anyone with write access to the user directory) can control the permissions of the server.

On the other hand, a well designed plugin interface could set default permissions. For example the plugin interface could have a SQL method, so that a plugin does not need to talk to a socket directly.

Most of the issues with firefox, drupal, whatever are coming from plugins anyway.

Protocol based interactions (that requires clear API) do a better job at isolation than modules.

In a certain way linux too the obligation for drivers to be runned kernel space and thus the adaptation made for drivers to access resources have made some part of the kernel internal API klugish. On the other way, it is true that the use of modules without API enables linux to present an external API that does not change while you can modify the behaviour of the kernel and its internal.

(I guess there is a price to pay for everything, but not everybody is a genius like Torvalds)

I think the OpenBSD philosophy is to run plugins in separate processes.

I'd presume child processes would inherit the same restrictions as the parent process.

I guess a root process could remain privileged, the main restricted process could be a child of that, and that main process could ask the root process to spawn plugins. But, that'd weaken the model a bit.

I think it makes sense for child plugins to inherit restrictions at the time of the fork, not indefinitely. So the main process could spawn its plugins, then drop its own privileges.

But then you can just fork() and run your exploit in the child process after it raises is permissions. I don't see how that could work.

The unixy solution to this is rather obvious. Main app runs at relatively-higher privilege. Main app forks proc for plugin. Drops privilege in the child after the fork. Then calls exec on the plugin. The plugin is stuck at that lower privilege, but still talks to the main app through a pipe etc.

That's how you should do it, yes. But the OP was talking about increasing privs in the forked process, which would defeat the security.

Who, me? No, I wasn't. I was talking about the plugins keeping the high privileges (not raising them). And that would be because I assumed the plugins, like the main process, were trusted - my idea (like the proposed syscall) would be for the plugins to tame themselves, knowing what privileges they would need to work, and dropping everything else, so that they couldn't be used as an exploit. It wasn't to protect the system from the plugin authors themselves.

And my point is nobody who knows what they are doing would build it that way. Nobody suggested it ever work that way. OP's doubt on the subject was rooted in a misunderstanding of unix philosophy.

Privileges should be inherited across fork and monotonic down to solve this problem.

I don't see why this would need to be known at compile time. You could load the plugins, ask them what they need, then drop everything else.

the flags will need to be configurable through config file (e.g nginx, apache). When you explicitly grant more access because of a plugin, you know what you are getting into.

But assume nginx supports tame now, you know at least what the process can do and what it cannot do explicitly. If one day a zero-day attack was discovered in nginx, nginx running tame will have a lesser security impact, at least in theory.

This interface makes a system call at run time to reduce privileges. I wonder if there is a way to do this statically and automatically, either with some header file magic, or by analyzing the symbols in the executable: just assume at link time that any system calls it makes (and only those) are allowed.

Looking at a program statically and limiting its privileges automatically would be good (autotame?), but it misses the point that tame appears to be trying to solve. Specifically, it is often the case that programs need more privilege once at initialization time, and less privileges later. So at startup, your program needs to open some files, and open listening network sockets, etc., but once initialization is done it just needs CPU. A static analysis would see that the program opens files and listening sockets and stuff, and grant those privileges forever. If you want to capture the reduced privileges required after the one-time initialization then you need to modify the program to use something like tame (or capsicum, etc.).

well, if it can create new address space mappings, it can simply create a new text mapping and execute the call from there.

If anything, the place to implement it statically could be either a virtual machine jit, or at the compilation stage.

For comparison, the man-page for FreeBSD's capsicum API: http://mdoc.su/f/capsicum.4.

Great to see another attempt at this model. I do like capsicum (seems pretty straightforward), but it seems like it can require some added complexity with things like casperd for dns.

Very interested to see how this works out.

Let's say I received a SIGHUP and I need to reload my configuration after I dropped my right for reading /etc/ and all the syscall for provisioning my resources.

How screwed am I?

You would have a parent process that retains permissions to read configuration and sends changes to children, i.e: imsg_*(3) on OpenBSD.


This sounds very similar to code access security (CAS) that Microsoft's CLR had from around 2000 (but at the OS level rather than the VM level).

It reminds me SELinux.

Not really.

A simple easy way to keep a lid on privelige escalation is to remove all the files that you computer does not absolutely require to do its job.

Especially the development tools: the Morris worm enabled portability by distributing itself in source code form then building its binary on its target hosts.

My sister once read a novel about some very traditional, strictly religious people who fastened their shirts with string ties as they felt buttons were hooks that the Devil could use to grab hold of you.

I feel much the same way about files. I dont know what tomorrow's zero-day will look like but the chances are quite good that it will depend on a file that is installed by default. Cliff Stoll wrote in "The Cuckoo's Egg" of a subtle bug in a subprogram used by GNU emacs for email. Had the Lawrence Berkeley Laboratory used vi rather than emacs they would not have been vulnerable. ;-D

Yes it is a step in the right direction not to run daemons or windows services you dont need but its even better to remove them.

In 1990 I wrote an A/UX 2.0 remote root exploit to drive home my objection to one single file having incorrect permissions. Its source was about a dozen lines. That particular file was required but our default installs have many files we dont really need.

Also if you can read - not just execute - the binary to any program or library then your malware can load it into its memory then execute it. We have no way of knowing who is going to do that tomorrow but we do know there are many binaroies we do not really need.

If you develop code for your server, install the same distro in a vm on your desktop box then compile it there.

> A simple easy way to keep a lid on privelige escalation is to remove all the files that you computer does not absolutely require to do its job.

Taken to its logical conclusion, you sort of end up with a unikernel system, like Mirage OS[1]: only the code necessary for the execution of the service is compiled into the kernel. These systems don't even have a shell.

[1] https://mirage.io

Many embedded systems are that way.

While it is helpful that hardware memory management will protect against erroneous and malicious code, even better is for the code to be correct and ethical.

This because the MMU hardware takes up power, it costs money, generates heat and uses real estate. Also the software is complex and uses a lot of memory for page tables and complex allocation schemes.

The Oxford Semiconductor 911, 912 and 922 didnt even have a kernel nor did they have dynamic memory allocation, just stack and static memory with an infinite loop operating a state machine. A huge PITA to debug but the memory and flash were quite cheap because there werent very much of either.

It is common for executables to have these permissions:


maybe this is better:


what that means is that you can run the program but you cannot read it as a regular file.

To delete or create a file you must have write permission to the directory it is or will be found in.

Yes it's a PITA to take away your own permissions but your server is not the box you take with you when you hang out at Starbucks.

So when is OpenBSD getting rewritten in Rust?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact