
Why pledge(2) or, how I learned to love web application sandboxing - zdw
https://learnbchs.org/pledge.html
======
qwertyuiop924
pledge(2) is great. However, BCHS sucks.

No. Do not write your webapp in C. I'm serious here. Write it in Python, write
it in TCL, write it in Lisp. But write it an environment that is at least
semi-manged.

FFIs are pretty good. You can call pledge from your app. But don't write your
app in C. If you're doing anything serious, it's a bad idea.

~~~
lilyball
If you're considering writing it in C, you probably have some reason why you
don't want to use a scripting language. To that end, I'd suggest using
something like Rust instead of C.

~~~
clarry
Maybe I'm writing it in C because I like the language, enjoy using it and am
good with it? And I know it works everywhere I want it, even years later? I
hate to be told what I should or shouldn't use.

~~~
Nullabillity
And your users will certainly enjoy all the security issues your code will
have.

~~~
z3t4
Someone with say 10-20 years of experience in a language will write a more
secure program then someone just started in _any_ language ... And even if you
program in a "safe" language, you still make bugs, and all those bugs passed
the compiler/parser, so your language obviously didn't help you there. There
are _a lot_ of code written in C in the wild, witch means there are also a lot
of bad code.

~~~
Nullabillity
There are business logic bugs, and there are memory corruption bugs. One of
those leads to RCEs a lot more often than the other. One of those is a lot
easier for the compiler to prevent than the other.

~~~
Sanddancer
Yes, but there are ways to prevent most memory corruption bugs, even, if
you're using an OS with a non-terrible malloc implementation. The gnu malloc
that most people know is a terrible antiquity that should have been replaced
decades ago -- mallocs like OpenBSD's make a lot of memory corruption bugs a
lot harder to exploit.

~~~
qwertyuiop924
But not harder to have exist in the first place.

------
loeg
The author dismisses capsicum without much thought. Really, any application
you've already restructured to be sandboxed under pledge(2) is trivial to
capsicumize.

~~~
aomix
That is probably the reason for the quick uptake of pledge in OpenBSD's tree.
A lot of their software had already been restructured or rewritten for
privilege separation. But looking at a few capsicum diffs in the FreeBSD
codebase I'm not sure I'd call those trivial. You can do more with capscium so
there is more code to deal with those abilities. But even simple cases hit 20+
lines of code with multiple failure states to deal with. Compared to the usual
two line pledge diffs that's quite a bit more work to do.

~~~
loeg
> Compared to the usual two line pledge diffs that's quite a bit more work to
> do.

I don't think you're comparing apples and oranges. OpenBSD has just already
done the restructuring as a separate commit; in FreeBSD you see it all at the
same time, so it looks bigger. Capsicum for simple stuff is only a few lines,
especially with the "helper" subroutines.

In some cases we can get more functionality with the same restriction, or more
restriction than OpenBSD due to more specific constraints. So there's more
code, but it also does more. It can be a trade-off.

~~~
aomix
That statement was comparing capsicum being applied to programs that were also
pledged in around 2 lines of code. A few of the programs on the wiki as
examples of capsicum applications originated from OpenBSD so they were primed
and ready to go. Even those best cases are much more involved than pledge.

To use the example of the unix tr utility, the change to use pledge required
the standard two line diff.

    
    
      if (pledge("stdio", NULL) == -1)
    		  err(1, "pledge");
    

tr.c was the one of the earliest programs pledged (back when it was called
tame). The original diff was a one liner before they start doing the "pledge
or error out".

[http://marc.info/?l=openbsd-
tech&m=144070638327053](http://marc.info/?l=openbsd-tech&m=144070638327053)

    
    
      +	tame(TAME_STDIO, NULL);
    

The capsicum diff required the following

[https://reviews.freebsd.org/D7928](https://reviews.freebsd.org/D7928)
[https://reviews.freebsd.org/file/data/4exxbzvuc3dayrvdj6qe/P...](https://reviews.freebsd.org/file/data/4exxbzvuc3dayrvdj6qe/PHID-
FILE-2kbbfbdvu7dpl2u3ztnx/D7928.diff)

    
    
      +	cap_rights_t rights;
      +	unsigned long cmd;
    

(...)

    
    
      +	cap_rights_init(&rights, CAP_FSTAT, CAP_IOCTL, CAP_READ);
      +	if (cap_rights_limit(STDIN_FILENO, &rights) < 0 && errno != ENOSYS)
      +		err(1, "unable to limit rights for stdin");
      +	cap_rights_init(&rights, CAP_FSTAT, CAP_IOCTL, CAP_WRITE);
      +	if (cap_rights_limit(STDOUT_FILENO, &rights) < 0 && errno != ENOSYS)
      +		err(1, "unable to limit rights for stdout");
      +	if (cap_rights_limit(STDERR_FILENO, &rights) < 0 && errno != ENOSYS)
      +		err(1, "unable to limit rights for stderr");
      +
      +	/* Required for isatty(3). */
      +	cmd = TIOCGETA;
      +	if (cap_ioctls_limit(STDIN_FILENO, &cmd, 1) < 0 && errno != ENOSYS)
      +		err(1, "unable to limit ioctls for stdin");
      +	if (cap_ioctls_limit(STDOUT_FILENO, &cmd, 1) < 0 && errno != ENOSYS)
      +		err(1, "unable to limit ioctls for stdout");
      +	if (cap_ioctls_limit(STDERR_FILENO, &cmd, 1) < 0 && errno != ENOSYS)
      +		err(1, "unable to limit ioctls for stderr");
      +
      +	if (cap_enter() < 0 && errno != ENOSYS)
      +		err(1, "unable to enter capability mode");
    

No one would dispute that capsicum is more capable but is significantly more
complex. Pledge trades finer control over capabilities for the ability to have
a "work or die" usage model. Capsicum requires that you be aware of all the
potential failure cases and account for them.

~~~
loeg
> The capsicum diff [to tr] required the following ...

This is a slightly out of date example. We've since added simplifying wrappers
for stdio.

The current equivalent of if pledge("stdio", ...) / err is:

    
    
        if (caph_limit_stdio() < 0 || (cap_enter() < 0 && errno != ENOSYS))
          err();
    

Some examples:
[https://reviews.freebsd.org/D8307](https://reviews.freebsd.org/D8307)

> a "work or die" usage model

This is an option (or will shortly become an option) in capsicum.

~~~
aomix
That's really cool, thanks.

------
egh5oon
Firejail is one of the most underrated:
[https://firejail.wordpress.com/](https://firejail.wordpress.com/)

------
marktangotango
I'm confused about what exactly 'sandboxing' entails here. Seems to be
limiting syscalls and file system interactions. But what about process heap
and cpu usage? Are these handled by either seccomp or capsicum?

~~~
loeg
No, capsicum is a security sandbox. Resource usage is not governed by
capsicum. For that you can use traditional ulimits or RCTL for finer-grained
control (similar to Linux cgroups).

------
catern
Since the author is using cgi:

>However, this is definitely something in the crosshairs of ksql(3): forking a
process, like kcgi(3) does, that handles the database I/O and communicates
with the master over pipes.

If they do this, then they can just use seccomp SECCOMP_SET_MODE_STRICT, which
is the first thing described in the seccomp man page and in the Wikipedia
page. And is also the original/first mode of seccomp to be available.

I believe capsicum would likewise be trivial to use: You can just immediately
lock things down completely. (Perhaps just cap_enter() is sufficient?)

~~~
justincormack
SECCOMP_SET_MODE_STRICT is pretty hard to use - you cannot allocate memory for
example.

~~~
catern
I did not realize that. Given that, I was incorrect when I said that the
author (or anyone) could easily use seccomp MODE_STRICT.

Capsicum may still be easy to use. And seccomp-bpf is not that complex either.

------
Sean1708

      > allows resource requests or denies them and (hopefully) kills the application
    

Why is killing the application desirable behaviour over just denying the
request?

~~~
shanemhansen
There's a school of thought in security that believes you should hard fail at
any attempt to violate an invariant. I believe the linux grsec patches do
this. They will just crash the kernel if they see something at runtime that
they don't like.

~~~
aaronmdjones
Indeed; if you have the Active Kernel Exploit Response feature enabled in
grsec, and a root process triggers a kernel OOPS, it will panic() the kernel.

You can combine this with the panic= kernel command-line argument to set a
reboot delay, but for headed (rather than headless) boxes it may be more
desirable to keep the panic message up to see what process triggered it.

------
acidx
Shameless plug, inspired by pledge(2), if you're a Linux rather than BSD
person: how to use elementary virus writing techniques to sandbox programs
under Linux using seccomp-bpf:
[https://tia.mat.br/posts/2016/11/08/infect_to_protect.html](https://tia.mat.br/posts/2016/11/08/infect_to_protect.html)

------
comex
The difference between pledge and seccomp is not just a matter of fine-
grainedness. There's some of that, but I'd say there's also a difference in
philosophy: sandboxing resources versus sandboxing kernel attack surfaces.

At one end of the spectrum is the macOS sandbox, which pretty much only
restricts how the process is allowed to interact with the outside world. It's
not actually simplistic like the article claims; that's just the public API.
If you look at the raw sandbox profiles, which Apple writes for daemons and
such, it's quite fine-grained: sandbox profiles can specify which paths can be
read and which can be written (with an arbitrary number of allow/deny rules,
each of which can match on an arbitrary regular expression), which network
ports can be opened or listened to, which preferences can be read or written,
which IPC services can be accessed, and so on... there are a ton of
categories. If you're on macOS, look in /usr/share/sandbox and
/System/Library/Sandbox/Profiles to see what they look like. _However_ ,
sandboxing only occurs when a given syscall's kernel implementation explicitly
calls for it, and even the most locked-down process can access hundreds of
syscalls and Mach kernel IPC calls, which based on their intended purpose
should be harmless. And most of them really are harmless, but all it takes is
a small oversight in one of them for an attacker to potentially corrupt memory
in the kernel and gain control over the system. *

At the other end is Linux seccomp, which is designed to allow minimizing
kernel attack surface. Chrome's renderer processes only get access to a tiny
set of syscalls, and even the arguments to the syscalls are locked down,
corresponding as closely as possible to what the process actually uses. (This
is possible in part because Chrome has an unsandboxed master process that
handles things like reading files for it. The renderer sandbox doesn't even
allow open(2).) Thus Chrome is protected from most vulnerabilities in the
Linux kernel, because there's just no way to get to the buggy code in question
from within the sandbox. This isn't perfect: I've found multiple Linux kernel
bugs (only one exploitable so far) that _are_ accessible from the sandbox. And
of course it doesn't make Chrome immune to privilege escalation, since
attacking the kernel directly isn't the only option; IPC communication with
the master process is a much larger attack surface. But that's basically
unavoidable, and the situation for Chrome is still a big improvement over
other operating systems. Unfortunately, as the article says, it takes quite a
lot of work to adapt to new applications.

OpenBSD pledge seems somewhere in between. Unlike the other two, I don't have
personal experience using it, but based on the manpage - an empty "promises"
set restricts the process to _exit(2), which like seccomp should protect
against kernel flaws. However, if you supply "stdio", suddenly the process
gets access to 69 system calls, which is not terrible (many of the calls in
the set are very basic and unlikely to contain bugs) but not as minimal as one
might want. There are a bunch of random functionality lockdowns, which is good
- for example, ioctl operations are locked down, "setsockopt(2) has been
reduced in functionality substantially"... on the other hand, the promise
categories generally seem more aligned to 'resources the program might want to
access' than 'kernel API surface the program might need to use'.

Overall, definitely not bad, and I'd like to see something comparable on
Linux, but I'd be more comfortable if OpenBSD also supported a more fine-
grained sandbox.

* (Okay, that's a little unfair to macOS, since there are some sandbox filters that lock down kernel attack surface, such as the ioctl filter and the IOKit filters - the latter of which has been tightened over time. But that still leaves a _ton_ of surface exposed.)

~~~
mtgx
Would you say Chrome is more sandboxed on Linux due to seccomp than on Windows
10?

~~~
comex
Yes; I'm not too familiar with Windows but I know there's no syscall-by-
syscall lockdown. However, they were able to disable all Win32k syscalls
starting in Windows 8, which is a significant attack surface reduction
compared to the past:

[https://www.chromium.org/developers/design-
documents/sandbox...](https://www.chromium.org/developers/design-
documents/sandbox#TOC-Win32k.sys-lockdown):

------
loosescrews
Go (golang) is a great language for web applications and has comparatively
small syscall coverage making it easy to sandbox with seccomp(2). It also has
most of the protections of a fully managed language, makes making raw syscalls
easy, and can be used without libc. If you don't use cgo or SWIG (pure Go and
Go assembly are fine), building statically linked pure Go binaries without
libc is just a couple of commandline flags. If you don't import the net
package or a few other problematic packages, no commandline flags are
required.

~~~
Someone
Apart from the 'easy to sandbox' (also debatable, as that depends way more on
the functionality of the application than on what it is written in, but let's
ignore that) does that matter much? pledge isn't only, or even primarily, a
defense against _you_ accidentally making syscalls you don't want to make, but
a defense against malware that, having invaded your application's process,
attempts to make such syscalls pretending to be you. That's code you didn't
write, making the programming language choice moot.

~~~
loosescrews
Not really. An effective sandbox is restrictive. Managed languages tend to use
a more limited set of system calls than non-managed languages because most
libraries don't make any system calls of their own. The Go runtime and
standard library use a more limited set of system calls than most
alternatives, so a more limited set of sandbox filters are required.

Injected instructions are more likely to invoke their own system calls, so you
are more likely to catch it if you have a more restrictive set of filters.

~~~
Someone
_" Managed languages tend to use a more limited set of system calls than non-
managed languages because most libraries don't make any system calls of their
own"_

I have a hard time understanding how that could be true. A library either
needs to use OS functionality, or it doesn't. If it needs to, managed
languages cannot avoid doing so, and if it doesn't, why would non-managed
languages make them, more so since a major argument for the most popular of
them, C, is performance?

The only argument I can see is that managed language libraries could use
(fewer) other libraries or the runtime to make such calls. Neither decreases
the number of system calls made by any program.

------
mioelnir
Does pledge(2) still enable its simple programming interface by allowing
certain default actions based on a filesystem hierarchy layout, statically
defined in the kernel/libc/somewhere? The first presented versions of tame(2)
had this iirc.

~~~
wahern
If I followed the tech@ list properly, I believe OpenBSD has added a
specialized socket type for the "dns" pledge. That allows removing the code
that permit certain socket calls iif /etc/resolv.conf has been opened.

There are other things in the works related to pledge. For example, they're
going to ship a native local stub resolver that will be a functional extension
of libc, permitting them to remove and simplify code from libc. That would
allow sandboxing networked services without necessarily requiring any
filesystem access after invoking pledge, while still permitting DNS to
seamlessly work even after configuration changes. (The trick is that most
client resolvers, third-party and in libc, default to 127.0.0.1:53 if
/etc/resolv.conf is unavailable.)

