Hacker News new | past | comments | ask | show | jobs | submit login
Unveil(2) – Unveil parts of a restricted filesystem view (openbsd.org)
206 points by notaplumber 9 months ago | hide | past | web | favorite | 67 comments



Not a very insightful comment but I really like the way OpenBSD names system calls - both `unveil` and `pledge` are great, descriptive Unix-y names for what these system calls do.


I'm curious. To me, unveil means reveal information, while the intended purpose for this tool is to hide it. Why do you think it's such a descriptive name?


While it implicitely hides everything on the first call, it unveils the arguments. Reads really natural. I imagine typical usage would be:

    fork();
    unveil("/home/jakob/", "w");
    unveil("/etc/some_config_file", "r");
    unveil(0,0);
    exec();


> While it implicitely hides everything on the first call

I think that's the really weird bit, I guess they didn't want multiple functions but it would make more sense to veil() (hide everything), unveil(path, mode) (show that path) and lockveil(), something along those lines. Or maybe use some sort of mode constants e.g. veil(VEIL_INIT), veil(VEIL_REVEAL, path, mode), veil(VEIL_LOCK).


What benefit would requiring veil() before unveil() have? There is no point in calling unveil() if the file system isn't hidden. Making the hiding implicit reduces the number of possible mistakes people can make when using the API.


You could be more sure that something else didn't already unveil things you want hidden in your current invocation.


The end game might be that everything is veiled by default, so if you have no unveil calls then your process can't read anything


That would work just as well with a better API. The initial call would just become a no-op at that point.


what do the 0,0 mean here?


Disallows further use of unveil()


It is stated in the 2nd line of the Description in the manual that this will reveal things with subsequent calls. It’s just the initial call that hides everything.


Tbh, to me the title "Unveil parts of a restricted filesystem view" read like it was about a hack.


Yeah, that was my thought too. This is more like “veil_except”.

That said, “unveil” is good enough.


Why not "restrict"?


Here be dragons! "restrict" is a keyword in C.


I agree. Naming things is so hard. It's always nice when people do manage to come up with terse names.


Reminds me more on Perl naming conventions. Take an arbitrary non technical expression, where established names already do exist, but clash with the semantics.

Also the usage of strings for flags and not int bits just let me cry out. This is pure nonsense.


Pledge (previously `tame`) did use bitmasks in its earlier iterations. See Ted Unangst's post about it: https://www.tedunangst.com/flak/post/string-interfaces


I know this change, that's why I bothered. He can do this in his private app as he likes, but not in the libc which is the most widely copied libc. It's about security, consistency and being a positive role model.


I don't like it, it's un-C-like.


Too many vowels? Would you prefer 'pldg' ? ;) Unless you meant the stringly-typed arguments in which case I'm 100% with you, they're asking for trouble.


It's like everything else in openbsd. The people who don't use it don't like it; the people who live with it every day for years realize benefits the haters don't perceive.


Maybe Theo wanted fopen-like semantics?


fopen is not a system call, it is a library call. The system call is open, and it uses bits for flags.


Which makes it that much harder to extend.

There's nothing wrong with the use of strings here. It's very readable, easily understandable to anyone who knows C, can be checked by automated tools (or at runtime) for invalid values, and is easily extensible in the future. It's also not on a hot path (I mean, you shouldn't have to do this at any point other than process creation).


Right but is there really any reason why the implementation should call strncmp or anything that would be costlier than a bitwise AND? It's not like "r" reads better than O_RDONLY.


Pledge happens on process startup only, so since it’s not like it is called over and over again during the life of the process, the overhead is certainly negligible for all modern servers, desktops and laptops.


But compiler errors (typo in a flag) become runtime errors (typo in the unveil string).


You're only one character away from FLAG1 || FLAG2.


It is trivial to run a script to check the values are correct.


It's also pretty trivial to run the program. Maybe it's different in other projects, but I'm pretty sure in OpenBSD you're encouraged to actually test the code, not just ship it because it compiles. There's 2 programs in OpenBSD's base that don't abort on the pledge call failing, they're both shells, and they don't fail quietly.


You don't see an issue with needing to write and maintain a verification script for something that is essentially a solved problem? Why add more development runtime overhead? What do string-based flags get you that bit-based don't?


Removes the need to add a separate include for namespacing, ie. no pledge.h header needed.

Makes the calls shorter and easier to read. Consider:

    if (pledge("stdio rpath tmppath proc exec", NULL) == -1)
        err(1, "pledge");
vs.

    if (pledge(PLEDGE_STDIO | PLEDGE_RPATH | PLEDGE_TMPPATH | PLEDGE_PROC | PLEDGE_EXEC, NULL) == -1)
        err(1, "pledge");
Makes it easier to add a pledge interface to other languages with having to keep chasing bit flag changes, ie. there's been several new promises add since 2015 but the OpenBSD::Pledge perl module hasn't needed to change.

And really, unless you're using pledge wrong, ie. by using it outside of the startup path and/or not checking the return value, what does compile-time checking get you?


Nope, I really don't. I don't fear strings in code because I have to run verification scripts on whatever config format we end up using or you get stuff like [1]. Its just another part of the compile process once setup, and their reasoning was sound[2], so why stress. Heck, the script isn't even complicated.

Now, at a language level, would I like the C committee to start looking at some of these things? Yes, but why bother hoping for miracles.

1) https://www.engadget.com/2018/07/13/aliens-colonial-marines-...

2) https://www.tedunangst.com/flak/post/string-interfaces


If I’m not mistaken, you can even have the best of both worlds in languages with proper macro support (I’m sure you know which lang I‘m talking about): The Macro call can take the string input and „desugar“ it into bitmasks / function calls / whatever you need at compile time. Then you have:

1) easily readable, string-like input; 2) built-in-your-code verification and 3) compile-time check; 4) zero-overhead execution.

Just reading Ted‘s blog post gave me an idea how lang-specific libraries / (thin-layer) abstractions (or simply additional magic helper functionality) can improve upon legacy API designs (e.g. the GLX example) by using lang-specific functionality I never knew I‘d need or even cared about. Sometimes one can learn when you‘d never expect it. :D

Unless I‘m mistaken and misunderstood the whole thing, please correct me in that case.


I think the issue is that for most languages you have to update the libraries with the new constants for bit-wise operations and not for the string approach. Some languages would be fine, but this is multi-language in the end.


This seems like a pretty major limitation of the interface:

QUOTE

It is important to consider that directory results are remembered at the time of a call to unveil(). This means that a directory that is removed and recreated after a call to unveil() will appear to not exist. Non directories are remembered by name within their containing directory, and so may be created, removed, or re-created after a call to unveil() and still appear to exist.

ENDQUOTE

So it's not really path-based, but rather translates a path at the the time of the call to some UUID (or just an inode? But then how is inode re-use handled?) for that particular directory.

So if you unveil("~/myapp", ...) and the user does something like `mv ~/myapp ~/myapp.back && mkdir ~/myapp` the program won't see anything under ~/myapp anymore, since it's not the same directory, even thought it's at the same path.


Why would the user do that? You don't move away a program's data dir while it's running and expect everything to just work, right?


There are programs that rely on this working, which would continue to work with unveil(2) if it wasn't for this artificial limitation.

Think e.g. a daemon that processes data in /var/run/something and uploads it somewhere else, and gracefully handles failure if the directory disappears (e.g. an admin had to remove all existing data). Due to the unveil(2) interface such an action would require a full restart of the daemon, but that's not a normal limitation on *nix systems.

I can't think of any reason for it to work this way except that it's exposing the underlying limits of the implementation, or if they're making the trade-off of tying it to the inode so `mv ~/myapp ~/myapp.back` will continue to allow access to files under ~/myapp.back, but they don't document that.


Even though the function's name is unveil, I'm thinking that this feature is for denying access rather than allowing. In that sense, I think it's a useful protection measure that the new directory doesn't suddenly become visible in your example -- smells like an attack vector. Also, I would guess that doing these checks via pathname every time files are accessed would have to be pretty bad for performance.

If such a feature is wanted, I would think that the daemon can just call unveil again when it's doing an automatic recovery.


Perhaps it's for performance reasons. I don't see how it would be an attack vector to permit /some/path/dir to be read after the "dir" is re-created or replaced. You'd guard against that with standard nix ACLs. If someone can replace the directory they can also modify it.

    > [...]the daemon can just call unveil again[...]
Not if it previously called unveil(NULL, NULL), which is a significant part of the security selling point of this feature. If so it'll need a full restart.

You also won't be able to tell if the directory can't be accessed or if it truly doesn't exist, since syscalls will return ENOENT instead of EACCES in this case. So the state machine to recover from this becomes complex. You might do a full restart just to find that no, the directory really doesn't exist anymore, rather than getting replaced.

On the other hand this is consistent with how chroot(8) works, but in that case you're cd'd to the directory in question, which you'll loose access to if it gets replaced (standard nix semantics, nothing to do with chroot per-se).

Overall I really like these simple security restriction mechanisms OpenBSD is adding, and wish these sorts of APIs were available on more popular OSs, but if say Linux cloned this I hope they don't carry over this caveat, since it's not consistent with how path access works in general.


I think in linux you could already do this if you wanted to.

unprivileged user namespace + mount namespace -> mount some tmpfs, bind-mount (optionally read-only or noexec), pivot_root and you got an isolated view of the filesystem. You could write a wrapper around that which provides a streamlined API in the fashion of unveil.

It's what sandboxes and containers runtimes do. Instead it could also be made into a security library which processes could use as part of their startup sequence.

Basically, the linux building-blocks (seccomp, namespaces, privileges) are more low-level and you need to assemble pledge/unveil-like abstractions from them. But I am not aware of any established library that does that in a convenient fashion.


The difference between pledge and seccomp, or unveil and namespaces, is that pledge and unveil limitations are not inherited.

Intuitively you would think that makes them less secure. But inheritance means that it becomes impossible to use system utilities. On OpenBSD /bin/sh is pledge'd; you could never do the same thing using seccomp because it would make the shell useless. At best you'd dynamically seccomp a single shell session according to a particular task, and you'd have to do it from the invoking process. Unsurprisingly, nobody bothers doing this.

So it's impossible to emulate pledge, and impractical to emulate unveil, on Linux. At this point, almost every program on OpenBSD uses pledge. No configuration. No option switches. It's done. Everything works like before, except now the security risk of bugs in all that code are substantially mitigated.

All the criticisms of pledge and unveil miss the basic point of these interfaces--to be easily useable by developers and easily used in all programs, not as low-level primitives to be used to write libraries to be used to write tools to be used by sysadmins for sandboxing services.


in linux, systemd can do that when starting a service. the config is in a unit file, not a library

probably better, as such code is likely to be very unportable. even if you standardise the calls in a library, on linux the right paths will probably vary between distributions and architectures


Pledge and unveil allow you to do privileged setup within the process, e.g. obtaining a bunch of file descriptors, or doing other things that need caps, then holding onto them and then dropping those privs.

That's not possible with an external container/sandbox launcher, you need to interleave it with the control flow.

Seccomp&co are too low-level. Jail launchers are too coarse.


If the daemon is going to call unveil on every file access then what purpose does unveil serve?


The trick is to unveil all the files the server needs to operate on and then lock further unveiling.

This way bugs/exploits of the server can't suddenly go read/write to i.e. /etc/passwd.


I assume that is intended as a security feature to prevent TOCTTOU bugs, especially when a process does not use the *at family of syscalls.

I.e. a malicious process can't swap directories under your nose.


> or just an inode? But then how is inode re-use handled?

inodes have generations. Each reuse increments generation. (ino, generation) tuple uniquely identifies an object within the file system.


This is correct, however generation is mostly needed (and was created) for NFS, and rarely used for anything else. Unveal lives in kernel, it just keeps a reference to vnode (https://github.com/openbsd/src/blob/03249988e9bbffb48a568839...)


Thanks. I was wrong about that, I was confusing this with PID re-use.


Note, this is the same as bindmounts in linux.

    $ mkdir a b
    $ sudo mount --bind a b
    $ rmdir a && mkdir a
    $ touch a/foo
    $ ls b # no foo
    
This can be seen in docker containers if you bindmount things around as well.

The solution for both is the same: simply bindmount a level higher. If you need to mutate "data", instead make the directory "myapp/data" and mutate the sub directory only while mounting/unveiling the parent.


This design is so thoughtful. Thanks to fork, you can easily sandbox a child process to a specific subtree of the filesystem: fork, unveil(0, 0), run untrusted user code.

I wish OS X and windows had this.


Build systems can use such feature, without going through heavy duty isolation (sandboxing) like Bazel.

It's done there, such that "accidental" dependencies are not followed, e.g. if your a.cpp source included b.h it better be that the rule compiling a.cpp explicitly stated that is dependent on a rule "exporting" b.h

And possibly for other uses too :)


Is sandboxing on macOS not exactly this (albeit with a different API)?


It is. You have to write the rules as a Lisp script, and there's a scary "this is sort of deprecated, use the new Mac App Store stuff" paragraph in the manpage, but yes, it does allow exactly that.


I wish instead of a patchwork of hacks like this we had a nice fundamental mechanism, e.g. capabilities exposed via a union file system (i.e. you make everything your process might want to access available via a file handle and run it in some chroot where you "mounted in" the things you want it to be able to touch; e.g. /bin has all the programs it can run, network, window manager and process tree access is mapped to /net /x and /proc etc).


Can you?

I haven't used OpenBSD in around fifteen years.

Is the dynamic linker somehow exempt from unveil? Or does it somehow know what's going on?


Here's a presentation from Bob Beck entitled Pledge, and Unveil, in OpenBSD: https://www.openbsd.org/papers/BeckPledgeUnveilBSDCan2018.pd...


This discussion from 39 days ago has some good links plus whether there is a Linux equivalent:

https://news.ycombinator.com/item?id=17277067


I want a "merge" or "link" posts UX feature. Conceptually like merging bug reports.


We do merge posts while the discussion is still happening, but afterwards we aim to preserve context as much as possible.


This is neat but I am still partial to slower but fully generic approaches for syscall filtering, such as BPF.

Sooner or later you run out of clever English verbs for the plethora of fixed-function ones.


> fully generic approaches for syscall filtering, such as BPF

How do you constrain filesystem access to be under a list of allowed directories using BPF?

I briefly looked at it, and there doesn't seem to be a good way to manipulate strings. Everyone seems to be using chroot / bind mounts for that on Linux, which adds crap lines to `mount` output.


> unveil can be locked, preventing further filesytem exposure by calling unveil with two NULL arguments.

This is nice and I bet it will be preferable to blocking further unveil calls via pledge. Doing it with pledge would depend on the code path. Just as an example, are you connecting to an IP address or a hostname (requiring a "dns" pledge)? Are you reading from stdin or opening a file directly (requiring an "rpath" pledge)? The current pledge state wouldn't matter when removing unveil access via unveil itself, much simpler.

Pledge is already outstanding but combined with unveil I feel like a kid in a candy store.


I don't think the unveil(0,0) call to disable/ignore further calls is intuitive. But I'm not sure how you would improve on it beyond creating an unveil_disable() call.


The hardcoded zero literal for the flags could be replaced by a named constant.

unveil(0, UNV_LOCK)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: