This guy's story is really impressive. Sounds like he managed to quit his job to work on livestream developing this full time as part of a drug addiction recovery strategy. [1] That's pretty impressive, I hope the best for him going forward with that. It looks like a pretty interesting introduction to operating system development, a huge topic with tons of stuff to cover.
Why is the interface string-based? Seems very hacky and unusual. Also, special bonus evil points for whoever decided that "rpath" should have nothing to do with RPATH.
Seriously please can all of you other programmers stop pointlessly abbreviating things that are already quite short?! "read_path" is perfectly fine.
Yeah I know, I was really asking why OpenBSD did it that way. I don't think string-based arguments are easier to use. If it was a struct with a load of `bool` fields you'd get code completion, compile time type checking, built in documentation, discoverability, etc. Much easier!
You can't trivially extend structs in a kernel ABI (to be fair this is worse in Linux as there is more than one libc and many programs do raw syscalls bypassing the libc anyway -- though it can still be done[1]) but string APIs are simpler to use and upgrade. They can also be far more ergonomic in some cases.
The core issue is that userspace programs and libraries can be compiled using structs with the old size (in theory the libc can abstract this using symbol versioning but then the same issue lies within the libc) causing out-of-bounds memory accesses when the kernel tries to access the struct fields. There's also forwards-compatibility issues but bad memory accesses are marginally worse.
>This mechanism works by marshaling parameters to a system call into a single C structure; a pointer to that structure and the size of the structure are passed as the parameters to the system call. That size parameter acts as a sort of version number.
> You can't trivially extend structs in a kernel ABI
IMO the article you linked is evidence that this is trivial. Especially for syscalls that will only be called a few times in a process’s lifetime. I dunno why there’s so much bike shedding about this on the mailing list.
Edit: Ahh I didn’t realize you were the Aleksa mentioned in the article. I wish you good luck.
That Windows article gives a better argument than I ever could as to why putting the size in the argument list is better than in the struct. Some pre-openat2 Linux syscalls are designed in the same way. (Having no forwards nor backwards compatibility but still having struct size versioning really is an interesting design choice...)
But yes, it is relatively trivial -- my point was more that you can't just use a struct in the way the first comment suggested, you need to come up with some scheme (even if it seems trivial in retrospect).
As for the bike-shedding, that's LKML for you (though in fairness it is a bit of a tall order to try to come up with some enforceable API design rules in Linux -- syscalls with half-baked designs being added is less rare than one would hope, so clearly there's not an overarching design principle being applied already, though thankfully it's becoming pretty rare to see a completely borked syscall that clearly has no users being merged).
The kernel can easily version APIs using struct size, as long as new members are only appended. The libc function would pass the size of the struct to the syscall, or you could have pledge() itself be a macro that computes sizeof in the caller.
That is the exact solution described in the link I included in my comment (I am the "Aleksa" in that article). It is not entirely trivial (certain edge cases need to be handled) but it is entirely doable. But string arguments also work if you don't have complicated data parsing requirements.
I (obviously) prefer the extensible struct solution but there are downsides (and other solutions weren't an option for Linux anyway).
There's a great blog post[1] by one of the OpenBSD developers about why they did so. tl;dr using bitmasks necessitates namespaced enums/defines that take up horizontal space, strings are easier and don't need to go through the C pre-processor.
Nice find. That article is highly unconvincing though and mostly argues against straw men.
> Although using strings subverts C’s already weak type checking, that’s probably not a major concern. One can screw up bit masks by using || in place of |. Or, as above, one can incorrectly pack the magic array. It’s usually much easier to visually audit a string than the C code used to plaster a dozen option together.
It's pretty easy to design an interface that is way way less error-prone than strings (especially ones full of single-letter differences!) and the visual auditing argument falls apart as soon as you have to `snprintf()` some string together from parts.
This code is way more readable, way less error prone, more discoverable, faster and more easily extendable than strings:
Kinda except not. pledge/unveil is about privilege dropping: in searching for ways to better secure the system, the openbsd developers came to the conclusion that the average program (especially things like network daemons and such) tend to have a complicated setup phase where they read config files, open sockets, query the system state, etc… then a much simpler “steady” phase which needs much less access.
With application permissions or external constraints, this is not really helpful, because the application needs to do its setup.
However if the application can pledge not to do the setup things between the setup and the steady state, and it gets corrupted or owned during the steady state (e.g. because it’s a network daemon and there’s a bug), it becomes a lot harder to exploit since there should be very little the would-be exploiter can do or explore before the OS kills the program.
So this is not really about protecting the system against the application, it’s about the application participating in the system’s protection by dynamically reducing its own permissions while running.
How does this work for child processes? What if a service regularly starts new processes to accomplish various tasks over its lifetime. Would each process also declare promises that have to be a subset of the parent? If the parent is compromised, could it then cause its children to ask for more permissions than it needs?
Exec can indeed cause an pledge escalation; the caller can cause it to inherit or have predetermined limits, but in practice this isn't used since the child would have to do similar setup tasks. As such, "exec" is a commonly pledged-out permission!
AFAIK, with pledge() a process can tell the kernel “I’m only going to use X, Y, Z features” (e.g. read, write from file system)
After the process has told this to the kernel the process can then only do these things for its life time. You can pledge() again later, but you can only restrict your pledge never expand it.
This is a nice feature because it limits the number of processes that can potentially be security liabilities even if they have bugs.
unveil() is a similar feature but for file system paths.
It’s a feature of SerenityOS (inspired/borrowed from OpenBSD), and not a feature of C/C++.
Try reading the article, it’s pretty easy to follow :)
It's not the broadest permissions from the parent, but the promises at the time of the fork, for example you can setup the parent in such a way that you fork off early a unprivileged (or privileged) child that has a different set of promises from the parent.
if you have exec permission (pledge "exec") you can exec another program and it starts with a clean slate. It's about dropping privileges so it's assumed you know what your doing and in the best case scenario the executed binary will pledge itself.
Pledge is not some external security feature but something that every program itself manages.
The syscalls pledge and unveil were created on OpenBSD to easily limit what a process can do. The pledge syscall specifies what syscalls that process can access while the unveil syscall specifies what directories it can access. After calling each in a certain way it is no longer possible to call any syscall not allowed by pledge or access any directory not allowed by unveil.
The idea is that every process should call each early when it starts running and, after that, if the process is ever compromised, it will not be able to do much harm since the files and syscalls it can interact with are limited.
For example, a browser should never access /etc/passwd or call the exec syscall. So, a browser, when run, should as early as possible call pledge and unveil to prevent itself from accessing /etc/passwd or calling exec if it ever becomes compromised.
FWICT, only sort of.
It's like giving up permissions that you might already have (presumably to reduce potential security problems).
And it's a bit more specific to the kernel.
Pledge: "I will at most use these kernel facilities" (don't let me do otherwise)
Unveil: "I will at most access these fs paths" (hide all other paths)
Most programs have a pretty good idea of what they’ll be doing in their lifetime. They’ll open some files, read some inputs, generate some outputs. Maybe they’ll connect to a server over the Internet to download something. Maybe they’ll write something to disk.
pledge() allows programs to declare up front what they’ll be doing. Functionality is divided into a reasonably small number of “promises” that can be combined. Each promise is basically a subset of the kernel’s syscalls.
Once you’ve pledged a set of promises, you can’t add more promises, only remove ones you’ve already made.
If a program then attempts to do something that it said it wouldn’t be doing, the kernel immediately terminates the program.
In unix like user spaces applications have access to a lot of stuff by default (such as the complete filesystem accessible to the executing users and a lot of system calls). For security purposes tighter restrictions would be better. Pledge() and Unveil() allow application authors to opt in at run time to restrict what they can do in the future.
The way Andreas Kling builds up code in his livestream makes it seem so easy to do OS development. After watching one of his videos I feel I could easily build up any system level program or feature, but I know from prior experience that it is not so easy for me.
No, simply because using unveil is not mandatory. It's opt-in. The filesystem becomes "veiled" to the application only after the first call to unveil(), subsequent calls "unveil" files/directories until the final call which locks it, preventing any future unveils.
This is awesome, now if someone asks me what a promise is, I can say "it's an argument to pledge about which system resources I am declaring access to" and watch their faces glaze over. "No I'm talking about the async promise..."
SerenityOS has had pledge and unveil for a while now, to the point where many system tools implement pledge by default. The video covering a reimplementation of justine.lol's pledge.com was uploaded yesterday (https://youtu.be/T6YkQF6ohoA) leveraging these APIs to replicate the original tool's behaviour (though it needs some pledges just to spawn the child process which I think are a bit clunky, but surely can be worked around).
Pledge() and Unveil() in SerenityOS - https://news.ycombinator.com/item?id=22116914 - Jan 2020 (28 comments)
Related from yesterday:
Show HN: Porting OpenBSD Pledge() to Linux - https://news.ycombinator.com/item?id=32096801 - July 2022 (114 comments)