Hacker News new | past | comments | ask | show | jobs | submit login
splice() and the ghost of set_fs() (lwn.net)
210 points by chmaynard on May 26, 2022 | hide | past | favorite | 81 comments



Our software[1] got broken by a kernel change a few years ago. The experience was quite interesting.

We were making use of `/proc/PID/pagemap`, which is a kernel-generated file that previously would show the physical addresses of all the pages in a process's address space. Unfortunately, with the Rowhammer exploit, exposing this information - even for one's own processes - to unprivileged users went from being harmless to a security risk.

The first we saw of the change was when newer kernels started reporting zeros for all physical addresses, unless we ran as root. We raised this the LKML, explaining that we'd been relying on this feature to implement a somewhat esoteric optimisation.

Linus replied very helpfully - the security fix trumped userspace compatibility but he could see a secure way of getting us the information we really needed, given the technique we'd described. He invited us to submit a kernel patch and gave a few hints about potential gotchas.

I did work one up but the kernel community actually jumped on it as an opportunity to do more cleanup, so I ended up just signing off on the patch they produced. It was all a remarkably smooth and efficient process.

[1] Time travel debugging - http://undo.io


Heh, as someone who maintains another exotic debugger[1] my experience has been that the kernel development process is kind of a pain. We've had good experiences with people fixing regressions once they're discovered but getting new features into the kernel has been difficult. I think it took 10 revisions for me to get cpuid faulting into the kernel, including multiple review cycles where I was first told "change X to Y" and then in a subsequent cycle told "change Y to X".

[1] https://rr-project.org/


Yeah, for similar reasons, the /proc/PID/wchan now shows just "0" (for other users' processes) on newer kernels, unless you run as root. Same with /proc/PID/stack, but it's implemented in a different way, I can open() that file successfully, but the read() syscall on the opened file descriptor returns EACCESS error...

  $ ls -l /proc/$$/stack
  -r-------- 1 tanel tanel 0 May 26 21:52 /proc/967141/stack
  $ 
  $ cat /proc/$$/stack
  cat: /proc/967141/stack: Permission denied
  $ 
  $ sudo cat /proc/$$/stack
  [<0>] do_wait+0x1c3/0x230
  [<0>] kernel_wait4+0xaf/0x150
  [<0>] __do_sys_wait4+0x85/0x90
  [<0>] __x64_sys_wait4+0x1e/0x20
  [<0>] do_syscall_64+0x49/0xc0
  [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Edit: Adding one more comment - my impression has been that the "no userland-visible changes" promise applies to system calls - how procfs presents data as human-readable text in the /proc files has changed every now and then before (I recall the sar command showing wrong numbers after a kernel update, for example).


We've found userspace ABI does change in some surprising ways occasionally but mostly it doesn't matter to anybody.

e.g. we've seen the format of signal stack frames change in the past but nobody relies on that layout so it's OK.

/proc is another one along those lines - technically somebody could probably complain if it changes but if nobody shouts then it will just drift over time.


> the security fix trumped userspace compatibility

TIL

All stories about kernels describe it as an absolute that userspace is never broken, but that actually makes sense


Yeah, there are rare exceptions to The Rule.


In case anyone else is curious here is the LKML thread he mentioned https://lkml.org/lkml/2015/4/24/379 and this is the followup thread with the patches https://lkml.org/lkml/2015/6/9/804


Is there a way to try out liverecorder without having to speak with a salesperson?


Apologies for the slow response - but, yes (mostly).

If you go here: https://undo.io/udb-form/

Then you do have to fill out a contact details form but you will be able to download a trial of UDB, our interactive debugger. That gets you the Time Travel functionality.

(if you're a VSCode user then you can download the extension https://marketplace.visualstudio.com/items?itemName=Undo.udb and start from there instead)

If you want the full LiveRecorder experience - additional tool, library, etc then you do have to request a demo. https://undo.io/about-us/contact/request-demo/ - Mention that you had exchanged messages with me (Mark Williamson - Architect @ Undo) and I can help from my side if you have any technical issues.

(edit: s/issues/technical issues/)


Even if you don't want to directly support splice() for every possible filesystem or file descriptor, I don't understand why it couldn't be "emulated" in those cases, by having the kernel just pretend it was given a series of read()/write() calls. That might not be as efficient, but surely it would be better than just breaking completely.


The point of splice(2) is to be a fast-path; it's not in POSIX, so software that wants to guarantee portability has to write the slow path too, and only use splice(2) if it's present.

Such a blind/naive compatibility shim, would likely be much slower than whatever hand-written slow path the developer has in place. If the whole point of a call is to be a fast/efficient version of something else, then the call is breaking its semantics if it isn't more fast/efficient than the alternative.

Because of this, it's better to just make autoconf et al detect splice(2) as absent for the given use-case (and fall back to the hand-rolled portable slow path), rather than relying on a naive kernel shim.


You cannot in general use autoconf to detect runtime behavior.


Many of autoconf's more fraught checks work by attempting to compile and run little C programs to see what happens. These checks enable not just static analysis (seeing what the compiler does), but also "edge-case analysis" ala a fuzzer — e.g. seeing if the runtime environment of the compilation environment is one where passing certain printf(3) format-specifiers makes it choke, etc.

So if splice(2) no longer works when the source is e.g. /dev/random, then autoconf could attempt to compile+run a program that splice(2)s from /dev/random to a known-working sink, and see whether that program SIGSEGVs or not when run; and use that to decide whether to allow an --enable-splice configure flag to be passed, vs. bailing out if such a flag is passed.

Of course, there's the implicit assumption that if you're passing such a flag, the build environment is going to be the deploy environment; or at least, the build environment's feature-set will be a subset of the deploy environment's, such that the build environment's runtime features can be used as a conservative underestimate of the deploy environment's runtime features.

This "build is a subset of deploy" is usually a sensible assumption. Build environments are controllable, while deploy environments are arbitrary; so anyone who wants to make a build that works on many different deploy environments, can just set up their build environment to have the "lowest common denominator" of the features of the systems they want to target.

(Compare and contrast: microarchitectural optimizations. Same story.)


> then autoconf could attempt to compile+run a program that splice(2)s from /dev/random to a known-working sink

And then the next day you, or your user, update your kernel, the result is out of date and your program crashes. That's just not how autoconf (or cmake or meson for that matter) are used.

> running autoconf checks inside a target-machine emulator

Run tests are quite rare. Older autoconf used to use printf to detect the size or alignment of a type, but for 15-20 years it has instead been doing binary search (basically "guess the number") so that only compilation tests are needed instead.

(I am a former autoconf developer and GCC build system maintainer).


I don't think they're suggesting runtime use, but rather building software after the splice(2) change. That's how I read it at least.


Aside from the “where you build isn’t where you run problem”, there’s the “this syscall doesn’t work when running on filesystem X or accessing /dev/urandom”.

Not only is the autoconf solution not fixing the problem, it’s placing a massive undue burden on developers. Linus has been inconsistent here. Telemetry would have been helpful here in aiding this work (ie support it with the slow path but report the event so that distro maintainers could provide feedback on broken paths). Once you think you’ve eliminated the long tail of issues, then remove and see if anything remains broken that telemetry didn’t catch.


It would break if compiled on old kernel and run on new kernel. When CI is containerized the kernel might not even have anything to do with the distro that you're building on.


Doesn't that happen often anyway?


Nope. You can run programs to detect some behavior, for example printing sizeof(unsigned long) or checking how some rounding is performed. Those tests of course may affect the program behavior at runtime, but what they test is still the compilation environment.


That however is not how the linux kernel guarantee of the stability of userspace APIs works.


It used to. But this emulation worked by doing set_fs() so that the internal (outdated) read/write implementation could follow a pointer into kernel memory, and that facility is gone now.


Ah, I think I get it. So the problem isn't just that these drivers don't specifically support splice() -- it's that they also only support reading/writing with userspace buffers, and set_fs() was just a hack around that limitation that is no longer supported. That's kind of unfortunate.


Correct. The modern interface supports reading and writing from a rather generic concept of a buffer, and drivers that support that are usable with splice().


Perhaps because some of the semantics of a spliced pipe aren't exactly the same as read() followed by write()?


I think that leaving splice() as it was, is worse than breaking the "NEVER BREAK USERSPACE" rule, in this specific instance. It should not become a regular occurence, but I don't see a better way. They tried to stick to it, things happened, but I am optimistic that they will fix it and try better the next time.


If rule #1 is don't break userspace, we might need a rule #0, which would be don't support insecure behaviour. In the end, people might need to balance competing concerns...


The only time I’ve had a kernel update break things was a weird Proxmox bug where if you booted with the (at the time) latest kernel, it would fail to start the first VM, and nothing you did from the UI or the command line could touch it without timing out. Rebooting to the previous kernel release and it just started working again with no other changes.

That definitely broke my assumption that kernel updates were well vetted for regressions.


"Don't break userspace" is more like a _goal_ than a law.

The kernel is an extremely complex piece of code, the developers can't be asked to test every kernel release against every piece of userspace software on every hardware configuration. That's one of the reasons that new code and significant changes require a bunch of mailing list discussion, reviews, and signing-off.

Also, Proxmox ships with its own supported kernel, it shouldn't be a big surprise that you ran into issues while straying off the beaten path.


> The kernel is an extremely complex piece of code,

With no regression tests, nearly no unit tests, and lots of bits of supported hardware that nobody on the Dev team uses day to day...

It's a miracle that quality is as high as it is.


There are plenty of tests, just not in Linus's tree. KVM has not one but three suites of unit and integration tests, for example, but only 60ish tests are in the kernel's tools/testing/selftests directory.


I suspect kernel updates aren't vetted that well in general; the "don't break user space" is a ruling from Linus about what not to do - usually with regards to the user space API (don't remove or change things that already exist).

And this article is an example of where the decision was made to break user space.


Is it even a hard break? It seems to be more of a short lived regression while patches for filesystem drivers are still coming in to restore support and I don't think Linux ever guaranteed a stable driver API.


> The only time I’ve had a kernel update break things

> That definitely broke my assumption that kernel updates were well vetted for regressions.

Shouldn't that be evidence for the contrary? One break in a decade or two of use isn't so bad. It's actually pretty good.


> That definitely broke my assumption that kernel updates were well vetted for regressions.

I would argue that it's also a failure by the distro; unless there was something special about your exact setup that would have made the bug not show up in testing, I would argue that proxmox should have been testing updates to catch that kind of problem before users noticed.


Proxmox VE provides Enterprise and No-Subscription repository. Possibly GP uses latter repo that's for testing (and free!).

Maybe this? https://forum.proxmox.com/threads/latest-update-5-13-19-4-pv...


Ah, so free users hitting bugs is the test suite:)


> kernel updates were well vetted for regressions.

That seems like the kind of things that is easier said than done.


I've just had that with the most recent Proxmox update and also had no option but to regress


I had the issue as well but I managed to fix it by I changing my boot parameters (exact change differs per bootloader and hardware). Now I happily boot with .15 series (up from .13).

I expect some more users of Proxmox given Broadcom taking over VMware (e.g. I'd like to merge away from ESXi to Proxmox as I don't trust Broadcom). Hopefully it does the product and community well.


Not sure if here is the place for this but here goes. Personally I've been dealing with a regression of sorts on my laptop: it no longer suspends without being quickly woken up again on the main kernel release (works fine on LTS). Does anyone know if and where I could file a bug report?


> Does anyone know if and where I could file a bug report?

The linux kernel does take bug reports: https://docs.kernel.org/admin-guide/reporting-issues.html

However, that bug probably isn't specific enough as you've described it, unless you can find the commit causing it (such as via a git bisect https://docs.kernel.org/admin-guide/bug-bisect.html), or come up with a clearer repro.

Alternatively, if you're seeing the issue on a distro-maintained kernel (such as on fedora/ubuntu/debian with their kernel package), reporting the issue to the distro maintainers may be more appropriate.


i've had an ubuntu employee fix a kernel bug that rendered my machine unbootable on newer kernels. it was a newer thinkpad so there were quite a few affected users, it'd probably be lesser priority with more obscure hardware. Still, just to support your statement with some concrete experience, reporting to your distro can definitely help


I've seen similar things. Messing with /proc/acpi/wakeup (https://unix.stackexchange.com/questions/698185/laptop-wakes...) made the problem go away for me. I've been meaning to try to bisect it down and find the real problem, but haven't had the time to do that yet...


I have some issues as well; I heard that [1] was the cause for some breakage, which should be fixed in 5.18, but I didn't verify as I haven't had much reason to use suspend lately.

[1]: https://github.com/torvalds/linux/commit/dfbba2518aac4204203...


> my laptop: it no longer suspends without being quickly woken up again on the main kernel release

Long story short: there are a lot of things in your laptop generating interrupts.

Some of them you want to ignore, because it would cause the behavior you describe (preventing sleep)

Some of them you really want to listen to closely, because if it's an interrupt generated by say brushing on the power button, not listening to it means not waking up from sleep (traditional example: GPE96 on dells, cf for example https://bugzilla.kernel.org/show_bug.cgi?id=102281) until a longer press on the button generates an ACPI event or a powerup.

You can configure or finetune that with /proc/acpi/wakeup which hopefully will give you more context as to what other people have explained here.


I noticed it recently too, but disconnecting my Bluetooth headset first allowed it to suspend properly (it seems like the wifi chip kept it awake before making like a yo-yo back and forth).


Me 3 and it's maddening. I've posted this [1] on the Arch Linux forums but haven't found a fix. I've seen it mentioned around the web that it's a known bug, but I have yet to find a bug report

1: https://bbs.archlinux.org/viewtopic.php?id=276064


It's an open question whether Bluetooth should prevent sleep, or let sleep go through.

On a well done "modern suspend/suspend to idle", I use disconnected modern sleep + Windows Media Player on Windows 11 to listen to songs on my bluetooth headphones for a cost of about 1% of the battery per hour (as measured and plotted with powercfg) which can come down to about half of it, 0.5%/h when not using Bluetooth.

I wouldn't want Bluetooth to prevent sleep (a 1% drop per hour is better than not sleeping!)

I also wouldn't want sleep to prevent me from using my Bluetooth headset (the difference between 0.5% and 1% is significant, but it doesn't matter much in practice if my computer can be usable in the morning)

This is one of the rare examples of "modern suspend" delivering on its promises, and blowing the good old ACPI S3 away: I never saw a drop of battery <5% on S3 suspend-to-ram unless it also involved S4 in a hybrid sleep of "ACPI S3 suspend-to-ram then S4 suspend-to-disk after a while or when I run out of power whichever comes first"!


Be that as it may, acting like a yo-yo is certainly the wrong direction on a "desired behavior" scale. Especially when the headset itself is announcing the up and down events in synthesized speech interrupting what else is going on.


I have noticed this as well


with_resource(mutex file handle etc etc *o, func callback) is a safer pattern, for this reason. Thank you, Python

(with_resource is a generic example. Search for such capabilities in your library.)

Computers exist to automate. Let the computer remember to close resources upon scope end.

Go's defer is also helpful in this regard.


That was my first thought - would they leave the set_fs solution in place if C had a native solution for resource blocks / RAII+drop.


It's not that hard to do in C. Make a macro that takes a function, it runs:

  set_fs(KERN)
  *function_pointer()
  set_fs(USER)
There did it before, so they knew about that option, and if they didn't do it, they must have had a reason.


It's true, unfortunately the lack of lambdas in C makes this pattern cumbersome. You could achieve it with a macro instead, I suppose.


There is a GCC extension (__attribute__ cleanup) [1] gcc which they probably could use if they wanted, given that the kernel relies primarily on gcc (and clang supports gcc extensions). Probably not allowed through the kernel coding style?

[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attribute...


A simple policy that both set_fs() calls need to happen within the same function body with corresponding CI test based on AST/DWARF inspection would have also prevented it. Do you really want to rely on stack unwinding/destructors for security sensitive code when stack is usually the first thing that gets controlled by the attacker? Exception handling (SEH) on Windows is an exploitation vector of it's own.


I'm talking about the general idea not specific implementation. Having something happen at function/block exit doesn't mean a runtime configurable behaviour. If you don't have exceptions, it's pretty easy to statically compile that behaviour and guarantee it rather than rely on checks.


You still need to implement a CI rule so that I don't just call `set_fs()` without using `preferred_syntax(set_fs)`, don't you?


We're talking a theoretical language here, so... maybe? It could be also that set_fs is usable only in the preferred_syntax mode.


Except that things are weird in a kernel, one cannot guarantee that language behavior is enforceable. Even with a RAII/drop style solution there is a possibility that kernel weirdness will prevent it from running correctly.

The issue with forgetting function calls like this are the edge cases, and the kernel has a lot more of those.


The problem isn't that it relies on developer foresight to make the second call to set_fs. The problem (if you look through other LWN articles on the subject) is that the set_fs calls can be prone to not getting restored if the kernel code is interrupted.


Don't like the lecturing tone of this article.

This seems entirely successful to me and retired crappy and insecure behavior.

The idea that you can never break anything means that some historical mistakes can never get fixed.

The someone complains about how the language / operating system is a bolted on mess of nonsense and people go chasing after the next hot thing which has the benefit of just being newer and being able to get away with breaking changes because people on the bleeding edge put up with it.

The linux kernel is probably striking the right balance here, and I doubt that this one breakage is a sign of the decay and downfall of the Empire.


This could have interesting implications for Go programs. Go optimizes io.Copy down to splice if dst and src both support the necessary bits. Unlike C code where you could easily grep this out, this happens transparently, and could even by influenced by user input in various cases. The tail could be quite long for bugs that show up from dropping support. I'm sure some dynlangs may have similar optimizations.

edit: checked the current implementation in go, and it's only doing that for unix and tcp sockets, so it'll be fine.


Not too long ago I was constantly facing this kernel bug [0] that would kill audio whenever system memory would be under moderately high load. It's fixed now but it's kind of insane that the kernel would mess around with components as critical as sound for desktop users. [0]https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/22...


The "don't break user space" guideline should just be called "break user space, and if someone complains, fix it afterwards."


Systemd had a regression recently that has broken booting a secondary encrypted LUKS device.


What does “booting a secondary […] device” mean? Either you boot from a device, and that makes it the primary device, or you boot off another device, in which case this is now the primary device.

Or did you mean unlocking an encrypted device after unlocking and booting on the primary device? What does systemd have to do with that? Unless you want to type in the password manually every boot, you store the password (for the secondary device) on the primary device, add the password file to /etc/crypttab and be done with it, right?


> But it is true that this type of episode makes the kernel's "no regressions" rule look a bit more like just a guideline. It does not take too many of those to create breakage to the project's reputation that is hard to splice back together.


Shouldn't they have left it for 6.0?


Kernel doesn't adhere to a SemVer policy


Maybe they should. Maybe, they, should.


Naw. Kernel version number increases. Can't be more than 255 in any field, except the EXTRAVERSION which is a string. Kernel version numbering is exposed in the userspace ABI, so changing it to have meaning would be a breaking change for no substantial benefit. This[1] LWN article is a good overview of the history. Effectively the major gets incremented when the minor gets to 20 and Linus runs out of fingers & toes to keep track.

[1] https://lwn.net/Articles/871989/


How many digits does a penguin have?


I believe it's 14. (Birds feet have 4 toes and their wings have 3 fingers)


> Can't be more than 255 in any field, except the EXTRAVERSION which is a string. Kernel version numbering is exposed in the userspace ABI, so changing it to have meaning would be a breaking change.

Sounds like they should leave this change for 6.0 then.


Why? The kernel version number has no meaning, other than to provide an ordering of releases.


In 6.0 it will.


>Can't be more than 255 in any field

They can just change the max. It's not that big of a deal.


No, it's one byte per field (except EXTRAVERSION, which is a string). That's part of the userspace ABI, so it can't change without extremely good reason (critical security issue that can't be mitigated in another way).


As an aside, if you enjoy this article please consider subscribing to LWN at [0]. Normally articles are only available for free a week after they're posted; one of the perks of subscribing is being able to send a link like the one posted here so non-subscribers can view it before the week has passed. LWN's main source of revenue as I understand it is subscriptions, and the work they do is definitely worth it

[0] https://lwn.net/subscribe/Info


yep, LWN is a great resource, highly recommend for any developer (even/especially those who aren't linux people, they have lots of articles regarding other programming language development and the like).

I used to complain that they didn't have recurring subscriptions, but I believe they have fixed that! Just sign up yearly, at the worst you are a patron to good journalism that is not really happening elsewhere.

Just reading an article or two a week will give you loads of insight into what is going on in the tools you have everyday




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: