More

markjdb · 2026-01-22T16:27:24 1769099244

At least FreeBSD's syscall ABI is guaranteed to be stable, one can run ancient binaries on a modern kernel. I believe the same is not true of OpenBSD and maybe NetBSD however.

markjdb · 2025-11-25T02:16:53 1764037013

The pf maintainer in FreeBSD has been doing a ton of work to bring more recent improvements over from OpenBSD, trying to bring them in sync as much as possible without breaking compatibility: https://cgit.freebsd.org/src/log/sys/netpfil/pf

The state of affairs you described is much less the case now than in the past.

markjdb · on Dec 29, 2024

I don't think the article does a good job of arguing its premise, which I think is that kqueue is a less general interface than epoll.

When adding a new descriptor type, one can define semantics for existing filters (e.g., EVFILT_READ) as one pleases.

To give an example, FreeBSD has EVFILT_PROCDESC to watch for events on process descriptors, which are basically analogous to pidfds. Right now, using that filter kevent() can tell you that the process referenced by a procdesc has exited. That could have been defined using the EVFILT_READ filter instead of or in addition to adding EVFILT_PROCDESC. There was no specific need to introduce EVFILT_PROCDESC, except that the implementor presumably wanted to leave space to add additional event types, and it seemed cleaner to introduce a new EVFILT_PROCDESC filter. Process descriptors don't implement EVFILT_READ today, but there's no reason they can't.

So if one wants to define a new event type using kevent(), one has the option of adding a new definition (new filter, new note type for an existing filter, etc.), or adding a new type of file descriptor which implements EVFILT_READ and other "standard" filters. kqueue doesn't really constrain you either way.

In FreeBSD, most of the filter types correspond to non-fd-based events. But nothing stops one from adding new fd types for similar purposes. For instance, we have both EVFILT_TIMER (a non-fd event filter) and timerfd (which implements EVFILT_READ and in particular didn't need a new filter). Both are roughly equivalent; the latter behaves more like a regular file descriptor from kqueue's perspective, which might be better, but it'd be nice to see an example illustrating how.

One could argue that the simultaneous existence of timerfds and EVFILT_TIMER is technical debt, but that's not really kqueue's fault. EVFILT_TIMER has been around forever, and timerfd was added to improve Linux compatibility.

So, I think the article is misguided. In particular, the claim that "any time you want kqueue to do something new, you have to add a new type of event filter" is just wrong. I'm not arguing that there isn't technical debt here, but it's not really because of kqueue's design.

nobodyandproud · on Dec 29, 2024

Thanks.

Then it seems like there are more similarities than differences here: Both solve the same problem of “select” by being the central (kernel-level) events queue; though with different APIs.

The other bit that caught my eye was the author saying epoll can do nearly everything kqueue can do.

What is that slight bit that epoll can’t do?

markjdb · on Dec 29, 2024

I'm not sure. Maybe it's "wait for events that aren't tied to an fd."

For instance, FreeBSD (and I think other BSDs) also have EVFILT_PROC, which lets you monitor a PID (not an fd) for events. One such event is NOTE_FORK, i.e., the monitored process just forked. Can you wait for such events with epoll? I'm not sure.

More generally, suppose you wanted to automatically start watching all descendants of the process for events as well. If I was required to use a separate fd to monitor that child process, then upon getting the fork event I'd have to somehow obtain an fd for that child process and then tell epoll about it, and in that window I may have missed the child process forking off a grandchild.

I'm not sure how to solve this kind of problem in the epoll world. I guess you could introduce a new fd type which represents a process and all of its descendants, and define some semantics for how epoll reports events on that fd type. In FreeBSD we can just have a dedicated EVFILT_PROC filter, no need for a new fd type. I'm not really sure whether that's better or worse.

oasisaimlessly · on Dec 29, 2024

AFAIK, just aio[1] (async file IO).

[1]: https://man.freebsd.org/cgi/man.cgi?query=aio&sektion=4&apro...

markjdb · on Dec 29, 2024

The same is true of kqueue/kevent though... the driver just needs to decide which filters it wants to implement. There's no need to extend kqueue when adding some custom driver or fd type. One just needs to define some semantics for the existing filters.

ajross · on Dec 29, 2024

> the driver just needs to decide

That's pretty much the definition of technical debt though. "This interface works fine for you if you take steps to handle it specially according to its requirements". It makes kqueue into a point of friction for everything in the system wanting to provide a file descriptor-like API.

markjdb · on Dec 30, 2024

Well, no, it's "this interface works fine for you if you implement it."

The kernel doesn't magically know whether your device file has data available to read, your device file has to define what that means. That's all I'm referring to. Hooking that up to kqueue usually involves writing 5-10 lines of code.

netbsdusers · on Dec 29, 2024

No it isn't. Letting files be poll/select/epoll'd isn't free either. They don't get support for that by magic. A poll operation has to be coded, and this is just as a much a "point of friction" then as supporting kqueue. (It bears mentioning as well that on at least DragonFly BSD and OpenBSD, they have reimplemented poll()/select() to use kqueue's knotes mechanism, and so now you only have to write a VOP_KQFILTER implementation and not a VOP_POLL too.)

ajross · on Dec 30, 2024

> Letting files be poll/select/epoll'd isn't free either.

Yes, but those slashes are showing the lie in the statement. Letting files be polled/selected isn't "free", but it's standard. The poll() method has been in struct file_operations for literally decades[1]. Adding "epoll support" requires no meaningful changes to the API, for any device that ever supported select().

That kind of evolutionary flexibility (the opposite of "technical debt") is generally regarded as good design. And it's something that epoll had designed in and something that queue lacks, having decided to go its own way. And it's not unreasonable to call that out, IMHO.

[1] It's present in commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), which is the very first git commit. I know people maintain archives of older trees, but I'm too lazy to dig. Suffice it to say that epoll relies on an interface that is likely older than many of the driver developers using it.

markjdb · on Dec 30, 2024

The article clearly isn't talking about technical debt within the kernel implementations of epoll and kqueue, and if one wanted, it'd be easy to define fallback EVFILT_READ/WRITE filters using a device's poll implementation.

I don't really understand what argument you're making. Is io_uring also a bad design because it requires new file_operations?

ajross · on Dec 30, 2024

> if one wanted, it'd be easy to

Which, again, is a statement that gets to the root of the idea of "technical debt". You can excuse almost anything like that. It still doesn't make it better than a design that works by default. I remain shocked that this seems to be controversial.

FWIW: io_uring has been very loudly criticized for being hard to implement, maintain and use, via some of this same logic, yes. This isn't a senseless platform flame. Linux does bad stuff too. There are good designs and bad designs everywhere, and io_uring is probably not one (though to be fair it does have some extremely attractive performance characteristics, so I guess one might be tempted to forgive a few warts in the interface layers).

markjdb · on Dec 30, 2024

A design that works by default isn't automatically better either though. You have to look at the details.

> I guess one might be tempted to forgive a few warts in the interface layers

... well, yeah, that's exactly my sentiment about kqueue here. What you're talking about is basically a small wart that no one's bothered to address because it's inconsequential.

markjdb · on Oct 1, 2024

For what it's worth, the default root shell is now /bin/sh instead of csh. I think that's true as of 14.0. /bin/sh is also a better interactive shell than it used to be, though yeah, I don't use it to do anything other than install my preferred shell.

foldr · on Oct 1, 2024

Right, but isn't the default non-root shell still tcsh? Zsh and ksh are both under BSD(-compatible) licenses, and it would be a huge usability upgrade if they could just set one of those as the default shell for everyone.

markjdb · on Sept 19, 2023

CHERI does more than help eliminate security vulnerabilities. Consider that today we rely on the MMU to provide memory isolation between Unix processes; CHERI enables isolation without switching page tables, at a smaller hardware cost (though it's not like you can drop unmodified software into such an architecture). So I don't think it's correct to consider this yet another layer of complexity. If anything it has the potential to lead to simpler system designs.

gpderetta · on Sept 19, 2023

It will be nice to see CHERI applied to single address space operating systems!

markjdb · on April 6, 2022

CHERI does permit tricks like storing flags in the low bits of a pointer, at least to some extent. Quite a lot of low level C code (including some in the CheriBSD kernel) needs that to work.

markjdb · on Feb 23, 2022

I use "ktrace -t f" once in a while for debugging and it's really handy. Output looks like

78436 cat PFLT 0x6c71f99cda8 0x2<VM_PROT_WRITE> 78436 cat PRET KERN_SUCCESS 78436 cat PFLT 0x3c6efd36c280 0x2<VM_PROT_WRITE> 78436 cat PRET KERN_SUCCESS 78436 cat PFLT 0x3c6efd36e158 0x2<VM_PROT_WRITE> 78436 cat PRET KERN_SUCCESS ...

Obviously not nearly as flexible as ebpf though. For instance it'll log all page faults happening in the context of the process, and so includes page faults that happen in the kernel due to copyin()/copyout() etc. Sometimes it's helpful and other times confusing.

markjdb · on Dec 10, 2021

I'd be amazed if it isn't a configuration error of some kind.

vermaden · on Dec 10, 2021

Even if, the defaults should be as bulletproof as possible ...

grahamjperrin · on Dec 12, 2021

Does the source help?

https://openbenchmarking.org/innhold/bc38edf6b112784a5e15803...

grahamjperrin · on Dec 12, 2021

FYI <https://old.reddit.com/r/BSD/comments/rdkw0j/-/ho8g0t3/?cont...> tl;dr

* the article showed FreeBSD getting 20.3 for compression level 8

* with vastly inferior hardware I got around thirty-six percent better (27.7)

markjdb · on Sept 19, 2021

How much data ends up being served from RAM? I had the impression that it was negligible and that the page cache was mostly used for file metadata and infrequently accessed data.

drewg123 · on Sept 19, 2021

It depends. Normally about 10-ish percent. I've seen well over that in the past for super popular titles on their release date.