
A new kernel polling interface - doener
https://lwn.net/Articles/743714/
======
FooBarWidget
Does anybody know why they won't just copy kqueue? kqueue seems to be a very
good interface to me. It can batch multiple events and operations in a single
syscall. It supports many types of events, not just file descriptor readiness
changes. Lots of people are already familiar with it and there is already code
out there to take advantage of it.

~~~
rhinoceraptor
Linux is the shining example of NIH syndrome.

Everyone else (IOCP, kqueue, etc) solved this problem. And then, Linux created
the famously broken epoll().

~~~
_wmd
I downvoted this because you provided no evidence either IOCP or kqueue were
the one true solution, any evidence the epoll designer was even aware of the
existence of kqueue, and if they were, whether it was relevant to his work at
all.

There is no reference to the context behind rejecting a kqueue-like API
(opinion was that it looks a lot like ioctl - very true)

There is no rationale for why IOCP or kqueue is obviously a better design for
Linux compared to epoll etc etc.

But just chalk it up to NIH, because that is the easiest and cheapest
explanation for just about anything.

FWIW in the context of network APIs before Kqueue, there was the then-standard
STREAMS, and the decisionmaking that led to Linux having epoll is the same
that led to it avoiding the tragedy that was STREAMS. But if Linux had STREAMS
today and not epoll or kqueue, the people who'd cry NIH and not-following-
standards today would instead by crying about how much STREAMS sucks, and why
Linux doesn't do its own thing

~~~
deathanatos
> _There is no rationale for why IOCP or kqueue is obviously a better design
> for Linux compared to epoll etc etc._

I can, I think, partially address this, though unfortunately in the negative.

My understanding of IOCP is that the model requires the following: if you want
to, say, receive data on a socket, you supply the IOCP w/ a buffer/length.
After some data has been received, the buffer and the amount received are
returned to you.

The problem with this model is that the buffer is effectively locked up in
kernel space until the I/O is complete. Compared to readiness-notifications
(select() and derivatives, such as poll/epoll/kqueue), which can share the
same receive buffer among _all_ receiving sockets. (They might need to buffer
things like partially received commands, but you can perhaps use a much
_smaller_ buffer there, and/or only allocate it in the event you require it.)

It has been a _long_ time since I've done Windows programming, so please
correct me if I've gotten the above wrong. But that's a fundamental difference
and _advantage_ that epoll/kqueue have over IOCP.

Now, I do think the IOCP model is very much conceptually simpler, and I would
guess that it is easier to write correct IOCP code than epoll code, but at the
expense of memory in some situations. But IOCP doesn't (I don't think) cover
as many situations as epoll/kqueue.

~~~
wahern
People confuse IOCP with Windows' Overlapped I/O. kqueue supports _completion_
notifications; it all depends on the semantics of the event filter and its
flags. Likewise for epoll. For example, both kqueue and epoll support pollable
I/O _completion_ notifications for AIO, which is the the analogous Unix API
for Overlapped I/O. (Similarly, people conflate implementation details with
architecture, such as when people explain that AIO isn't like Overlapped I/O
by describing how AIO and Overlapped I/O is _implemented_ , without explaining
how the _API_ necessarily makes it so.)

The benefit of IOCP and Overlapped I/O on Windows isn't the design. The
benefit is that it comes complete out of the box, whereas on Linux and *BSD
you either need to roll your own or supplement inconsistent kernel interfaces
that people tend to avoid. But both IOCP and Overlapped I/O are higher-level
APIs than traditional Unix readiness notification. The problem on Windows is
that there's nothing like epoll or kqueue, which is critically important when
you're trying to write library code that works with different event models.
(Even on Windows Overlapped I/O isn't always ideal, especially in libraries
trying to avoid callbacks or support multi-threading strategies different than
those dictated by Overlapped I/O). Windows does implement something equivalent
to traditional Unix readiness notification internally--it's how IOCP and
Overlapped I/O are implemented--but its unpublished and exceptionally opaque.
See
[https://github.com/piscisaureus/wepoll](https://github.com/piscisaureus/wepoll)

------
hinkley
The punchline:

> Multiple notifications can be consumed without the need to enter the kernel
> at all, and polling for multiple file descriptors can be re-established with
> a single io_submit() call. The result, Hellwig said in the patch posting, is
> an up-to-10% improvement in the performance of the Seastar I/O framework.
> More recently, he noted that the improvement grows to 16% on kernels with
> page-table isolation turned on.

------
Someone
_”But sometimes three is not enough; there is now a proposal circulating for a
fourth kernel polling interface”_

I’m not convinced. Can anybody explain what’s the set of desirable properties
of polling interfaces, and why we need at least four different interfaces to
implement all of them?

~~~
scottlamb
> I’m not convinced. Can anybody explain what’s the set of desirable
> properties of polling interfaces, and why we need at least four different
> interfaces to implement all of them?

I'll give you a partial answer.

One desirable property is that each iteration of your event loop is not O(n)
with the total number of descriptors being watched. select() and poll() are
flawed for that reason—the entire list of file descriptors is passed in on
each iteration and has to at least be compared to the previous iteration. No
one writes things using these ancient interfaces anymore. (That's a bit
unfortunate given that all the modern interfaces are single-platform, so
everyone who cares about portability needs an abstraction layer, but it is
what it is.) epoll() is better.

The kernel doesn't break old programs, so old interfaces stick around,
basically no matter how bad they are. There are three interfaces now, so there
have to be at least three. If there can be only three or if there have to be
four comes down to if epoll is (or can become) good enough or not. If folks
keep coming up with new requirements that can't be met with existing
interfaces, there will just be more and more interfaces over time...

~~~
Someone
So, the answer is “we don’t think we _need_ three, let alone four, but we
happen to have two bad ones that we don’t want to get rid of”?

If those old ones don’t have any unique properties, couldn’t they be
implemented on top of a single syscall, or is the syscall interface sacred on
Linux, and calls cannot be retired?

~~~
caf
What tends to happen is that kernel is changed to implement the old syscall
using the new infrastructure on the kernel side. There's no real cost to
having another syscall number used.

~~~
Someone
But it grows the amount of code in the kernel, making it more likely to be
buggy, more so if some of these old interfaces get used less and less (and,
hence, likely tested less and less), and the underlying root interface gets
refactored again and again to support new polling interfaces.

~~~
caf
We are generally talking about very simple wrapper functions here that don't
have a lot of scope for hidden bugs to creep in.

------
cbsmith
This is from January...

~~~
doener
Yes, but the new feature will only come to Linux 4.19, which is only released
as rc1 so far:

[https://lore.kernel.org/lkml/CA+55aFw9mxNPX6OtOp-
aoUMdXSg=gB...](https://lore.kernel.org/lkml/CA+55aFw9mxNPX6OtOp-
aoUMdXSg=gBkQudGGamo__sh_ts_LdQ@mail.gmail.com/)

[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfe4037e722ec672c9dafd5730d9132afeeb76e9)

