
Epoll is broken – part 2/2 - protomyth
https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/
======
int_19h
I always found it interesting that WinNT got this one right very early on
(IOCP; first appeared in NT 3.5 in 1994), and thus avoided all these problems,
and a series of subsequent APIs trying to correct them. The way you do
_efficient_ asynchronous I/O in Win32 today is still the same as it was 20
years ago.

For those unfamiliar with that model, here's a brief description:
[https://web.archive.org/web/20101101112358/http://doc.sch130...](https://web.archive.org/web/20101101112358/http://doc.sch130.nsc.ru/www.sysinternals.com/ntw2k/info/comport.shtml)

My understanding is that kqueue in FreeBSD is also conceptually similar, but I
never had a chance to take a close look at that.

~~~
jstarks
I dunno, NT tied IOs to the initiating thread until Vista. This means that you
had to keep the initiating thread around or your IOs would be cancelled.

NT also has the same flaw mentioned in the article; it ties the IOCP
registration to the file object instead of the file handle. This means that
you can't use a handle with an IOCP in one process then hand it off to another
process.

The NT IO model has some nice properties but also some pretty serious warts. I
think no one got it right.

~~~
wahern
kqueue got it right. Not just wrt descriptors but with many other aspects of
the API.

The BSDs tend to put more thought and abstraction into their designs. Linux
and Windows designs tend to be primarily driven by the [expected] most common
use case, which often leads to premature design and performance optimizations
that over the years prove ill conceived or shortsighted as programming
patterns shift.

OTOH, Linux and Windows APIs tend to be more immediately useable. On BSD
things tend to be more "some assembly required".

That's all very general but I have very specific examples in mind, like IOCP
vs polling, signalfd vs EVFILT_SIGNAL, inotify v EVFILT_VNODE, containers v
jail, and seccomp v pledge[1], others.

[1] With seccomp v pledge, seccomp seems like it needs more assembly. But I
think the most common expected use case, given that seccomp restrictions are
inherited across fork, was that seccomp sandboxes would be created by a core
system utility which then invoked other services, sandboxing them without
having to modify the services. pledge requires source code modification.
pledge, I think, is the better and more useful approach but it necessarily
requires some assembly. Same story for containers vs jail, at least until
recently.

~~~
_delirium
Regarding pledge: I agree it's a good pragmatic design, but I wouldn't see it
as an example of BSD vs. Linux broadly. To my mind it's a very specifically
OpenBSD style design, a kind of ruthless reduction to a dozen or so hardcoded
cases that cover 90% of the benefit, sane defaults, opposition to
configurability if it can at all be avoided, and screw coming up with a clean
"general" solution. Other BSDs have gone in different directions. FreeBSD's
solution here is capsicum, which predates pledge and is more powerful in
principle, but is more difficult to use, so not that many programs in FreeBSD
actually use it.

~~~
wahern
Ah, good point. Capsicum is a much better example of the difference in
approaches.

------
mbrumlow
From the man page...

> Q6 Will closing a file descriptor cause it to be removed from all epoll sets
> automatically?

> A6 Yes, but be aware of the following point. A file descriptor is a
> reference to an open file description (see open(2)). Whenever a file
> descriptor is duplicated via dup(2), dup2(2), > fcntl(2) F_DUPFD, or
> fork(2), a new file descriptor referring to the same open file description
> is created. An open file description continues to exist until all file
> descriptors refer‐ > ring to it have been closed. A file descriptor is
> removed from an epoll set only after all the file descriptors referring to
> the underlying open file description have been closed (or > before if the
> file descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL).
> This means that even after a file descriptor that is part of an epoll set
> has been closed, events > may be reported for that file descriptor if other
> file descriptors referring to the same underlying file description remain
> open.

EDIT:

Lots of comments here seem to think this should be unexpected or is a bug.
Closing a FD you are using is a bug. I think epoll does a fairly good job of
letting the user know that it is watching the description and not the
descriptor. Failing to read the man page for dup would also leave you in a
blind spot. I have been writing code for linux a while now and I did not think
it was any secret that a file is still open until all of the fds pointing to
it are closed. That is why you have to take care and close your duplicated fds
at the right time otherwise you will end up with file handles leaking. The
example code provided illustrates this perfectly.

As a side note using dup2 to get your original FD passed to epoll associated
with the still open description from the duped fd should allow you to remove
it.

~~~
ajross
> Closing a FD you are using is a bug.

Sort of, but think of the problem epoll is addressing: having an epoll-
registered descriptor is semantically equivalent to having an active read() on
the thing. That's sort of the whole point, of course. Calling close() on a
descriptor blocked in read() works fine and produces the expected (synchronous
EOF) behavior. Doing it on the epoll one produces surprising behavior
depending on whether or not it got dup'd somewhere (which can have been
somewhere outside your process!).

It's a bug for sure. Not sure it rises to the level of "fundamentally broken",
but it's a bug.

~~~
mark4o
POSIX does not require EOF to be returned from a blocking read() if the same
file descriptor is passed to close() in another thread. Even if it appears to
work on some particular OS, consider the case where the reading thread may
have called read() but is preempted just before entering the kernel. If
another thread then closes the file descriptor, expecting the first thread to
see EBADF or EOF, but then the file descriptor is reused (e.g. by accept())
before the first thread resumes, the reading thread may block reading from an
unrelated socket. Even if you can guarantee that accept(), open(), or other
calls that explicitly allocate a file descriptor will not be called, ordinary
library functions are generally allowed to use file descriptors internally
(e.g. to read a configuration file), so closing the file descriptor while
another thread is reading from it is almost certainly a bug.

------
jgh
Interesting writeup, seems like this only applies to people writing epoll
abstraction layers though. Everybody else is able to avoid this bug by
deregistering the fds before calling close. So I'm not sure calling it
"fundamentally broken" is necessarily accurate.

~~~
joosters
Yeah, I think 'fundamentally broken' is going into clickbait territory - these
are API calls that are successfully used by a huge number of programs. epoll()
is the most recent of these multiplexer syscalls and it's been around for a
decade or so. The underlying message is that there are corner cases that exist
and should be thought about, but that is true of most syscalls.

The epoll() call itself is a strange API. As the series of articles says, it
was introduced to solve the problem of the kernel having to register and
unregister every listed FD on every invocation. However, as a trade-off, your
program now needs to make a syscall each time it wants to add/remove an FD. So
where you once had a single poll() call per event loop, you now have to fire
off extra syscalls for each single FD on each new connection.

Compare this to the API of FreeBSD's kqueue - it is another solution to avoid
the linear costs of select()/poll(), but instead of making a syscall for
registering/unregistering each new FD, kevent() is called with a list of FDs
that you want to change. So an event loop using kqueue consists of just one
syscall per loop.

~~~
zrm
> Compare this to the API of FreeBSD's kqueue - it is another solution to
> avoid the linear costs of select()/poll(), but instead of making a syscall
> for registering/unregistering each new FD, kevent() is called with a list of
> FDs that you want to change. So an event loop using kqueue consists of just
> one syscall per loop.

It would be simple enough to add a version of epoll_ctl that took array
arguments, but the existing epoll_ctl makes a lot of sense for the most common
use cases.

You either know all the descriptors ahead of time or you don't. If you do then
you can add them all on startup and never touch it again, and one-time costs
usually aren't worth optimizing. If you don't then you're usually modifying
them one at a time anyway.

And if the syscall takes them one at a time then you don't have to worry about
maintaining the state yourself or memory management based on how many
descriptors you have or anything like that.

------
protomyth
[https://illumos.org/man/5/epoll](https://illumos.org/man/5/epoll)

 _While a best effort has been made to mimic the Linux semantics, there are
some semantics that are too peculiar or ill-conceived to merit accommodation.
In particular, the Linux epoll facility will -- by design -- continue to
generate events for closed file descriptors where /when the underlying file
description remains open. For example, if one were to fork(2) and subsequently
close an actively epoll'd file descriptor in the parent, any events generated
in the child on the implicitly duplicated file descriptor will continue to be
delivered to the parent -- despite the fact that the parent itself no longer
has any notion of the file description! This epoll facility refuses to honor
these semantics; closing the EPOLL_CTL_ADD'd file descriptor will always
result in no further events being generated for that event description._

------
al452
"Using these toolkits is like trying to make a bookshelf out of mashed
potatoes." \- Jamie Zawinski

------
ericbb
For people reading this thread who want to better understand the distinctions
between files, file descriptions, and file descriptors, I recommend Michael
Kerrisk's book The Linux Programming Interface, which has great coverage of
this topic.

------
mbrumlow
I am making a new post because I feel like this is getting out of hand. There
is a lot of misinformation that needs to be cleared up.

When you are working with system calls you use fds to tell the kernel what
file you are working with. These are simply numbers, an int to be exact. When
you call a system call that takes in a fd you pass the int that references the
file you want to work with and it is looked up in a table to find the kernel
structure that represents the file. When you dup a fd all that does is add a
entry to the lookup table to point to the same kernel structure -- increasing
the refcount at the same time. The ref count lives on the kernel structure,
not with the int. The fd system is real simple. For open, dupe, dup2… will add
an entry. While close will remove the entry. If the entry is removed the fd is
now useless for any communication to the kernel over system calls. This is
because the lookup table will not be able to translate your fd to a kernel
structure. Now the question is epoll broken or have a bug? I don’t think so.
You just have to keep in mind you are using a system call and not a library.
This is controlled access to kernel level functions. In the example adding a
fd to epoll, then duping it followed by a closing is providing some surprising
results. Understanding how userspace talks to kernel space and how system
calls translate into running kernel code is key to understanding why this is
not a bug but just how things work. You might also be surprised that other
system calls behave the same way! When you tell epoll to wait you use a fd to
tell it which file you want to wait on. You are not telling a library, or a
userspace program to do this. You are telling the kernel. So the fd is
translated into a kernel object and that is what epoll is working with. Only
when adding and removing is this translation done. The kernel subsystem that
provides epoll will continue to work with kernel objects -- not fds.

When you call dup you are adding a new entry to your programs fd lookup table
for a new fd but pointing to the same kernel structure -- at the same time you
are also increment the refcount of the kernel structure. The file is not
actually closed until both the original fd and the duped fd have had closed
called on it, or more correctly, until the refcount of the kernel structure is
zero. This means that if you tell epoll to wait on a file and do so by an fd,
and then close the fd -- you will only see a close event if the refcount is
zero. By calling dup before closing your original fd you have incremented the
ref count and thus the file remains open -- no close event.

If you then try to remove your file from epoll using the now invalid (closed)
fd it will fail. This is because the fd is no longer in the fd lookup table
and the system call can not translate your fd into a kernel object. Now you
might say this is not how other system calls work. As in another thread
pointed out that “read” does not behave this way. To the contrary -- read
actually works the same way. If you dup your socket fd, and use a thread to
close your socket fd while in an active read -- your read will not fail. It
will finish out delivering the data to you. This is because the fd is not the
actual object you are working with. It is just a reference to the object.
Onces you enter the read function you are in kernel land. And the kernel only
knows about the kernel object. Because you duped the fd before closing it the
kernel object has a ref count of 2 and your close while read leaves the ref
count at 1 thus there is nothing for the kernel to do other than finish out
its read. If you attempt to use the now closed and invalid fd for a subsequent
read, or close you will get a EBADF error -- much like you do when you attempt
to use it with epolls functions to remove the file.

~~~
Asooka
But no other syscall interface presumes that you're directly twiddling kernel
structs and I struggle to think of any other syscall that operates on fds, for
which you need to keep in mind that it actually operates on internal kernel
structs.

It would be trivial for the kernel to attach the epoll struct to the file
descriptor and give you a close notification when the descriptor is closed
(even if there are other open descriptors to the same file), but it's really
hard to emulate that behaviour in userland, except this is the precise
behaviour most people want! The only way to emulate this behaviour is to write
a full abstraction over all file syscalls and make sure you never ever call
them directly.

~~~
mbrumlow
The entire point of my post was to say yes, that all the other/all system
calls do work with kernel structs internally ones the system call passes into
kernel land. Again, please do not forget a system call is a controlled
exposure to kernel code -- it is not a library.

Every function call works the same way as epoll works ounces the fd is
translated into a actual kernel structure. Conflating the fd with the file is
where all assumptions go wrong. The fd is the means you tell epoll what file
you want to be notified on events about. A file is NOT closed until there are
no more references. This is simply how linux works. To change that would be
non trivial and break years of software.

Let's explore your idea a bit more. If you were to want to know when a "fd"
was closed, what would you be using that for? You really want to know when a
file was closed right? If so, any code that you execute after you get the
close on the fd would be wrong and cause even more surprising results. This is
because the file is not actually closed. It is open. Whoever has the remaining
handles can modify it while you think the file is closed -- that is a bug.

------
mnarayan01
I would think you could use dup (or dup2/ioctl) to de-register (assuming
nothing grabbed that FD)?

~~~
DSMan195276
That could possibly work, but only if you dup() the same original FD. If you
just try dup()ing any random FD the internal file description pointer won't
match, so epoll won't reconize it.

But of course, if you have any threads then it's a big race condition.

