
Epoll is fundamentally broken - Philipp__
https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12/
======
ktRolster
The article explains how to use epoll() correctly to solve all the problems he
raises. That's not 'fundamentally broken,' because it works. The worst you can
say is _confusing and painful._ Maybe that doesn't make good headlines,
though.

When you're dealing with shared resources (like a listening socket), and you
are using threads, and you are trying to maximize performance with networking,
_confusing and painful_ is kind of the nature of the problem.

~~~
gizmo
The linked interview with Bryan Cantrill gives a good explanation of why it's
broken: a straightforward implementation behaves in a way you never want, with
bizarre behavior where the kernel can wake up a thread with an accept() on a
fd that has already been closed. linux should just have adopted kqueue or an
IOCP model instead.

[https://youtu.be/l6XQUciI-Sc?t=3643](https://youtu.be/l6XQUciI-Sc?t=3643)

~~~
dboreham
I believe at the time there was concern that Microsoft would sue for patent
infringement relating to IOCP.

Source: at the time I worked on high-performance server products and had
discussions with various people active on the kernel side of the problem and
encountered much head-shaking and muttering about patents, when I mentioned
implementing IOCP.

~~~
aijeiy9X
Didn't Linus said that all he cared about is getting code back, with no
concern over patent grants such as the one in gplv3? Head-shaking and
muttering is all good but the kernel developers has had a long time to take an
active stance against software patents and has consistently chosen not to do
so. As a project they have chosen not to get political about patents, so this
is the reality they have embraced.

~~~
dboreham
I think we're saying the same thing.

fwiw there was an IOCP-like capability added to AIX, again supposedly because
IBM did not have fears about patent infringement (likely due to cross-
licensing arrangements).

------
viraptor
Is this actually the case?

> Waking up "Thread B" was completely unnecessary and wastes precious
> resources. Epoll in level-triggered mode scales out poorly.

In the analysed situation thread B was already in a wait state and there
aren't enough incoming connections to immediately accept the next one. Of
course the resources (CPU time) were wasted, but does that impact the
performance in any way? (Assuming one "main" application on that host)

~~~
bonzini
Because the time wasted by all the threads trying to accept() or read() on the
ready socket introduces latency for all other sockets. And since throughput is
bounded by the number of threads divided by inverse of latency, increasing the
latency makes you lose in scalability.

In addition, because a single "readiness event" has to wake up many threads,
it can introduce lock contention and cacheline bouncing in the kernel's data
structures.

------
bluejekyll
I don't feel like this was given enough time:

> One option is to use SO_REUSEPORT and create multiple listen sockets sharing
> the same port number. This approach has problems though - when one of the
> file descriptors is closed, the sockets already waiting in the accept queue
> will be dropped

Yes, it's a problem if you use it across processes... but if you have a single
long lived process with each thread listening on a separate queue, isn't that
the simplest solution?

~~~
richardwhiuk
That process isn't allowed to fork, I assume, which is problematic.

~~~
bluejekyll
I'm not proposing a fork either. Each thread would open a separate socket with
the SO_REUSEPORT set. The kernel would then have a separate queue per socket.

Please correct me if I'm wrong.

~~~
richardwhiuk
I'm saying that none of those threads can start any programs. That may or not
be required by your use-case, but it's certainly a disadvantage.

------
arielweisberg
You probably shouldn't share epoll FDs across threads for performance reasons.
A shared nothing design is likely to perform better with a simpler
implementation in both the application and the kernel.

I don't see sharing FDs across threads as a useful thing to aspire to.

The common design I see these days is load balancing FDs across shared nothing
threads. The thread that receives the notification via the selector is the
thread that does the IO (no other thread has that FD). Keep adding threads as
makes sense and never block let them block.

A combined queue makes sense when task sizes are large. For small tasks the
performance is poor. I see the queuing decision as something you make after
you have already retrieved the message from the network and presented it to an
application layer which makes a decision on how to dispatch.

It would be cool if the kernel would do this for you under the hood! That
would be amazing. Just making it correct is not enough though.

~~~
nine_k
If no data are shared, what is the point of having threads, as opposed to
processes?

~~~
tyingq
One example is in NGINX. They mostly follow a "no threads" pattern, but
specifically for Linux, they support "some threads" to get around issues with
Linux aio.

[https://www.nginx.com/blog/thread-pools-boost-
performance-9x...](https://www.nginx.com/blog/thread-pools-boost-
performance-9x/)

That's not "shared nothing" threads, of course, so it doesn't answer your
question at a high level. It does highlight, though, that Linux is especially
challenging in this space as compared to FreeBSD. And not just because of
epoll().

------
xorblurb
I'm curious if you can really design a practical API that avoid all the issues
the author talk about.

Even on something as simple as interrupt delivery to a single consumer, you
MUST be prepared to handle merged and spurious interrupts -- I would argue
that any driver that is not prepared to do so (in the general case) is buggy.

Maybe it's easier to do perfectly with an epoll/kqueue API for some reasons,
but, without having tried to think much about it, I can't imagine why it
should be. I have the intuition this is way harder. Actually I'm not even sure
if I can have any intuition about the difficulty to achieve the behavior
wanted by the author of that article, because the author did not actually
specified the behavior he desires...

------
rdtsc
> The best and the only scalable approach is to use recent Kernel 4.5+ and use
> level-triggered events with EPOLLEXCLUSIVE flag. This will ensure only one
> thread is woken for an event, avoid "thundering herd" issue and scale
> properly across multiple CPU's

Exactly use EPOLLEXCLUSIVE for accept, that seems to work and has been in the
kernel for more than a year.

For reads, how reasonable is to even share the file descriptor with multiple
threads. Just hand it to a thread and let that thread (possibly bound to a CPU
core for possible more performance boost) handle it from then on.

Am I crazy to be surprised by that architecture choice?

------
dijit
Anecdatum: I work with a guy who is pretty scary brilliant when it comes to
programming; He programs on windows and his software is of a really good
quality- I asked him why he used Windows and he gave two reasons:

1) Windows is mandated by HQ as the only supported desktop. You're not getting
anything else to run on your desktop so you either write Windows software
locally or use a VM (which is cumbersome0

2) Epoll is trash.

I have successfully convinced him to make some software on freebsd because
kqueue is fundamentally better. But it's shocking that epoll is so bad even
compared to windows :(

~~~
ktRolster
Windows has no right to trash-talk: WaitForMultipleObjects() is one of the
most depressing select-like APIs out there.

~~~
GauntletWizard
Not being familiar, it looks pretty standard. What're it's negative
attributes?

~~~
ktRolster
Conceptually, select() can wait on anything that's a file descriptor.
WaitForMultipleObjects() can only work on things that were specifically added
to its capability list.

Practically, MAXIMUM_WAIT_OBJECTS is set to such a low number (64) that people
resort to this kind of hack: [https://msdn.microsoft.com/en-
us/library/windows/desktop/ms6...](https://msdn.microsoft.com/en-
us/library/windows/desktop/ms687055\(v=vs.85\).aspx)

~~~
xorblurb
IIRC there are even advanced Win32 API that, so that you can go above
MAXIMUM_WAIT_OBJECTS, ... automatically handle a thread poll!

That kind of "solution" to such a problem is way more insane than epoll()
eventually providing the correct flags to fix most of its own issues, even if
we had to wait during a decade for it...

So praising NT because epoll is "trash" is uninformed at best, malicious at
worst.

------
bitwize
The least broken modern OS when it comes to async I/O is Windows.

I/O completion ports are the correct solution when it comes to waking up one
of multiple threads to handle I/O events.

------
bogomipz
I have a question. The post states:

>"This is because "level triggered" (aka: normal) epoll inherits the
"thundering herd" semantics from select(). Without special flags, in level-
triggered mode, all the workers will be woken up on each and every new
connection."

Isn't this behavior similar to disk I/O where the kernel wakes up(via
wake_up()?) all tasks that are sleeping on a wait queue for disk I/O? I
believe all processes sleeping on a disk I/O wait queue will be woken up
regardless of whether their disk I/O is complete or not and will be put back
to sleep in the whre it is not.

------
yokohummer7
So, the kernel is free to assign a connection to a new thread even if there is
a previous thread working on it? Because the kernel has no idea when the
previous worker will be done processing the data. So the user has to manually
unregister and register the connection each time?

How about a concept like "transaction"? Maybe such as `epoll_begin` and
`epoll_end`? One thread achieves exclusive access when entering the block, and
releases it when leaving.

Does this flaw also exist in IOCP or kqueue?

~~~
Philipp__
I think not, this[0] could be nice read.

[0] [http://people.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-
vs...](http://people.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-vs-
kqueue.html)

~~~
yokohummer7
Well, just by looking at a glance, kqueue seems to pose similar problems. I
mean, the problem mentioned in the OP:

    
    
      1. Kernel receives some data, and wakes up A
      2. A reads some data
      3. Kernel receives some other data, and wakes up B (cause A didn't call `epoll_wait` yet)
      4. B reads some data, leading to a race condition (out-of-order reading).
    

kqueue's `kevent` seems to be more or less similar to `epoll_wait`, in that it
doesn't seem to provide a way to notify the kernel the "we're done" signal. Am
I missing here? Does `kevent` also signal the end of the exclusive access?

~~~
Philipp__
From Bryan's interview it's clear they've solved the problem like that. From
article I linked, it doesn't indicate so.

Take a look here[1], at 'EVFILT_SIGNAL'. But does that mean that we have to
manually attach signal to monitor, and then we receive "we're done". But
that's kinda similar to what you have to do with epoll, the difference is that
kqueue structure encapsulates functiona of 'epoll_wait' and 'epoll_ctl'?

[1]
[https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2)

Edit: sorry for spamming with links, but I find these things really
interesting, look here [http://austingwalters.com/io-
multiplexing/](http://austingwalters.com/io-multiplexing/) specifically at 4
steps after kqueue code, I think that explains well. So it looks like we wait
for signal 'done' after which we rewatch.

------
lngnmn
It is pthreads which are fundamentally broken indeed, not epoll.

~~~
unscaled
All of the problems outlined are relevant whether you pre-fork into multiple
processes or use different threads. I don't see what it has to do with
pthreads.

------
beastman82
So what's the best way to multiplex then?

~~~
AaronFriel
(Caveat: this is a contentious opinion on Hacker News)

(Caveat #2: I'm not terribly familiar with such low level programming. Take
anything I say with a grain of salt.)

It's my opinion that IO Completion Ports on Windows are superior to the
approach taken by *nix and BSDs.

Instead of having the usermode application sleep and wake up, do some checks,
etc., the application provides an entry point when an event occurs or data is
available. Essentially, a callback for the kernel to use. The kernel then
jumps directly to this, and can manage the threads involved, using a thread
pool to balance requests. This gives much better utilization of threads than
with poll/epoll/kqueue, but does place some other constraints on how the code
is written.

The fundamental difference is that the Unix-kin is a readiness based model.
They wake up a thread to tell it that it is ready to read an event. IOCP on
Windows is a completion based model, and wakes up threads with the data (or
error) already present in a data structure provided to the thread.

~~~
zzzcpan
> a completion based model, and wakes up threads with the data (or error)

Which means that in this model you have to allocate and provide a buffer for
that data long before the kernel is going to fill it. It's going to just sit
there waiting, wasting memory. While in unix model you don't have to allocate
a buffer until you know there is some data to copy from the kernel, which is
easier for the user and much more efficient.

Completion model makes sense if your entire networking stack lives in
userspace and you can allocate memory on the lowest layer, but pass it as a
reference all the way up. Or if you at least can do syscall batching, to make
operating on very small buffers efficient.

~~~
eklitzke
This exact same problem is also present in the concurrency model provided by
Go. To read from the network you need to provide a buffer to read into, which
means that a buffer has to be allocated for every goroutine (instead of just
the goroutines that actually have data to read).

