Hacker News new | past | comments | ask | show | jobs | submit login
Epoll is fundamentally broken (popcount.org)
160 points by Philipp__ on Feb 26, 2017 | hide | past | favorite | 58 comments

The article explains how to use epoll() correctly to solve all the problems he raises. That's not 'fundamentally broken,' because it works. The worst you can say is confusing and painful. Maybe that doesn't make good headlines, though.

When you're dealing with shared resources (like a listening socket), and you are using threads, and you are trying to maximize performance with networking, confusing and painful is kind of the nature of the problem.

The linked interview with Bryan Cantrill gives a good explanation of why it's broken: a straightforward implementation behaves in a way you never want, with bizarre behavior where the kernel can wake up a thread with an accept() on a fd that has already been closed. linux should just have adopted kqueue or an IOCP model instead.


I believe at the time there was concern that Microsoft would sue for patent infringement relating to IOCP.

Source: at the time I worked on high-performance server products and had discussions with various people active on the kernel side of the problem and encountered much head-shaking and muttering about patents, when I mentioned implementing IOCP.

Didn't Linus said that all he cared about is getting code back, with no concern over patent grants such as the one in gplv3? Head-shaking and muttering is all good but the kernel developers has had a long time to take an active stance against software patents and has consistently chosen not to do so. As a project they have chosen not to get political about patents, so this is the reality they have embraced.

I think we're saying the same thing.

fwiw there was an IOCP-like capability added to AIX, again supposedly because IBM did not have fears about patent infringement (likely due to cross-licensing arrangements).

   >accept() on a fd that has already been closed.
Why are people doing that?

Threads, like drugs, make you do strange and unpredictable things.

They would never EVER adopt something NIH. Linux is riddled with proof of that. They should still adopt kqueue though. But he is right, it is, just.. broken isn't even the right word. Crime against humanity maybe?

What is so disturbing is this has been known forever that epoll should be drown in a bathtub but no one does, it's just "la dee da I can't here you! I can't hear you!" and Linux trudges on with yet another limp.

As he also says "kqueue and event ports are T-Ball, compared to ZFS and Dtrace and jails are a lot harder... if you got the little stuff wrong..."

It's just frustrating. It's broken. You know it's broken. FIX IT!

The most generous interpretation is that it can work - if you're careful, and if you're using a kernel from 2016.

While I trust the author to do this (thankfully, as he's my coworker) there is a lot of Linux software that doesn't, even assuming it was updated in the last year and you're running something vaguely bleeding edge (not Debian).

A new Debian is about to be released, featuring Linux 4.9.

Some people using a completely different kind of OS insist in continuing to use a version from 2009, instead of its more recent version. I guarantee you that this version from 2009 also has its share of dark APIs, some of which have been fixed in the modern version...

Yet I've yet to find articles titled like if all of them are equally broken, while the content would precisely describe the caveats to do not-broken things on the capable recent releases.

Yes, exactly.

when you deal with multi threads, not only epoll will cause some problems, but also global variable, memory, etc. global variable solved by mutex, but epoll solved by avoid using it to epoll_wait fds in multi threads.

Don't we usually hide warts like this behind libuv or some other popular library?

I wonder if we could put a sane fix into libc to fix these problems.

As far as I know in libuv it is not possible to share socket in different event loops. But the point is while it might be possible to build an abstraction to fix the broken API, is not it would be better to have fixed API instead

> The worst you can say is confusing and painful

Everyone's approach towards these mechanisms is broken. Just don't treat it as a reliable notification mechanism, do your own scheduling using information from epoll/kqueue only as hints and everything will be fine.

I admit, sometimes when I read these articles, I have trouble understanding why these people are having so much difficulty getting it to work.

It's also hard for me to imagine a scenario where accept() takes longer than servicing a request and becomes a bottleneck. That is, why would you need multiple threads accepting on the same socket?

Author here. I showed an example - short lived http 1.0 connections. I haven't done benchmarks but my hunch is that you can run accept at in low tens of thousands times per second from one CPU. If you have more than, say, 10k qps the accept() might well be the bottleneck. You can imagine having a box with 64 CPU's, doing short-lived connections, and being limited by accept() done on only one CPU since it doesn't scale.

Second issue is cache locality. If you do accept() in one thread only, then you will need to move the new accept-ed client socket to another worker thread. Depending on details this might not be efficient - aRFS comes into mind. (but frankly, epoll alone won't help here, you need SO_REUSEPORT with SO_INCOMING_CPU).

Even if you don't agree that scaling out accept is a real concern - that's missing the point. The point is: the epoll() model should take this into account and at least support this problem. Or loudly say that scaling out accept() with epoll is not possible. But neither things happened. Up till kernel 4.5 it was impossible to do it correctly, but undocumented, from 4.5 you can use the EPOLLEXCLUSIVE flag, which I feel is a hack.

I haven't done benchmarks but my hunch

Your writing will be more convincing if you profile.

Is this actually the case?

> Waking up "Thread B" was completely unnecessary and wastes precious resources. Epoll in level-triggered mode scales out poorly.

In the analysed situation thread B was already in a wait state and there aren't enough incoming connections to immediately accept the next one. Of course the resources (CPU time) were wasted, but does that impact the performance in any way? (Assuming one "main" application on that host)

Because the time wasted by all the threads trying to accept() or read() on the ready socket introduces latency for all other sockets. And since throughput is bounded by the number of threads divided by inverse of latency, increasing the latency makes you lose in scalability.

In addition, because a single "readiness event" has to wake up many threads, it can introduce lock contention and cacheline bouncing in the kernel's data structures.

I don't feel like this was given enough time:

> One option is to use SO_REUSEPORT and create multiple listen sockets sharing the same port number. This approach has problems though - when one of the file descriptors is closed, the sockets already waiting in the accept queue will be dropped

Yes, it's a problem if you use it across processes... but if you have a single long lived process with each thread listening on a separate queue, isn't that the simplest solution?

That process isn't allowed to fork, I assume, which is problematic.

I'm not proposing a fork either. Each thread would open a separate socket with the SO_REUSEPORT set. The kernel would then have a separate queue per socket.

Please correct me if I'm wrong.

I'm saying that none of those threads can start any programs. That may or not be required by your use-case, but it's certainly a disadvantage.

You probably shouldn't share epoll FDs across threads for performance reasons. A shared nothing design is likely to perform better with a simpler implementation in both the application and the kernel.

I don't see sharing FDs across threads as a useful thing to aspire to.

The common design I see these days is load balancing FDs across shared nothing threads. The thread that receives the notification via the selector is the thread that does the IO (no other thread has that FD). Keep adding threads as makes sense and never block let them block.

A combined queue makes sense when task sizes are large. For small tasks the performance is poor. I see the queuing decision as something you make after you have already retrieved the message from the network and presented it to an application layer which makes a decision on how to dispatch.

It would be cool if the kernel would do this for you under the hood! That would be amazing. Just making it correct is not enough though.

Yes, in a multithreaded scenario it's possible to think of poll() and epoll() primarily as an alternative to pthread_cond_timedwait().

These functions are not about 'event dispatching' so much as they are about putting the current thread to sleep until something of interest might have occurred.

Using poll() or epoll() means that the thread gets to wake up on file descriptor events in addition to intrathread signals, and epoll() seems to work fine in this regard.

If no data are shared, what is the point of having threads, as opposed to processes?

One example is in NGINX. They mostly follow a "no threads" pattern, but specifically for Linux, they support "some threads" to get around issues with Linux aio.


That's not "shared nothing" threads, of course, so it doesn't answer your question at a high level. It does highlight, though, that Linux is especially challenging in this space as compared to FreeBSD. And not just because of epoll().

"Message passing" might be considered a component of a "shared nothing" system. If so, a thread-based message passing implementation might still use shared process memory behind the scenes to efficiently pass messages between local threads. Worker threads might be written so that the responsibility for managing a resource external to the process is assigned to max 1 thread simultaneously, even if the message passing abstraction uses a bit of shared process memory behind the scenes.

FD =\= data

Get the data from the FD and then let the application decide the best dispatch policy.

Threads are still the best way to get asynchronous filesystem I/O in a reliable way. I know that people will reply with various explanations of how to do it for some cases some of the time; however, nobody will post a way that works as reliably as threads, because none exist.

At the same time, threads are a vile abomination and we should shun them whenever possible. Computers are terrible, aren't they?

I'm curious if you can really design a practical API that avoid all the issues the author talk about.

Even on something as simple as interrupt delivery to a single consumer, you MUST be prepared to handle merged and spurious interrupts -- I would argue that any driver that is not prepared to do so (in the general case) is buggy.

Maybe it's easier to do perfectly with an epoll/kqueue API for some reasons, but, without having tried to think much about it, I can't imagine why it should be. I have the intuition this is way harder. Actually I'm not even sure if I can have any intuition about the difficulty to achieve the behavior wanted by the author of that article, because the author did not actually specified the behavior he desires...

> The best and the only scalable approach is to use recent Kernel 4.5+ and use level-triggered events with EPOLLEXCLUSIVE flag. This will ensure only one thread is woken for an event, avoid "thundering herd" issue and scale properly across multiple CPU's

Exactly use EPOLLEXCLUSIVE for accept, that seems to work and has been in the kernel for more than a year.

For reads, how reasonable is to even share the file descriptor with multiple threads. Just hand it to a thread and let that thread (possibly bound to a CPU core for possible more performance boost) handle it from then on.

Am I crazy to be surprised by that architecture choice?

Anecdatum: I work with a guy who is pretty scary brilliant when it comes to programming; He programs on windows and his software is of a really good quality- I asked him why he used Windows and he gave two reasons:

1) Windows is mandated by HQ as the only supported desktop. You're not getting anything else to run on your desktop so you either write Windows software locally or use a VM (which is cumbersome0

2) Epoll is trash.

I have successfully convinced him to make some software on freebsd because kqueue is fundamentally better. But it's shocking that epoll is so bad even compared to windows :(

But even with IOCP, Windows IIS implements request handling as a kernel module to accelerate it, and yet still isn't faster than nginx on linux with many cpu cores.


But can be noticeably faster on fewer cores: https://www.rootusers.com/linux-vs-windows-web-server-benchm...

But still in many cases has more performance pitfalls in practice: http://serverfault.com/questions/317199/linux-vs-windows-7-w...

So, sure, epoll() does not help for optimal distribution of accept() or read() of the same fds over multiple threads. But even with a single thread accepting, or with "thundering herd" problems of other strategies, linux is still usually more efficient in practice. Often dumb non-optimal designs are faster just due to subtle problems caused by overall system complexity.

Side note: macOS has kqueue, but it's been buggy in enough releases that libevent and libev usually have to use select() on macOS to be safe. http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#OS_X_AN... https://github.com/libevent/libevent/pull/377

So for your "optimal socket handling" tasks you're really limited to choosing between Linux and FreeBSD, and there are other issues to consider there, like drivers.

Windows has no right to trash-talk: WaitForMultipleObjects() is one of the most depressing select-like APIs out there.

WaitForMultipleObjects sucks, but that's because it got very little love since Windows 95. That's because Microsoft seems to have completely abandoned the Reactor (select/epoll wake-on-readiness) pattern in favor of the Proactor (IOCP-like wake-on-event/completion) pattern.

With the proactor pattern, you register events you care about explicitly, and your event-loop gets the next event to handle and schedules the respective callback/coroutine/future to handle that event.

With the reactor pattern you don't have explicit registration for completion status when you do an I/O operation, and you need to explicitly specify the file descriptors you want to select on in the point where you select on them.

Many people (including the author of the OP article) think that proactor is superior, and this was what the Windows guy in the parent post was referring to. In Windows you're NEVER SUPPOSED to use any version of select (including WSAAsyncSelect or WaitForMultipleObjects) if you have a large number of events. They don't scale.

Instead, Microsoft decided to put its scaling effort into the proactor approach, which they introduced in Windows NT 3.5 -almost a decade before Linux came out with epoll - and this was the right way to do async I/O on Windows ever since.

Linux, on the other hand, chose to perfect the reactor model, and with epoll it gave a very powerful select-like reactor implementation[1]. So yes, if you specifically want to program a reactor, go with Linux. Reactor-centric software and libraries (like libev or nginx) just don't scale on Windows. But that doesn't mean async I/O on Windows sucks - it's just optimized for a different model.

[1] kqueue could be better, but I don't know it well enough to make a call. And while it's I/O events are readiness based, as far as I understand it's general enough that it can be used to implement a proactor as well.

Not being familiar, it looks pretty standard. What're it's negative attributes?

Conceptually, select() can wait on anything that's a file descriptor. WaitForMultipleObjects() can only work on things that were specifically added to its capability list.

Practically, MAXIMUM_WAIT_OBJECTS is set to such a low number (64) that people resort to this kind of hack: https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

IIRC there are even advanced Win32 API that, so that you can go above MAXIMUM_WAIT_OBJECTS, ... automatically handle a thread poll!

That kind of "solution" to such a problem is way more insane than epoll() eventually providing the correct flags to fix most of its own issues, even if we had to wait during a decade for it...

So praising NT because epoll is "trash" is uninformed at best, malicious at worst.

That is actually an excellent API; you can wait for up to 64 arbitrary events in a relatively efficient manner.

You should never, ever use it for high performance I/O.

> even compared to windows

Windows is an expensive commercial OS, if anything it should be better, and it usually is better than the Linux desktop.

Power use (no rip battery), "it just works", fast and efficient in most cases.

CLI is a whole different story, but bash in ubuntu in windows is already pretty good, and there's no reason it can't be equal to bash on any other Linux distribution.

Powershell is pretty cool, but it's not as integrated into the "essence" of windows like sh is for linux. There's been a bunch of projects that make extensive use of it lately, like chocolatey - I hope it keeps getting better.

The least broken modern OS when it comes to async I/O is Windows.

I/O completion ports are the correct solution when it comes to waking up one of multiple threads to handle I/O events.

I have a question. The post states:

>"This is because "level triggered" (aka: normal) epoll inherits the "thundering herd" semantics from select(). Without special flags, in level-triggered mode, all the workers will be woken up on each and every new connection."

Isn't this behavior similar to disk I/O where the kernel wakes up(via wake_up()?) all tasks that are sleeping on a wait queue for disk I/O? I believe all processes sleeping on a disk I/O wait queue will be woken up regardless of whether their disk I/O is complete or not and will be put back to sleep in the whre it is not.

So, the kernel is free to assign a connection to a new thread even if there is a previous thread working on it? Because the kernel has no idea when the previous worker will be done processing the data. So the user has to manually unregister and register the connection each time?

How about a concept like "transaction"? Maybe such as `epoll_begin` and `epoll_end`? One thread achieves exclusive access when entering the block, and releases it when leaving.

Does this flaw also exist in IOCP or kqueue?

I think not, this[0] could be nice read.

[0] http://people.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-vs...

Well, just by looking at a glance, kqueue seems to pose similar problems. I mean, the problem mentioned in the OP:

  1. Kernel receives some data, and wakes up A
  2. A reads some data
  3. Kernel receives some other data, and wakes up B (cause A didn't call `epoll_wait` yet)
  4. B reads some data, leading to a race condition (out-of-order reading).
kqueue's `kevent` seems to be more or less similar to `epoll_wait`, in that it doesn't seem to provide a way to notify the kernel the "we're done" signal. Am I missing here? Does `kevent` also signal the end of the exclusive access?

From Bryan's interview it's clear they've solved the problem like that. From article I linked, it doesn't indicate so.

Take a look here[1], at 'EVFILT_SIGNAL'. But does that mean that we have to manually attach signal to monitor, and then we receive "we're done". But that's kinda similar to what you have to do with epoll, the difference is that kqueue structure encapsulates functiona of 'epoll_wait' and 'epoll_ctl'?

[1] https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2

Edit: sorry for spamming with links, but I find these things really interesting, look here http://austingwalters.com/io-multiplexing/ specifically at 4 steps after kqueue code, I think that explains well. So it looks like we wait for signal 'done' after which we rewatch.

It is pthreads which are fundamentally broken indeed, not epoll.

All of the problems outlined are relevant whether you pre-fork into multiple processes or use different threads. I don't see what it has to do with pthreads.

So what's the best way to multiplex then?

(Caveat: this is a contentious opinion on Hacker News)

(Caveat #2: I'm not terribly familiar with such low level programming. Take anything I say with a grain of salt.)

It's my opinion that IO Completion Ports on Windows are superior to the approach taken by *nix and BSDs.

Instead of having the usermode application sleep and wake up, do some checks, etc., the application provides an entry point when an event occurs or data is available. Essentially, a callback for the kernel to use. The kernel then jumps directly to this, and can manage the threads involved, using a thread pool to balance requests. This gives much better utilization of threads than with poll/epoll/kqueue, but does place some other constraints on how the code is written.

The fundamental difference is that the Unix-kin is a readiness based model. They wake up a thread to tell it that it is ready to read an event. IOCP on Windows is a completion based model, and wakes up threads with the data (or error) already present in a data structure provided to the thread.

> a completion based model, and wakes up threads with the data (or error)

Which means that in this model you have to allocate and provide a buffer for that data long before the kernel is going to fill it. It's going to just sit there waiting, wasting memory. While in unix model you don't have to allocate a buffer until you know there is some data to copy from the kernel, which is easier for the user and much more efficient.

Completion model makes sense if your entire networking stack lives in userspace and you can allocate memory on the lowest layer, but pass it as a reference all the way up. Or if you at least can do syscall batching, to make operating on very small buffers efficient.

This exact same problem is also present in the concurrency model provided by Go. To read from the network you need to provide a buffer to read into, which means that a buffer has to be allocated for every goroutine (instead of just the goroutines that actually have data to read).

Hard to answer. Varnish and Nginx are similar in performance, but not similar at all in implementation. So, which one is better?

Perhaps the most comprehensive single place you can look to see different approaches and their respective strengths/weaknesses: http://www.kegel.com/c10k.html

According to the article, use FreeBSD

This is far too simple of an answer. There are many different reasons to use Linux and not FreeBSD. In this one case FreeBSD might be better than Linux (I offer no opinion), but there are probably other more substantial reasons why one can't easily switch from Linux to FreeBSD.

dup the listening socket, and use epoll on the dups in threads/processes. (At least that's what nginx does.)

Don't share the epoll fd if you're not ready to manage the things covered by it. (Because that means you just shared everything :|)

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact