When you're dealing with shared resources (like a listening socket), and you are using threads, and you are trying to maximize performance with networking, confusing and painful is kind of the nature of the problem.
Source: at the time I worked on high-performance server products and had discussions with various people active on the kernel side of the problem and encountered much head-shaking and muttering about patents, when I mentioned implementing IOCP.
fwiw there was an IOCP-like capability added to AIX, again supposedly because IBM did not have fears about patent infringement (likely due to cross-licensing arrangements).
What is so disturbing is this has been known forever that epoll should be drown in a bathtub but no one does, it's just "la dee da I can't here you! I can't hear you!" and Linux trudges on with yet another limp.
As he also says "kqueue and event ports are T-Ball, compared to ZFS and Dtrace and jails are a lot harder... if you got the little stuff wrong..."
It's just frustrating. It's broken. You know it's broken. FIX IT!
>accept() on a fd that has already been closed.
While I trust the author to do this (thankfully, as he's my coworker) there is a lot of Linux software that doesn't, even assuming it was updated in the last year and you're running something vaguely bleeding edge (not Debian).
Some people using a completely different kind of OS insist in continuing to use a version from 2009, instead of its more recent version. I guarantee you that this version from 2009 also has its share of dark APIs, some of which have been fixed in the modern version...
Yet I've yet to find articles titled like if all of them are equally broken, while the content would precisely describe the caveats to do not-broken things on the capable recent releases.
Everyone's approach towards these mechanisms is broken. Just don't treat it as a reliable notification mechanism, do your own scheduling using information from epoll/kqueue only as hints and everything will be fine.
It's also hard for me to imagine a scenario where accept() takes longer than servicing a request and becomes a bottleneck. That is, why would you need multiple threads accepting on the same socket?
Second issue is cache locality. If you do accept() in one thread only, then you will need to move the new accept-ed client socket to another worker thread. Depending on details this might not be efficient - aRFS comes into mind. (but frankly, epoll alone won't help here, you need SO_REUSEPORT with SO_INCOMING_CPU).
Even if you don't agree that scaling out accept is a real concern - that's missing the point. The point is: the epoll() model should take this into account and at least support this problem. Or loudly say that scaling out accept() with epoll is not possible. But neither things happened. Up till kernel 4.5 it was impossible to do it correctly, but undocumented, from 4.5 you can use the EPOLLEXCLUSIVE flag, which I feel is a hack.
Your writing will be more convincing if you profile.
when you deal with multi threads, not only epoll will cause some problems, but also global variable, memory, etc. global variable solved by mutex, but epoll solved by avoid using it to epoll_wait fds in multi threads.
I wonder if we could put a sane fix into libc to fix these problems.
> Waking up "Thread B" was completely unnecessary and wastes precious resources. Epoll in level-triggered mode scales out poorly.
In the analysed situation thread B was already in a wait state and there aren't enough incoming connections to immediately accept the next one. Of course the resources (CPU time) were wasted, but does that impact the performance in any way? (Assuming one "main" application on that host)
In addition, because a single "readiness event" has to wake up many threads, it can introduce lock contention and cacheline bouncing in the kernel's data structures.
> One option is to use SO_REUSEPORT and create multiple listen sockets sharing the same port number. This approach has problems though - when one of the file descriptors is closed, the sockets already waiting in the accept queue will be dropped
Yes, it's a problem if you use it across processes... but if you have a single long lived process with each thread listening on a separate queue, isn't that the simplest solution?
Please correct me if I'm wrong.
I don't see sharing FDs across threads as a useful thing to aspire to.
The common design I see these days is load balancing FDs across shared nothing threads. The thread that receives the notification via the selector is the thread that does the IO (no other thread has that FD). Keep adding threads as makes sense and never block let them block.
A combined queue makes sense when task sizes are large. For small tasks the performance is poor. I see the queuing decision as something you make after you have already retrieved the message from the network and presented it to an application layer which makes a decision on how to dispatch.
It would be cool if the kernel would do this for you under the hood! That would be amazing. Just making it correct is not enough though.
These functions are not about 'event dispatching' so much as they are about putting the current thread to sleep until something of interest might have occurred.
Using poll() or epoll() means that the thread gets to wake up on file descriptor events in addition to intrathread signals, and epoll() seems to work fine in this regard.
That's not "shared nothing" threads, of course, so it doesn't answer your question at a high level. It does highlight, though, that Linux is especially challenging in this space as compared to FreeBSD. And not just because of epoll().
Get the data from the FD and then let the application decide the best dispatch policy.
At the same time, threads are a vile abomination and we should shun them whenever possible. Computers are terrible, aren't they?
Even on something as simple as interrupt delivery to a single consumer, you MUST be prepared to handle merged and spurious interrupts -- I would argue that any driver that is not prepared to do so (in the general case) is buggy.
Maybe it's easier to do perfectly with an epoll/kqueue API for some reasons, but, without having tried to think much about it, I can't imagine why it should be. I have the intuition this is way harder. Actually I'm not even sure if I can have any intuition about the difficulty to achieve the behavior wanted by the author of that article, because the author did not actually specified the behavior he desires...
Exactly use EPOLLEXCLUSIVE for accept, that seems to work and has been in the kernel for more than a year.
For reads, how reasonable is to even share the file descriptor with multiple threads. Just hand it to a thread and let that thread (possibly bound to a CPU core for possible more performance boost) handle it from then on.
Am I crazy to be surprised by that architecture choice?
1) Windows is mandated by HQ as the only supported desktop. You're not getting anything else to run on your desktop so you either write Windows software locally or use a VM (which is cumbersome0
2) Epoll is trash.
I have successfully convinced him to make some software on freebsd because kqueue is fundamentally better. But it's shocking that epoll is so bad even compared to windows :(
But can be noticeably faster on fewer cores:
But still in many cases has more performance pitfalls in practice:
So, sure, epoll() does not help for optimal distribution of accept() or read() of the same fds over multiple threads. But even with a single thread accepting, or with "thundering herd" problems of other strategies, linux is still usually more efficient in practice. Often dumb non-optimal designs are faster just due to subtle problems caused by overall system complexity.
Side note: macOS has kqueue, but it's been buggy in enough releases that libevent and libev usually have to use select() on macOS to be safe.
So for your "optimal socket handling" tasks you're really limited to choosing between Linux and FreeBSD, and there are other issues to consider there, like drivers.
With the proactor pattern, you register events you care about explicitly, and your event-loop gets the next event to handle and schedules the respective callback/coroutine/future to handle that event.
With the reactor pattern you don't have explicit registration for completion status when you do an I/O operation, and you need to explicitly specify the file descriptors you want to select on in the point where you select on them.
Many people (including the author of the OP article) think that proactor is superior, and this was what the Windows guy in the parent post was referring to. In Windows you're NEVER SUPPOSED to use any version of select (including WSAAsyncSelect or WaitForMultipleObjects) if you have a large number of events. They don't scale.
Instead, Microsoft decided to put its scaling effort into the proactor approach, which they introduced in Windows NT 3.5 -almost a decade before Linux came out with epoll - and this was the right way to do async I/O on Windows ever since.
Linux, on the other hand, chose to perfect the reactor model, and with epoll it gave a very powerful select-like reactor implementation.
So yes, if you specifically want to program a reactor, go with Linux. Reactor-centric software and libraries (like libev or nginx) just don't scale on Windows. But that doesn't mean async I/O on Windows sucks - it's just optimized for a different model.
 kqueue could be better, but I don't know it well enough to make a call. And while it's I/O events are readiness based, as far as I understand it's general enough that it can be used to implement a proactor as well.
Practically, MAXIMUM_WAIT_OBJECTS is set to such a low number (64) that people resort to this kind of hack: https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...
That kind of "solution" to such a problem is way more insane than epoll() eventually providing the correct flags to fix most of its own issues, even if we had to wait during a decade for it...
So praising NT because epoll is "trash" is uninformed at best, malicious at worst.
You should never, ever use it for high performance I/O.
Windows is an expensive commercial OS, if anything it should be better, and it usually is better than the Linux desktop.
Power use (no rip battery), "it just works", fast and efficient in most cases.
CLI is a whole different story, but bash in ubuntu in windows is already pretty good, and there's no reason it can't be equal to bash on any other Linux distribution.
I/O completion ports are the correct solution when it comes to waking up one of multiple threads to handle I/O events.
>"This is because "level triggered" (aka: normal) epoll inherits the "thundering herd" semantics from select(). Without special flags, in level-triggered mode, all the workers will be woken up on each and every new connection."
Isn't this behavior similar to disk I/O where the kernel wakes up(via wake_up()?) all tasks that are sleeping on a wait queue for disk I/O? I believe all processes sleeping on a disk I/O wait queue will be woken up regardless of whether their disk I/O is complete or not and will be put back to sleep in the whre it is not.
How about a concept like "transaction"? Maybe such as `epoll_begin` and `epoll_end`? One thread achieves exclusive access when entering the block, and releases it when leaving.
Does this flaw also exist in IOCP or kqueue?
1. Kernel receives some data, and wakes up A
2. A reads some data
3. Kernel receives some other data, and wakes up B (cause A didn't call `epoll_wait` yet)
4. B reads some data, leading to a race condition (out-of-order reading).
Take a look here, at 'EVFILT_SIGNAL'. But does that mean that we have to manually attach signal to monitor, and then we receive "we're done". But that's kinda similar to what you have to do with epoll, the difference is that kqueue structure encapsulates functiona of 'epoll_wait' and 'epoll_ctl'?
Edit: sorry for spamming with links, but I find these things really interesting, look here http://austingwalters.com/io-multiplexing/ specifically at 4 steps after kqueue code, I think that explains well. So it looks like we wait for signal 'done' after which we rewatch.
(Caveat #2: I'm not terribly familiar with such low level programming. Take anything I say with a grain of salt.)
It's my opinion that IO Completion Ports on Windows are superior to the approach taken by *nix and BSDs.
Instead of having the usermode application sleep and wake up, do some checks, etc., the application provides an entry point when an event occurs or data is available. Essentially, a callback for the kernel to use. The kernel then jumps directly to this, and can manage the threads involved, using a thread pool to balance requests. This gives much better utilization of threads than with poll/epoll/kqueue, but does place some other constraints on how the code is written.
The fundamental difference is that the Unix-kin is a readiness based model. They wake up a thread to tell it that it is ready to read an event. IOCP on Windows is a completion based model, and wakes up threads with the data (or error) already present in a data structure provided to the thread.
Which means that in this model you have to allocate and provide a buffer for that data long before the kernel is going to fill it. It's going to just sit there waiting, wasting memory. While in unix model you don't have to allocate a buffer until you know there is some data to copy from the kernel, which is easier for the user and much more efficient.
Completion model makes sense if your entire networking stack lives in userspace and you can allocate memory on the lowest layer, but pass it as a reference all the way up. Or if you at least can do syscall batching, to make operating on very small buffers efficient.
Perhaps the most comprehensive single place you can look to see different approaches and their respective strengths/weaknesses: http://www.kegel.com/c10k.html
Don't share the epoll fd if you're not ready to manage the things covered by it. (Because that means you just shared everything :|)