Back in the old single-threading Unix days, select() was a good choice because it allows you to write a program handling async inputs in an easy style.
Thanks for writing it -- it's a good contribution!
Really? that's how MySQL works.
EDIT: Also IIRC when using select() writes can still block your thread...
You should only ever use select() with file descriptors in non-blocking mode. Otherwise there's an inherent race condition with any readiness signaling mechanism along these lines. But yes, the read/write syscalls are not asynchronous, and select() on regular files are required by POSIX to always be signaled as ready for reading and writing. Basically, descriptor readiness as per select() does not mean and has never meant that the thread won't be temporarily descheduled and put on an IO wait queue. Its semantics are only a useful fit for things like UARTs, sockets, pipes, etc.
By the way, even with asynchronous/overlapped IO on Windows, there are cases where asynchronous reads/writes on files will effectively be synchronous, especially in the common case of file-extending writes. See https://support.microsoft.com/en-us/kb/156932 for details. The only work-around I know of is to use SetFileValidData to pre-extend the file, but that requires increased privileges and so is not often feasible (and usually not worth it compared to queuing the writes to a dedicated IO thread).
Whereas essentially all platforms (including Windows) have select() and most have something equivalent but better like epoll() or kqueue().
And for most applications the benefits of IOCP over epoll() or kqueue() are only theoretical. A call to send() or recv() isn't literally synchronous, it's buffered by the OS.
So using IOCP looks better on paper than in practice.
Between select/poll/epoll/kqueue/etc and each other that is easy. You have a few functions like "register(ctx, fd, event)" and "wait_for_events(ctx)" that map straight to each of the implementations.
To use IOCP it's not just those you have to wrap. It's also send(), sendto(), sendmsg(), write(), recv(), recvfrom(), recvmsg(), read(), connect(), accept() and others. In ways that don't make for trivial or efficient implementations. And then contend with any third party library that uses any of those internally.
There is a reason libev doesn't support IOCP and libevent only supports it by exposing a different API on Windows.
It is also telling, of how difficult to interconvert it actually is in practice, that the Heily libkqueue does not implement EVFILT_AIO, EVFILT_FS, or EVFILT_PROCDESC, and implements some of the others in only a limited fashion.
Granted libaio only really works well on XFS , and even there it has issues.
 - https://lwn.net/Articles/251413/
 - http://man7.org/linux/man-pages/man2/eventfd.2.html
 - https://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
 - http://www.scylladb.com/2016/02/09/qualifying-filesystems/
But this proves next point - select is a poor abstraction. This means that accept() is doing something more then just wait for readability (it attempts round robin) - a thing you can't express with select().
In the article I used the accept() case for illustration of the thundering herd problem. Non-blocking connect() taking a long time makes a good case. The same experiment could be done though measuring write() or sendmsg() syscalls.
Second, accept() goes over a queue of the established connections created by a listen() call
select() and accept() are meant for different things. selec/poll/epoll/kqueue work with a list of file descriptors to detect I/O, accept works with a socket.
To be honest, I'm not sure why the author spends time beating up on select. It's a three-decade old API that isn't really used for high performance applications any more. The epoll problem seems to be resolved by EPOLLEXCLUSIVE-- I don't understand why he feels that the kernel dispatch time would still be O(num processes), when clearly the goal of this flag is to only wake one process. It's still a warty API, but certainly a usable one. Maybe kqueue is better, but this post does little to convince us of that.
Suppose that at any given time, most of your threads are running, not blocked (which is why you need this many threads). The kernel will always wake the first blocked thread in the list, so the few blocked threads will cluster towards the end of the list. And the kernel will have to walk most of the list to find a thread to unblock.
This is speculation; I haven't benchmarked anything.
You just have to use select() correctly:
1) You can raise the 1024 limit of feed set size by "#define FD_SETSIZE 65536" (required for SunOS to use select_large_fdset() instead of select()) and allocating memory for fd_set yourself.
2) Do not loop over descriptors and use FD_ISSET to check if file descriptor is in set. Instead loop over fd_set one word at a time: if word != 0 then go and analyze each bit of that word (see how Linux kernel does it).
3) The other thing is to limit number of select() calls you make per second and do short sleeps if needed. That allows for events to be processed in batches and the cost of select() calls gets relatively smaller compared to the "real work" done. It also increases latency but you can work out a reasonable number of select() per second. This idea I got from "Efficient Network I/O Polling with Fine-Grained Interval Control by Eiji Kawai, Youki Kadobayashi, Suguru Yamaguchi"
I learned how to use accept() correctly from "Scalability of Linux Event-Dispatch Mechanisms by Abhishek Chandra, David Mosberger". The main idea is to call accept() in a loop until EAGAIN or EWOULDBLOCK is returned or you have accepted "enough" connections.
I don't get why author claims that epoll() fixes the problem with registering and unregistering descriptors. If you use epoll then adding or removing a descriptor is a system call but in case of select() you just modify a local data structure and call select() when you're done adding and removing all of descriptors. And you shouldn't call accept() from multiple threads, a single thread calling accept() is enough for most of us unless you're web scale ;-D
Because it is edge triggered not level triggered, the application has no way to tell epoll(2) that a thread is already handling events on a fd (and thus shouldn't hand it to a different thread -- which will lead to a race). There are ways to hack around it, but it's just a bad design.
When you use epoll in edge-triggered mode it returns and reports a resource as available once, when it was unavailable and just switched to being available. If you enter epoll again, it won't return just to report the resource is available.
In level-triggered mode, when a resource polled on is available it always returns immediately to report it as available.
Edge-triggered is useful to make sure you only act once on a resource becoming available (e.g. by handing it to a worker thread that will work in the background, you want to enter epoll in the main thread again to wait for other changes) or if you have to wait for something else to become available to actually be able to do something with it and don't want to waste time spinning in the waiting state (e.g. when moving data between two sockets, both have to be available, so you only want to wake up when the second one changes as well).
For example, if a socket is reported as readable but you only read some of the available data before calling epoll() again, with the edge-triggered interface it won't be reported as readable the next time you check, but with the level-triggered interface it will be.
And seems like both are quite different from IOCP too... kqueue_qos aside, you get the readiness state(s) and then do the operation(s) that look likely to be possible. So from this viewpoint epoll/kqueue/poll/select are actually basically the same - in contrast to the IOCP approach of doing the operation and then getting a notification when it completes. kqueue/epoll vs select/poll then looks like a more efficient way of doing the same stuff (with improvements - e.g., because the OS has more information to hand in the kqueue/epoll case it probably has more opportunity to minimize multiple wakeups, etc.).
Also, poll/epoll get even more versatile with current Linux systems, which take "everything is a file" much further with signalfd, timerfd, and eventfd.
That's why there are workarounds. For example TCP SO_REUSEPORT:
One reasonable way to design a server is to queue all requests onto a single queue, and then have multiple worker threads that take requests from the queue and process them.
When requests arrive from the outside, they are coming in multiplexed onto one data stream (assuming one network cable). It seems wasteful to have the kernel demultiplex these into separate streams (one per client), just for your application to remultiplex them when it puts them onto its queue.
If the server application is handling all traffic for the port, let it handle the demultiplexing.
TCP flow control might get a bit tricky, because I think you'd want to still handle that in the kernel.
With this kind of system your server application would have one file descriptor for reading network data, and it could have a single thread dedicated to reading that in blocking mode.
The point of the "socket" abstraction is so that the kernel can make the arrival of packets - which may appear out-of-order, or not at all - seem like a continuous stream of data to the application. Get rid of the multiplexing, and you also get rid of the socket receive buffer. The application basically needs to implement a TCP layer in userspace.
BTW, UDP sockets provide exactly the abstraction you're looking for: the sending IP address is filled in along with the data, and there's no socket to wait on.
The "non-broadcast wakeup" solution - ie EPOLLEXCLUSIVE - will also work just as well there.
- kqueue_qos - lets you do an atomic poll+receive of a Mach message on a Mach port
- EV_ONESHOT - once you get the event, it's gone, and you have to reactivate it. I think this is the analogue to EPOLLEXCLUSIVE
I've used kqueue but both of these are new to me, so I could be wrong. (I'm pretty sure kqueue_qos didn't exist when I was doing this OS X stuff about a years ago. And I only had one waiting thread, so if I noticed EV_ONESHOT at the time, I didn't pay much attention to it.)
(Interestingly, EPOLLEXCLUSIVE promises to wake up "at least" one thread, suggesting that it might wake up all of them. Whereas EV_ONESHOT sounds like it will only wake up one - since once retrieved by one call to kqueue, the event will be canceled, and it will never be returned by another. But the Linux man page doesn't say what "at least" actually means... and, to be honest, the OS X one is barely any clearer anyway, AS USUAL. So who knows for certain without scraping through the source.)
EPOLLEXCLUSIVE is acting at a different layer - in kevent terms it would cause only one of the registered kevents watching an object to be activated when the object is ready instead of all of the kevents watching for the same thing.
-epoll: epoll is an absolute mess. Just read the manpage. Cantrill said it, and in this case he's right. But since Cantrill is hardly an independant souce, here's what the libev documentation has to say about epoll:
>The epoll mechanism deserves honorable mention as the most misdesigned of the more advanced event mechanisms: mere annoyances include silently dropping file descriptors, requiring a system call per change per file descriptor (and unnecessary guessing of parameters), problems with dup, returning before the timeout value, resulting in additional iterations (and only giving 5ms accuracy while select on the same platform gives 0.1ms) and so on. The biggest issue is fork races, however - if a program forks then both parent and child process have to recreate the epoll set, which can take considerable time (one syscall per file descriptor) and is of course hard to detect.
>Epoll is also notoriously buggy - embedding epoll fds should work, but of course doesn't, and epoll just loves to report events for totally different file descriptors (even already closed ones, so one cannot even remove them from the set) than registered in the set (especially on SMP systems).
>Epoll is truly the train wreck among event poll mechanisms, a frankenpoll, cobbled together in a hurry, no thought to design or interaction with others. Oh, the pain, will it ever stop...
-kqueue: kqueue isn't epoll, which is a massive advantage. However, once again according to libev (I do like to site people who know what they're doing, because I don't), it's at least somewhat broken on every system other than NetBSD: it only works with sockets and pipes on FreeBSD, and on OSX it's totally broken. If this is wrong, or has been fixed, please let me know: I'd love that.
-Eventports: according to libev (once again) eventports are probably the least broken: it's slower than select and poll on small scales, but scales up well. Apparently the interface is a bit quirky, and it has some problems ("The event polling function sometimes returns events to the caller even though an error occurred, but with no indication whether it has done so or not"), but it's a heck of a lot better than everything else.
So yes, Solaris wins. Again.
Dammit, Solaris: it's getting increasingly hard to argue with your fans, because the system is actually really nice.
Where did that rubbish come from? It certainly was not the FreeBSD doco. I've been happily using kevent with regular files, directories, child processes, pseudo-terminals, and signals for years. On OpenBSD, too.
I cannot likewise personally attest to its support for process descriptors, timers, AIO requests, or BPF devices, but they're documented.
So now Solaris doesn't win, it's just that Linux totally loses...
It's possible they're incompotent, of course, but I find that unlikely.
I've never particularly liked select but it has solved many, many problems in practice.
Can someone clarify what this tries to say. Thanks.
So that - when an event on a socket happens, the kernel instantly knows which processes to wake up. A reverse lookup: given socket, return list of blocking processes.
I'm arguing that maintaining this reverse lookup is a hard work, especially when you need to set it up it and tear it down on every call to select().
If that's a large number of fds, then that's quite some overhead.
Has anyone around here ever experienced that problem of single process accept(2) throughput being insufficient? If so, what were you doing?
Divide your FDs into groups of 1024 each. Put each group in a different worker thread, each connected to the main thread with a Unix socket. Select over the group in each worker thread; when a socket wakes up, notify the main thread. Select over the worker connections in the main thread.
You laugh, but it's a documented approach on Windows .
But I don't touch the kernel, so I could be wrong there.
My bike is broken, it doesn't go 550mph!!1
Edit: adapted stupid example to be differently stupid.
It would be more like "My bike is broken, when I try to ride more than 12 blocks the wheels fall off".
Also, it's not even that select can't handle more than 1024 fds, it's that it can't handle a file descriptor above 1024 at all, so if your app opens a bunch of data files before accepting connection, the limit could be as low as 10 concurrent connections before select fails.
As for ideally, using select as a default is a good idea for as long as you can get away with, everything else has too many caveats and working around them is tricky as you must relearn everything libev and nginx learned over the years if you haven't been paying close attention to the problems they encountered.
The author didn't even mention the glaring problem with select.