
Select(2) is fundamentally broken - panic
https://idea.popcount.org/2017-01-06-select-is-fundamentally-broken/
======
cjensen
The phrase "fundamentally broken" is overblown when referring to a syscall
that works fine for most applications. "Doesn't scale" may not sound as
glamorous in the title, but it's a lot more accurate.

Back in the old single-threading Unix days, select() was a good choice because
it allows you to write a program handling async inputs in an easy style.

~~~
duskwuff
And select() is still a perfectly good option for situations where the number
of FDs involved is constant, or at least stays small. For instance: for
networking in an X11 application, using a select()-based event loop is much
simpler than creating a thread for each socket.

~~~
jwatte
You should never, ever, create one thread per socket. The choice is between
select, polled IO, or events. Sadly, UNIX doesn't have a good unified
asynchronous IO model, like OVERLAPPED and IOCP on NT.

~~~
snnn
> You should never, ever, create one thread per socket

Really? that's how MySQL works.

------
YZF
It's also worth noting that the Windows IOCP model unifies network and other
I/O. This allows something like a Web server to do asynchronous reads and
writes on files without needing more threads and still also servicing the
network. Would be nice to get similar capabilities in Linux.

EDIT: Also IIRC when using select() writes can still block your thread...

~~~
zrm
The difficult thing with IOCP is that it requires architectural changes to the
application and there are platforms that have no reasonable equivalent to it.
So portable applications that use IOCP semantics are hard to support on those
platforms.

Whereas essentially all platforms (including Windows) have select() and most
have something equivalent but better like epoll() or kqueue().

And for most applications the benefits of IOCP over epoll() or kqueue() are
only theoretical. A call to send() or recv() isn't literally synchronous, it's
buffered by the OS.

So using IOCP looks better on paper than in practice.

~~~
TwoBit
For many use cases, either of the two can be written as an emulation over the
other.

~~~
zrm
> For many use cases, either of the two can be written as an emulation over
> the other.

Between select/poll/epoll/kqueue/etc and each other that is easy. You have a
few functions like "register(ctx, fd, event)" and "wait_for_events(ctx)" that
map straight to each of the implementations.

To use IOCP it's not just those you have to wrap. It's also send(), sendto(),
sendmsg(), write(), recv(), recvfrom(), recvmsg(), read(), connect(), accept()
and others. In ways that don't make for trivial or efficient implementations.
And then contend with any third party library that uses any of those
internally.

There is a reason libev doesn't support IOCP and libevent only supports it by
exposing a different API on Windows.

~~~
JdeBP
The devil is in the details. Consider Mark Heily's libkqueue library for Linux
that emulates kqueue/kevent. Because of the way that signalfd, upon which it
is built, works, code that uses kevent with signals on the BSDs has to be
written one way and on Linux has to be written another. It's a subtle
difference in the manual page; and it's all too easy to read the wrong
implementation's manual page on the WWW (if one makes the mistake of using a
search engine to find manual pages) and write things incorrectly.

It is also telling, of how difficult to interconvert it actually is in
practice, that the Heily libkqueue does not implement EVFILT_AIO, EVFILT_FS,
or EVFILT_PROCDESC, and implements some of the others in only a limited
fashion.

------
alexginzburg
It seems that the author is slightly confused. Description of the thundering
herd is mostly correct. Initially it referred to multiple processes calling
the accept() on a single listen socket, it used to cause all processes to wake
up. This has been fixed a long time ago. Currently multiple processes blocking
on the accept() on a single listening socket works in a round-robin fashion
(AFAIK). Calling select() from multiple processes on a single fd, should wake
all the processes up when IO comes in. This is a documented behavior.

\--Edit--

s/write/IO/

~~~
majke
You are correct. Directly blocking on accept() in multiple processes does not
have the "thundering herd" behaviour. This is good to know.

But this proves next point - select is a poor abstraction. This means that
accept() is doing something more then just wait for readability (it attempts
round robin) - a thing you can't express with select().

In the article I used the accept() case for illustration of the thundering
herd problem. Non-blocking connect() taking a long time makes a good case. The
same experiment could be done though measuring write() or sendmsg() syscalls.

~~~
alexginzburg
First, the article starts talking about the accept() and thundering herd but
the example shows use of the select().

Second, accept() goes over a queue of the established connections created by a
listen() call

select() and accept() are meant for different things. selec/poll/epoll/kqueue
work with a list of file descriptors to detect I/O, accept works with a
socket.

------
colin_mccabe
The author criticizes select(2) because it requires the kernel to iterate over
a big data structure in memory, which has to be passed to the kernel every
time select is invoked. He criticizes epoll(7) because it didn't work well
when multiple processes try to "split the work" of handling a big batch of
file descriptors.

To be honest, I'm not sure why the author spends time beating up on select.
It's a three-decade old API that isn't really used for high performance
applications any more. The epoll problem seems to be resolved by
EPOLLEXCLUSIVE-- I don't understand why he feels that the kernel dispatch time
would still be O(num processes), when clearly the goal of this flag is to only
wake one process. It's still a warty API, but certainly a usable one. Maybe
kqueue is better, but this post does little to convince us of that.

~~~
danarmak
The kernel walks the list of threads until it finds one that's actually
blocked in epoll_wait().

Suppose that at any given time, most of your threads are running, not blocked
(which is why you need this many threads). The kernel will always wake the
first blocked thread in the list, so the few blocked threads will cluster
towards the end of the list. And the kernel will have to walk most of the list
to find a thread to unblock.

This is speculation; I haven't benchmarked anything.

------
aivarsk
I did some tests years ago and a single-threaded event loop was able to handle
more than 10,000 short-lived connections per second sending and receiving ISO
8583 bitmaps (38,000 with 2 threads if I remember correctly). And the
bottleneck was not in networking APIs or code but in middleware (Oracle
Tuxedo) calls that were done in the same thread. I think you can do even more
with modern hardware.

You just have to use select() correctly:

1) You can raise the 1024 limit of feed set size by "#define FD_SETSIZE 65536"
(required for SunOS to use select_large_fdset() instead of select()) and
allocating memory for fd_set yourself.

2) Do not loop over descriptors and use FD_ISSET to check if file descriptor
is in set. Instead loop over fd_set one word at a time: if word != 0 then go
and analyze each bit of that word (see how Linux kernel does it).

3) The other thing is to limit number of select() calls you make per second
and do short sleeps if needed. That allows for events to be processed in
batches and the cost of select() calls gets relatively smaller compared to the
"real work" done. It also increases latency but you can work out a reasonable
number of select() per second. This idea I got from "Efficient Network I/O
Polling with Fine-Grained Interval Control by Eiji Kawai, Youki Kadobayashi,
Suguru Yamaguchi"

I learned how to use accept() correctly from "Scalability of Linux Event-
Dispatch Mechanisms by Abhishek Chandra, David Mosberger". The main idea is to
call accept() in a loop until EAGAIN or EWOULDBLOCK is returned or you have
accepted "enough" connections.

I don't get why author claims that epoll() fixes the problem with registering
and unregistering descriptors. If you use epoll then adding or removing a
descriptor is a system call but in case of select() you just modify a local
data structure and call select() when you're done adding and removing all of
descriptors. And you shouldn't call accept() from multiple threads, a single
thread calling accept() is enough for most of us unless you're web scale ;-D

~~~
ejanus
Great! Do you have simple code base that illustrates the a over?

------
cyphar
Bryan Cantrill has given many rants about epoll(2) as well. In particular, one
of the biggest issues with epoll(2) compared to kqueue (BSD) or eventports
(Solaris) is that epoll(2) is effectively useless when it comes to multi-
threaded processes in the "worker pool" model.

Because it is edge triggered not level triggered, the application has no way
to tell epoll(2) that a thread is already handling events on a fd (and thus
shouldn't hand it to a different thread -- which will lead to a race). There
are ways to hack around it, but it's just a bad design.

~~~
jzwinck
Epoll is level triggered by default. Edge triggered is an option called
EPOLLET.

~~~
bogomipz
I know edge triggered vs level triggered in the context of electronics but can
you elaborate on their meaning in a software context?

~~~
jnbiche
It's the same. If you imagine an async signal (notification), for example for
reading from a file handle, as an electronic signal, the two are exactly the
same.

~~~
bogomipz
Thanks for the explanations, good to know.

------
putzdown
"Broken" is the wrong word. The argument is actually, "it's slow." Broken
would imply logically flawed, impossible to make work consistently, something
along those lines. The title to "Select(2) is fundamentally slow" would be
more accurate (but less interesting).

------
tedunangst
As one bandaid, one can accept() in one process and pass the fd to a worker.

------
JoshTriplett
The article's conclusion about select() seems correct; however, after
exploring select() performance in general, it then mentions epoll's exclusive
flag in passing, and then dismisses it without actually doing any performance
analysis.

~~~
to3m
I'll be interested to hear why epoll and kqueue are so very different. Strikes
me that both are quite similar: you attach waitobject-specific events to fds,
and then wait on the waitobject until one of the events occurs. Much like one
another, and not much like select/poll!

And seems like both are quite different from IOCP too... kqueue_qos aside, you
get the readiness state(s) and then do the operation(s) that look likely to be
possible. So from this viewpoint epoll/kqueue/poll/select are actually
basically the same - in contrast to the IOCP approach of doing the operation
and then getting a notification when it completes. kqueue/epoll vs select/poll
then looks like a more efficient way of doing the same stuff (with
improvements - e.g., because the OS has more information to hand in the
kqueue/epoll case it probably has more opportunity to minimize multiple
wakeups, etc.).

~~~
xenadu02
IOCP is the sensible way to do it. Unix, for a system where supposedly
everything is a file, has a lot of very specific behaviors for "special" kinds
of files. In contrast IOCP lets you treat all async operations from timers to
sockets to disk the same way and the thread pool does a decent job of scaling
too.

~~~
JoshTriplett
I've used both programming models, and I find "can I read/write this fd" much
more intuitive and versatile than "try to do it and let me know when actually
done". In particular, you can use many different models to distribute and
parallelize work with poll/epoll.

Also, poll/epoll get even more versatile with current Linux systems, which
take "everything is a file" much further with signalfd, timerfd, and eventfd.

------
kazinator
I don't think it's a useful idea to use a thread or process pool to accept on
the same passive socket. It's perfectly fine to have one thread doing this and
handing the connected socket off to a pool for processing. If your bottleneck
is in the accept loop, what that proves is that you're not writing a real
service application, but a benchmark contrived to produce such a bottleneck.

~~~
majke
Depending on your server workload, sometimes accept() is the bottleneck.

That's why there are workarounds. For example TCP SO_REUSEPORT:
[https://lwn.net/Articles/542629/](https://lwn.net/Articles/542629/)

------
tzs
I'd like to try an interface that provides all traffic on a given port,
regardless of the IP address on the other end, via a single file descriptor.

One reasonable way to design a server is to queue all requests onto a single
queue, and then have multiple worker threads that take requests from the queue
and process them.

When requests arrive from the outside, they are coming in multiplexed onto one
data stream (assuming one network cable). It seems wasteful to have the kernel
demultiplex these into separate streams (one per client), just for your
application to remultiplex them when it puts them onto its queue.

If the server application is handling all traffic for the port, let it handle
the demultiplexing.

TCP flow control might get a bit tricky, because I think you'd want to still
handle that in the kernel.

With this kind of system your server application would have one file
descriptor for reading network data, and it could have a single thread
dedicated to reading that in blocking mode.

~~~
nostrademons
How would you handle requests that span more than one packet?

The point of the "socket" abstraction is so that the kernel can make the
arrival of packets - which may appear out-of-order, or not at all - seem like
a continuous stream of data to the application. Get rid of the multiplexing,
and you also get rid of the socket receive buffer. The application basically
needs to implement a TCP layer in userspace.

BTW, UDP sockets provide exactly the abstraction you're looking for: the
sending IP address is filled in along with the data, and there's no socket to
wait on.

------
caf
The thundering herd issue applies just as much in the kqueue case, if you set
it up with N processes all interested in an event on the same socket. Waking
them all up is _still_ going to be an O(N) operation.

The "non-broadcast wakeup" solution - ie EPOLLEXCLUSIVE - will also work just
as well there.

~~~
to3m
Modern OS Xs appear to have two extra things to help with this:

\- kqueue_qos - lets you do an atomic poll+receive of a Mach message on a Mach
port

\- EV_ONESHOT - once you get the event, it's gone, and you have to reactivate
it. I think this is the analogue to EPOLLEXCLUSIVE

I've used kqueue but both of these are new to me, so I could be wrong. (I'm
pretty sure kqueue_qos didn't exist when I was doing this OS X stuff about a
years ago. And I only had one waiting thread, so if I noticed EV_ONESHOT at
the time, I didn't pay much attention to it.)

(Interestingly, EPOLLEXCLUSIVE promises to wake up "at least" one thread,
suggesting that it might wake up all of them. Whereas EV_ONESHOT sounds like
it will only wake up one - since once retrieved by one call to kqueue, the
event will be canceled, and it will never be returned by another. But the
Linux man page doesn't say what "at least" actually means... and, to be
honest, the OS X one is barely any clearer anyway, AS USUAL. So who knows for
certain without scraping through the source.)

~~~
daniel02216
EV_ONESHOT refers to the individual kevent registration, and is basically an
automatic EV_DELETE on delivery of the event.

EPOLLEXCLUSIVE is acting at a different layer - in kevent terms it would cause
only one of the registered kevents watching an object to be activated when the
object is ready instead of all of the kevents watching for the same thing.

~~~
caf
And there's a epoll() equivalent of EV_ONESHOT, EPOLLONESHOT.

------
qwertyuiop924
The problem is that all of select(2)'s competitors have a lot of problems as
well. Because, as will be repeated below, I have no idea what I'm talking
about, I will be citing a relatively independant source, namely, the libev
documentation.

-epoll: epoll is an absolute mess. Just read the manpage. Cantrill said it, and in this case he's right. But since Cantrill is hardly an independant souce, here's what the libev documentation has to say about epoll:

>The epoll mechanism deserves honorable mention as the most misdesigned of the
more advanced event mechanisms: mere annoyances include silently dropping file
descriptors, requiring a system call per change per file descriptor (and
unnecessary guessing of parameters), problems with dup, returning before the
timeout value, resulting in additional iterations (and only giving 5ms
accuracy while select on the same platform gives 0.1ms) and so on. The biggest
issue is fork races, however - if a program forks then both parent and child
process have to recreate the epoll set, which can take considerable time (one
syscall per file descriptor) and is of course hard to detect.

>Epoll is also notoriously buggy - embedding epoll fds should work, but of
course doesn't, and epoll just loves to report events for totally different
file descriptors (even already closed ones, so one cannot even remove them
from the set) than registered in the set (especially on SMP systems).

[...]

>Epoll is truly the train wreck among event poll mechanisms, a frankenpoll,
cobbled together in a hurry, no thought to design or interaction with others.
Oh, the pain, will it ever stop...

-kqueue: kqueue isn't epoll, which is a massive advantage. However, once again according to libev (I do like to site people who know what they're doing, because I don't), it's at least somewhat broken on every system other than NetBSD: it only works with sockets and pipes on FreeBSD, and on OSX it's totally broken. If this is wrong, or has been fixed, please let me know: I'd love that.

-Eventports: according to libev (once again) eventports are probably the least broken: it's slower than select and poll on small scales, but scales up well. Apparently the interface is a bit quirky, and it has some problems ("The event polling function sometimes returns events to the caller even though an error occurred, but with no indication whether it has done so or not"), but it's a heck of a lot better than everything else.

So yes, Solaris wins. Again.

Dammit, Solaris: it's getting increasingly hard to argue with your fans,
because the system is actually really nice.

~~~
JdeBP
> _it only works with sockets and pipes on FreeBSD_

Where did that rubbish come from? It certainly was not the FreeBSD doco. I've
been happily using kevent with regular files, directories, child processes,
pseudo-terminals, and signals for years. On OpenBSD, too.

I cannot likewise personally attest to its support for process descriptors,
timers, AIO requests, or BPF devices, but they're documented.

~~~
qwertyuiop924
Oh. Like I said, that's only what I've heard. It's good to know that's working
now.

So now Solaris doesn't win, it's just that Linux totally loses...

~~~
JdeBP
I am not aware of it ever _not_ working, since its creation.

~~~
qwertyuiop924
I _did_ point to my source: the documentation for libev, a widely used event
library that abstracts over such OS-specific mechanisms. Whilebit may be
outdated, I'm inclined to trust that it was true at some point, as it's more
or less the libev maintainers' jobs to work with these interfaces day in and
day out.

It's possible they're incompotent, of course, but I find that unlikely.

------
vbezhenar
If kernel uses some kind of callbaks internally, why this callback API isn't
directly presented to userspace API? Callback API is easier to use than
select. I remember emulating callback API on top of select and if kernel
emulating select on top of callbacks, it seems pretty weird.

------
luckydude
Just curious, are there any real world examples where select is a bottleneck?
I guess if I had a threaded web server or something but is there such a thing?

I've never particularly liked select but it has solved many, many problems in
practice.

~~~
exDM69
Yes, it quickly becomes the bottleneck with a large number of sockets. Web
servers like nginx use epoll/kqueue and other techniques described in the
article.

------
ensiferum
"It is heavyweight. It requires constantly registering and unregistering
processes from the file descriptors, potentially thousands of descriptors,
thousands times per second."

Can someone clarify what this tries to say. Thanks.

~~~
majke
Imagine a kernel has a data structure for each socket. When a process blocks
on that socket, a kernel must "register" it on it.

So that - when an event on a socket happens, the kernel instantly knows which
processes to wake up. A reverse lookup: given socket, return list of blocking
processes.

I'm arguing that maintaining this reverse lookup is a hard work, especially
when you need to set it up it and tear it down on every call to select().

------
marcosdumay
Now I'm very curious.

Has anyone around here ever experienced that problem of single process
accept(2) throughput being insufficient? If so, what were you doing?

------
styfle
I came here thinking this was about select2.js[1] the javascript dropdown
plugin.

[1]: [https://select2.github.io/](https://select2.github.io/)

------
justinsaccount
If you think select isn't broken, show me a small server written using it that
can handle more than 1024 concurrent clients.

~~~
anarazel
Something not being usable for a specific usecase doesn't mean $thing is
broken. It means it's not applicable for that usecase.

My bike is broken, it doesn't go 550mph!!1

Edit: adapted stupid example to be differently stupid.

~~~
pekk
This kind of broken reasoning is common in certain communities. Example: this
programming language is broken, it can't be used to write drivers for the
Linux kernel!

~~~
justinsaccount
No, it's more like "This feature is broken, if you attempt to use it for its
intended purpose it will fail horribly once you try running it at scale".

------
zlskefjj
Sensationalist title for an article exposing nothing new about select(), as
the old UNIX programming adage goes: select is not broken.

The author didn't even mention the glaring problem with select.

~~~
somethingsimple
> The author didn't even mention the glaring problem with select.

Which is...?

