
Using select(2) the right way - aivarsk
http://aivarsk.github.io/2017/04/06/select/
======
ts4z
I have worked on servers that had this problem with select(2). poll was better
but kqueue was much much better (from near-100% utilization with a few
thousand open fds to < 1% cpu utilization; please forgive the vague numbers,
it was 7 years ago at a company that I no longer work for).

If you're using nonblocking I/O, or threads, select can become the bottleneck
with a few thousand interesting connections at sizes that are much less than
the FD_SETSIZE max.

The user-space CPU utilization should be near-0 since this is really a kernel
I/O problem; the fact that it is much higher is an indication that a lot of
wasted user space work is being done.

Select, by design, has a flaw: the kernel must scan the bitmask for the
highest-numbered fd passed in, even if there is little change in the fd_set
bitmask (as is usually the case). This means incurring a cost if any fd in the
set has a high fd number. The fd on which you accept probably has a low
number, but your client I/O fds will have increasing numbers as the server
gets greater and greater utilization.

kqueue (and epoll, but I haven't used it) resolve this problem nicely. Only
things that change state must be communicated. poll is better than select in
this respect if the array is sparse, but if a large number of fds are
interesting, it will also get expensive.

~~~
aivarsk
What you wrote about the design flaws of select is correct but a feed set with
10,000 sockets is just 1250 bytes or 157 longs and hardware is very good at
sequential memory access. I agree it will be slower than kqueue or epoll for
long-lived connections because there is more work to do for the kernel and
application but for many applications select is good enough when used
correctly.

I have my doubts about select causing 100% cpu utilization, I suspect you were
doing other suboptimal things as well. The sample code I wrote is running well
with both 100 and 10,000 connections. I have my own anecdotal evidence where
application was barely handling just 100 mostly inactive connections and OPS
guys suggested limits of just 50 connections per application. After I fixed
how feed sets were created and how the result of select was processed the same
application was running just fine with 8,000 connections. We had to support
Linux, Solaris, AIX and HP-UX at that time and select/poll were available on
all of them. That's why I invested time in optimizing the code instead of
switching to epoll. OPS guys still suggested limit of 1,000 connections per
application but this time it was for availability and other non-performance
reasons.

~~~
joosters
In my experience, there's definitely a point where poll() beats select() when
handling lots of connections, long-lived or short-lived. But at that point you
are much better off moving to epoll() or kqueue() depending upon the OS. And
if you support those syscalls, there's little point falling back to select().

For most people, if your program is handling 10s or 100s of concurrent
connections, select() should be just fine.

If you are going to be handling more, it's worthwhile looking into the other
syscalls to improve performance. In any case, I'd recommend abstracting away
the event handling so that you can switch between different syscalls and can
use benchmarking to try to replicate the traffic you want to handle. There's
lots of libraries that will do this work for you, unless you need to get into
the low-level stuff.

Trying to judge the syscall performance based on reasoning about amount of
data transferred between your program and the OS, number of syscalls etc, is
very difficult. You are much better off measuring the actual behaviour. For
example, epoll() seems like a terribly designed API to me as it involves
making many syscalls (whereas kqueue() is just one per loop). However, I found
epoll() was very high performance. I guess the cost of syscalls on Linux can
be very low in some cases.

~~~
valarauca1
systems calls on linux that deal with per-process state try very hard to not
invalidate userland memory mappings.

The kernels _real_ range and userland's _virtual_ range won't overlap so for a
lot of functions kernel memory just has to mapped/unmapped on call, not
invalidate _all_ userland bindings.

Well okay they will overlap. So yeah your mappings may get invalidated but for
synchronized higher performance systems calls they _shouldnt_.

This lets them be in the 10's to 100's of nano-seconds.

Normally the most _expensive_ part of a linux syscall is the TLB misses after
one.

\---

Your model of memory transfer size assumes data is being copied.

The Linux kernel has a lot of features to let userland, devices, and itself
all share the same memory copy free.

~~~
koverstreet
This is wrong.

Kernel address space and userspace address space don't overlap at all. There's
zero mucking with memory mappings, TLB etc. on syscall.

------
hawkinsw
I've written lots of servers in C and found this article quite interesting.

The techniques that the author presents are new to me and I know that I will
take them into consideration when I write a server in C the next time.

More importantly, they are techniques that I think library writers should use.
My uneducated guess is that the underlying implementations of servers in
Node.js and Iron, etc, could be improved with these ideas, even though they
might not be directly applicable since those servers are written in different
languages.

I can't vouch for the results, but the author's experiments seem conclusive.

~~~
DSMan195276
His allocating an fd_set to an arbitrary size is somewhat iffy standards wise,
but if it works it works - especially for a solution for specific OSs like
Linux, where this won't be broken in the future. On that note, it is worth
wondering how expensive the copying is, but I would wager he is right in that
the copying isn't actually that expensive - it's only an 8K buffer.

Also, when creating an array that is indexed by fd, if you want to only use
the memory you need instead of creating an extremely large array for the max
fd of the system you can just allocate the memory using `mmap`. Systems like
Linux won't actually allocate all the memory right when you do the `mmap`,
instead it will map in a zero-page in each spot. When you write to it, the
kernel will automatically allocate a page to back that memory. When you
combine that with the kernel always selecting the lowest fd, it means that the
memory backing your array will only grow to be as large as needed for the
maximum number of fd's you use. (Note: It's likely that such a large array
would be allocated via `mmap` in it's own spot by malloc/calloc anyway, but
using `mmap` ensures you get the functionality you intended.)

------
kelnos
The thing that really resonated for me in this article was the point that for
most of us, the performance of select() or accept() really doesn't matter,
because any time spent there is easily dwarfed by application logic, database
lookups, queries to other services, etc.

You only really need to care about this if you're working on something mainly
concerned with the connections themselves, like HAProxy.

~~~
morecoffee
The right way to think about these kind of things is in terms of
responsiveness. For example, when I think of any request intiated by a user,
there is a ~200ms budget of time to service that request. If you can shave off
a couple hundred micro seconds, those can be spent on returning a better
answer.

Everyone has a piece to contribute to servicing user requests, and every
should be mindful how much of the budget they consume. Using select (or epoll)
more efficiently makes everyone elses job easier. It is important for everyone
in the pipeline to think this way, you included. The user experience depends
on it.

~~~
astrobe_
In a nutshell: performance still matters because your program isn't the only
one running on the computer.

------
Matthias247
Regarding this:

> epoll uses system calls (“slow”) to add or remove descriptor to the set.
> select does the same thing in userspace. If you have to add and remove many
> descriptors per iteration which is typical for short-lived connections then
> select might be more efficient.

I would have estimated that on a select system call the kernel needs to
internally attach some kind of event listener for each passed FD and uninstall
it before the system call returns. Which means select is interally a sequence
of "iterate over all FDs, check if an event is available and if not install
listener", "wait at least one event to happen" and finally "uinstall all
listeners". While epoll requires only "wait for at least one event to be
available and then iterate over the FDs and record all available events". So I
would guess that while epoll requires more syscalls for setup it should be a
lot more efficient if a FD is polled more than once. With edge triggering most
of the registration/unregistration calls can also be avoided.

Or do kernels have special optimizations that make select internally more
efficient than I described? Maybe some memoization of former used pollsets?

~~~
aivarsk
[http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod](http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod)
and search for "EVBACKEND_EPOLL (value 4, Linux)" to read how much fun epoll
is :-D

~~~
masklinn
Or read this: [https://idea.popcount.org/2017-02-20-epoll-is-
fundamentally-...](https://idea.popcount.org/2017-02-20-epoll-is-
fundamentally-broken-12/)

------
loeg
A lot of the advice seems misguided, if not wrong. Rather than optimize around
the limitations of select(2), I'd suggest using different APIs instead.

For example:

> Literature ussually says that fd_set is limited to 1024 descriptors but some
> mention that you can increase that by defining FD_SETSIZE to a larger
> number.

If you just need a one off tool for a few descriptors, why not just use
poll(2) instead? No problems with high fd numbers, and the API is quite
simple.

> To find out which sockets have events some loop through data structures and
> check each socket they contain. But a much better way is to iterate through
> fd_set as array of long and check 32-64 bits at a time.

If you've reached the point at which this trade-off matters, you're better off
switching to a socket API that scales well (epoll or kevent) and does this
filtering for you. Or like another commenter suggested, using a library that
abstracts this functionality.

> select modifies descriptor sets passed in. A common mistake is to create
> descriptor set from zero before calling select by looping through data
> structures and adding all sockets they contain. I’ve seen that to take huge
> amount of time due to locking and synchronization between threads. The
> correct way is to maintain a descriptor set all the time and create a copy
> of it and pass the copy to select.

Again -- if you've reached the point where this tradeoff matters, just go
directly to epoll/kevent.

> Most programs will need to store some state associated with the socket. Map
> seems to be a good choice at first but you should use an array indexed by
> socket (it’s an int). The POSIX API says that kernel always has to use the
> lowest unused file descriptor when creating a new one. So the array will be
> as big as many connections your program handles and kernel will take care of
> finding a free slot in your array.

Maybe your program deals with non-socket fds, and the set of socket fds is
fairly sparse. Using a map is actually pretty reasonable even if fds are
dense.

~~~
aivarsk
I probably should have started by telling it was years ago when applications
had to work on OSes other than Linux and BSD (AIX, HP-UX, Solaris). Today it's
libev or Boost.Asio.

> why not just use poll(2) instead?

Because as I mentioned later poll basically is the same as select but requires
more memory to be copied to/from kernel and was slower in some test cases
although I can come up with cases where poll will be faster than select.
Networking libraries like libev and others allocate fd_set the same way.

>> To find out which sockets have events some loop through data structures and
check each socket they contain. But a much better way is to iterate through
fd_set as array of long and check 32-64 bits at a time.

> you're better off switching to a socket API that scales well (epoll or
> kevent) and does this filtering for you. Or like another commenter
> suggested, using a library that abstracts this functionality.

That's how kernel, libev and others work with fd_set.

>> The correct way is to maintain a descriptor set all the time and create a
copy of it and pass the copy to select.

> Again -- if you've reached the point where this tradeoff matters, just go
> directly to epoll/kevent.

Again -- libev and others do this

> Maybe your program deals with non-socket fds, and the set of socket fds is
> fairly sparse. Using a map is actually pretty reasonable even if fds are
> dense.

Yes, there might be cases where you have non-sockets and you can't use array
indexed by socket. But it's great in most of the cases and kernel will keep it
as dense as possible. It might be that you have 10,001 connections and then
10,000 are closed and highest socket is still in use and array memory is
wasted. But it will not require more memory than during peaks.

libev use of select:
[http://cvs.schmorp.de/libev/ev_select.c?revision=1.56](http://cvs.schmorp.de/libev/ev_select.c?revision=1.56)

------
adynatos

      unsigned long *wfds = (unsigned long *)wfds_out;
    

Unless fd_set has the same or stricter alignment requirements as unsigned
long, and fd_set and unsigned long are compatible effective types, this line
of code provokes undefined behaviour twice (breaks strict alignment and breaks
effective type access).

~~~
beeforpork
In this case, it is most probably OK, because wfds_out comes from calloc(),
which uses the largest possible alignment to work for any type (it returns
void* and cannot know which types are inside the allocated memory, so it needs
to assume the max on the architecture).

------
t5e
Another problem with select() is that trying to use a file descriptor above
FD_SETSIZE in FD_SET() and the likes will actually result in overwriting the
stack and will cause random segfaults in other areas of code. Note that
sockets aren't the only thing using up file descriptors and it is quite easy
to run into this issue. The solution is to use poll() as essentially a drop in
replacement. Strangely the winsock version of select() acts more like poll and
instead of a bitmask based on socket fds, uses an array of sockets so you can
actually use a number of sockets up to FD_SETSIZE regardless of their file
descriptor.

------
dorianm
Btw "ussually" instead of usually

------
known
Does Apache web server use this trick?

~~~
dfox
In general it cannot, because it does not do asynchronous IO on "sane"
platforms. (what it does/did on Netware might pass for async IO)

------
styfle
It took me a few seconds to realize this article is _not_ about select2[0],
the javascript library.

[0]: [https://select2.github.io/](https://select2.github.io/)

------
peterwwillis
The correct way to use select() is not to use it. You should instead use a
library designed to do at a high level whatever it was you were going to use
select() for. This library will then (hopefully) figure out what your
application needs as efficiently as it can, rather than you having to spend a
week researching how modern operating systems work to finally learn that you
should have been using a library to do what you were about to try to reinvent.

~~~
linkregister
Are there really no applications for using low-level async I/O APIs/syscalls?
Embedded systems? Game servers? Proxies? Software routers? Hidden command &
control servers?

~~~
MrBuddyCasino
For embedded systems, LWIP has a socket API that supports select(). Try
finding an event library for that one! I had to roll my own, and I learned
quite a bit in the process.

~~~
Matthias247
Actually if you use the low-level LWIP (sometimes called Raw API) instead of
the socket API it will be actually event-driven (singlethreaded, callback-
driven, etc.) by design. The socket abstraction layer on top of it then
provides blocking APIs and multithreading support - with the cost of a lower
performance.

~~~
MrBuddyCasino
I'm planning to eventually migrate to the netconn API, but was deterred by a
lack of documentation. Do you know of any sources? This doesn't help much:
[http://www.nongnu.org/lwip/2_0_x/index.html](http://www.nongnu.org/lwip/2_0_x/index.html)

~~~
Matthias247
I also had quite some hard time with the lack of documentation for LWIP. The
only other source that I found was the Wiki which is also linked there:
[http://lwip.wikia.com/wiki/LwIP_Wiki](http://lwip.wikia.com/wiki/LwIP_Wiki)

But I still had to figure out lots of things from the actual code.

~~~
MrBuddyCasino
Rats. I had a feeling you'd say that, thanks!

~~~
Matthias247
I found something more in my work bookmarks:
[http://www.atmel.com/Images/Atmel-42233-Using-the-lwIP-
Netwo...](http://www.atmel.com/Images/Atmel-42233-Using-the-lwIP-Network-
Stack_AP-Note_AT04055.pdf)

Maybe it also has some extra information which can help.

~~~
MrBuddyCasino
Thanks!

