Hacker News new | past | comments | ask | show | jobs | submit login
Select(2) is fundamentally broken (popcount.org)
117 points by panic on Jan 5, 2017 | hide | past | favorite | 85 comments

The phrase "fundamentally broken" is overblown when referring to a syscall that works fine for most applications. "Doesn't scale" may not sound as glamorous in the title, but it's a lot more accurate.

Back in the old single-threading Unix days, select() was a good choice because it allows you to write a program handling async inputs in an easy style.

It's not just works, it's actually the only one that works properly for most things and is the most portable across OSes.

I was kinda embarrassed about making a new program with select() a few years ago. I'd rather have written something with a epoll or kpoll or something like that, but the documentation was pretty bad and I never got them to work. Anyway it doesn't need to scale, it only ever has three servers talking to it. Works great, whether I like the style or not.

select() has its uses and servicing a small handful of sockets is one of them. Nothing to be embarassed about.

Hence libevent (&c). Coding direct to select is a little like reading directly from /dev/bpf0.

That's probably fair, but beej's guide corrupted me. It's just too user-friendly. I'll look around again for the next project though.

In my defense, I never said "don't use select(2)". The point I want to get through: due to the semantics, it's impossible to create a fast kernel side implementation of select().

I liked your article and it got me thinking about your point a lot. I'm just whining about the title since I think it could be improved. Would like to also see some text that mentioned that the inefficiencies add up so that it mostly matters when scaling to large numbers.

Thanks for writing it -- it's a good contribution!

And select() is still a perfectly good option for situations where the number of FDs involved is constant, or at least stays small. For instance: for networking in an X11 application, using a select()-based event loop is much simpler than creating a thread for each socket.

You should never, ever, create one thread per socket. The choice is between select, polled IO, or events. Sadly, UNIX doesn't have a good unified asynchronous IO model, like OVERLAPPED and IOCP on NT.

> You should never, ever, create one thread per socket

Really? that's how MySQL works.

And even when the number of FDs grow large, epoll() (which is like select()) in Linux is often preferable spawning threads.

i'll just leave this here is case someone walks by and thinks it looks interesting


It's also worth noting that the Windows IOCP model unifies network and other I/O. This allows something like a Web server to do asynchronous reads and writes on files without needing more threads and still also servicing the network. Would be nice to get similar capabilities in Linux.

EDIT: Also IIRC when using select() writes can still block your thread...

> EDIT: Also IIRC when using select() writes can still block your thread...

You should only ever use select() with file descriptors in non-blocking mode. Otherwise there's an inherent race condition with any readiness signaling mechanism along these lines. But yes, the read/write syscalls are not asynchronous, and select() on regular files are required by POSIX to always be signaled as ready for reading and writing. Basically, descriptor readiness as per select() does not mean and has never meant that the thread won't be temporarily descheduled and put on an IO wait queue. Its semantics are only a useful fit for things like UARTs, sockets, pipes, etc.

By the way, even with asynchronous/overlapped IO on Windows, there are cases where asynchronous reads/writes on files will effectively be synchronous, especially in the common case of file-extending writes. See https://support.microsoft.com/en-us/kb/156932 for details. The only work-around I know of is to use SetFileValidData to pre-extend the file, but that requires increased privileges and so is not often feasible (and usually not worth it compared to queuing the writes to a dedicated IO thread).

The difficult thing with IOCP is that it requires architectural changes to the application and there are platforms that have no reasonable equivalent to it. So portable applications that use IOCP semantics are hard to support on those platforms.

Whereas essentially all platforms (including Windows) have select() and most have something equivalent but better like epoll() or kqueue().

And for most applications the benefits of IOCP over epoll() or kqueue() are only theoretical. A call to send() or recv() isn't literally synchronous, it's buffered by the OS.

So using IOCP looks better on paper than in practice.

For many use cases, either of the two can be written as an emulation over the other.

> For many use cases, either of the two can be written as an emulation over the other.

Between select/poll/epoll/kqueue/etc and each other that is easy. You have a few functions like "register(ctx, fd, event)" and "wait_for_events(ctx)" that map straight to each of the implementations.

To use IOCP it's not just those you have to wrap. It's also send(), sendto(), sendmsg(), write(), recv(), recvfrom(), recvmsg(), read(), connect(), accept() and others. In ways that don't make for trivial or efficient implementations. And then contend with any third party library that uses any of those internally.

There is a reason libev doesn't support IOCP and libevent only supports it by exposing a different API on Windows.

The devil is in the details. Consider Mark Heily's libkqueue library for Linux that emulates kqueue/kevent. Because of the way that signalfd, upon which it is built, works, code that uses kevent with signals on the BSDs has to be written one way and on Linux has to be written another. It's a subtle difference in the manual page; and it's all too easy to read the wrong implementation's manual page on the WWW (if one makes the mistake of using a search engine to find manual pages) and write things incorrectly.

It is also telling, of how difficult to interconvert it actually is in practice, that the Heily libkqueue does not implement EVFILT_AIO, EVFILT_FS, or EVFILT_PROCDESC, and implements some of the others in only a limited fashion.

As I understand it, the Windows caching layer doesn't actually support asynchronous IO. Any asynchronous cached file IO runs on a small, fixed-size system-wide thread pool using blocking IO, and once that runs out all such operations block until completion. That's why Windows software generally uses its own thread pools for file IO.

And also user defined events and timers(through the new thread pool API)

This is all possible on Linux today with timerfd [0] and eventfd [1]. You can use the same mechanism for async file-IO with libaio and the IOCB_FLAG_RESFD [2] flag.

Granted libaio only really works well on XFS [3], and even there it has issues.

[0] - https://lwn.net/Articles/251413/

[1] - http://man7.org/linux/man-pages/man2/eventfd.2.html

[2] - https://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt

[3] - http://www.scylladb.com/2016/02/09/qualifying-filesystems/

It seems that the author is slightly confused. Description of the thundering herd is mostly correct. Initially it referred to multiple processes calling the accept() on a single listen socket, it used to cause all processes to wake up. This has been fixed a long time ago. Currently multiple processes blocking on the accept() on a single listening socket works in a round-robin fashion (AFAIK). Calling select() from multiple processes on a single fd, should wake all the processes up when IO comes in. This is a documented behavior.



Yes, all this 'fundamentally broken' talk is overblown rubbish. I was working on servers handling large numbers of sockets across lots of processes over a decade ago. Scaling non-blocking I/O was a well known area a long time ago, and problems like the 'thundering herd' and others had multiple solutions across a variety of operating systems.

You are correct. Directly blocking on accept() in multiple processes does not have the "thundering herd" behaviour. This is good to know.

But this proves next point - select is a poor abstraction. This means that accept() is doing something more then just wait for readability (it attempts round robin) - a thing you can't express with select().

In the article I used the accept() case for illustration of the thundering herd problem. Non-blocking connect() taking a long time makes a good case. The same experiment could be done though measuring write() or sendmsg() syscalls.

First, the article starts talking about the accept() and thundering herd but the example shows use of the select().

Second, accept() goes over a queue of the established connections created by a listen() call

select() and accept() are meant for different things. selec/poll/epoll/kqueue work with a list of file descriptors to detect I/O, accept works with a socket.

The author criticizes select(2) because it requires the kernel to iterate over a big data structure in memory, which has to be passed to the kernel every time select is invoked. He criticizes epoll(7) because it didn't work well when multiple processes try to "split the work" of handling a big batch of file descriptors.

To be honest, I'm not sure why the author spends time beating up on select. It's a three-decade old API that isn't really used for high performance applications any more. The epoll problem seems to be resolved by EPOLLEXCLUSIVE-- I don't understand why he feels that the kernel dispatch time would still be O(num processes), when clearly the goal of this flag is to only wake one process. It's still a warty API, but certainly a usable one. Maybe kqueue is better, but this post does little to convince us of that.

The kernel walks the list of threads until it finds one that's actually blocked in epoll_wait().

Suppose that at any given time, most of your threads are running, not blocked (which is why you need this many threads). The kernel will always wake the first blocked thread in the list, so the few blocked threads will cluster towards the end of the list. And the kernel will have to walk most of the list to find a thread to unblock.

This is speculation; I haven't benchmarked anything.

EPOLLEXCLUSIVE was introduced very recently, and frankly, I don't have a 4.5+ kernel to test.

I did some tests years ago and a single-threaded event loop was able to handle more than 10,000 short-lived connections per second sending and receiving ISO 8583 bitmaps (38,000 with 2 threads if I remember correctly). And the bottleneck was not in networking APIs or code but in middleware (Oracle Tuxedo) calls that were done in the same thread. I think you can do even more with modern hardware.

You just have to use select() correctly:

1) You can raise the 1024 limit of feed set size by "#define FD_SETSIZE 65536" (required for SunOS to use select_large_fdset() instead of select()) and allocating memory for fd_set yourself.

2) Do not loop over descriptors and use FD_ISSET to check if file descriptor is in set. Instead loop over fd_set one word at a time: if word != 0 then go and analyze each bit of that word (see how Linux kernel does it).

3) The other thing is to limit number of select() calls you make per second and do short sleeps if needed. That allows for events to be processed in batches and the cost of select() calls gets relatively smaller compared to the "real work" done. It also increases latency but you can work out a reasonable number of select() per second. This idea I got from "Efficient Network I/O Polling with Fine-Grained Interval Control by Eiji Kawai, Youki Kadobayashi, Suguru Yamaguchi"

I learned how to use accept() correctly from "Scalability of Linux Event-Dispatch Mechanisms by Abhishek Chandra, David Mosberger". The main idea is to call accept() in a loop until EAGAIN or EWOULDBLOCK is returned or you have accepted "enough" connections.

I don't get why author claims that epoll() fixes the problem with registering and unregistering descriptors. If you use epoll then adding or removing a descriptor is a system call but in case of select() you just modify a local data structure and call select() when you're done adding and removing all of descriptors. And you shouldn't call accept() from multiple threads, a single thread calling accept() is enough for most of us unless you're web scale ;-D

Great! Do you have simple code base that illustrates the a over?

Bryan Cantrill has given many rants about epoll(2) as well. In particular, one of the biggest issues with epoll(2) compared to kqueue (BSD) or eventports (Solaris) is that epoll(2) is effectively useless when it comes to multi-threaded processes in the "worker pool" model.

Because it is edge triggered not level triggered, the application has no way to tell epoll(2) that a thread is already handling events on a fd (and thus shouldn't hand it to a different thread -- which will lead to a race). There are ways to hack around it, but it's just a bad design.

Epoll is level triggered by default. Edge triggered is an option called EPOLLET.

I know edge triggered vs level triggered in the context of electronics but can you elaborate on their meaning in a software context?

Similar to the meaning in electronics: edge-triggered reports on changes of state, level-triggered on state.

When you use epoll in edge-triggered mode it returns and reports a resource as available once, when it was unavailable and just switched to being available. If you enter epoll again, it won't return just to report the resource is available.

In level-triggered mode, when a resource polled on is available it always returns immediately to report it as available.

Edge-triggered is useful to make sure you only act once on a resource becoming available (e.g. by handing it to a worker thread that will work in the background, you want to enter epoll in the main thread again to wait for other changes) or if you have to wait for something else to become available to actually be able to do something with it and don't want to waste time spinning in the waiting state (e.g. when moving data between two sockets, both have to be available, so you only want to wake up when the second one changes as well).

In the case of epoll(), using the edge-triggered interface it will only report an event when the status of the monitored file descriptor changes (eg from "not readable" to "readable"), whereas when using the level-triggered interface it will continue to report the event as long as the monitored file descriptor remains in the relevant state.

For example, if a socket is reported as readable but you only read some of the available data before calling epoll() again, with the edge-triggered interface it won't be reported as readable the next time you check, but with the level-triggered interface it will be.

They're the same; at some point in history edge and level triggering somehow leaked over into operating systems through interrupt controllers, and from there into event and signal APIs for yet another set of reasons.

It's the same. If you imagine an async signal (notification), for example for reading from a file handle, as an electronic signal, the two are exactly the same.

Thanks for the explanations, good to know.

"Broken" is the wrong word. The argument is actually, "it's slow." Broken would imply logically flawed, impossible to make work consistently, something along those lines. The title to "Select(2) is fundamentally slow" would be more accurate (but less interesting).

As one bandaid, one can accept() in one process and pass the fd to a worker.

The article's conclusion about select() seems correct; however, after exploring select() performance in general, it then mentions epoll's exclusive flag in passing, and then dismisses it without actually doing any performance analysis.

I'll be interested to hear why epoll and kqueue are so very different. Strikes me that both are quite similar: you attach waitobject-specific events to fds, and then wait on the waitobject until one of the events occurs. Much like one another, and not much like select/poll!

And seems like both are quite different from IOCP too... kqueue_qos aside, you get the readiness state(s) and then do the operation(s) that look likely to be possible. So from this viewpoint epoll/kqueue/poll/select are actually basically the same - in contrast to the IOCP approach of doing the operation and then getting a notification when it completes. kqueue/epoll vs select/poll then looks like a more efficient way of doing the same stuff (with improvements - e.g., because the OS has more information to hand in the kqueue/epoll case it probably has more opportunity to minimize multiple wakeups, etc.).

IOCP is the sensible way to do it. Unix, for a system where supposedly everything is a file, has a lot of very specific behaviors for "special" kinds of files. In contrast IOCP lets you treat all async operations from timers to sockets to disk the same way and the thread pool does a decent job of scaling too.

I've used both programming models, and I find "can I read/write this fd" much more intuitive and versatile than "try to do it and let me know when actually done". In particular, you can use many different models to distribute and parallelize work with poll/epoll.

Also, poll/epoll get even more versatile with current Linux systems, which take "everything is a file" much further with signalfd, timerfd, and eventfd.

The problem with IOCP is that it is more memory intensive as all memory for outstanding operations must be pre-allocated. With the readiness model, you can use pools of memory instead for dramatically less overall memory usage. There is a hack to use 1B reads with IOCP to get around this, but it doesn't feel very clean.

kqueue allows for batch updates on fds that are being polled for readiness. In addition, it has less hacky support for non-socket files such as timers, events, signals, and disk IO. Setup for all of these on epoll requires extra unique syscalls. I'm not sure I've even ever seen someone use epoll for disk IO (via AIO). Overalls, kqueue just seems like a more cohesive, unified async solution.

I don't think it's a useful idea to use a thread or process pool to accept on the same passive socket. It's perfectly fine to have one thread doing this and handing the connected socket off to a pool for processing. If your bottleneck is in the accept loop, what that proves is that you're not writing a real service application, but a benchmark contrived to produce such a bottleneck.

Depending on your server workload, sometimes accept() is the bottleneck.

That's why there are workarounds. For example TCP SO_REUSEPORT: https://lwn.net/Articles/542629/

Came here to say this. And if you do have one of those rare applications where the time spent in accept rivals the time spent doing useful work, then have one thread accepting per CPU. Not 10k.

I'd like to try an interface that provides all traffic on a given port, regardless of the IP address on the other end, via a single file descriptor.

One reasonable way to design a server is to queue all requests onto a single queue, and then have multiple worker threads that take requests from the queue and process them.

When requests arrive from the outside, they are coming in multiplexed onto one data stream (assuming one network cable). It seems wasteful to have the kernel demultiplex these into separate streams (one per client), just for your application to remultiplex them when it puts them onto its queue.

If the server application is handling all traffic for the port, let it handle the demultiplexing.

TCP flow control might get a bit tricky, because I think you'd want to still handle that in the kernel.

With this kind of system your server application would have one file descriptor for reading network data, and it could have a single thread dedicated to reading that in blocking mode.

How would you handle requests that span more than one packet?

The point of the "socket" abstraction is so that the kernel can make the arrival of packets - which may appear out-of-order, or not at all - seem like a continuous stream of data to the application. Get rid of the multiplexing, and you also get rid of the socket receive buffer. The application basically needs to implement a TCP layer in userspace.

BTW, UDP sockets provide exactly the abstraction you're looking for: the sending IP address is filled in along with the data, and there's no socket to wait on.

> I'd like to try an interface that provides all traffic on a given port, regardless of the IP address on the other end, via a single file descriptor.


The thundering herd issue applies just as much in the kqueue case, if you set it up with N processes all interested in an event on the same socket. Waking them all up is still going to be an O(N) operation.

The "non-broadcast wakeup" solution - ie EPOLLEXCLUSIVE - will also work just as well there.

Modern OS Xs appear to have two extra things to help with this:

- kqueue_qos - lets you do an atomic poll+receive of a Mach message on a Mach port

- EV_ONESHOT - once you get the event, it's gone, and you have to reactivate it. I think this is the analogue to EPOLLEXCLUSIVE

I've used kqueue but both of these are new to me, so I could be wrong. (I'm pretty sure kqueue_qos didn't exist when I was doing this OS X stuff about a years ago. And I only had one waiting thread, so if I noticed EV_ONESHOT at the time, I didn't pay much attention to it.)

(Interestingly, EPOLLEXCLUSIVE promises to wake up "at least" one thread, suggesting that it might wake up all of them. Whereas EV_ONESHOT sounds like it will only wake up one - since once retrieved by one call to kqueue, the event will be canceled, and it will never be returned by another. But the Linux man page doesn't say what "at least" actually means... and, to be honest, the OS X one is barely any clearer anyway, AS USUAL. So who knows for certain without scraping through the source.)

EV_ONESHOT refers to the individual kevent registration, and is basically an automatic EV_DELETE on delivery of the event.

EPOLLEXCLUSIVE is acting at a different layer - in kevent terms it would cause only one of the registered kevents watching an object to be activated when the object is ready instead of all of the kevents watching for the same thing.

And there's a epoll() equivalent of EV_ONESHOT, EPOLLONESHOT.

The "at least" is meant to convey that the API doesn't promise only one will be woken. It's very similar to the way that pthread_cond_signal() is documented by POSIX to wake at least one thread waiting on the condition variable, and for the same underlying reason.

The problem is that all of select(2)'s competitors have a lot of problems as well. Because, as will be repeated below, I have no idea what I'm talking about, I will be citing a relatively independant source, namely, the libev documentation.

-epoll: epoll is an absolute mess. Just read the manpage. Cantrill said it, and in this case he's right. But since Cantrill is hardly an independant souce, here's what the libev documentation has to say about epoll:

>The epoll mechanism deserves honorable mention as the most misdesigned of the more advanced event mechanisms: mere annoyances include silently dropping file descriptors, requiring a system call per change per file descriptor (and unnecessary guessing of parameters), problems with dup, returning before the timeout value, resulting in additional iterations (and only giving 5ms accuracy while select on the same platform gives 0.1ms) and so on. The biggest issue is fork races, however - if a program forks then both parent and child process have to recreate the epoll set, which can take considerable time (one syscall per file descriptor) and is of course hard to detect.

>Epoll is also notoriously buggy - embedding epoll fds should work, but of course doesn't, and epoll just loves to report events for totally different file descriptors (even already closed ones, so one cannot even remove them from the set) than registered in the set (especially on SMP systems).


>Epoll is truly the train wreck among event poll mechanisms, a frankenpoll, cobbled together in a hurry, no thought to design or interaction with others. Oh, the pain, will it ever stop...

-kqueue: kqueue isn't epoll, which is a massive advantage. However, once again according to libev (I do like to site people who know what they're doing, because I don't), it's at least somewhat broken on every system other than NetBSD: it only works with sockets and pipes on FreeBSD, and on OSX it's totally broken. If this is wrong, or has been fixed, please let me know: I'd love that.

-Eventports: according to libev (once again) eventports are probably the least broken: it's slower than select and poll on small scales, but scales up well. Apparently the interface is a bit quirky, and it has some problems ("The event polling function sometimes returns events to the caller even though an error occurred, but with no indication whether it has done so or not"), but it's a heck of a lot better than everything else.

So yes, Solaris wins. Again.

Dammit, Solaris: it's getting increasingly hard to argue with your fans, because the system is actually really nice.

> it only works with sockets and pipes on FreeBSD

Where did that rubbish come from? It certainly was not the FreeBSD doco. I've been happily using kevent with regular files, directories, child processes, pseudo-terminals, and signals for years. On OpenBSD, too.

I cannot likewise personally attest to its support for process descriptors, timers, AIO requests, or BPF devices, but they're documented.

Oh. Like I said, that's only what I've heard. It's good to know that's working now.

So now Solaris doesn't win, it's just that Linux totally loses...

I am not aware of it ever not working, since its creation.

I did point to my source: the documentation for libev, a widely used event library that abstracts over such OS-specific mechanisms. Whilebit may be outdated, I'm inclined to trust that it was true at some point, as it's more or less the libev maintainers' jobs to work with these interfaces day in and day out.

It's possible they're incompotent, of course, but I find that unlikely.

If kernel uses some kind of callbaks internally, why this callback API isn't directly presented to userspace API? Callback API is easier to use than select. I remember emulating callback API on top of select and if kernel emulating select on top of callbacks, it seems pretty weird.

Just curious, are there any real world examples where select is a bottleneck? I guess if I had a threaded web server or something but is there such a thing?

I've never particularly liked select but it has solved many, many problems in practice.

Yes, it quickly becomes the bottleneck with a large number of sockets. Web servers like nginx use epoll/kqueue and other techniques described in the article.

"It is heavyweight. It requires constantly registering and unregistering processes from the file descriptors, potentially thousands of descriptors, thousands times per second."

Can someone clarify what this tries to say. Thanks.

Imagine a kernel has a data structure for each socket. When a process blocks on that socket, a kernel must "register" it on it.

So that - when an event on a socket happens, the kernel instantly knows which processes to wake up. A reverse lookup: given socket, return list of blocking processes.

I'm arguing that maintaining this reverse lookup is a hard work, especially when you need to set it up it and tear it down on every call to select().

Before each call to select, you'll have to FD_ZERO the fd set(s), then individually FD_SET each fd you're interested in.

If that's a large number of fds, then that's quite some overhead.

Now I'm very curious.

Has anyone around here ever experienced that problem of single process accept(2) throughput being insufficient? If so, what were you doing?

I came here thinking this was about select2.js[1] the javascript dropdown plugin.

[1]: https://select2.github.io/

If you think select isn't broken, show me a small server written using it that can handle more than 1024 concurrent clients.

> If you think select isn't broken, show me a small server written using it that can handle more than 1024 concurrent clients.

Divide your FDs into groups of 1024 each. Put each group in a different worker thread, each connected to the main thread with a Unix socket. Select over the group in each worker thread; when a socket wakes up, notify the main thread. Select over the worker connections in the main thread.

You laugh, but it's a documented approach on Windows [1].

[1]: https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

That actually doesn't sound like the worst solution in the world. Not really a good one, either, but not as awful as you make it out to be.

But I don't touch the kernel, so I could be wrong there.

Something not being usable for a specific usecase doesn't mean $thing is broken. It means it's not applicable for that usecase.

My bike is broken, it doesn't go 550mph!!1

Edit: adapted stupid example to be differently stupid.

Really? multiplexing file descriptors IS the specific use case of select.

It would be more like "My bike is broken, when I try to ride more than 12 blocks the wheels fall off".

Also, it's not even that select can't handle more than 1024 fds, it's that it can't handle a file descriptor above 1024 at all, so if your app opens a bunch of data files before accepting connection, the limit could be as low as 10 concurrent connections before select fails.

This kind of broken reasoning is common in certain communities. Example: this programming language is broken, it can't be used to write drivers for the Linux kernel!

No, it's more like "This feature is broken, if you attempt to use it for its intended purpose it will fail horribly once you try running it at scale".

This is an artificial limit. On many systems it can handle way more FDs than that if you use it directly, bypassing wrappers. Libev does that I believe, so any server on top of libev can use unlimited select.

Well, using libev to work around limitations in select is not really a simple server using select.. and if you're using libev wouldn't you not ideally be using select in the first place?

It's not like that. As a system call select doesn't really have those limitations. Choosing an interface and wrappers to work with it is up to the programmer, be it libc or libev.

As for ideally, using select as a default is a good idea for as long as you can get away with, everything else has too many caveats and working around them is tricky as you must relearn everything libev and nginx learned over the years if you haven't been paying close attention to the problems they encountered.

Intersting. So it's just libc that has the broken FD_SETSIZE stuff?

This is a problem today with PHP unfortunately, particularly when compiled into Apache.

Sensationalist title for an article exposing nothing new about select(), as the old UNIX programming adage goes: select is not broken.

The author didn't even mention the glaring problem with select.

> The author didn't even mention the glaring problem with select.

Which is...?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact