
Async IO on Linux: select, poll, and epoll - matthewwarren
https://jvns.ca/blog/2017/06/03/async-io-on-linux--select--poll--and-epoll/
======
StillBored
Hmmm the article links to another article called "select is fundamentally
broken" but even that article fails to note, that select has a nasty gotcha.
Many (most?) examples
([http://www.tutorialspoint.com/unix_system_calls/_newselect.h...](http://www.tutorialspoint.com/unix_system_calls/_newselect.htm))
simply use 'fd_set' and the FD_* macros.

So, struct fd_set is a fixed size structure, and the fd, passed to the macros
(or kernel) has a max fd value that it can support. So its quite possible your
application has a fd > FD_SETSIZE even if you don't think your using a lot of
fds (libraries and such can drive up the value of fd quite easily).

The result is random buffer smashing, both by userspace via the FD_XXX macros
as well as the kernel, which doesn't have any knowledge beyond the nfds (which
is the largest fd, not the number of fd's) its given. That is why you use
poll()..

I can't tell you how many times I've had to fix that bug in programs that
appear to work fine for $SHORTTIME, but fail when running for $LONGTIME.

------
wyldfire
Julia's articles are an excellent reminder of all the great stuff I take for
granted. I learned much of this stuff on the job and I only wish that I had
easy-to-digest articles like these to read early on in my career. Especially
if you are writing high-level application software, this is great info to help
you understand what's happening below.

> “level-triggered” vs “edge-triggered” ... I’d never heard this terminology
> before (I think it comes from electrical engineering maybe?).

This is IMO easiest to understand when using a logic analyzer or an o-scope.
Often you want to see what condition the system was in right when something
interesting happens and you can either detect a pulse (like from a momentary
button, e.g. elevator call button) or a level change (like from a pole/knife
switch like a room's light switch). Also, if you've programmed interrupts
before you'll need to understand this distinction.

~~~
vram22
> > “level-triggered” vs “edge-triggered” ... I’d never heard this terminology
> before (I think it comes from electrical engineering maybe?).

Yes, me neither, but my guess is comes from the level or edge of a square
wave, respectively.

[https://en.wikipedia.org/wiki/Square_wave](https://en.wikipedia.org/wiki/Square_wave)

------
IgorPartola
I always appreciate when these topics are covered, but the article could be
better. The performance table for one has no units. I don't know what I'm
looking at whatsoever.

The description of edge vs level triggering is short and IMO not the easiest
to understand.

The highlight of differences between the states that poll vs select can return
doesn't actually explain the extra states that poll gives you.

The memory considerations of select vs poll vs epoll are only slightly touched
upon.

If you want to understand the how, you'll probably have to read the book she
mentions (I haven't read it), or other articles around the web (what I did
read). IMO the rule of thumb is that if you are writing a non-performance
critical piece of software, just use select. It is widely available and easy
to understand. If you want performance, you will want to use epoll _and_
kqueue with a fallback to poll. Understanding kqueue and epoll will give you a
better idea of why these are different than poll/select.

The last time I touched this level of code, I did it in Python and I wrote a
generic listener set of classes that wrap all four of these calls to make it
work as performantly as possible on as many systems as possible. I suggest
this as an exercise to budding networking programmers.

~~~
Thaxll
Usually you use a framework and never use those syscall.

~~~
IgorPartola
Sometimes you don't want to drag in all of Twisted to create a small server.
Sometimes you are writing code in C and don't want many external dependencies
to keep the memory footprint small. Sometimes you want very tight control over
what happens. Sometimes you want to process OS signals along with network file
descriptors. Sometimes your desired concurrency model doesn't match that of
the popular frameworks. Sometimes you have a more complex workload than just
listening on sockets.

Besides the article is going down to the level of reading kernel source. Most
people don't do that because they want to modify it, by because they want to
understand it. Just saying "use Node" is not the answer to those types of
questions.

------
zokier
> Then it makes 2 DNS queries for example.com (why 2? I don’t know!)

If I had to guess, I'd say it makes separate A and AAAA queries in parallel

~~~
js2
That's exactly correct. Looking at the output:

    
    
       write(3, "\3048\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\34\0\1", 29
       write(4, ";\251\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\1\0\1", 29
    

We can make these a little more legible like this:

    
    
        $ printf "%b" '\3048\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\34\0\1' | xxd
        00000000: c438 0100 0001 0000 0000 0000 0765 7861  .8...........exa
        00000010: 6d70 6c65 0363 6f6d 0000 1c00 01         mple.com.....
    
        $ printf "%b" ';\251\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\1\0\1' | xxd -a
        00000000: 3ba9 0100 0001 0000 0000 0000 0765 7861  ;............exa
        00000010: 6d70 6c65 0363 6f6d 0000 0100 01         mple.com.....
    

RFC1035 and later explain how to decode. We only care about decoding the
"question section" which follows the format of QNAME/QTYPE/QCLASS.

The QNAME is sequence of labels ("example", "com", "") preceded by their
length. So we see "example" (preceded by its length of 7), "com" (preceded by
its length of 3, and the null label (empty string with a length of 0)
representing the root.

Each QNAME is then followed by the QTYPE (two bytes) and QCLASS (two bytes).
The first QTYPE is 0x1c, or 28 decimal, which is indeed a AAAA query. The
second QTYPE is 1, which is an A query.

------
twic
> Interestingly, you can give it lots of different kinds of file descriptors
> (pipes, FIFOs, sockets, POSIX message queues, inotify instances, devices, &
> more), but not regular files. I think this makes sense – pipes & sockets
> have a pretty simple API (one process writes to the pipe, and another
> process reads!), so it makes sense to say “this pipe has new data for
> reading”. But files are weird! You can write to the middle of a file! So it
> doesn’t really make sense to say “there’s new data available for reading in
> this file”.

And yet kqueue and IO completion ports do support regular disk files [1]. It's
true that epoll's model doesn't make sense in the context of disk files, but
doesn't mean that you can't do asynchronous notification on disk files, it
means that its model isn't general enough.

[http://people.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-
vs...](http://people.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-vs-
kqueue.html)

------
pjungwir
I'm reading this book right now too. It is sort of my bedtime reading. I've
used Linux since 2000 and Solaris before that, and I do occasional C but
mostly Ruby/Python/Java. At first a lot of it was stuff I knew, but reading it
still helped me make a few connections between things I was hazy on. One of my
favorite parts was about process groups, sessions, and controlling terminals
---it helped me understand shell job control and daemons a lot better. I also
learned a lot of new nuances about signal handlers. Now I'm on SysV
semaphores, and it has gotten a lot more interesting!

I would definitely recommend this to anyone working on Linux systems. Even if
you're using a higher-level language, I really think you get extra superpowers
if you can go down the stack a few levels, and it is great for that.

The author is one of the most thorough I've known. He gives you every caveat
and detail. I don't think there is anything he pretends to know, but every
statement feels verified.

------
api
Isn't all this just an artifact of the file descriptor API? What would happen
if programs could register callbacks to be called directly by the kernel in
any of N threads?

This would be a bit like signal driven I/O. I'm thinking nobody uses that
because it's single threaded and antiquated.

~~~
Ded7xSEoPKYNsDd
What difference do you propose between "just call this function" and "just
call this signal handler"?

My feeling is that nobody uses signal driven I/O because signal handling is
error prone and complicated. In fact, a very common pattern is to only set a
flag in the signal handler, which is checked at defined points in the main
program... not completely dissimilar from doing non-blocking calls at defined
points in the main program.

------
SwellJoe
I follow the author on twitter because her occasionally posted comics/zines
are _great_. Bite-sized chunks of actually correct technical information about
systems-level topics. I don't know that they are artistic or funny (not to be
overly critical, I merely mean aside from the knowledge they transfer, I don't
know if one would read them solely for enjoyment), but they are technically
excellent, and often a good supplement to the much drier technical docs they
derive from.

I like this style of writing, as well: "I'm learning this new thing, come with
me and I'll explain it as I figure it out."

------
sroussey
It would a wonderful project for someone to bring epoll and kqueue (with a
fallback to select) to the PHP (and its related libs).

