
Asynchronous I/O in Windows for Unix Programmers (2011) - luu
http://www.tinyclouds.org/iocp-links.html
======
kogir
Crazy that IOCPs were in NT 3.5 in 1994[1], but it took FreeBSD until 2000 to
add kqueue[2], and Linux until 2002 to add epoll[3][4].

I guess it just must not have mattered that much? Is it only recently machines
became primarily IO bound?

    
    
      [1]https://en.wikipedia.org/wiki/Windows_NT_3.5
      [2]https://en.wikipedia.org/wiki/History_of_FreeBSD
      [3]https://en.wikipedia.org/wiki/Epoll
      [4]https://lwn.net/Articles/13197/

~~~
chubot
I think being I/O bound is orthogonal to whether you need kqueue or epoll.

You could be I/O bound on a small number of simultaneous sockets, in which
case select() would work fine. Or you can be I/O bound on a lot, in which case
select() would take too much CPU by scanning the descriptor table repeatedly.

[http://stackoverflow.com/questions/970979/what-are-the-
diffe...](http://stackoverflow.com/questions/970979/what-are-the-differences-
between-poll-and-select)

~~~
kogir
Oh, that makes sense.

Older servers couldn't handle enough clients for connection related limits to
matter.

------
majke
Actually, there is quite a context here. IOCP had been a headache for quite a
while for Unix programmers. Two examples:

\- libevent 2.0 was created exactly to support IOCP
[https://blog.torproject.org/blog/some-notes-progress-iocp-
an...](https://blog.torproject.org/blog/some-notes-progress-iocp-and-libevent)

\- apache 2.0 again, was created to support windows threading model and IOCP

\- this particular article was in the context of libev not supporting IOCP,
which was needed for decent node.js performance on windows:
[https://nodejs.org/codeconf.pdf](https://nodejs.org/codeconf.pdf)

(correct me if I'm wrong!)

~~~
nitwit005
Porting unix applications that used select/poll/epoll in a performant way was
a common enough problem that they added WSAPoll to make things easier:
[http://blogs.msdn.com/b/wndp/archive/2006/10/26/wsapoll.aspx](http://blogs.msdn.com/b/wndp/archive/2006/10/26/wsapoll.aspx)

The biggest abstraction headache is usually nonblocking file IO and DNS
lookups, as many OSes don't have any support. You have to have a thread pool
fallback.

------
csense
> regular file system files do not support non-blocking operations.
> (Disturbingly no man page mentions this rather important fact.)

This is the most disturbing thing I've read about UNIX file I/O in a long
time. Think about what the article is claiming:

\- Non-blocking filesystem calls _don 't work on files_!

\- This fact _isn 't documented_!

\- The situation has persisted for a long time (I kind of assume it's been 10+
years since non-blocking I/O was introduced in popular UNIXes)

Can any gurus confirm these allegations the article makes?

~~~
phs2501
* Non-blocking filesystem calls do not in fact work on files. If you think about it for a minute this makes sense. Network data is flowing in to you asyncronously, it will show up if you don't do anything (and trigger poll). But you have to request to read a file to make the OS make the disk go get it. This is why the async read model works for disk files where the poll/read model doesn't; a disk file is always readable as the data is already there. In Linux in particular the kernel does not (I think, I may be getting the explanation wrong) have a wait queue for page faults, it can only block the faulting process, so you can't do async (non-blocking) cached file I/O at all.

* It is not well documented in man pages. It's a relatively well-known problem if you google for it.

* The situation for traditional non-blocking reads of disk files has persisted because it is unsolvable in general, poll/read just doesn't work for for reasons described above. You need async read/write, which Linux in particular does not have good support for (see above regarding cached reads). I think the situation on the BSDs may be better, but I'm not sure. There's a number of articles in the kernel section on LWN discussing async I/O that may be enlightening if you want to know more about what's been done here. ([http://lwn.net/Kernel/Index/#Asynchronous_IO](http://lwn.net/Kernel/Index/#Asynchronous_IO))

------
userbinator
_unlike the modern constant-time multiplexers like epoll_

I'm curious as to how this exactly works - I mean, if you have 10000 file
descriptors, and you need to process all of them, there's no way around doing
that processing 10000 times. To me it appears O(n) is a lower bound and what
epoll does is just move the selection loop from the application into the
kernel.

~~~
asveikau
The O(1) part is:

* It's O(1) for user mode to know which fd in the set is signaled after returning from the wait syscall.

With select() or poll() the caller needs to traverse the list to know which fd
is ready. With epoll(), kqueue(), or IOCP, you are draining a queue consuming
events, each event removed from the queue in O(1) without re-scanning the
entire table.

* It's O(1) to add an fd to the set of watched fds, or to re-enter the "blocked" state.

Because select() and poll() specify the entire list of watched fds all the
time, the kernel side ends up copying and scanning the whole list at every
wait step [O(n)], whereas epoll/kqueue/IOCP just leave the list of FDs in
kernel memory across calls.

~~~
userbinator
I was curious enough about the O(1) claims to look at the source code[1], and
I don't think this is true:

 _It 's O(1) to add an fd to the set of watched fds_

It uses a red-black tree to keep track of the set of watched fds associated
with each epoll fd, which makes insertion O(log n). However, waiting and
notification can be done in O(1).

[1] [http://lxr.free-electrons.com/source/fs/eventpoll.c](http://lxr.free-
electrons.com/source/fs/eventpoll.c)

~~~
asveikau
OK, so I was being a bit fast and loose with my description of insert. What I
was going for is it doesn't copy and scan the entire list every time.

It's also not _too_ hard to imagine a constant time implementation. The api
would allow for it. Whereas poll is broken at the api level. It assumes O(n)
at every call and there's no good way to get around it.

