Yes, it's true that many APIs would theoretically allow kernel-level asynchronous I/O, but in practice the story is not so rosy.
* Asynchronous disk I/O is in practice often not actually asynchronous. Some of these cases are documented (https://support.microsoft.com/en-us/kb/156932), but asychronous I/O also actually blocks in cases that are not listed in that article (unless the disk cache is disabled). This is the reason that node.js always uses threads for file i/o.
* For sockets, the downside of the 'completion' model that windows is that the user must pre-allocate a buffer for every socket that it wants to receive data on. Open 10k sockets and allocate a 64k receive buffer for all of them - that adds up quickly. The unix epoll/kqueue/select model is much more memory-efficient.
* Many APIs may support asynchronous operation, but there are blatant omissions too. Try opening a file without blocking, or reading keyboard input.
* Windows has many different notification mechanisms, but none of them are both scalable and work for all types of events. You can use completion ports for files and sockets (the only scalable mechanism), but you need to use events for other stuff (waiting for a process to exit), and a completely different API to retrieve GUI events. That said, unix uses signals in some cases which are also near impossible to get right.
* Windows is overly modal. You can't use asynchronous operations on files that are open in synchronous mode or vice versa. That mode is fixed when the file/pipe/socket is created and can't be changed after the fact. So good luck if a parent process passes you a synchronous pipe for stdout - you must special case for all possible combinations.
* Not to mention that there aren't simple 'read' and 'write' operations that work on different types of I/O streams. Be ready to ReadFileEx(), Recv(), ReadConsoleInput() and whatnot.
IMO the Windows designers got the general idea to support asynchronous I/O right, but they completely messed up all the details.
> * Asynchronous disk I/O is in practice often not actually asynchronous. Some of these cases are documented (https://support.microsoft.com/en-us/kb/156932), but asychronous I/O also actually blocks in cases that are not listed in that article (unless the disk cache is disabled). This is the reason that node.js always uses threads for file i/o.
The key to NT asynchronous I/O is understanding that the cache manager, memory manager and file system drivers all work in harmony to allow a ReadFile() request to either immediately return the data if it is available in the cache, and if not, indicate to the caller that an overlapped operation has been started.
Things like extending a file, opening a file, that's not typically hot-path stuff. If you're doing a network oriented socket server, you would submit such a blocking operation to a separate thread pool (I set up separate thread pools for wait events, separate to the normal I/O completion thread pools), and then that I/O thread moves on to the next completion packet in its queue.
> * For sockets, the downside of the 'completion' model that windows is that the user must pre-allocate a buffer for every socket that it wants to receive data on. Open 10k sockets and allocate a 64k receive buffer for all of them - that adds up quickly. The unix epoll/kqueue/select model is much more memory-efficient.
Well that's just flat out wrong. You can set your socket buffer size as large or as small as you want. For PyParallel I don't even use an outgoing send buffer.
Also, the new registered I/O model in 8+ is a much better way to handle socket buffers without the constant memcpy'ing between kernel and user space.
> IMO the Windows designers got the general idea to support asynchronous I/O right, but they completely messed up all the details.
I disagree. Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.
The Microsoft article cited above (https://support.microsoft.com/en-us/kb/156932) directly contradicts you:
> Be careful when coding for asynchronous I/O because the system reserves the right to make an operation synchronous if it needs to. Therefore, it is best if you write the program to correctly handle an I/O operation that may be completed either synchronously or asynchronously.
Microsoft is directly saying that it reserves the right to violate the guarantee you are counting on at any time, and it documents several known cases of this. You can try to guess when this will happen and put those I/O operations on a different thread pool, but you're just playing whack-a-mole. And you're violating Microsoft's own recommendations.
You wouldn't be using compression or encryption for a file that you wanted to be able to submit asynchronous file I/O writes to in a highly concurrent network server. Those have to be synchronous operations. You'd do everything you can to use TransmitFile() on the hot path.
If you need to sequentially write data, wanted to employ encryption or compression, and reduce the likelihood of your hot-path code blocking, you'd memory map file-sector-aligned chunks at a time, typically in a windowed fashion, such that when you consume the next one you submit threadpool work to prepare the one after that (which would extend the file if necessary, create the file mapping, map it as a view, and then do an interlocked push to the lookaside list that the hot-path thread will use).
I use that technique, and also submit prefaults in a separate threadpool for the page ahead of the next page as I consume records I'm writing to. Before you can write to a page, it needs to be faulted in, and that's a synchronous operation, so you'd architect it to happen ahead of time, before you need it, such that your hot-path code doesn't get blocked when it writes to said page.
That works incredibly well, especially when you combine it with transparent NTFS compression, because the file system driver and the memory manager are just so well integrated.
If you wanted to do scatter/gather random I/O asynchronously, you'd pre-size the file ahead of time, then simply dispatch asynchronous writes for everything, possibly leveraging SetFileIoOverlappedRange such that the kernel locks all the necessary sections into memory ahead of time.
And finally, what's great about I/O completion ports in general is they are self-aware of their concurrency. The rule is always "never block". But sometimes, blocking is inevitable. Windows can detect when a thread that was servicing an I/O completion port has blocked and will automatically mark another thread as runnable so the overall concurrency of the server isn't impacted (or rather, other network clients aren't impacted by a thread's temporary blocking). The only service that's affected is to the client that triggered whatever blocking I/O call there was -- it would be indistinguishable (from a latency perspective) to other clients, because they're happily being picked up by the remaining threads in the thread pool.
I describe that in detail here: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...
> > Be careful when coding for asynchronous I/O because the system reserves the right to make an operation synchronous if it needs to. Therefore, it is best if you write the program to correctly handle an I/O operation that may be completed either synchronously or asynchronously.
That's not the best wording they've used given the article is also talking about blocking. If you've followed my guidelines above, a synchronous return is actually advantageous for file I/O because it means your request was served directly from the cache, and no overlapped I/O operation had to be posted.
And you know all of the operations that will block (and they all make sense when you understand what the kernel is doing behind the scenes), so you just don't do them on the hot path. It's pretty straight forward.
I wrote Windows drivers and file systems for about 10 years, and Unix drivers and file systems also for about 10 years.
I'd rather practice substance agriculture for the rest of my life than deal with Windows drivers again.
Can programming against the userspace interface the I/O subsystem really be compared to programming against the kernel driver interface to I/O subsystem? In Linux, kernel drivers have access to structures, services, and layers that userspace doesn't. And can these be compared between a monolithic and a micro-kernel approach, other than what has been debated ad nauseam for micro/monolithic kernels in general (not just used for I/O)?
I just meant that writing an NT kernel driver will really give you an appreciation of what's going on behind the scenes in order to facilitate awesome userspace things like overlapped I/O, threadpool completion routines, etc.