You can do hundreds of thousands of random reads a second from a single thread submitting tasks to a thread pool on an in memory data set. You can do tens of thousands of reads for a > memory data set with an SSD and I was able to get the advertised number of 4k IOPs out of the SSD (Crucial m4) and an Intel i5 desktop CPU.
I frequently have to multiplex data as it becomes available into a single file (to keep the IO sequential for the disk and filesystem) and I always use a thread per file and I got up to 250 megabytes/sec on a 4 disk RAID-0. I don't currently have a use case for needing more sequential write throughput than that so I haven't tried to attaching more disk and SSDs weren't as fast or common at the time.
My reading of buffered IO in Linux is that it translates to a combination of page cache interactions and async IO under the hood so we are technically always using async IO.
While it's true that the Windows API seems to be the best thought through, I was surprised to learn that the implementation may randomly fall back to synchronous IO in unpredictable ways, which (depending on the app, but likely for something that's attempting to juggle a lot of work like a bittorrent implementation) means you need a thread pool anyway.
We ran into issues where lots of threads attempting a disk write were causing latency problems. We were able to get around this by having a single dedicated thread to disk IO.
As a result, there is little incentive to improve AIO (on all systems, but especially on Linux --- a lot of the Direct I/O work was done to make the enterprise database vendors happy). And since AIO isn't good enough, very few people want to use it, and since making it better is difficult, few people are interested in working to solve the problem, and the cycle repeats again.
I'm also not surprised that Windows fared better here; with IOCP they had a chance to redo async I/O completely.
The Evolution of MonoTorrent - FOSDEM 2010
Simplified Threading API
This sounds like a bad idea. If the improvements aren't tied to asynch, why weigh them down with the async albatross instead of merging them to the mainline?
In general, unless you're doing something super high-performance, you should not bother with AIO. It's kind of one of those "if you have to ask, you don't need to know," situations.