My experience with Linux and buffered IO (ext4) from multiple threads has been very positive. The only beef I have is that you can't prevent the data you write/read from polluting the cache without resorting to madvise which isn't available from Java. I don't usually care about the contents of the page cache so it isn't a showstopper.
You can do hundreds of thousands of random reads a second from a single thread submitting tasks to a thread pool on an in memory data set. You can do tens of thousands of reads for a > memory data set with an SSD and I was able to get the advertised number of 4k IOPs out of the SSD (Crucial m4) and an Intel i5 desktop CPU.
I frequently have to multiplex data as it becomes available into a single file (to keep the IO sequential for the disk and filesystem) and I always use a thread per file and I got up to 250 megabytes/sec on a 4 disk RAID-0. I don't currently have a use case for needing more sequential write throughput than that so I haven't tried to attaching more disk and SSDs weren't as fast or common at the time.
My reading of buffered IO in Linux is that it translates to a combination of page cache interactions and async IO under the hood so we are technically always using async IO.
i actually very much doubt that they can do better. writing any sort of large LRU cache in a machine with swap turned on is a bad idea because a lot of your cache will get swapped out and then swapped back in unnecessarily when you try to use that memory for something else. mlock can be used to mitigate this effect but the default mlock limit is so low that it's useless in practice. another thing to consider is that using mmap gives you a big advantage over writing your own disk cache because the OS can take advantage of the paging hardware in the processor, which you can't do from userspace.
one reason why the bittorrent layer could make better decisions on what to flush is because it always need to hash pieces that are downloaded. When blocks are downloaded out of order, it doesn't make sense to flush unhashed blocks before ones that the the hash cursor has already passed. If an unhashed block is flushed, it will have to be read back from disk again, when the piece complete, which is very expensive.