You can do hundreds of thousands of random reads a second from a single thread submitting tasks to a thread pool on an in memory data set. You can do tens of thousands of reads for a > memory data set with an SSD and I was able to get the advertised number of 4k IOPs out of the SSD (Crucial m4) and an Intel i5 desktop CPU.
I frequently have to multiplex data as it becomes available into a single file (to keep the IO sequential for the disk and filesystem) and I always use a thread per file and I got up to 250 megabytes/sec on a 4 disk RAID-0. I don't currently have a use case for needing more sequential write throughput than that so I haven't tried to attaching more disk and SSDs weren't as fast or common at the time.
My reading of buffered IO in Linux is that it translates to a combination of page cache interactions and async IO under the hood so we are technically always using async IO.