I'm not so sure of your policework on mmap() vs. read():
* The "extra copies" you make with read happen in the L1/L2 cache, making them comparable to register spills. Buffer copying just isn't expensive.
* (and here I start paraphrasing Matt Dillon) On the other hand, you are absolutely going to take extra page faults and blow your TLB if you address a huge file using mmap, which is not only going to slow down I/O but also hurt performance elsewhere.
It seems to me like you did mmap() so you could easily externalize vectors and maps. Which actually leads me to a second question:
Isn't the reason people use things like B-Trees that they are optimized to touch the disk a minimal number of times? Isn't that kind of not the case with a C++ container "ported" to the disk?
I was planning a big blog entry just on the backend options that we used since I tried several combinations of I/O backends with different numbers of reader threads in our profiling dataset with different I/O elevator scheduling algorithms (switching the algorithms, disappointingly, had a negligible effect on performance and I/O throughput degraded when increasing the number of reader threads to more than twice the number of active cores) -- but that kind of slipped into the background as we started filling out the bits of the database to give it an acceptable level of robustness.
The hashing scheme that we're using is optimized for keeping a tight memory profile -- and hence disk profile. Again, much of the rationale was applied after the fact to try to explain the results of profiling. At first we tried things with B-trees and with the combination of the VM's paging and our access patterns the hashes were faster. It's possible that if we were using the direct I/O APIs that many databases use and doing all of our own caching internally that we'd be able to achieve higher throughput with B-trees.
In our case, letting the OS handle our caching and keeping identical C++ structures to our disk structures simplified the code enough to merit leaving things this way for the time being. At the end of the day, our product is a recommendation engine, not a database, so we'd like to keep the codebase relatively lean.
So, yeah, a lot of the explanations are applied after the fact from what I know of systems programming, but the results were validated through actual test runs through multiple competing backend implementations. The one we described gave the best overall results.
Scott's conclusions agree with my experiences very well: if you design around mmap(), and let the system handle the caching, you can end up with something several times faster than the traditional alternatives. This isn't to say that your criticisms are completely wrong, just that they don't match up with the actual testing.
* "extra copies" [are cheap]
True, but the real cost is the greater memory footprint. Less application buffering means more room for cached pages. And this cache is transparent across multiple processes.
* extra page faults
I think the opposite turns out to be true. Letting the system handle the buffering results in more cache hits, since the memory is used more efficiently.
* blow your TLB
Theoretically a problem, but in practice one doesn't linearly access the entire file. The beauty of mmap() is that it allows for brilliantly efficient non-sequential access.
* B-trees vs C++ containers
While it's true that you have to think closely about the memory layout of your containers, if you do so the access patterns can be even better than a B-Tree. If the container has been designed for efficient memory-access with regard to cache-lines and cache-sizes, it tends to have great disk-access as well.
What's really beautiful about the mmap() approach is the simplicity it offers. In this model, RAM can be viewed as a 16 Gig L4 cache, and disk as a multi-Terabyte L5. Just as one currently writes code that doesn't distinguish between a fetch from L1 and a fetch from main memory, mmap() allows extending this syntax all the way to a fetch from disk.
Now, this doesn't mean that one can just substitute mmap() for fread() and get any significant improvement. One needs to re-optimize the data structures as well. But the nice part is that these techniques are the same techniques used to optimize existing cache accesses, and certain 'cache-oblivious' algorithms already work out of the box.
Anyway, thanks to Scott for the writeup!
Each opens a 10M file and accesses aligned pages. Depending on how many bytes in the page you ask the mmap() case to touch, mmap ranges from 10x faster to 10x slower for me. Reading straight through without seeking, it's no contest for me; read() wins. But you knew that.
I was having trouble comparing results, so I combined your two into one, tried to make the cases more parallel, took out the alarm() stuff, and just ran it under oprofile.
My conclusions were that for cases like this, where the file is small enough to remain in cache, there really isn't any difference between the performance of read() and mmap(). I didn't find any of 10x differences you found, found that the mmap() version ranged from twice as fast for small chunks to about equal for full pages.
You might argue that I'm cheating a little bit, as I'm using memcpy() to extract from the mmap(). When I don't do this, the read() version often comes out up to 10% faster. But I'm doing it so that the code in the loop can be more similar --- I presume that a buf can optimize better.
I'd be interested to know how you constructed the case where read() was 10x faster than mmap(). This doesn't fit my mental model, and if it's straight up, I'd be interested in understanding what causes this. For example, even when I go to linear access, I only see read() being 5% faster.
In current Linux, I'm pretty sure both of them use the same underlying page cache. fread() adds a small amount of management overhead, but read() does just as much system level buffering. mmap() uses the same cache, but just gives direct access to it.
But it's possible I'm wrong, and I don't seem to be able to find a solid source for this online. This page references this, though:
I feel like I've read other more explicit descriptions, although possibly offline.
I don't follow the rest of your caching arguments, though. read(2) exploits the buffer cache; in fact, the rap on mmap() is that it makes worse use of the buffer cache, because it doesn't provide the kernel with enough information to read ahead. Apocryphal, though.
The big issue is that the mmap() case is much more demanding on the VM system. You're thinking only of the buffer cache when you talk about caching, but the X86 is also at pains to cache the page directory hierarchy (that's what the TLB is doing). Hopping all over your process' address space rips up the TLB, which is expensive. There are also hardware cycle penalties for dicking with page table entries.