When you say "fread()", I wonder whether you're considering that fread() does stdio buffering in userland above and beyond the small window of memory you reuse on every read (and that is going to stay in-cache) when you use the read(2) syscall directly.
Each opens a 10M file and accesses aligned pages. Depending on how many bytes in the page you ask the mmap() case to touch, mmap ranges from 10x faster to 10x slower for me. Reading straight through without seeking, it's no contest for me; read() wins. But you knew that.
I was having trouble comparing results, so I combined your two into one, tried to make the cases more parallel, took out the alarm() stuff, and just ran it under oprofile.
My conclusions were that for cases like this, where the file is small enough to remain in cache, there really isn't any difference between the performance of read() and mmap(). I didn't find any of 10x differences you found, found that the mmap() version ranged from twice as fast for small chunks to about equal for full pages.
You might argue that I'm cheating a little bit, as I'm using memcpy() to extract from the mmap(). When I don't do this, the read() version often comes out up to 10% faster. But I'm doing it so that the code in the loop can be more similar --- I presume that a buf can optimize better.
I'd be interested to know how you constructed the case where read() was 10x faster than mmap(). This doesn't fit my mental model, and if it's straight up, I'd be interested in understanding what causes this. For example, even when I go to linear access, I only see read() being 5% faster.
I went back and forth on whether use read() or fread() in my example, and I wasn't sure which to choose. For the purpose of this example, I don't think there is a functional difference between them.
In current Linux, I'm pretty sure both of them use the same underlying page cache. fread() adds a small amount of management overhead, but read() does just as much system level buffering. mmap() uses the same cache, but just gives direct access to it.
stdio does its own buffering, which is why you have to turn output buffering off with setbuf() when you want to do debug prints. But I may be on crack in the read case, vs. the write.
I don't follow the rest of your caching arguments, though. read(2) exploits the buffer cache; in fact, the rap on mmap() is that it makes worse use of the buffer cache, because it doesn't provide the kernel with enough information to read ahead. Apocryphal, though.
The big issue is that the mmap() case is much more demanding on the VM system. You're thinking only of the buffer cache when you talk about caching, but the X86 is also at pains to cache the page directory hierarchy (that's what the TLB is doing). Hopping all over your process' address space rips up the TLB, which is expensive. There are also hardware cycle penalties for dicking with page table entries.