I think that MAP_POPULATE here will fill the page table with entries rather than leaving the page table empty and letting the CPU interrupt at (almost) every time a new page is accessed. That would be about 200k less interrupts for a 1G file.
MAP_POPULATE will probably also do the whole disk read in one go rather than in a lazy+speculative manner.
Page size is probably not affected and neither is number of TLB misses. I in my testing that the size of the file (and the mapping) will affect the page size, a 4G had significantly less page fault interrupts than a 500MB file.
And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.
in my testing that the size of the file (and the mapping) will affect the page size
I'm doubtful of this, although it might depend on how you have "transparent huge pages" configured. But even then, I don't think Linux currently supports huge pages for file backed memory. I think something else might be happening that causes the difference you see. Maybe just the fact that the active TLB can no longer fit in L1?
I'm confused by this, but this does appear to be the case. It seems strange to me that the MAP_POPULATE|MAP_NONBLOCK is no longer possible. I was slow to realize this may be closely related to Linus's recent post: https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6
The best wall times (that is, with OS time included) I get are obtained by reading L1-sized chunks into a small buffer instead of using mmap. YMMV.
For a 1 GB file, I get wall times of:
Original: .22 sec
MAP_POPULATE: .17 sec
Hugepages: .11 sec
Hugepages with prefetch: .07 sec
It means walking whatever structures the OS uses to keep things in cache. We generally don't know what they are, nor control them.
> What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does.
MAP_POPULATE minimizes the number of page faults during the main loop, which are more expensive (and require a context switch) than TLB misses. Plus, TLB misses can be avoided in our loop, especially with such a friendly linear sweep.
The main problem here, in my view: trying to coax the OS into using memory the way we want. Huge pages surely help in that regard, but they help the most in code that we do not control. The sum itself over 1 GB of memory would be roughly the same speed, regardless of page size.
To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.
Yes, I'm pretty sure this is the case, and had in fact been assuming that it is the major effect. It's a little trickier to measure than the case from the file, since you don't want to include the random number generation as part of the measurement. This essentially excludes the use of 'perf', but luckily 'likwid' works great for this.
I'll try to post some numbers here in the next hour or so.
What performance counters are you interested in?
Here's the standard case:
sudo likwid -C 1 -g
DTLB_LOAD_MISSES_WALK_DURATION:PMC1 -m ./anonpage
|RDTSC Runtime [s] | 0.0732768 |
|INSTR_RETIRED_ANY | 5.78816e+08 |
|CPU_CLK_UNHALTED_CORE | 2.62541e+08 |
|DTLB_LOAD_MISSES_WALK_COMPLETED | 262211 |
|DTLB_LOAD_MISSES_WALK_DURATION | 5.52261e+06 |
sudo likwid -C 1 -g
DTLB_LOAD_MISSES_WALK_DURATION:PMC1 -m ./hugepage
| RDTSC Runtime [s] | 0.0716703 |
| INSTR_RETIRED_ANY | 5.78816e+08 |
| CPU_CLK_UNHALTED_CORE | 2.56794e+08 |
| DTLB_LOAD_MISSES_WALK_COMPLETED | 63 |
| DTLB_LOAD_MISSES_WALK_DURATION | 4891 |