It means walking whatever structures the OS uses to keep things in cache. We generally don't know what they are, nor control them.
> What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does.
MAP_POPULATE minimizes the number of page faults during the main loop, which are more expensive (and require a context switch) than TLB misses. Plus, TLB misses can be avoided in our loop, especially with such a friendly linear sweep.
The main problem here, in my view: trying to coax the OS into using memory the way we want. Huge pages surely help in that regard, but they help the most in code that we do not control. The sum itself over 1 GB of memory would be roughly the same speed, regardless of page size.
To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.
Yes, I'm pretty sure this is the case, and had in fact been assuming that it is the major effect. It's a little trickier to measure than the case from the file, since you don't want to include the random number generation as part of the measurement. This essentially excludes the use of 'perf', but luckily 'likwid' works great for this.
I'll try to post some numbers here in the next hour or so.
What performance counters are you interested in?
Here's the standard case:
sudo likwid -C 1 -g
DTLB_LOAD_MISSES_WALK_DURATION:PMC1 -m ./anonpage
|RDTSC Runtime [s] | 0.0732768 |
|INSTR_RETIRED_ANY | 5.78816e+08 |
|CPU_CLK_UNHALTED_CORE | 2.62541e+08 |
|DTLB_LOAD_MISSES_WALK_COMPLETED | 262211 |
|DTLB_LOAD_MISSES_WALK_DURATION | 5.52261e+06 |
sudo likwid -C 1 -g
DTLB_LOAD_MISSES_WALK_DURATION:PMC1 -m ./hugepage
| RDTSC Runtime [s] | 0.0716703 |
| INSTR_RETIRED_ANY | 5.78816e+08 |
| CPU_CLK_UNHALTED_CORE | 2.56794e+08 |
| DTLB_LOAD_MISSES_WALK_COMPLETED | 63 |
| DTLB_LOAD_MISSES_WALK_DURATION | 4891 |