Hacker News new | past | comments | ask | show | jobs | submit login

It's moving page faults away from the main addition loop, which is what I'm interested in measuring anyway. It also reads the whole file in one go, instead of page by page with the default lazy approach.

The best wall times (that is, with OS time included) I get are obtained by reading L1-sized chunks into a small buffer instead of using mmap. YMMV.




Those are sort of fuzzy concepts for me. At the level of the processor, what does "in one go" really mean? And what does it mean to read it if it's already in memory? Since there are only 512 standard TLB entries, there's no way that all of them can be 'hot' at a time with 4K pages.

For a 1 GB file, I get wall times of:

  Original:     .22 sec
  MAP_POPULATE: .17 sec
  Hugepages:    .11 sec
  Hugepages with prefetch: .07 sec
While I generally agree with the idea that mmap() is no faster than read()/fread(), I'm dubious that one could achieve equally good performance without using huge pages. What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does. I've confirmed that it is not changing the number of TLB page walks. It stays at the expected ~250,000 per GB whether it's used or not.


> And what does it mean to read it if it's already in memory?

It means walking whatever structures the OS uses to keep things in cache. We generally don't know what they are, nor control them.

> What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does.

MAP_POPULATE minimizes the number of page faults during the main loop, which are more expensive (and require a context switch) than TLB misses. Plus, TLB misses can be avoided in our loop, especially with such a friendly linear sweep.

The main problem here, in my view: trying to coax the OS into using memory the way we want. Huge pages surely help in that regard, but they help the most in code that we do not control. The sum itself over 1 GB of memory would be roughly the same speed, regardless of page size.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.


Thanks for walking me through. I've usually dealt the with hugepages as buffers (as you mention in the last test) and haven't thought much previously about how they work as shared memory.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.

Yes, I'm pretty sure this is the case, and had in fact been assuming that it is the major effect. It's a little trickier to measure than the case from the file, since you don't want to include the random number generation as part of the measurement. This essentially excludes the use of 'perf', but luckily 'likwid' works great for this.

I'll try to post some numbers here in the next hour or so. What performance counters are you interested in?


Cycle counts and DTLB_LOAD_* events should be enough here. Note that the random number generation also doubles as the "populate" flag in this case, since malloc returns uninitialized pages. I fully expect huge pages to be faster, all other things being equal, but I wonder by how much.


OK, I've tested, and I'm excited to report that you about 98% correct that TLB misses are _not_ the main issue! For a prewarmed buffer, using 1GB hugepages instead of standard 4KB standard pages is only about a 2% difference.

Here's the standard case:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./anonpage

  |RDTSC Runtime [s]                | 0.0732768   |
  |INSTR_RETIRED_ANY                | 5.78816e+08 |
  |CPU_CLK_UNHALTED_CORE            | 2.62541e+08 |
  |DTLB_LOAD_MISSES_WALK_COMPLETED  |   262211    |
  |DTLB_LOAD_MISSES_WALK_DURATION   | 5.52261e+06 |
And here is the version using 1GB hugepages:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./hugepage

   | RDTSC Runtime [s]               | 0.0716703   |
   | INSTR_RETIRED_ANY               | 5.78816e+08 |
   | CPU_CLK_UNHALTED_CORE           | 2.56794e+08 |
   | DTLB_LOAD_MISSES_WALK_COMPLETED |     63      |
   | DTLB_LOAD_MISSES_WALK_DURATION  |    4891     |
The hugepages are indeed faster by amount the difference reported in as DTLB_LOAD_MISSES_WALK_DURATION. This means that as you surmised, the majority of the savings is not due to the avoidance of TLB misses per se. I need to think about this more.


Interesting. I wonder if the slowdown of small pages is a result of excessive pointer chasing on the kernel side, which intuitively does a better job of trashing the TLB than sequential accesses would.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: