Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for walking me through. I've usually dealt the with hugepages as buffers (as you mention in the last test) and haven't thought much previously about how they work as shared memory.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.

Yes, I'm pretty sure this is the case, and had in fact been assuming that it is the major effect. It's a little trickier to measure than the case from the file, since you don't want to include the random number generation as part of the measurement. This essentially excludes the use of 'perf', but luckily 'likwid' works great for this.

I'll try to post some numbers here in the next hour or so. What performance counters are you interested in?




Cycle counts and DTLB_LOAD_* events should be enough here. Note that the random number generation also doubles as the "populate" flag in this case, since malloc returns uninitialized pages. I fully expect huge pages to be faster, all other things being equal, but I wonder by how much.


OK, I've tested, and I'm excited to report that you about 98% correct that TLB misses are _not_ the main issue! For a prewarmed buffer, using 1GB hugepages instead of standard 4KB standard pages is only about a 2% difference.

Here's the standard case:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./anonpage

  |RDTSC Runtime [s]                | 0.0732768   |
  |INSTR_RETIRED_ANY                | 5.78816e+08 |
  |CPU_CLK_UNHALTED_CORE            | 2.62541e+08 |
  |DTLB_LOAD_MISSES_WALK_COMPLETED  |   262211    |
  |DTLB_LOAD_MISSES_WALK_DURATION   | 5.52261e+06 |
And here is the version using 1GB hugepages:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./hugepage

   | RDTSC Runtime [s]               | 0.0716703   |
   | INSTR_RETIRED_ANY               | 5.78816e+08 |
   | CPU_CLK_UNHALTED_CORE           | 2.56794e+08 |
   | DTLB_LOAD_MISSES_WALK_COMPLETED |     63      |
   | DTLB_LOAD_MISSES_WALK_DURATION  |    4891     |
The hugepages are indeed faster by amount the difference reported in as DTLB_LOAD_MISSES_WALK_DURATION. This means that as you surmised, the majority of the savings is not due to the avoidance of TLB misses per se. I need to think about this more.


Interesting. I wonder if the slowdown of small pages is a result of excessive pointer chasing on the kernel side, which intuitively does a better job of trashing the TLB than sequential accesses would.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: