Hacker News new | past | comments | ask | show | jobs | submit login

What do you think MAP_POPULATE is actually doing here? Unless it's changing the page size, I don't see how it would be significantly reducing the number of TLB misses. Is it perhaps doing the preloading in a separate thread on a different core? And the timing happens to work out that so that the L3 cache is getting filled at the same rate it's being drained?



> What do you think MAP_POPULATE is actually doing here? Unless it's changing the page size, I don't see how it would be significantly reducing the number of TLB misses.

I think that MAP_POPULATE here will fill the page table with entries rather than leaving the page table empty and letting the CPU interrupt at (almost) every time a new page is accessed. That would be about 200k less interrupts for a 1G file.

MAP_POPULATE will probably also do the whole disk read in one go rather than in a lazy+speculative manner.

Page size is probably not affected and neither is number of TLB misses. I in my testing that the size of the file (and the mapping) will affect the page size, a 4G had significantly less page fault interrupts than a 500MB file.

And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.


I came across this link, which helped me understand the process a bit better: http://kolbusa.livejournal.com/108622.html. So yes, the main savings seems to be that the page table is created in a tight loop rather than ad hoc. Given the number of pages in the scane, it's still going to be a TLB miss for each page, but it will be just a lookup (no context switch).

in my testing that the size of the file (and the mapping) will affect the page size

I'm doubtful of this, although it might depend on how you have "transparent huge pages" configured. But even then, I don't think Linux currently supports huge pages for file backed memory. I think something else might be happening that causes the difference you see. Maybe just the fact that the active TLB can no longer fit in L1?

And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.

I'm confused by this, but this does appear to be the case. It seems strange to me that the MAP_POPULATE|MAP_NONBLOCK is no longer possible. I was slow to realize this may be closely related to Linus's recent post: https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6


It's moving page faults away from the main addition loop, which is what I'm interested in measuring anyway. It also reads the whole file in one go, instead of page by page with the default lazy approach.

The best wall times (that is, with OS time included) I get are obtained by reading L1-sized chunks into a small buffer instead of using mmap. YMMV.


Those are sort of fuzzy concepts for me. At the level of the processor, what does "in one go" really mean? And what does it mean to read it if it's already in memory? Since there are only 512 standard TLB entries, there's no way that all of them can be 'hot' at a time with 4K pages.

For a 1 GB file, I get wall times of:

  Original:     .22 sec
  MAP_POPULATE: .17 sec
  Hugepages:    .11 sec
  Hugepages with prefetch: .07 sec
While I generally agree with the idea that mmap() is no faster than read()/fread(), I'm dubious that one could achieve equally good performance without using huge pages. What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does. I've confirmed that it is not changing the number of TLB page walks. It stays at the expected ~250,000 per GB whether it's used or not.


> And what does it mean to read it if it's already in memory?

It means walking whatever structures the OS uses to keep things in cache. We generally don't know what they are, nor control them.

> What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does.

MAP_POPULATE minimizes the number of page faults during the main loop, which are more expensive (and require a context switch) than TLB misses. Plus, TLB misses can be avoided in our loop, especially with such a friendly linear sweep.

The main problem here, in my view: trying to coax the OS into using memory the way we want. Huge pages surely help in that regard, but they help the most in code that we do not control. The sum itself over 1 GB of memory would be roughly the same speed, regardless of page size.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.


Thanks for walking me through. I've usually dealt the with hugepages as buffers (as you mention in the last test) and haven't thought much previously about how they work as shared memory.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.

Yes, I'm pretty sure this is the case, and had in fact been assuming that it is the major effect. It's a little trickier to measure than the case from the file, since you don't want to include the random number generation as part of the measurement. This essentially excludes the use of 'perf', but luckily 'likwid' works great for this.

I'll try to post some numbers here in the next hour or so. What performance counters are you interested in?


Cycle counts and DTLB_LOAD_* events should be enough here. Note that the random number generation also doubles as the "populate" flag in this case, since malloc returns uninitialized pages. I fully expect huge pages to be faster, all other things being equal, but I wonder by how much.


OK, I've tested, and I'm excited to report that you about 98% correct that TLB misses are _not_ the main issue! For a prewarmed buffer, using 1GB hugepages instead of standard 4KB standard pages is only about a 2% difference.

Here's the standard case:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./anonpage

  |RDTSC Runtime [s]                | 0.0732768   |
  |INSTR_RETIRED_ANY                | 5.78816e+08 |
  |CPU_CLK_UNHALTED_CORE            | 2.62541e+08 |
  |DTLB_LOAD_MISSES_WALK_COMPLETED  |   262211    |
  |DTLB_LOAD_MISSES_WALK_DURATION   | 5.52261e+06 |
And here is the version using 1GB hugepages:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./hugepage

   | RDTSC Runtime [s]               | 0.0716703   |
   | INSTR_RETIRED_ANY               | 5.78816e+08 |
   | CPU_CLK_UNHALTED_CORE           | 2.56794e+08 |
   | DTLB_LOAD_MISSES_WALK_COMPLETED |     63      |
   | DTLB_LOAD_MISSES_WALK_DURATION  |    4891     |
The hugepages are indeed faster by amount the difference reported in as DTLB_LOAD_MISSES_WALK_DURATION. This means that as you surmised, the majority of the savings is not due to the avoidance of TLB misses per se. I need to think about this more.


Interesting. I wonder if the slowdown of small pages is a result of excessive pointer chasing on the kernel side, which intuitively does a better job of trashing the TLB than sequential accesses would.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: