1/4 second to plow through 1 GB of memory is certainly fast compared to some thi...

exDM69 · on May 13, 2014

Good stuff! The low numbers (4 GB/s, one tenth the available memory bandwidth) that we're seeing sounds like a lot of time is wasted in page faults (and indeed, perf stat confirms it). However, the solution you propose sounds difficult, you need a special file system and need to know what kind of page sizes the CPU supports.

Can you think of ways to get reduce the number of page faults from inside the application itself? Or methods that would be portable to architectures with different page sizes?

I tried a simple call to madvise and posix_fadvise to inform the operating system ahead of time that I am going to need the memory but that did not have any effect on the number of page faults.

Any other tips for squeezing some more perf out? Did you happen to do any cache miss stats your benchmarks?

pbsd · on May 13, 2014

You get a similar effect to huge pages by passing MAP_POPULATE to mmap, which pre-faults all pages itself before returning.

exDM69 · on May 13, 2014

Ah, thanks!

Using MAP_POPULATE, I'm seeing this go from 104,276 page faults down to 130 page faults. However, I'm only seeing a modest increase in overall performance, goes from 160 ms to 130 ms for my ~500 MB test file. MAP_POPULATE + MAP_NONBLOCK is as bad as not using MAP_POPULATE (same number of page faults).

Could the number of TLB misses from using huge pages be a major factor too? Page faults alone do not explain the huge perf difference. (edit: perf stat tells me I'm only missing 0.19% of TLB lookups and only 3% dcache misses)

I also tried with a 4 GB file and I see two interesting effects. First of all, using MAP_POPULATE is slower, which I presume is because physical memory is exhausted (6 G memory total). Secondly, I see less than 100k page faults which suggests that bigger pages (2M or 1G?) are being used (10x bigger file, a little less page faults).

dchichkov · on May 13, 2014

How about the following?

    HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so

nkurz · on May 13, 2014

This is well worth trying, but not directly applicable to reading from a file. This would help if you were allocating a buffer with malloc() and you wanted it to be backed by hugepages. But if you have control of the source, you are probably better using MAP_HUGETLB with madvise() or mmap(), or using the get_huge_pages() directly instead of using the LD_PRELOAD.

nkurz · on May 13, 2014

What do you think MAP_POPULATE is actually doing here? Unless it's changing the page size, I don't see how it would be significantly reducing the number of TLB misses. Is it perhaps doing the preloading in a separate thread on a different core? And the timing happens to work out that so that the L3 cache is getting filled at the same rate it's being drained?

exDM69 · on May 14, 2014

> What do you think MAP_POPULATE is actually doing here? Unless it's changing the page size, I don't see how it would be significantly reducing the number of TLB misses.

I think that MAP_POPULATE here will fill the page table with entries rather than leaving the page table empty and letting the CPU interrupt at (almost) every time a new page is accessed. That would be about 200k less interrupts for a 1G file.

MAP_POPULATE will probably also do the whole disk read in one go rather than in a lazy+speculative manner.

Page size is probably not affected and neither is number of TLB misses. I in my testing that the size of the file (and the mapping) will affect the page size, a 4G had significantly less page fault interrupts than a 500MB file.

And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.

nkurz · on May 14, 2014

I came across this link, which helped me understand the process a bit better: http://kolbusa.livejournal.com/108622.html. So yes, the main savings seems to be that the page table is created in a tight loop rather than ad hoc. Given the number of pages in the scane, it's still going to be a TLB miss for each page, but it will be just a lookup (no context switch).

in my testing that the size of the file (and the mapping) will affect the page size

I'm doubtful of this, although it might depend on how you have "transparent huge pages" configured. But even then, I don't think Linux currently supports huge pages for file backed memory. I think something else might be happening that causes the difference you see. Maybe just the fact that the active TLB can no longer fit in L1?

And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.

I'm confused by this, but this does appear to be the case. It seems strange to me that the MAP_POPULATE|MAP_NONBLOCK is no longer possible. I was slow to realize this may be closely related to Linus's recent post: https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6

pbsd · on May 13, 2014

It's moving page faults away from the main addition loop, which is what I'm interested in measuring anyway. It also reads the whole file in one go, instead of page by page with the default lazy approach.

The best wall times (that is, with OS time included) I get are obtained by reading L1-sized chunks into a small buffer instead of using mmap. YMMV.

nkurz · on May 13, 2014

Those are sort of fuzzy concepts for me. At the level of the processor, what does "in one go" really mean? And what does it mean to read it if it's already in memory? Since there are only 512 standard TLB entries, there's no way that all of them can be 'hot' at a time with 4K pages.

For a 1 GB file, I get wall times of:

  Original:     .22 sec
  MAP_POPULATE: .17 sec
  Hugepages:    .11 sec
  Hugepages with prefetch: .07 sec

While I generally agree with the idea that mmap() is no faster than read()/fread(), I'm dubious that one could achieve equally good performance without using huge pages. What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does. I've confirmed that it is not changing the number of TLB page walks. It stays at the expected ~250,000 per GB whether it's used or not.

pbsd · on May 13, 2014

> And what does it mean to read it if it's already in memory?

It means walking whatever structures the OS uses to keep things in cache. We generally don't know what they are, nor control them.

> What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does.

MAP_POPULATE minimizes the number of page faults during the main loop, which are more expensive (and require a context switch) than TLB misses. Plus, TLB misses can be avoided in our loop, especially with such a friendly linear sweep.

The main problem here, in my view: trying to coax the OS into using memory the way we want. Huge pages surely help in that regard, but they help the most in code that we do not control. The sum itself over 1 GB of memory would be roughly the same speed, regardless of page size.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.

nkurz · on May 13, 2014

Thanks for walking me through. I've usually dealt the with hugepages as buffers (as you mention in the last test) and haven't thought much previously about how they work as shared memory.

To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.

Yes, I'm pretty sure this is the case, and had in fact been assuming that it is the major effect. It's a little trickier to measure than the case from the file, since you don't want to include the random number generation as part of the measurement. This essentially excludes the use of 'perf', but luckily 'likwid' works great for this.

I'll try to post some numbers here in the next hour or so. What performance counters are you interested in?

pbsd · on May 13, 2014

Cycle counts and DTLB_LOAD_* events should be enough here. Note that the random number generation also doubles as the "populate" flag in this case, since malloc returns uninitialized pages. I fully expect huge pages to be faster, all other things being equal, but I wonder by how much.

nkurz · on May 13, 2014

OK, I've tested, and I'm excited to report that you about 98% correct that TLB misses are _not_ the main issue! For a prewarmed buffer, using 1GB hugepages instead of standard 4KB standard pages is only about a 2% difference.

Here's the standard case:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./anonpage

  |RDTSC Runtime [s]                | 0.0732768   |
  |INSTR_RETIRED_ANY                | 5.78816e+08 |
  |CPU_CLK_UNHALTED_CORE            | 2.62541e+08 |
  |DTLB_LOAD_MISSES_WALK_COMPLETED  |   262211    |
  |DTLB_LOAD_MISSES_WALK_DURATION   | 5.52261e+06 |

And here is the version using 1GB hugepages:

  sudo likwid -C 1 -g 
    INSTR_RETIRED_ANY:FIXC0,
    CPU_CLK_UNHALTED_CORE:FIXC1,
    DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0,
    DTLB_LOAD_MISSES_WALK_DURATION:PMC1  -m ./hugepage

   | RDTSC Runtime [s]               | 0.0716703   |
   | INSTR_RETIRED_ANY               | 5.78816e+08 |
   | CPU_CLK_UNHALTED_CORE           | 2.56794e+08 |
   | DTLB_LOAD_MISSES_WALK_COMPLETED |     63      |
   | DTLB_LOAD_MISSES_WALK_DURATION  |    4891     |

The hugepages are indeed faster by amount the difference reported in as DTLB_LOAD_MISSES_WALK_DURATION. This means that as you surmised, the majority of the savings is not due to the avoidance of TLB misses per se. I need to think about this more.

pbsd · on May 14, 2014

Interesting. I wonder if the slowdown of small pages is a result of excessive pointer chasing on the kernel side, which intuitively does a better job of trashing the TLB than sequential accesses would.

userbinator · on May 13, 2014

> Any other tips for squeezing some more perf out?

Ditch the OS and go into huge unreal mode ala LoseThos. Disable interrupts in the loop. No MMU means no page faults, no TLB, no interrupts, you can't possibly get any faster than that! :-)

More seriously, I think 14.5GB/s is getting close to the limits of memory bandwidth on a single core already. Multicore CPU's memory controllers are optimised for multiple cores accessing memory, so a single one trying to saturate the bus might not perform so well.

logicchains · on May 13, 2014

Seriously, someone should take TempleOS (if it's open source) or else make something similar and market it in some form of really high-speed server, I'm sure there'd be demand. Of course, that's assuming it would be possible to achieve decent security on the software level to ensure safe operation without memory protection.

nkurz · on May 13, 2014

I haven't explored it, but you might check out http://www.returninfinity.com/baremetalnode.html

You can also get pretty far on modern Linux by reserving cores for computation with NOHZ_FULL: http://www.breakage.org/2013/11/nohz_fullgodmode/

sbierwagen · on May 13, 2014

Something Snabb Switch-like, then, where they disable the Linux task scheduler on 6 out of 8 cores, and do all the scheduling manually: https://news.ycombinator.com/item?id=7250505

Locke1689 · on May 13, 2014

Places where people care about this (like HPC) already have customized kernels that run with large pages by default and don't suffer from the poor quality of TempleOS.

See https://software.sandia.gov/trac/kitten for an example.

lhgaghl · on May 14, 2014

What exactly do you mean by poor quality? Have you ever run the OS? Do you have a reference to some results that reveal some aspect of it that is poor?

count · on May 14, 2014

Hah, that's how Novell Netware got such high performance in file serving back in the day (no protected memory).

Funny how things come full circle sometimes...

nkurz · on May 13, 2014

However, the solution you propose sounds difficult, you need a special file system and need to know what kind of page sizes the CPU supports.

Yes, I've definitely offered a "proof of concept" rather than a solution. But it has gotten simpler with recent Linux kernel versions: http://lwn.net/Articles/533499/.

One difficulty in emulating Julia's particular benchmark is (at least on Linux) transparent hugepages can't be used for file-backed memory. It's easier if you are just allocating an anonymous buffer, and even easier if you don't need to share that buffer between unrelated processes.

Can you think of ways to get reduce the number of page faults from inside the application itself?

"Page faults" is an overloaded term (used for semi-related OS and hardware level events), so probably best to avoid it (even though I was using it, and even though 'perf' does). I was actually using 'likwid' to do most of the measuring, and switched to 'perf' for the post since it's more generally available. Our goal here is to avoid TLB misses, or the consequent 'page walks'.

With that in mind, the subgoal becomes allocating memory that uses hugepages. If you are OK with the default 2MB pages, this is reasonably straightforward. 'libhugetlbfs' (a library distributed independently of the 'hugetlbfs' distributed with the kernel) may make this easier.

Or methods that would be portable to architectures with different page sizes?

The number of page sizes supported is not a big issue. There are only a few sizes out there. In practice, all 64-bit systems will support the 2MB hugepages, and this is almost always the default hugepage size. In this particular case with a linear scan of memory, there is almost no difference in performance between these and the harder to specify 1GB pages.

I tried a simple call to madvise and posix_fadvise to inform the operating system ahead of time that I am going to need the memory but that did not have any effect on the number of page faults.

This is the overloading of the terms getting you. Presuming a system with ample free memory, after the first run you won't have any OS level page faults. But anytime you try to have a working set of more than 512 4K regions in RAM, you will have TLB misses aka "page exceptions".

The madvise() option that should help here is MAP_HUGETLB. You should also be able to use this in mmap(), but not (to my knowledge) if you are doing a file backed allocation.

Any other tips for squeezing some more perf out?

Not really, learning these is my goal as well. For me, it's probably going to involve getting more familiar with the "uncore" performance monitors. Sandy Bridge and post, these are accessed via PCI rather than MSR's, and are difficult to use from 'perf'. 'likwid' supports them better, and is definitely worth exploring.

The other place to get further gains is by trying to understand the actual memory access patterns, and trying to get them to read full DRAM pages each time a bank is opened. John McCalpin (author of the Stream benchmark) goes into it here: http://blogs.utexas.edu/jdm4372/2010/11/09/optimizing-amd-op...

Did you happen to do any cache miss stats your benchmarks?

This is directly related to the performance improvement, but I didn't look at the numbers separately. Glancing now, it looks like practically everything is hitting L3, but a lot of things are missing L1. This is probably worth exploring further. I think one issue is that the hardware prefetcher deposits things in L2 (~12 cycles), but not L1 (~6 cycles).

cshimmin · on May 13, 2014

Naively, it sounds about right just on dimensional grounds. CPU clock: ~ 2GHz, so that means

    (2 GHz) / (4 GB / s) = 0.5 instructions per byte

For a 64 bit processor, you might imagine the best you can do is, say, 2 instructions per 8 bytes (load 64 bits, add+store accumulator).

But of course in practice, the CPU is occupied by other things (i.e. operating system), so I think it's still pretty amazing to get this close to the "theoretical" maximum efficiency.

nkurz · on May 13, 2014

I'm not sure if this is the right way to look at the problem. With the right numbers, it does provide a theoretical limit, but the problem is that there are other lower I/O limits that get in the way first.

As a stab at the right numbers, on Sandy Bridge you can issue two 16B loads per cycle. Since modern processors are superscalar, in that same cycle you can issue the two vector adds. So your actual CPU limit is more like 32B per cycle (.03 cycles per byte).

cshimmin · on May 13, 2014

Very interesting; of course I expect there are many refinements to be made. As a physicist my first reaction is always to mash numbers together based on dimension. If I can get within one or two orders of magnitude in a problem I know nothing about, I'm pretty happy ;)

nkurz · on May 13, 2014

I agree completely with the thought process, and but I think there might be better 'dimensions' to be used. The one that I'd recommend here is based on Little's Law, which gives a limit on throughput based on the number and latency of outstanding requests: Concurrency = Latency * Bandwidth.

It turns out that each core has can only have 10 requests for memory outstanding (line fill buffers). Since in the end these requests are coming from RAM, each request has a latency of about 100 cycles. Since each request is for 64B, this gives us a maximum throughput of about about:

  Bandwidth = 10 * 64B in flight / 100 cycles
  Bandwidth = about 6B per cycle

At a 3.5 GHz clock frequency, this would suggest that we have a hard cap at about 3.5 billion cycles/s * 6 bytes/cycle = 21 GB/s, which is mighty close to the actual limit! The numbers are fudged a little bit because it's not actually a flat 100 cycle latency to access RAM, but I think this limit is more relevant and indicative here than the instruction count.

jvns · on May 13, 2014

Can you share the prefetching code you wrote? I tried to use __builtin_prefetch, but couldn't figure out how to make it faster.

nkurz · on May 13, 2014

My approach was somewhat ugly. I went this this:

  for (int i = 0; i < n; i += 8*16) {
      __builtin_prefetch(&a[i + 512]);
      for (int j = 0; j < 8; j++) {
        __m128i v = _mm_load_si128((__m128i *)&a[i + j*16]);
        __m128i vl = _mm_unpacklo_epi8(v, vk0); 
        __m128i vh = _mm_unpackhi_epi8(v, vk0);
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
    }

The goal is to issue one prefetch for each 128B block of data you read. There are probably better ways to do this than what I did. I'm hoping the compiler did something reasonable, and haven't really looked at the generated assembly.

Also, if it indeed is that case that TLB misses are the major factor (and I think it is), I don't think you will have much success with by adding prefetch alone. Trying right now, I get a slight slowdown with just the prefetch. It may only in combination with hugepages that you get a positive effect.

jvns · on May 13, 2014

I profiled my program looking for TLB misses, and got 0 dTLB misses (https://gist.github.com/jvns/a42ff6f48c659cfc4600).

nkurz · on May 13, 2014

I think 'perf' is probably lying to you. Although maybe it's not a lie, as your Gist does contain that line '<not supported> dTLB-misses'. Perf tries very hard to be non-CPU specific, and thus doesn't do a great job of handling the CPU specific stuff.

What processor are you running this on? If Intel, you might have luck with some of the more Intel specific wrappers here: https://github.com/andikleen/pmu-tools

You also might have better luck with 'likwid': https://code.google.com/p/likwid/

Here's the arguments I was giving it to check:

  sudo likwid -C 1 -g              \
       INSTR_RETIRED_ANY:FIXC0,    \
       CPU_CLK_UNHALTED_CORE:FIXC1,\      
       CPU_CLK_UNHALTED_REF:FIXC2, \     
       DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \
       ./bytesum 1gb_file

  |        INSTR_RETIRED_ANY        | 7.38826e+08 |
  |      CPU_CLK_UNHALTED_CORE      | 5.42765e+08 |
  |      CPU_CLK_UNHALTED_REF       | 5.42753e+08 |
  | DTLB_LOAD_MISSES_WALK_COMPLETED | 1.04509e+06 |

  sudo likwid -C 1 -g              \
       INSTR_RETIRED_ANY:FIXC0,    \
       CPU_CLK_UNHALTED_CORE:FIXC1,\      
       CPU_CLK_UNHALTED_REF:FIXC2, \     
       DTLB_LOAD_MISSES_WALK_COMPLETED:PMC0 \
       ./hugepage_prefetch 1gb_file

  |        INSTR_RETIRED_ANY        | 5.79098e+08 |
  |      CPU_CLK_UNHALTED_CORE      | 2.63809e+08 |
  |      CPU_CLK_UNHALTED_REF       | 2.63809e+08 |
  | DTLB_LOAD_MISSES_WALK_COMPLETED |    11970    |

The other main advantage of 'likwid' is that it allows you to profile just a section of the code, rather than the program as a whole. For odd political reasons, 'perf' doesn't make this possible.

ps. I think your 'argc' check is off by one. Since the name of the program is in argv[0], and argc is the length of argv, you want to check 'argc != 2' to confirm that a filename has been given.

jvns · on May 13, 2014

yeah, I tried prefetching like this: https://github.com/jvns/howcomputer/blob/master/bytesum_pref... (using MAP_POPULATE instead of hugepages) and got a slight slowdown

JoshTriplett · on May 13, 2014

> That gets us up to 14.5 GB/s, which is more reasonable for a a single stream read on a single core.

If you want to see roughly the limit of memory speed on your machine, try booting into memtest86+; in addition to testing memory, it benchmarks memory.