1/4 second to plow through 1 GB of memory is certainly fast compared to some things (like a human reader), but it seems oddly slow relative to what a modern computer should be capable off. Sure, it's a lot faster than a human, but that's only 4 GB/s! A number of comments here have mentioned adding some prefetch statements, but for linear access like this that's usually not going to help much. The real issue (if I may be so bold) is all the TLB misses. Let's measure.
Here's the starting point on my test system, an Intel Sandy Bridge E5-1620 with 1600 MHz quad-channel RAM:
$ perf stat bytesum 1gb_file
Size: 1073741824
The answer is: 4
Performance counter stats for 'bytesum 1gb_file':
262,315 page-faults # 1.127 M/sec
835,999,671 cycles # 3.593 GHz
475,721,488 stalled-cycles-frontend # 56.90% frontend cycles idle
328,373,783 stalled-cycles-backend # 39.28% backend cycles idle
1,035,850,414 instructions # 1.24 insns per cycle
0.232998484 seconds time elapsed
Hmm, those 260,000 page-faults don't look good. And we've got 40% idle cycles on the backend. Let's try switching to 1 GB hugepages to see how much of a difference it makes:
$ perf stat hugepage 1gb_file
Size: 1073741824
The answer is: 4
Performance counter stats for 'hugepage 1gb_file':
132 page-faults # 0.001 M/sec
387,061,957 cycles # 3.593 GHz
185,238,423 stalled-cycles-frontend # 47.86% frontend cycles idle
87,548,536 stalled-cycles-backend # 22.62% backend cycles idle
805,869,978 instructions # 2.08 insns per cycle
0.108025218 seconds time elapsed
It's entirely possible that I've done something stupid, but the checksum comes out right, but the 10 GB/s read speed is getting closer to what I'd expect for this machine. Using these 1 GB pages for the contents of a file is a bit tricky, since they need to be allocated off the hugetlbfs filesystem that does not allow writes and requires that the pages be allocated at boot time. My solution was a run one program that creates a shared map, copy the file in, pause that program, and then have the bytesum program read the copy that uses the 1 GB pages.
Now that we've got the page faults out of the way, the prefetch suggestion becomes more useful:
$ perf stat hugepage_prefetch 1gb_file
Size: 1073741824
The answer is: 4
Performance counter stats for 'hugepage_prefetch 1gb_file':
132 page-faults # 0.002 M/sec
265,037,039 cycles # 3.592 GHz
116,666,382 stalled-cycles-frontend # 44.02% frontend cycles idle
34,206,914 stalled-cycles-backend # 12.91% backend cycles idle
579,326,557 instructions # 2.19 insns per cycle
0.074032221 seconds time elapsed
That gets us up to 14.5 GB/s, which is more reasonable for a a single stream read on a single core. Based on prior knowledge of this machine, I'm issuing one prefetch 512B ahead per 128B double-cacheline. Why one per 128B? Because the hardware "buddy prefetcher" is grabbing two lines at a time. Why do prefetches help? Because the hardware "stream prefetcher" doesn't know that it's dealing with 1 GB pages, and otherwise won't prefetch across 4K boundaries.
What would it take to speed it up further? I'm not sure. Suggestions (and independent confirmations or refutations) welcome. The most I've been able to reach in other circumstances is about 18 GB/s by doing multiple streams with interleaved reads, which allows the processor to take better advantage of open RAM banks. The next limiting factor (I think) is the number of line fill buffers (10 per core) combined with the cache latency in accordance with Little's Law.
Good stuff! The low numbers (4 GB/s, one tenth the available memory bandwidth) that we're seeing sounds like a lot of time is wasted in page faults (and indeed, perf stat confirms it). However, the solution you propose sounds difficult, you need a special file system and need to know what kind of page sizes the CPU supports.
Can you think of ways to get reduce the number of page faults from inside the application itself? Or methods that would be portable to architectures with different page sizes?
I tried a simple call to madvise and posix_fadvise to inform the operating system ahead of time that I am going to need the memory but that did not have any effect on the number of page faults.
Any other tips for squeezing some more perf out? Did you happen to do any cache miss stats your benchmarks?
Using MAP_POPULATE, I'm seeing this go from 104,276 page faults down to 130 page faults. However, I'm only seeing a modest increase in overall performance, goes from 160 ms to 130 ms for my ~500 MB test file. MAP_POPULATE + MAP_NONBLOCK is as bad as not using MAP_POPULATE (same number of page faults).
Could the number of TLB misses from using huge pages be a major factor too? Page faults alone do not explain the huge perf difference. (edit: perf stat tells me I'm only missing 0.19% of TLB lookups and only 3% dcache misses)
I also tried with a 4 GB file and I see two interesting effects. First of all, using MAP_POPULATE is slower, which I presume is because physical memory is exhausted (6 G memory total). Secondly, I see less than 100k page faults which suggests that bigger pages (2M or 1G?) are being used (10x bigger file, a little less page faults).
This is well worth trying, but not directly applicable to reading from a file. This would help if you were allocating a buffer with malloc() and you wanted it to be backed by hugepages. But if you have control of the source, you are probably better using MAP_HUGETLB with madvise() or mmap(), or using the get_huge_pages() directly instead of using the LD_PRELOAD.
What do you think MAP_POPULATE is actually doing here? Unless it's changing the page size, I don't see how it would be significantly reducing the number of TLB misses. Is it perhaps doing the preloading in a separate thread on a different core? And the timing happens to work out that so that the L3 cache is getting filled at the same rate it's being drained?
> What do you think MAP_POPULATE is actually doing here? Unless it's changing the page size, I don't see how it would be significantly reducing the number of TLB misses.
I think that MAP_POPULATE here will fill the page table with entries rather than leaving the page table empty and letting the CPU interrupt at (almost) every time a new page is accessed. That would be about 200k less interrupts for a 1G file.
MAP_POPULATE will probably also do the whole disk read in one go rather than in a lazy+speculative manner.
Page size is probably not affected and neither is number of TLB misses. I in my testing that the size of the file (and the mapping) will affect the page size, a 4G had significantly less page fault interrupts than a 500MB file.
And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.
I came across this link, which helped me understand the process a bit better: http://kolbusa.livejournal.com/108622.html. So yes, the main savings seems to be that the page table is created in a tight loop rather than ad hoc. Given the number of pages in the scane, it's still going to be a TLB miss for each page, but it will be just a lookup (no context switch).
in my testing that the size of the file (and the mapping) will affect the page size
I'm doubtful of this, although it might depend on how you have "transparent huge pages" configured. But even then, I don't think Linux currently supports huge pages for file backed memory. I think something else might be happening that causes the difference you see. Maybe just the fact that the active TLB can no longer fit in L1?
And obviously, MAP_POPULATE is bad if physical memory is getting exhausted.
I'm confused by this, but this does appear to be the case. It seems strange to me that the MAP_POPULATE|MAP_NONBLOCK is no longer possible. I was slow to realize this may be closely related to Linus's recent post: https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6
It's moving page faults away from the main addition loop, which is what I'm interested in measuring anyway. It also reads the whole file in one go, instead of page by page with the default lazy approach.
The best wall times (that is, with OS time included) I get are obtained by reading L1-sized chunks into a small buffer instead of using mmap. YMMV.
Those are sort of fuzzy concepts for me. At the level of the processor, what does "in one go" really mean? And what does it mean to read it if it's already in memory? Since there are only 512 standard TLB entries, there's no way that all of them can be 'hot' at a time with 4K pages.
While I generally agree with the idea that mmap() is no faster than read()/fread(), I'm dubious that one could achieve equally good performance without using huge pages. What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does. I've confirmed that it is not changing the number of TLB page walks. It stays at the expected ~250,000 per GB whether it's used or not.
> And what does it mean to read it if it's already in memory?
It means walking whatever structures the OS uses to keep things in cache. We generally don't know what they are, nor control them.
> What I don't understand is what MAP_POPULATE is doing that gets the speedup that it does.
MAP_POPULATE minimizes the number of page faults during the main loop, which are more expensive (and require a context switch) than TLB misses. Plus, TLB misses can be avoided in our loop, especially with such a friendly linear sweep.
The main problem here, in my view: trying to coax the OS into using memory the way we want. Huge pages surely help in that regard, but they help the most in code that we do not control. The sum itself over 1 GB of memory would be roughly the same speed, regardless of page size.
To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.
Thanks for walking me through. I've usually dealt the with hugepages as buffers (as you mention in the last test) and haven't thought much previously about how they work as shared memory.
To put this to the test: generate 1 GB of random bytes on the fly, instead of reading them from a file, and do the same sum. Does the speed change much with the page size? I'd be interested in the results, especially if accompanied by fine-grained performance counter data.
Yes, I'm pretty sure this is the case, and had in fact been assuming that it is the major effect. It's a little trickier to measure than the case from the file, since you don't want to include the random number generation as part of the measurement. This essentially excludes the use of 'perf', but luckily 'likwid' works great for this.
I'll try to post some numbers here in the next hour or so.
What performance counters are you interested in?
Cycle counts and DTLB_LOAD_* events should be enough here. Note that the random number generation also doubles as the "populate" flag in this case, since malloc returns uninitialized pages. I fully expect huge pages to be faster, all other things being equal, but I wonder by how much.
OK, I've tested, and I'm excited to report that you about 98% correct that TLB misses are _not_ the main issue! For a prewarmed buffer, using 1GB hugepages instead of standard 4KB standard pages is only about a 2% difference.
The hugepages are indeed faster by amount the difference reported in as DTLB_LOAD_MISSES_WALK_DURATION. This means that as you surmised, the majority of the savings is not due to the avoidance of TLB misses per se. I need to think about this more.
Interesting. I wonder if the slowdown of small pages is a result of excessive pointer chasing on the kernel side, which intuitively does a better job of trashing the TLB than sequential accesses would.
> Any other tips for squeezing some more perf out?
Ditch the OS and go into huge unreal mode ala LoseThos. Disable interrupts in the loop. No MMU means no page faults, no TLB, no interrupts, you can't possibly get any faster than that! :-)
More seriously, I think 14.5GB/s is getting close to the limits of memory bandwidth on a single core already. Multicore CPU's memory controllers are optimised for multiple cores accessing memory, so a single one trying to saturate the bus might not perform so well.
Seriously, someone should take TempleOS (if it's open source) or else make something similar and market it in some form of really high-speed server, I'm sure there'd be demand. Of course, that's assuming it would be possible to achieve decent security on the software level to ensure safe operation without memory protection.
Something Snabb Switch-like, then, where they disable the Linux task scheduler on 6 out of 8 cores, and do all the scheduling manually: https://news.ycombinator.com/item?id=7250505
Places where people care about this (like HPC) already have customized kernels that run with large pages by default and don't suffer from the poor quality of TempleOS.
What exactly do you mean by poor quality? Have you ever run the OS? Do you have a reference to some results that reveal some aspect of it that is poor?
However, the solution you propose sounds difficult, you need a special file system and need to know what kind of page sizes the CPU supports.
Yes, I've definitely offered a "proof of concept" rather than a solution. But it has gotten simpler with recent Linux kernel versions: http://lwn.net/Articles/533499/.
One difficulty in emulating Julia's particular benchmark is (at least on Linux) transparent hugepages can't be used for file-backed memory. It's easier if you are just allocating an anonymous buffer, and even easier if you don't need to share that buffer between unrelated processes.
Can you think of ways to get reduce the number of page faults from inside the application itself?
"Page faults" is an overloaded term (used for semi-related OS and hardware level events), so probably best to avoid it (even though I was using it, and even though 'perf' does). I was actually using 'likwid' to do most of the measuring, and switched to 'perf' for the post since it's more generally available. Our goal here is to avoid TLB misses, or the consequent 'page walks'.
With that in mind, the subgoal becomes allocating memory that uses hugepages. If you are OK with the default 2MB pages, this is reasonably straightforward. 'libhugetlbfs' (a library distributed independently of the 'hugetlbfs' distributed with the kernel) may make this easier.
Or methods that would be portable to architectures with different page sizes?
The number of page sizes supported is not a big issue. There are only a few sizes out there. In practice, all 64-bit systems will support the 2MB hugepages, and this is almost always the default hugepage size. In this particular case with a linear scan of memory, there is almost no difference in performance between these and the harder to specify 1GB pages.
I tried a simple call to madvise and posix_fadvise to inform the operating system ahead of time that I am going to need the memory but that did not have any effect on the number of page faults.
This is the overloading of the terms getting you. Presuming a system with ample free memory, after the first run you won't have any OS level page faults. But anytime you try to have a working set of more than 512 4K regions in RAM, you will have TLB misses aka "page exceptions".
The madvise() option that should help here is MAP_HUGETLB. You should also be able to use this in mmap(), but not (to my knowledge) if you are doing a file backed allocation.
Any other tips for squeezing some more perf out?
Not really, learning these is my goal as well. For me, it's probably going to involve getting more familiar with the "uncore" performance monitors. Sandy Bridge and post, these are accessed via PCI rather than MSR's, and are difficult to use from 'perf'. 'likwid' supports them better, and is definitely worth exploring.
The other place to get further gains is by trying to understand the actual memory access patterns, and trying to get them to read full DRAM pages each time a bank is opened. John McCalpin (author of the Stream benchmark) goes into it here: http://blogs.utexas.edu/jdm4372/2010/11/09/optimizing-amd-op...
Did you happen to do any cache miss stats your benchmarks?
This is directly related to the performance improvement, but I didn't look at the numbers separately. Glancing now, it looks like practically everything is hitting L3, but a lot of things are missing L1. This is probably worth exploring further. I think one issue is that the hardware prefetcher deposits things in L2 (~12 cycles), but not L1 (~6 cycles).
Naively, it sounds about right just on dimensional grounds. CPU clock: ~ 2GHz, so that means
(2 GHz) / (4 GB / s) = 0.5 instructions per byte
For a 64 bit processor, you might imagine the best you can do is, say, 2 instructions per 8 bytes (load 64 bits, add+store accumulator).
But of course in practice, the CPU is occupied by other things (i.e. operating system), so I think it's still pretty amazing to get this close to the "theoretical" maximum efficiency.
I'm not sure if this is the right way to look at the problem. With the right numbers, it does provide a theoretical limit, but the problem is that there are other lower I/O limits that get in the way first.
As a stab at the right numbers, on Sandy Bridge you can issue two 16B loads per cycle. Since modern processors are superscalar, in that same cycle you can issue the two vector adds. So your actual CPU limit is more like 32B per cycle (.03 cycles per byte).
Very interesting; of course I expect there are many refinements to be made. As a physicist my first reaction is always to mash numbers together based on dimension. If I can get within one or two orders of magnitude in a problem I know nothing about, I'm pretty happy ;)
I agree completely with the thought process, and but I think there might be better 'dimensions' to be used. The one that I'd recommend here is based on Little's Law, which gives a limit on throughput based on the number and latency of outstanding requests: Concurrency = Latency * Bandwidth.
It turns out that each core has can only have 10 requests for memory outstanding (line fill buffers). Since in the end these requests are coming from RAM, each request has a latency of about 100 cycles. Since each request is for 64B, this gives us a maximum throughput of about about:
Bandwidth = 10 * 64B in flight / 100 cycles
Bandwidth = about 6B per cycle
At a 3.5 GHz clock frequency, this would suggest that we have a hard cap at about 3.5 billion cycles/s * 6 bytes/cycle = 21 GB/s, which is mighty close to the actual limit! The numbers are fudged a little bit because it's not actually a flat 100 cycle latency to access RAM, but I think this limit is more relevant and indicative here than the instruction count.
for (int i = 0; i < n; i += 8*16) {
__builtin_prefetch(&a[i + 512]);
for (int j = 0; j < 8; j++) {
__m128i v = _mm_load_si128((__m128i *)&a[i + j*16]);
__m128i vl = _mm_unpacklo_epi8(v, vk0);
__m128i vh = _mm_unpackhi_epi8(v, vk0);
vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
}
The goal is to issue one prefetch for each 128B block of data you read. There are probably better ways to do this than what I did. I'm hoping the compiler did something reasonable, and haven't really looked at the generated assembly.
Also, if it indeed is that case that TLB misses are the major factor (and I think it is), I don't think you will have much success with by adding prefetch alone. Trying right now, I get a slight slowdown with just the prefetch. It may only in combination with hugepages that you get a positive effect.
I think 'perf' is probably lying to you. Although maybe it's not a lie, as your Gist does contain that line '<not supported> dTLB-misses'. Perf tries very hard to be non-CPU specific, and thus doesn't do a great job of handling the CPU specific stuff.
What processor are you running this on? If Intel, you might have luck with some of the more Intel specific wrappers here: https://github.com/andikleen/pmu-tools
The other main advantage of 'likwid' is that it allows you to profile just a section of the code, rather than the program as a whole. For odd political reasons, 'perf' doesn't make this possible.
ps. I think your 'argc' check is off by one. Since the name of the program is in argv[0], and argc is the length of argv, you want to check 'argc != 2' to confirm that a filename has been given.
> That gets us up to 14.5 GB/s, which is more reasonable for a a single stream read on a single core.
If you want to see roughly the limit of memory speed on your machine, try booting into memtest86+; in addition to testing memory, it benchmarks memory.
Here's the starting point on my test system, an Intel Sandy Bridge E5-1620 with 1600 MHz quad-channel RAM:
Hmm, those 260,000 page-faults don't look good. And we've got 40% idle cycles on the backend. Let's try switching to 1 GB hugepages to see how much of a difference it makes: It's entirely possible that I've done something stupid, but the checksum comes out right, but the 10 GB/s read speed is getting closer to what I'd expect for this machine. Using these 1 GB pages for the contents of a file is a bit tricky, since they need to be allocated off the hugetlbfs filesystem that does not allow writes and requires that the pages be allocated at boot time. My solution was a run one program that creates a shared map, copy the file in, pause that program, and then have the bytesum program read the copy that uses the 1 GB pages.Now that we've got the page faults out of the way, the prefetch suggestion becomes more useful:
That gets us up to 14.5 GB/s, which is more reasonable for a a single stream read on a single core. Based on prior knowledge of this machine, I'm issuing one prefetch 512B ahead per 128B double-cacheline. Why one per 128B? Because the hardware "buddy prefetcher" is grabbing two lines at a time. Why do prefetches help? Because the hardware "stream prefetcher" doesn't know that it's dealing with 1 GB pages, and otherwise won't prefetch across 4K boundaries.What would it take to speed it up further? I'm not sure. Suggestions (and independent confirmations or refutations) welcome. The most I've been able to reach in other circumstances is about 18 GB/s by doing multiple streams with interleaved reads, which allows the processor to take better advantage of open RAM banks. The next limiting factor (I think) is the number of line fill buffers (10 per core) combined with the cache latency in accordance with Little's Law.