When I was implementing a simple hash table that needed to be fast, I did some investigation in this area.
I created a huge array of 32 bit ints - a few gigabytes in size. Then timed this code:
uint32_t checksum = 0;
for (int i = 0; i < 1000000; i++) {
unsigned offset = rand32() & (TABLE_SIZE - 1);
checksum += table_ints[offset];
}
I was surprised to see it only took about 10 ns per iteration, which is 5 to 10 times faster than a DRAM access. Because the table was large and the access pattern was random, each iteration has to do a DRAM access (there is low probability of getting a cache hit).
The processor was able to execute 5 to 10 iterations of that loop in parallel in a single thread. Quite amazing.
BTW, is there somewhere suitable that I can write-up and publish this information without going to the trouble of a formal academic paper?
I spent months also learning a few things about how the virtual memory page table, memory controller and DRAM subsystem work. I suspect this will be useful to other people and it would be nice to have a forum where this research could be discussed.
Did you initialize table_ints by writing to every page before running the test? If you didn't, this could instead be measuring accesses to the zero page.
What's zero page? Where can I read more about it? Wikipedia[1] suggests in modern operation systems "actually make the zero page inaccessible to trap uses of NULL pointers".
Ya its not that Zero page. This is just a page of zeros. When you allocate virtual memory the OS gives you the virtual allocation but the physical allocation points to the page of zeros. So when you read you get zeros. Only when you write does the OS allocate a real physical page of memory (which is should zero before handing over to the application). The reason why is important for the random read test is that if he does not initialize memory he might not be reading from dram but constantly reading the zero page which would be very fast.
I'd very much enjoy reading a blog post about your findings, if you are interested in writing one. Thanks for sharing, modern CPUs really are quite incredible.
I like the 7-cpu.com benchmarks that also look at both the latency and the throughput (latency/parallelism) of the memory hierarchy. https://www.7-cpu.com/
Another relevant paper on low MLP in this case shared-memory graph algorithms (http://www.scottbeamer.net/pubs/beamer-iiswc2015.pdf). It would be interesting to see whether Cimple can significantly improve performance for graph algorithms (the paper does mention it as a potential use-case).
For anyone curious, I'm not affiliated with the authors of the paper. It's scheduled to appear at PACT'19. AFAIK the code is not publicly available yet.
Does anyone know if the tooling is available in either source or binary format?
The conclusion includes "We offer an optimization methodology for experts, and a tool usable by end-users today." but doesn't provide any pointers on where to find them.
The study cites a govt. grant but I couldn't find much using the grant number either.
I created a huge array of 32 bit ints - a few gigabytes in size. Then timed this code:
I was surprised to see it only took about 10 ns per iteration, which is 5 to 10 times faster than a DRAM access. Because the table was large and the access pattern was random, each iteration has to do a DRAM access (there is low probability of getting a cache hit).The processor was able to execute 5 to 10 iterations of that loop in parallel in a single thread. Quite amazing.
The random number generator looked like this: