Cimple: Instruction and Memory Level Parallelism

abainbridge · on Oct 13, 2018

When I was implementing a simple hash table that needed to be fast, I did some investigation in this area.

I created a huge array of 32 bit ints - a few gigabytes in size. Then timed this code:

  uint32_t checksum = 0;
  for (int i = 0; i < 1000000; i++) {
      unsigned offset = rand32() & (TABLE_SIZE - 1);
      checksum += table_ints[offset];
  }

I was surprised to see it only took about 10 ns per iteration, which is 5 to 10 times faster than a DRAM access. Because the table was large and the access pattern was random, each iteration has to do a DRAM access (there is low probability of getting a cache hit).

The processor was able to execute 5 to 10 iterations of that loop in parallel in a single thread. Quite amazing.

The random number generator looked like this:

  uint32_t g_seed = 12345;
  uint32_t rand32() {
      g_seed = 214013 * g_seed + 2531011;
      return g_seed;
  }

abainbridge · on Oct 13, 2018

BTW, is there somewhere suitable that I can write-up and publish this information without going to the trouble of a formal academic paper?

I spent months also learning a few things about how the virtual memory page table, memory controller and DRAM subsystem work. I suspect this will be useful to other people and it would be nice to have a forum where this research could be discussed.

petermcneeley · on Oct 13, 2018

If you want to see the actual memory latency you can make the next access depend on the results of the previous.

uint32_t checksum = 0; uint32_t prevRes = 0; for (int i = 0; i < 1000000; i++) { unsigned offset = (rand32() ^ prevRes )& (TABLE_SIZE - 1); prevRes = table_ints[offset]; checksum += prevRes ; }

abainbridge · on Oct 13, 2018

Yeah, I did that. It took 60ns per iteration, as long as the page table for the pages I was hitting was in cache.

gpderetta · on Oct 13, 2018

A blog post a guess. I would love to read it.

MaxBarraclough · on Oct 13, 2018

To parrot gpderetta, a blog post is probably best (with a linking HackerNews thread of course).

Not sure it would make sense to look for publication in a non-academic technical platform (somewhere like drDobbs.com) but I suppose it's an option.

twtw · on Oct 13, 2018

Did you initialize table_ints by writing to every page before running the test? If you didn't, this could instead be measuring accesses to the zero page.

dhuramas · on Oct 13, 2018

What's zero page? Where can I read more about it? Wikipedia[1] suggests in modern operation systems "actually make the zero page inaccessible to trap uses of NULL pointers".

[1] https://en.wikipedia.org/wiki/Zero_page

petermcneeley · on Oct 13, 2018

Ya its not that Zero page. This is just a page of zeros. When you allocate virtual memory the OS gives you the virtual allocation but the physical allocation points to the page of zeros. So when you read you get zeros. Only when you write does the OS allocate a real physical page of memory (which is should zero before handing over to the application). The reason why is important for the random read test is that if he does not initialize memory he might not be reading from dram but constantly reading the zero page which would be very fast.

petermcneeley · on Oct 13, 2018

Related: did you see this horrible bug https://www.youtube.com/watch?v=kPR8h4-qZdk&feature=youtu.be...

abainbridge · on Oct 13, 2018

Yes I did. But thanks for the comment.

twtw · on Oct 13, 2018

Awesome, just wanted to check.

I'd very much enjoy reading a blog post about your findings, if you are interested in writing one. Thanks for sharing, modern CPUs really are quite incredible.

lukego · on Oct 13, 2018

I like the 7-cpu.com benchmarks that also look at both the latency and the throughput (latency/parallelism) of the memory hierarchy. https://www.7-cpu.com/

moab · on Oct 13, 2018

Another relevant paper on low MLP in this case shared-memory graph algorithms (http://www.scottbeamer.net/pubs/beamer-iiswc2015.pdf). It would be interesting to see whether Cimple can significantly improve performance for graph algorithms (the paper does mention it as a potential use-case).

For anyone curious, I'm not affiliated with the authors of the paper. It's scheduled to appear at PACT'19. AFAIK the code is not publicly available yet.

zeusk · on Oct 13, 2018

Does anyone know if the tooling is available in either source or binary format?

The conclusion includes "We offer an optimization methodology for experts, and a tool usable by end-users today." but doesn't provide any pointers on where to find them.

The study cites a govt. grant but I couldn't find much using the grant number either.

petermcneeley · on Oct 13, 2018

Similar results can be achieved with C++ nano co routines. https://www.youtube.com/watch?v=j9tlJAqMV7U&feature=youtu.be...