
Cimple: Instruction and Memory Level Parallelism - moab
https://arxiv.org/abs/1807.01624
======
abainbridge
When I was implementing a simple hash table that needed to be fast, I did some
investigation in this area.

I created a huge array of 32 bit ints - a few gigabytes in size. Then timed
this code:

    
    
      uint32_t checksum = 0;
      for (int i = 0; i < 1000000; i++) {
          unsigned offset = rand32() & (TABLE_SIZE - 1);
          checksum += table_ints[offset];
      }
    

I was surprised to see it only took about 10 ns per iteration, which is 5 to
10 times faster than a DRAM access. Because the table was large and the access
pattern was random, each iteration has to do a DRAM access (there is low
probability of getting a cache hit).

The processor was able to execute 5 to 10 iterations of that loop in parallel
in a single thread. Quite amazing.

The random number generator looked like this:

    
    
      uint32_t g_seed = 12345;
      uint32_t rand32() {
          g_seed = 214013 * g_seed + 2531011;
          return g_seed;
      }

~~~
abainbridge
BTW, is there somewhere suitable that I can write-up and publish this
information without going to the trouble of a formal academic paper?

I spent months also learning a few things about how the virtual memory page
table, memory controller and DRAM subsystem work. I suspect this will be
useful to other people and it would be nice to have a forum where this
research could be discussed.

~~~
petermcneeley
If you want to see the actual memory latency you can make the next access
depend on the results of the previous.

uint32_t checksum = 0; uint32_t prevRes = 0; for (int i = 0; i < 1000000; i++)
{ unsigned offset = (rand32() ^ prevRes )& (TABLE_SIZE - 1); prevRes =
table_ints[offset]; checksum += prevRes ; }

~~~
abainbridge
Yeah, I did that. It took 60ns per iteration, as long as the page table for
the pages I was hitting was in cache.

------
moab
Another relevant paper on low MLP in this case shared-memory graph algorithms
([http://www.scottbeamer.net/pubs/beamer-
iiswc2015.pdf](http://www.scottbeamer.net/pubs/beamer-iiswc2015.pdf)). It
would be interesting to see whether Cimple can significantly improve
performance for graph algorithms (the paper does mention it as a potential
use-case).

For anyone curious, I'm not affiliated with the authors of the paper. It's
scheduled to appear at PACT'19\. AFAIK the code is not publicly available yet.

------
zeusk
Does anyone know if the tooling is available in either source or binary
format?

The conclusion includes "We offer an optimization methodology for experts, and
a tool usable by end-users today." but doesn't provide any pointers on where
to find them.

The study cites a govt. grant but I couldn't find much using the grant number
either.

------
petermcneeley
Similar results can be achieved with C++ nano co routines.
[https://www.youtube.com/watch?v=j9tlJAqMV7U&feature=youtu.be...](https://www.youtube.com/watch?v=j9tlJAqMV7U&feature=youtu.be&t=2326)

