
Cache Rules Everything Around Me - scaramanga
http://iainkfraser.blogspot.co.uk/2013/01/cache-money-hoes-attached-code.html
======
smcl
Just adding a bit of background on the title - it's a play on Wu Tang's
"C.R.E.A.M." ("Cash Rules Everything Around Me").

~~~
themckman
...You got stick up kids, corrupt cops and crack rocks and stray shots all on
a block that stays hot...

Rae's and Deck's verses are 2 of my favorite in all of hip hop. There's just a
certain flow, and the beat is just spooky as hell.

------
joelthelion
How do you reliably assess the cache behavior of your software? Thinking in
terms of cache behavior is certainly very useful for making software that runs
fast, but if you have no reliable to check what is actually happening, you're
pretty much walking in the dark.

~~~
moconnor
If you've got C/C++ (or Fortran!) code it's pretty straightforward - I work on
a profiling tool that tracks things like CPU utilization and memory contention
for HPC codes (<http://www.allinea.com/products/map/>). For single-machine
programs you could just hack something together with the perf counters fairly
easily.

~~~
vonmoltke
A clustered signal processing application I previously worked on used custom
middleware with perfmon2[1] built into it for collecting performance counter
data across all MPI nodes the application was running on. The framework for
multiple nodes is not all that difficult.

[1] <http://perfmon2.sourceforge.net/>

------
eamsen
I have been planning to release an article with the exact same title, so
naturally I hate the title. The article gives a quick insight on cache-aware
algorithms, but misses to state that it's really about the data structures.
Taking the example to the extreme, assuming you had huge, sparse
matrices/vectors, how would you encode them? Or more generally, when does
asymptotic complexity become a secondary criterion for the choice of a data
structure? Otherwise, it's short but good, there can't be enough articles on
that topic.

------
lautis
If you're interested in how this could be automated, check out Polly[1]. There
is also similar framework in GCC[2], but as can be seen from the blog post,
there is room for improvement.

[1] <http://polly.llvm.org>

[2] <http://gcc.gnu.org/wiki/Graphite>

------
signa11
here is harold-prokop's oldish thesis on the subject:
<http://supertech.csail.mit.edu/papers/Prokop99.pdf>

edit: also, i guess, it never hurts to plug drepper's most excellent paper:
"What every programmer should know about memory ?"

~~~
npsimons
Link to Drepper's article: <http://lwn.net/Articles/250967/>

Also informative reading: "A Memory Allocator"
<http://gee.cs.oswego.edu/dl/html/malloc.html>

------
codex
Cache does rule everything, and the proof lies in the laws of physics.

How fast can you access a bit of memory? The lower bound is the speed of
light. How do you organize your memory such that access is the fastest? In a
sphere with the processor in the center. There is no more overall efficient
packing of bits assuming that each bit takes up some finite "bit" of space
(pun intended).

This means that random access to memory can never be O(1) at the limit, it is
actually O(n^.333)--the cube root, as the volume of a sphere is 4/3 pi r^3.
Yes, even for hash tables. So random access memory always gets slower the more
you have of it.

Conversely, the less memory you have to access, the faster it can be. By
putting more frequently accessed bits closer to the center of the sphere, you
have created a cache.

------
martinced
TFA is a big simplistic but, indeed, cache does rule everything.

When you need really fast algorithms and very low latencies, you _have_ to get
your hands dirty.

If you're interested in much more advanced writings on the subject I'd suggest
googling for "How to do 100K+ TPS at less than 1ms latency".

It is explaining how the LMAX's disruptor pattern is now able to process 12
million events per second on a single core in... Java!

It explains, amongst other, the size and speed of the various caches and how
to minimize cache misses.

~~~
stbullard
For the lazy:

Slides: <http://bit.ly/11aeJfc>

Video (InfoQ): <http://www.infoq.com/presentations/LMAX>

Blogpost (2011): <http://martinfowler.com/articles/lmax.html?t=1319912579>

Discussion (2011): <http://news.ycombinator.com/item?id=3173993>

~~~
brown9-2
The "Disruptor" library mentioned can also be found at <http://lmax-
exchange.github.com/disruptor/>.

Funny that this was posted here as their whitepaper has been sitting in my
Downloads folder for months waiting to be read.

The slides mention "Hotspot likes small compact methods". Does anyone happen
to know a source or have more information on this?

~~~
octo_t
The hotspot inliner gives up after only 35 bytes
([http://blog.headius.com/2009/01/my-favorite-hotspot-jvm-
flag...](http://blog.headius.com/2009/01/my-favorite-hotspot-jvm-flags.html))

~~~
brown9-2
Fascinating, thanks!

