Hacker News new | past | comments | ask | show | jobs | submit login
Cache Rules Everything Around Me (iainkfraser.blogspot.co.uk)
117 points by scaramanga on Jan 31, 2013 | hide | past | web | favorite | 26 comments

Just adding a bit of background on the title - it's a play on Wu Tang's "C.R.E.A.M." ("Cash Rules Everything Around Me").

...You got stick up kids, corrupt cops and crack rocks and stray shots all on a block that stays hot...

Rae's and Deck's verses are 2 of my favorite in all of hip hop. There's just a certain flow, and the beat is just spooky as hell.

RZA must have cut the line about how being cache oblivious becomes asymptotically equivalent to being cache aware.

How do you reliably assess the cache behavior of your software? Thinking in terms of cache behavior is certainly very useful for making software that runs fast, but if you have no reliable to check what is actually happening, you're pretty much walking in the dark.

Most CPUs have performance counters that let you determine when performance impacting events occur (e.g. pipeline flushes, reduced dispatch width, cache misses, etc.). In their simplest form they are a simple count, but there is usually some way to trigger an interrupt every N events rather than simply incrementing a counter, so that you can store the offending program counter somewhere for later aggregation. This functionality isn't always exposed very well by the OS, though. If you're using Linux then perf and oprofile are the standard ways to access performance counters.

Alternatively you can execute your code in a simulator with a cache model. Of course, this model won't actually match your hardware (unless you're using a serious cycle-accurate simulator), but it might be more insightful than performance counters alone. I don't know if any of the standard x86 emulators include a good cache model.

AMD has CodeAnalyst software which helps you analyze many aspects of your software performance, including cache behavior.


Just from the front page I see a presentation "Cache Line Utilization with AMD CodeAnalyst Software" which touches the subject


Intel VTune would be one option. Not a cheap but very high quality results.


Cachegrind (part of Valgrind) will do this for you: http://valgrind.org/docs/manual/cg-manual.html

kcachegrind on linux is great for that. It shows you your code annotated with number of cache misses per line of code, it also can show you callgraph with aggregated running time, etc. Internaly it uses callgrind.

Great tool for optimizing C/C++ code.

Check out PAPI[1], it lets you examine CPU performance counters such as data- and instruction cache misses.

[1] http://icl.cs.utk.edu/projects/papi/wiki/Main_Page

If you've got C/C++ (or Fortran!) code it's pretty straightforward - I work on a profiling tool that tracks things like CPU utilization and memory contention for HPC codes (http://www.allinea.com/products/map/). For single-machine programs you could just hack something together with the perf counters fairly easily.

A clustered signal processing application I previously worked on used custom middleware with perfmon2[1] built into it for collecting performance counter data across all MPI nodes the application was running on. The framework for multiple nodes is not all that difficult.

[1] http://perfmon2.sourceforge.net/

indeed, or perf record -e cache-misses using the hardware performance counters

There are a bunch of existing tools that use the underlying architecture's performance measuring capabilities to give you things like cache hits/misses and branch prediction performance; I really like oprofile (but it only runs on linux).

If you're using multiple threads, then you might have issues with false sharing, which is a bitch. Valgrind can help you there.

I have been planning to release an article with the exact same title, so naturally I hate the title. The article gives a quick insight on cache-aware algorithms, but misses to state that it's really about the data structures. Taking the example to the extreme, assuming you had huge, sparse matrices/vectors, how would you encode them? Or more generally, when does asymptotic complexity become a secondary criterion for the choice of a data structure? Otherwise, it's short but good, there can't be enough articles on that topic.

If you're interested in how this could be automated, check out Polly[1]. There is also similar framework in GCC[2], but as can be seen from the blog post, there is room for improvement.

[1] http://polly.llvm.org

[2] http://gcc.gnu.org/wiki/Graphite

here is harold-prokop's oldish thesis on the subject: http://supertech.csail.mit.edu/papers/Prokop99.pdf

edit: also, i guess, it never hurts to plug drepper's most excellent paper: "What every programmer should know about memory ?"

Link to Drepper's article: http://lwn.net/Articles/250967/

Also informative reading: "A Memory Allocator" http://gee.cs.oswego.edu/dl/html/malloc.html

Cache does rule everything, and the proof lies in the laws of physics.

How fast can you access a bit of memory? The lower bound is the speed of light. How do you organize your memory such that access is the fastest? In a sphere with the processor in the center. There is no more overall efficient packing of bits assuming that each bit takes up some finite "bit" of space (pun intended).

This means that random access to memory can never be O(1) at the limit, it is actually O(n^.333)--the cube root, as the volume of a sphere is 4/3 pi r^3. Yes, even for hash tables. So random access memory always gets slower the more you have of it.

Conversely, the less memory you have to access, the faster it can be. By putting more frequently accessed bits closer to the center of the sphere, you have created a cache.


Actually, the title of this piece is "Cache Money Hoes"

ducks before another HN mod title debate begins

You're right, but it's the hook that matters.

TFA is a big simplistic but, indeed, cache does rule everything.

When you need really fast algorithms and very low latencies, you have to get your hands dirty.

If you're interested in much more advanced writings on the subject I'd suggest googling for "How to do 100K+ TPS at less than 1ms latency".

It is explaining how the LMAX's disruptor pattern is now able to process 12 million events per second on a single core in... Java!

It explains, amongst other, the size and speed of the various caches and how to minimize cache misses.

The "Disruptor" library mentioned can also be found at http://lmax-exchange.github.com/disruptor/.

Funny that this was posted here as their whitepaper has been sitting in my Downloads folder for months waiting to be read.

The slides mention "Hotspot likes small compact methods". Does anyone happen to know a source or have more information on this?

The hotspot inliner gives up after only 35 bytes (http://blog.headius.com/2009/01/my-favorite-hotspot-jvm-flag...)

Fascinating, thanks!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact