
Analysis of Processor Cache Effects - socratees
http://igoro.com/archive/gallery-of-processor-cache-effects/
======
yan
For those interested about RAM timings, effects of cache, and other low level
memory topics explained exceptionally well, check out "What Every Programmer
Should Know About Memory"[1]

[1] <http://people.redhat.com/drepper/cpumemory.pdf>

~~~
barrkel
That book is detailed, but that's also the problem with it. The relevant
signal is all but drowned in the noise of 114 pages of details of CAS/RAS,
circuit diagrams for SRAM and DRAM cells, bus protocols for hardware
transactional memory, descriptions of Linux APIs for configuring NUMA support,
etc.; frankly, things that need not be known unless dealing with those issues
specifically.

In the article of this post, on the other hand, a handful of key
characteristics of the cache hierarchy are graphically displayed using pretty
minimal code snippets and attractive graphs. I believe programmers would
generally be better served by the intuitions gained from the blog post than
daunted by the exhaustive detail and verbose presentation of Ulrich's book.

IOW, keep in mind simple things like: measure; be aware of the cheapness of
adjacent accesses, of the possibility of false sharing with certain strides,
of the cost of following pointers, of the pitfalls of concurrent access to
adjacent memory locations, and even serial access to the same memory location
(dependencies wrt ILP). Concentrate on building a mental model for providing
intuitions of cache behaviour at the appropriate level of abstraction, but
always be prepared to measure to verify intuitions, as the hardware changes
all the time, and different architectures can give rise to surprisingly
different results.

~~~
ajross
Then perhaps you might be serve better by the linked article, which is kinda
the "dumbed down" version of the Drepper whitepaper. But I don't see why
detail is bad. The Drepper paper is pretty much the _only_ place, short of
JEDEC specs and manufacturer datasheets, where you can get an exposition about
how modern DRAM works.

(FWIW: his similar whitepapers on ELF shared libraries and glibc TLS are
equally great, and also pretty much the canonical review documents on those
subjects.)

~~~
palish
_But I don't see why detail is bad._

Too much detail is bad because it hides the relevant knowledge. That is, there
are certain paragraphs / certain images that the programmer will remember.
Those are the important components. Brevity is good.

------
ewjordan
Has anyone run similar tests on the JVM after warmup?

Theoretically the JVM should be pushing memory around to negate as much of the
cache penalty as possible, and I've definitely noticed this in some algorithms
(this is how people come up with code where Java outperforms almost identical
code in pure C, and I've seen 3:1 speed differences in favor of the JVM), but
I'd be curious to see the graphs.

Maybe I'll do it later today if I get a few minutes...

~~~
ajross
That seems very unlikely. Effects like the ones in the article (which is
great, btw) can't be "fixed" simply by moving the data. In the first example,
there's simply no way for any runtime environment to "know" at creation time
that you would be walking the array in 16 byte strides. And if it tried to
reorder the data dynamically, it would obviously have to _touch_ all that
memory, thus invalidating whatever caching could be done anyway.

Where JVMs beat naive C, it's almost always in the locking and synchronization
code. See the Sun paper on Biased Locking for an example of the kind of tricks
you can do by (for example) "guessing" that newly locked objects are only ever
going to be referenced by one thread.

But memory is memory. The JVMs view of it lacks a raw pointer, but is
otherwise pretty much the same as C's.

~~~
sketerpot
On that note, I'm a fan of the speculative lock elision in the Azul Systems
machines. When the JVM sees a synchronized section, it can create a snapshot
of the current register and memory state and execute the critical section
speculatively, checking for memory access conflicts with other threads, and
roll back the changes if there's a conflict. Actual locks are a fall-back
option.

It's a special case of hardware transactional memory. (Specifically, bounded
best-effort HTM; it can only track a limited number of memory changes, and
aborts if you push it too far.) It can give small but non-trivial scaling
improvements on multithreaded code.

For example, suppose you have several threads that want to change a hash
table. They could lock the whole hash table first -- or they could go ahead
without locking, and abort if there was an actual memory conflict. Essentially
it lowers the granularity of locking, automatically.

------
lutorm
Nice summary. His "example 6", about false cache sharing, bit me once when I
was allocating data structures for my threads and they ended up close enough
in memory that threads shared cache lines. It took a while before I figured it
out and probably the only reason I did was that it didn't happen all the time,
so I was wondering why the code sometimes ran slower. If it had been
consistent I would just have thought the code was slow. While there are
profilers that spit out cache hit rates, it's usually on the scale of whole
threads which makes it very difficult to find which code is bad. You also have
to have enough experience to know what a "normal" hit rate is, which I'm not
sure I am.

------
RiderOfGiraffes
The next level after caching relates to paging and virtual memory. Submission
to an item explaining this in a humorous fashion here:

<http://news.ycombinator.com/item?id=1032528>

------
wingo
Remarkable clarity, and excellent diagrams.

------
palish
Keep in mind that 99% of the time, choice of algorithm determines your code's
performance way more than worrying about cache lines.

~~~
djcapelis
Yes, but be careful with this if your algorithms knowledge comes from a
traditional algorithms course. These are often taught as if computational
complexity was the only concern. Algorithms that minimize _space_ complexity
are often more of an issue.

So while the choice of algorithm is a large issue, one must be careful to
select an algorithm that makes the appropriate space vs. computation tradeoff.
That knowledge is often guided by an awareness of where the systems
bottlenecks are.

Once you've chosen the correct algorithm and you're down to optimizing code,
which is what this article discusses, understanding cache effects is critical
and this article provides a nice view of the basics. The usual rules about not
doing optimization unless one needs to applies, but I don't see why that
always merits an algorithms junkie popping out of the woodwork to essentially
say the same thing everyone always says: premature optimization is dumb.

~~~
palish
Man, you are reading way _way_ too far into what I said. Aren't introverts
like us supposed to always be taking statements at face value? And isn't what
I said true? Attacks like "algorithms junkie", and completely shoving words
into my mouth just make me not want to be here anymore.

For the record, I write software rasterizers for fun. I optimize code using
SSE2. I code mainly in C++. I am the target audience for this article. Maybe
you were trying to illustrate a contrasting point of view, which is fine, but
it is stupid to make it so personal.

~~~
sketerpot
I don't think that "algorithms junkie" is an insult; I'd happily call myself
an algorithms junkie, for example. Algorithms are awesome, and so are people
who write software rasterizers.

(And you know what? Figuring out algorithms that work well on modern hardware
is loads of fun.)

