
New cache design speeds up processing time by 15% - srikar
http://arstechnica.com/science/2014/03/new-cache-design-speeds-up-processing-time-by-15/
======
AceJohnny2
Reminds me of the hilarious column by James Mickens, "The Slow Winter", on how
these kinds of improvements used to be easy:

"“I wish that we could predict branches more accurately,” and you’d think,
“maybe we can leverage three bits of state per branch to implement a simple
saturating counter,” and you’d laugh and declare that such a stupid scheme
would never work, but then you’d test it and it would be 94% accurate, and the
branches would wake up the next morning and read their newspapers and the
headlines would say OUR WORLD HAS BEEN SET ON FIRE."

PDF:
[https://www.usenix.org/system/files/1309_14-17_mickens.pdf](https://www.usenix.org/system/files/1309_14-17_mickens.pdf)

~~~
gumby
That is a profoundly awesome column. Thanks for forwarding it!

~~~
MetaCosm
Indeed, THAT is who you cover a somewhat dry topic.

------
habosa
I just recently took an advanced computer architecture course, and I was
stunned by how little I knew about how computers ~actually~ work.

Most people think "memory is fast, disk is slow" but the reality is "high-
level caches are fast, other caches are slow, memory is really slow, and stay
the hell away from disk". You're always taught to think in terms of memory but
the good news is that very intelligent cache designers have made accessing
memory directly a relatively infrequent event, and that's one of the only
reasons our processors get to do any work at all.

~~~
tinco
Ulrich Drepper wrote a terrible paper titled "What every programmer should
know about memory"[1]. It covers basically everything involving memory. It's
much too long and really only is relevant for programmers who want (need) to
make optimal use of memory. I wouldn't recommend it, but every time someone
realises how slow some memory things are compared to other memory things, I
guess it's obligatory to link it. So here, don't read it ;)

1]
[http://www.akkadia.org/drepper/cpumemory.pdf](http://www.akkadia.org/drepper/cpumemory.pdf)

~~~
munificent
If you want a friendlier introduction to the topic with some basic patterns to
optimize for it, I wrote a chapter in my book on game programming about
caching:

[http://gameprogrammingpatterns.com/data-
locality.html](http://gameprogrammingpatterns.com/data-locality.html)

While the book is ostensibly about games, there's relatively little game-
specific about it.

------
solarexplorer
Link to the paper: [http://people.csail.mit.edu/devadas/pubs/acc-
hpca14.pdf](http://people.csail.mit.edu/devadas/pubs/acc-hpca14.pdf)

------
valarauca1
Cache misses are a major problem many people don't think about or shy away
from because you just need to assume they'll happen.

Missing L1 cache costs between 10-60 cycles, your 10x that for L2. Access to
the direct RAM because you also missed L3? you start dealing in 100,000+
processor cycles of hang.

Glad to see were moving forward.

~~~
FooBarWidget
That's easy to say, but try and debug/optimize CPU caching! The available
tools are extremely primitive and hard to use. It's very hard to gain
visibility into what parts of your code is causing caching problems, why, and
what you can do about it. Practical, comprehensive, useful literature about
CPU cache optimization is also almost nonexistent.

The L1 cache is so small, I don't even know how to optimize for it in general-
purpose software. It's not too hard if you have very narrow use cases, e.g. if
you're doing heavy matrix calculations on small localities at a time. But I'm
writing an HTTP server, and a single read() into a 16 KB buffer already blows
my entire L1 cache. Great, how do I optimize that?

~~~
stephencanon
1\. Open your favorite sampling profiler (vtune, zoom, instruments, whatever)
2\. Set it up to sample every N cache misses. Good choices of N absent extra
information are usually prime (why? to avoid artifacts where cache misses are
divided evenly among M locations, with M|N) numbers between ~1000 and
~1000000. 3\. Exercise your program under typical workload. If the resulting
data is too coarse-grained, decrease N. If your program is running too slowly
because of sampling overhead, increase N. 4\. You now have a detailed
accounting of what call stacks and instructions triggered cache misses.
Starting with the worst offenders, re-arrange your algorithms to maximize the
amount of data reuse.

This isn't trivial, but I’d say it’s a very long way from primitive.

