
Cache Lines Are The New Registers - nkurz
http://simonask.tumblr.com/post/30645840195/cache-lines-are-the-new-registers
======
kabdib
The AT&T Hobbit processor didn't have registers, just a stack and aggressive
caching there -

<http://en.wikipedia.org/wiki/AT%26T_Hobbit>

It's an interesting idea. It probably makes thread switching very cheap (just
change a pointer, a PC and maybe a condition code register).

[As the wiki article says, we were going to use it for the Newton, but the
price kept going up and the chip was buggy.]

On game consoles, at least, successful developers know how the cache works at
a detailed level; it's surprising how often it's cheaper to recompute
something than to fetch it from memory again. It's a waste to fetch a whole
cache line for a single byte, and you'll see good titles make very good
utilization of their cache fetches (e.g., 70%).

DSP programming environments often provide ways to lock cache lines so
replacement doesn't hose performance.

At the shared L2/L3 level, things get really complicated, since you suffer
cache replacement from other parts of the system. (Why did our frame rate go
down? / Oh, we sent an HTTPS request, which brought in the crypto stuff and
...). It's not something you can analyze or deal with statically, and good
tools and test methodologies are essential.

~~~
mmagin
The TMS 9900 (and its predecessor, the TI 990 minicomputer) had a similar sort
of idea, except back in the days of less layers of caching.
[https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Arch...](https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture)

(Of course, those days, there weren't orders of magnitude between the time to
execute a single instruction and the time to access DRAM or ROM.)

------
thedudemabry
I like this premise. Unfortunately, cache lines aren't nearly as visible as
registers in a debugger, so the author's claim that the next generation
systems languages may need to provide convenient controls (and implied
visibility) into this class of optimizations seems particularly apt. Nowadays,
cache-related optimization appears to center around experimentation on a
particular hardware setup. Hopefully, that will improve in the future.

In the meantime, Herb Sutter has been demonstrating some interesting cases of
cognitive dissonance in terms of code optimization. His work is a lot of fun
to read.

------
gillianseed
from the blog post: >The best any of them can do, however, is to make guesses
as to what the optimal performance characteristics of the cache structure on
each CPU is going to be.

Well actually most mainstream compilers support profile guided optimization
which allows the compiler to do a good job of reorganizing the code to
minimize instruction cache misses as the compiler then has runtime data which
provides the necessary hot/cold codepath statistics.

------
chj
Any sane computer architecture should assume high availability of stack
memory. The idea of preserving a memory block to replace stack won't buy you
anything.

Btw: why not write a test instead of just guessing?

------
primitur
Seems to me the answer lays in having the compiler more tightly bound with the
nature of the OS - such that, for example, adjustable stack sizes and
generalized metrics about application performance, can be used to make apps
fill the cache-line most optimally.

In Embedded/Safety-Critical works, such metrics are available; often-times
though, compiler/OS levels of support/integration/inter-operation are
insufficient to high-performance tuning .. and of course, for a lot of
industrial applications, not necessary (or encouraged, at least) in the first
place.

However I wonder what an attempt such as the one to make an exclusively Lua-
based OS (and, thus Application-Set) might produce in terms of optimizing
cache line fills; I say "Lua" because it of course has a pretty good mechanism
for determining memory usage patterns ("all elements", "each element",
"diverse access patterns" .. something that tables and metatables can easily
assist) .. and with a strong App base on a LuaOS tweaked specifically for the
cache/CPU combination its running on, perhaps we will see some interesting new
endeavours in this department ..

------
csense
The article's last point is a little questionable:

> compilers could defer stack usage in favor of a reserved area of program
> memory for temporary values, that has a very high likelihood of always being
> in cache.

"A reserved area of program memory for temporary values, that has a very high
likelihood of always being in cache" -- this is a fair definition of the
stack.

> the stack is function-local.

Depends on your definition of "function-local". Yes, variables local to an
individual function are stored on the stack, but the stack frames for multiple
function calls are adjacent in memory. It's not like entering or exiting a
function will always switch to a different cache line. Someone who looks at
"disassembly of optimized code in the wild" should be well aware of this.

> For temporaries that don’t need to persist past calls to other functions,
> some performance could be gained by avoiding cache misses on stack memory
> that isn’t going to be used after the function returns anyway.

You should be able to achieve this in the current world with aggressive use of
scopes to explicitly expire particular locals before the end of a function, or
use of inline functions.

It is an idea for a compiler optimization, however: Expire a local variable at
the last point it is used in a function. There are a number of subtleties
though: The compiler would have to be able to reorder local variables to make
sure the expirable ones are at the top of stack when they expire, would have
to mark variables for which pointers are taken non-expirable (unless the
pointers themselves, or values derived from them, themselves don't escape the
function), and would make stack frames more complicated to parse for debuggers
and backtrace displays.

------
DannyBee
Prefetch insertion was done years ago, but disabled (at least in GCC) because
it has never shown any performance benefit, even when we disassemble and see
it inserting prefetches in exactly the places it should. In general, the
hardware prefetchers are spectacular in most x86 implementations, and it's
roughly impossible to beat them in all but the weirdest situations.

(This is true of at least every processor since core duo, AFAIK)

Graphite/et al (and similar implementations) are the future. They are a very
large amount of work to get right, working, and performing well, but the
benefits are immense.

They can basically transform loops optimally for any set of cost functions you
can come up with.

(This is actually the easy part, the hard part is generating reasonable code
from the resulting loops :P)

There are numbers, but not from those particular implementations.

------
Dylan16807
_Every time the operating system decides that it’s time to run a different
process, all caches are invalidated (on account of the virtual memory
system)._

They what? I know most processors will blow the TLB, but I don't think any
processor would toss out megabytes of cache.

------
Lagged2Death
Don't modern CPUs (like x86) have about a zillion registers internally, which
are, behind the scenes, allocated to the virtual registers (AX, BX, CX etc)
that are seen from the assembly level?

Itanium has a register stack, and I'm pretty sure it wasn't the first.

[http://software.intel.com/en-us/articles/itaniumr-
processor-...](http://software.intel.com/en-us/articles/itaniumr-processor-
family-performance-advantages-register-stack-architecture/)

~~~
stephencanon
Yes, modern out-of-order processors have on the order of 200 physical
registers which are mapped to program registers by renaming.

Cache and register utilization are critical (register utilization not so much
because of the latency of loads from cache--which are fast--as because loads
from cache are a scarce resource; a typical core can manage only one or two
per cycle), but maximizing TLB reach turns out to be even more important than
cache for some algorithms that deal with large data sets.

------
opinali
HotSpot (Java's JIT compiler) generates prefetchw.

------
zurn
The high level problem is that compilers take the memory layout of data as
given and punt on optimizing the memory behaviour of programs. Partly the
fault of languages that aren't amenable to dynamic data layout optimizations
but it doesn't seem a very active area of research either.

~~~
DannyBee
What?

Compilers have been studying and doing data and loop reordering (for memory
optimization) optimizations for many many years.

In fact, some of these are even famous
([http://www.hpcwire.com/hpcwire/2007-11-09/compilers_and_more...](http://www.hpcwire.com/hpcwire/2007-11-09/compilers_and_more_gloptimizations.html)).
Look at the art hack.

This is a very active area of research. CGO, one of the main compiler
conference, has had many papers on memory and data optimization in the past 5
years, and PLDI did before that.

Google for "structure layout optimization compiler", "Data structure
optimization compiler", "polyhedral loop optimization compiler".

------
bstx
> Every time the operating system decides that it’s time to run a different
> process, all caches are invalidated (on account of the virtual memory
> system).

Unless you have a virtually indexed cache I don't see why an OS would want to
do that.

------
zippie
This is also important for index-heavy applications, llds for instance treats
cache lines as registers and heavily optimizes for cache friendly operations:

<https://github.com/johnj/llds>

------
MusicOS
I'm feeling smug. My compiler for LoseThos is not very optimal, but it doesn't
matter! I made many improvements, only to discover no benefit.

For old school programmers, it's just not right, for example, that floating
point math is almost as fast as fixed-point. (I did my graphic library as
fixed-point.)

~~~
rwmj
I guess he's worked out that his last account was hell-banned ...

~~~
Dylan16807
He's known about it. I wonder why he's trying a new one...

