

Optimizing code for instruction cache - smcl
http://www.eetimes.com/design/automotive-design/4017551/Optimizing-for-instruction-caches-part-1

======
Hoff
If you don't know some basics, you can get in trouble.

But if you do and if you get particularly clever, you can get in trouble on
the next generation of whatever you're designing for.

TANSTAAFL.

With some of that C code shown, I'd try a "static inline" on the declaration,
and see how well the C compiler and its code generator dealt with it. (If I
can get out of dealing with it and cede the work to the compiler, all the
better...)

And the caching locality discussions are the same ones that arise with virtual
memory out in main memory; if you bounce around all over the place in address
space (eg: setting up the wrong stride on an array, or partitioning your data
differently from your access), you'll incur page faults.

And if you get particularly good with your packing of your data, you can
sometimes trigger word tearing when you're accessing adjacent data within the
same granule from different processes or different threads.

Caches and cache designs, too, are individually funky. There are processors
which got faster by going to smaller and fewer caches and better (lower)
latency. So if you fell out of cache in the new design, you'd assume you'd do
worse, but with lower latency on the box out to memory, you actually did
better pulling from main memory than if you were pulling from second-level on
the previous generation.

~~~
Groxx
Actually, in just about anything non-JITted (without faking it by hinting the
compiler), I see no reason why (basic) locality optimizations would change
with different architectures. A sufficiently smart branch predictor deciding
what to load into caches could take care of such a thing, but that can't
really be relied on (too hard to predict), and it would likely benefit from
such hints as well.

This is all taking into account that there is indeed too much of a "good"
thing. More extreme methods of optimization can definitely shoot you in the
foot, and not _just_ while you're writing them.

~~~
Hoff
The word-tearing case had unrelated data values packed within the same granule
of cache storage, and that derailed the running environment in a very subtle
way. With just the right (wrong) timing, you very occasionally saw slightly
different values in the adjacent variable within the granule when (apparently-
unrelated) threads were spun up, and got tangled and torn. No shared
references. Just sharing that granule.

------
okmjuhb
It's worth mentioning that they're writing for DSP systems. More mainstream
CPUs have a larger instruction cache (and smarter eviction policies), but the
size of the hot code paths are (I imagine; I'm not terribly familiar with DSP)
likely not much bigger, so that the problems they mention need solving less
frequently for typical applications.

~~~
anamax
> It's worth mentioning that they're writing for DSP systems.

Yes.

> More mainstream CPUs have a larger instruction cache (and smarter eviction
> policies), but the size of the hot code paths are (I imagine; I'm not
> terribly familiar with DSP) likely not much bigger, so that the problems
> they mention need solving less frequently for typical applications.

I'd have assumed the exact opposite. I assume that the size of the "hot
enough" spots grows at least somewhat with the size of the binary. (It
probably grows much slower than linear.) DSP binaries are smaller, so I'd
assume that their hot spots are smaller even given "equivalent" programs.

But, DSP programs are different. Their data is more "regular" and their
instruction sets are specialized for their typical uses. I'd expect better
instruction cache behavior from DSP programs.

Fun fact - access times and feature sizes are related so if you maximize clock
rates, the maximum size of a one cycle cache is basically independent of the
access time. I suspect that almost all implementations can afford such a
cache.

I suspect that the same rule applies to L2 caches but it's not clear that
every implementation can afford the area. However, I suspect that
implementations that have an L3 have maximized their L2.

DSP implementations are more likely to have traded cache size for dollars, so
I expect that their caches are somewhat less effective on the same programs.

These two factors have opposing effects.

