

The other hidden cycle-eating demon: code cache misses - DarkShikari
http://x264dev.multimedia.cx/?p=201

======
Hoff
Cache sizing and memory latency are linked.

Up to a point, implementing larger caches can improve aggregate performance in
higher-latency processor design configurations. Beyond a point, larger caches
aren't performance- or cost-effective. Conversely, lower-latency processor
designs mean that cache misses are less costly, and that L1 or particularly L2
caches can be smaller.

For a reasonable comparison of what changing the latencies within a design can
provide, here is a LANL write-up from the Alpha microprocessor environment,
and where Alpha EV7 had (for its time) low interprocessor and low memory
latency and with toroidal processor links as compared with its Alpha EV68
predecessors and hierarchical or bus-based systems:

[http://www.c3.lanl.gov/PAL/publications/papers/kerbyson02:EV...](http://www.c3.lanl.gov/PAL/publications/papers/kerbyson02:EV7machine.pdf)

Among the x86 designs, the Xeon Nehalem-class processors have substantially
better latencies (around 27 ns and 54 ns remote) than previous generations of
Xeon processors. And rather better than the Alpha latencies discussed in the
LANL document. Which means the effects of different cache sizes or access
patterns can change.

Branches, too, can play havoc with the instruction streams and with the
efficacy of caching and of instruction decode. Branch often and performance
can suffer. Highly pipelined designs can take bigger performance hits with
branches.

------
DarkShikari
This post is a followup to this one:
<http://news.ycombinator.com/item?id=803826>

------
briansmith
Why do you assume that you get the whole L1 code cache to yourself? I would
think that on a real desktop system you would be lucky to get even half of it.

~~~
DarkShikari
The L1 cache is only 32 (or 64) kilobytes. One program's time slice lasts at
least a few milliseconds, easily 100 times longer than would be necessary to
fill up the cache with that program's code data.

The OS can't dedicate parts of the L1 cache to different applications (the CPU
doesn't offer any feature to allow it to do so), nor would it be a good idea
to do so.

~~~
ntoshev
I would expect the hyperthreaded cores share their cache between both threads
though.

~~~
DarkShikari
That could actually be a good explanation for why reducing code cache pressure
can help even in cases where it doesn't make sense that it would; because the
another thread is also using that cache.

Though I wonder if that's true of all SMT chips; I wonder if any chips have
dual L1 caches for exactly this reason?

