
How L1 and L2 CPU caches work, and why they’re an essential part of modern chips - nkurz
http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
======
userbinator
In the late 70s/early-to-mid 80s memory was actually _faster_ than the CPU,
and this is partly what motivated the RISC philosophy - at the time,
instruction decoding was the main bottleneck and there was plenty of memory
bandwidth available so it was a good idea to design CPUs to make better use of
that. Now that it's the exact opposite, and we have multilevel caches, I
wonder how different CPUs would be today if the engineers back then had
realised that their surplus of memory performance would be very short-lived
and cores would become faster while latencies to access memory continue to
increase, and designed for this long term situation.

~~~
colanderman
Does memory really bottleneck instruction throughput though? I've never ever
seen code that stalled on an instruction fetch.

CPUs generally hide latency by auto-prefetching the instruction stream into
cache, and memory _bandwidth_ has been keeping up just fine with CPU speed.
Hence memory latency rarely hinders instruction throughput, unless your code
jumps around unpredictably every few instructions (most doesn't; the only
common class of program I can think of which does are emulators/interpreters).

Probably the only differences in ISA design/implementation between this
universe and one where memory latency tracked CPU speed would be that (a)
instructions would need not be prefetched into cache (possibly there'd be no
cache), and (b) we'd still have wacky instructions like the TMS9900's X
instruction, which executes the instruction at a given memory address. [1]

[1]
[http://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Instr...](http://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Instruction_set_and_addressing)

~~~
sliverstorm
You totally can hide latency with auto-prefetching. But there are a few
problems.

\- This depends on correct branch prediction, otherwise you could easily wind
up having fetched the wrong fork

\- L2 latency is, what, 12 cyles? L3 is 28? (Sandy Bridge). To completely hide
this latency, we would need the prefetcher to run 28 cycles ahead. Considering
caches work in _cachelines_ and not _instructions_ , this means we have to run
potentially hundreds of instructions ahead of execution, without mispredicting
one branch...

Whether your code is going to bottleneck all depends on size and code
patterns. If your software fits in the L1 icache, you will only miss when
context-switching. If your software is 100MB and does not exhibit good code
locality, you might be stalling all over the place.

~~~
ganzuul
Is this code locality what makes functional style programming perform well?

~~~
Athas
No. There are only a few properties of functional programming that help
performance, and they are mostly spent compensating for the overhead typically
imposed by high-level languages:

* Easier parallelisation (although compiler auto-parallelisation has turned out to not work well in practice).

* Compacting garbage collectors can mimimise fragmentation, and thus improve locality. This is not specific to functional languages, though, but a general benefit of automatic memory management.

* For pure languages, you may be able to implement generational garbage collectors slightly more efficiently, as no object of an older generation will have references to objects in a young generation. This is a very fragile property though - even laziness breaks it - and I'm not sure anyone has tried to use it in practice.

* Purity means that the compiler has to be less careful about optimisations and can make more assumptions about the code. This is part of what makes automatic (or semi-automatic) loop fusion practical in functional languages.

In general, I would not say that functional-style languages "perform well".
They perform "well enough", but naive performance is not the reason to pick a
functional language over an imperative one.

~~~
m0th87
> For pure languages, you may be able to implement generational garbage
> collectors slightly more efficiently, as no object of an older generation
> will have references to objects in a young generation. This is a very
> fragile property though - even laziness breaks it - and I'm not sure anyone
> has tried to use it in practice.

That's how it works in Haskell, despite the laziness:
[https://www.haskell.org/haskellwiki/GHC/Memory_Management](https://www.haskell.org/haskellwiki/GHC/Memory_Management)

~~~
Athas
Huh - how can that work? Take some partially evaluated object 'F <thunk>',
where F is a data constructor. This object is old and thus stored in the
oldest generation. When we force the thunk, we have to update the reference in
the object to point to the resulting value - which, as it is newly allocated,
should be in the nursery. Is the trick simply to allocate in whichever
generation our referencing pointer belongs? As I see it, the basic problem is
that lazy evaluation requires mutation at the heap-level.

------
hcrisp
Relevant article by Francesc Alted, creator of PyTables: "Why Modern CPUs Are
Starving And What Can Be Done About It".
[http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf](http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf)

------
jokoon
Isn't there any C/C++ tutorial on how to avoid cache misses ?

I already read the classical argument of having data oriented designs and to
avoid linked lists, but I've never really heard of anything concrete on
performance in programming.

All I hear is "profile profile profile". I guess most of the time it will
point out a place where my code is slow, but if I find one, how do I solve it
?

For example, can I create some simple bit of code that do a cache miss at will
? Any example of such code ? How do I detect a cache miss ?

~~~
bboreham
Cache misses are independent of language; however C++ gives you a lot of
control over where items are placed in memory.

To create a cache miss in one processor all you need to force the processor to
look at M different memory addresses that map to the same cache location,
where M is higher than the N-way associativity for your processor. The
specifics of this differ for each processor, but as an example, Intel Sandy
Bridge CPUs have 8-way associativity, so if you fetch 9 addresses that are
4096 bytes apart you will cause a cache miss.

Lots of small tips and links if you Google, e.g.

[http://stackoverflow.com/questions/8744088/what-is-the-
best-...](http://stackoverflow.com/questions/8744088/what-is-the-best-way-to-
detect-cpu-cache-misses-when-running-an-algorithm)

[http://stackoverflow.com/questions/3359524/detecting-
cache-m...](http://stackoverflow.com/questions/3359524/detecting-cache-misses-
and-hits-pragmatically-in-linux)

The full detail is in books like
[http://www.intel.co.uk/content/www/uk/en/architecture-and-
te...](http://www.intel.co.uk/content/www/uk/en/architecture-and-
technology/64-ia-32-architectures-optimization-manual.html)

~~~
bboreham
Another interesting case: suppose you have a contiguous array of objects that
are 48 bytes big, and your cache line is 64 bytes. Then every second object
will straddle two cache lines. For random access to any one such object,
padding them out to 64 bytes each will run faster on average, as it will
access main memory only once and evict fewer cache lines. But you can only fit
3/4 as many objects in your cache now, so if you need to access a lot of them
then your program will run slower.

So, "profile, profile, profile", but also understand factors like this that
pull in different directions in different scenarios.

------
ganzuul
ARM is using L3 cache as a DMA-like interconnect, on the SoC side of things.

~~~
shortsightedsid
I'm not sure about ARMv8 but in ARMv7 SoCs the ARM Subsystem does not include
L3. Plus each L1 is tied to each core.

~~~
ganzuul
I believe this is a relevant source: [http://www.arm.com/products/system-
ip/interconnect/corelink-...](http://www.arm.com/products/system-
ip/interconnect/corelink-ccn-504-cache-coherent-network.php)

The L3 cache is 'integrated', which I presume means it's different names for
the same thing.

~~~
shortsightedsid
Ah! Thanks for sharing. It is for ARMv8 which I really need to read up on.

------
bluedino
This is probably a simple question, but it may not be obvious to others as
well: How effective are caches on a system that is running multiple different
processes at one time, much less hundreds? Is each working set that a process
uses at any certain time small enough to fit in the cache along with the
others?

Looking back at caches that used to be 4k or 64k (external) and are now 256k
and measured in megabytes (external), I suppose that could be the case.

~~~
pron
The number of threads a CPU can run at once is equal to the number of cores
(we'll ignore hyperthreading for the sake of simplicity), and the number of
cores is equal to the number of (L1) caches.

If you're referring to kernel-level multitasking, then these processes don't
really run concurrently from the CPUs perspective; in fact, they're switching
interval is quite long (relative to the processor speed).

However, the cost of bringing in a thread's data into the cache once it's been
scheduled by the kernel is not negligible at all, and is part of the
significant total task-switching cost.

------
mmphosis
All that complexity (and space) to handle the cache, why not simply add more
cache, lots more cache, and don't use RAM at all?

~~~
acveilleux
Cost. More cache means bigger die size and lower yield. Also once the area of
the cache gets too big, the latency increases too much. That's why the P3
moved the L2 cache back on the die while it was in a separate package on the
P2...

~~~
Sharlin
Yes. It's important to realize that at current clock speeds, much of the
latency is simply the direct result of the finite speed of light. You simply
can't fit gigabytes of memory in a location that's physically close enough to
the processor.

~~~
techrat
Hence the location of memory slots always being physically parallel to the CPU
socket, squiggles in the traces in the innermost lines from memory socket to
CPU (to equal the distance of the longer traces along the edges of the socket
to the CPU so timings remain in sync), and why die shrinks are more important
for performance than scaling out... If a chip is too physically big, the speed
of light can limit its performance before other inefficiencies will.

Really fascinating and reinforces how no matter how one can imagine how much
"faster" computers will get in the future, there is still that speed limit of
information.

