
Why do CPUs have multiple cache levels? - panic
https://fgiesen.wordpress.com/2016/08/07/why-do-cpus-have-multiple-cache-levels/
======
jgord
When my son was 5 or 6 I had a great discussion about salt containers - the
little one you have on the table, the big packet in the pantry, the pallet
that gets delivered to the supermarket, the vast piles of salt at the salt
mine, and all that salt in the ocean.

Next time we had an egg and he wanted salt I scratched my head and asked him
what should we do .. take our egg to the supermarket or we could take a pack
lunch to the beach, and maybe wave the egg around in the water ? "No daddy,
remember, we have a little salt cache in the kitchen." hehe.

I guess a lot of the world can be seen thru the glasses of caching data or
physical things.

~~~
csours
That's a beautiful analogy. I wonder if we would still use small salt shakers
if they cost 1000x a large salt container.

Edit: Cunningham's Law in action!
[https://meta.wikimedia.org/wiki/Cunningham%27s_Law](https://meta.wikimedia.org/wiki/Cunningham%27s_Law)

~~~
zepolen
From 100x to 1250x

    
    
                  container (40’)     shaker
        payload:  27,600 kg           100g
           cost:  1000$ to 5000$      2$ to 5%
        cost/kg:  0.04$ to 0.18$/kg   $20/kg to $50/kg

~~~
dghughes
Farming is similar $200/tonne of potatoes and one 25kg bag is $10. Quite a
difference in what a farmer gets and what the end products costs although not
as bad as salt.

------
rwmj
I was "enlightened" many years ago when I asked a colleague (a great
electronic engineer) why we didn't do fast task switching by having two sets
of registers. His reply was that this would require every regular access to a
register to go through an extra gate (to decide which bank of registers you
want to hit), making every access slightly slower.

Larger registers/caches/memories are slower because they need more address
decoding, that time scaling approximately linearly as the storage doubles in
size.

~~~
sargun
What's the "cost" of registers on an x86-64 chip? How hard would it be to
introduce 10 more general purpose registers? Ones that are used by the kernel,
say, and one set for user space.

~~~
exDM69
There are about 180 registers in the register file already. There are only
names for 16 or so registers, the CPU internally does renaming for them. It
would be possible to specify an API that leaves a part of the named registers
reserved for kernel, but that wouldn't really help.

When an interrupt or system call happens, all the registers get pushed to the
stack. The stack is typically in L1 cache (and subject to various CPU
optimizations), so it's really fast to push the 16*64 bit registers to the
stack.

System calls, interrupts and context switches are "slow" not because they have
trivial overhead like pushing registers. What's really consuming the time is
secondary effects like TLB flushes, changes to the page tables, polluting the
branch predictor, etc. It takes a very long time for the CPU to "warm up"
again after a context switch.

~~~
joseraul

      it's really fast to push the 16*64 bit registers to the stack
    

Since the CPU has 180 registers (with only 16 names), why don't we need to
push all 180 to store context?

~~~
Dylan16807
All that extra state is there to help overlap instructions on a time scale of
nanoseconds. Every instruction has its own name->register mapping. If you're
switching to kernel mode or maybe if you miss a branch or for whatever other
reason, the CPU consolidates you back down to running one instruction. Once
that happens, you only have one name->register mapping and all the registers
have their correct values. The hidden state is reduced to nothing and you only
have to save the "real" state of those 16.

------
daly
Suppose you want to make a salad.

register: a tomato in your hand level 1 cache: a tomato on the counter level 2
cache: a tomato in the refrigerator level 3 cache: a tomato at the store main
memory: a tomato on the plant at the farm disk: a tomato seed being planted

~~~
nabla9
This analogy is not explaining why.

Why not just make more hands? Why multiple levels?

~~~
seanmcdirmid
We are limited to two hands. The kitchen can only hold so many tomatoes and
other things you need to cook. You could buy all the produce you needed for
the year and store it at your house, but it is wasteful (you actually don't
need it, storing it is expensive, and it displaces other things you need in
your house like your toilet).

In the old days, humans would live near their wild food and not cache much.
Then the Neolithic revolution happened, we started caching seeds and then
surplus produce in cities, and eventually refrigeration came along and we
could cache more. Each new level of caching allowed us to do more (increase
population). Ya, we could still live as hunter gatherers, but we would be
doing less. We could still be using Eniacs, also, we would just be computing
slower.

------
i336_
> _For a “Haswell” or later Core i7 at 3GHz, we’re talking aggregate code+data
> L1 bandwidths well over 300GB /s per core if you have just the right
> instruction mix; very unlikely in practice, but you still get bursts of very
> hot activity sometimes._

Reading that reminded me of [http://stackoverflow.com/questions/8389648/how-
do-i-achieve-...](http://stackoverflow.com/questions/8389648/how-do-i-achieve-
the-theoretical-maximum-of-4-flops-per-cycle). I don't 100% understand either
domain, but I think this link is relevant - it's asking how to achieve the
theoretical max of 4 FLOPs per CPU cycle.

~~~
vardump
> it's asking how to achieve the theoretical max of 4 FLOPs per CPU cycle.

Nowadays you can do 32 FLOPs per core per cycle, single precision (counting
FMA as add + mul).

~~~
i336_
Wow. What's the minimum CPU series for that?

~~~
vardump
Haswell.

~~~
i336_
Ah, thanks.

------
forgotpwtomain
I find the real world analogies quite weak and unnecessary.

If you haven't read it already, this is worth mentioning:
[https://people.freebsd.org/~lstewart/articles/cpumemory.pdf](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)

~~~
mungoman2
I think they are great for building intuition. What don't you like about them?

~~~
Sylos
I generally like real-world analogies, but in this case I do find it too
elaborate and unnecessary, too.

As a straight answer to the question, it would have sufficed to explain that
accessing a larger cache takes more time and resources than accessing a small
cache.

Then one could have compared that to desk vs. cabinet once to make it visual,
but there's no need to extend that analogy for each individual cache level.

That just exhausts the reader and makes it near-impossible to tell which parts
of the analogy are relevant/accurate and which parts are just fluff to make
the analogy work.

Alternatively, if you really want to explain each individual detail of caches,
then do go with such an elaborate analogy, but then explain at every step to
what it corresponds and which part of the analogy is relevant.

You shouldn't write out a page-worth of mostly accurate text and then write a
paragraph afterwards to explain how the analogy fits.

Chances are you've lost half of your readership at that point and many (myself
included to be honest) will quit reading at exactly that point, because they
feel like you're repeating yourself.

~~~
smallnamespace
> it would have sufficed to explain that accessing a larger cache takes more
> time and resources than accessing a small cache.

I think the point of the long analogy is to hammer in the intuition that
physical locality has a concrete price in the real world, and the cache
hierarchy is simply a consequence.

------
SixSigma
One article worth reading is :

Machine perception of time, if only nanoseconds were seconds

[1] [http://umumble.com/blogs/hardware/machine-perception-of-
time...](http://umumble.com/blogs/hardware/machine-perception-of-time,-if-
only-nanoseconds-were-seconds/)

~~~
dTal
This is great.

------
ybaumes
I though having multiple cache levels was about a trade-off between
performances and costs. The closer to the cpus (or the fater cache lvl), the
more expensive it is.

~~~
crististm
Yes. I don't know where he gets the idea that a large L1 cache is to a CPU the
same as a 150mx150m desk to a human. Address decoding is done in parallel, not
sequentially. And desks are as large as people are comfortable to produce and
use.

Likewise, if the RAM would be as cheap to produce as SRAM like it is as DRAM,
it would be as fast as the CPU (since it is using the same technology as the
CPU) and we would not need the cache at all. Imagine gigabytes of L1 cache!

~~~
Symmetry
Well, address decoding can be _started_ in parallel if your page size lets you
do virtually indexed, physically tagged caches which applies to only some
processors. But that's a separate issue from the relationship between cache
size and cache speed. That's governed by three things.

First, the larger your cache the more layers of muxing you need to select the
data you need, meaning more FO4s of transistor delay.

Second, the larger your cache the physically bigger it is. That means more
physical distance between the memory location and where it is used. That means
more speed of light delay.

And third there's the issue of resolving contention for shared versus unshared
caches.

So despite the fact that you're using the same SRAM in both your L1 and L3 but
access to the former takes 4 clock cycle but access to the later takes 80.

~~~
gchadwick
There's also the fact that as you get down the cache hierachy the cache
becomes more complicated. An L1 does lookups for a single processor, and
responds to snoops. An L3 probably has several processors hanging it off and
may deal with running the cache coherency protocol (e.g. implements a
directory of what lines are where and sends clean or invalidation snoops when
someone wants to upgrade a line from shared to unique). As a result you've got
layers of buffering, arbitration and hazarding to get through before you can
even touch the memory array.

------
kristianp
Why have L1 caches have been the same size for quite a few generations of
Intel Core processors?

"Currently Intel's L1D (level 1 data) cache is 512 lines with 64 bytes each,
32 kB. Been that way for a pretty long time. L1D latency with a pointer is
mostly 4 cycles. Not sure, but I think having 1024 entries would increase that
to 5 cycles.." \- vardump, 550 days ago:
[https://news.ycombinator.com/item?id=9001238](https://news.ycombinator.com/item?id=9001238)

The total L1 cache increases as you increase the number of cores though.

------
rsync
Isn't cost the reason ?

That is, the reason you don't use battery backed DRAM for all of your photos
is not because you don't want to, but because 8TB of the stuff would be very
expensive. And so most of us have RAM leading to SSD leading to spinning
platters.

So the reason to have a CPU cache at all is (insert interesting explanations
of caches here). But the reason to have more than one CPU cache is because of
the relative cost of the first cache, right ?

If cost was no object, wouldn't you just have a huge primary cache ?

~~~
vvanders
Not really, SRAM(what most caches use) isn't just more costly is also consumes
a _lot_ more power. You'd be hitting thermal limits much sooner than with
traditional DRAM.

~~~
Symmetry
Is that right? Traditionally DRAM has used much more power because it requires
a periodic refresh whereas the power consumption of SRAM is purely leakage.
Now, maybe leakage power has grown so much in recent years that this is no
longer true but if so I'd find that very surprising. Do you have any numbers?

~~~
vvanders
In SRAM you get Leakage + Switching current. For cases like a cache where
you're going to be constantly churning bits this can drive up power quite a
bit.

I don't claim to be an expert so there are probably cases where it's lower.
However when I was looking into fast RAM for storing data from an FPGA for a
simple Logic Analyzer most of the SRAM was almost 2-4x what DRAM was for power
consumption at the same storage size(with SRAM being a blazing fast cycle
access).

------
lamontcg
If you're going to combine all the CPU caches into L1 why not also put the RAM
and the SSD into the CPU L1 cache as well and just have 1TB of L1 CPU cache?

Working out the cost and power consumption and die size of that might be
instructive -- as a kind of reductio ad absurdum...

------
Noseshine
So this document surely belongs here:

What Every Programmer Should Know About Memory

[https://www.akkadia.org/drepper/cpumemory.pdf](https://www.akkadia.org/drepper/cpumemory.pdf)

You also get an answer for the question asked by the headline of this thread -
in great detail (most people will probably skip a lot of details).

~~~
amelius
I also recommend: [https://www.amazon.com/Computer-Architecture-Fifth-
Quantitat...](https://www.amazon.com/Computer-Architecture-Fifth-Quantitative-
Approach/dp/012383872X)

