
Cache Organization in Intel CPUs (2009) - jxub
http://duartes.org/gustavo/blog/post/intel-cpu-caches/
======
etep
I had wondered, why are the L1 caches not growing, while L2 and L3 capacities
continue to grow: a significant limitation on L1 cache size is actually the
fixed tradeoff between associativity, page size (i.e. the 4KB pages allocated
by the OS to processes).

Because a 4 KB page has 64 cache lines, then you can have at most 64 cache
sets. With an 8 way associative cache this works out to 32 KB. Using 128 sets
would cause aliasing, but with 64 sets the cache index is built from the LSBs
that just index into the page (i.e. not used in the TLB lookup). Thus, the
only way to grow increase L1 capacity is to: \- totally abandon 4KB pages in
favor of (e.g.) 2MB pages (not likely) \- increase cache associativity (likely
imo) \- stop using virtual index+physical tag (not likely imo)

~~~
CalChris
I think a simpler argument is that for L1 you want fast, not big. Same thing
with registers (a form of cache at a lower level). Why did MIPS only have 32
registers?

 _Design Principle 2: Smaller is faster._ [1]

BTW, if you look at Agner Fog's latency tables [2], _mov mem,r_ (load) went
from 3 cycles in Haswell to 2 cycles in Skylake. So Intel has been
concentrating on faster which is nice.

And by way of comparison, AMD increased their μop cache size in Ryzen but then
only slightly. Way size went from 6 μops to 8. This matches their increase in
EUs.

[1] Patterson and Hennessy. _Computer Organization and Design_ , 5th edition,
p. 67.

[2]
[http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf)

~~~
etep
At a high level it's true that smaller is faster, but it's also true that
those L1s could have grown by adding sets (not ways) and achieved the same
latency. L2 has grown, but stayed iso-latency. This seems to say that "smaller
is faster" does not always hold.

Always impressed that Agner Fog takes the time to publish his results. Pretty
amazing. But I think focusing your thinking on the register count in MIPs or
the the uarch for some random opcode does not get into the real constraints on
L1 cache design at all. One could say that x86 should be even faster, because
hey, far less than 32 registers (or historically at least).

My response is like this: yes, the L1 has to be small to be fast, but it has
been stuck at 32KB forever now. It could have grown! So it's not as simple as
small is fast.

~~~
CalChris
x86_64 has 16 integer registers but Haswell has a 192 entry ROB. Skylake has
224. So Intel does increase these numbers. It's just that there has to be a
good reason. In the 90s maybe something like clock speed could win a marketing
spec battle. Not today.

I think at 6 transistors per bit we really aren't talking about a lot of die
area. Still I'm stone cold certain the Intel architects would increase L1
cache size if that was beneficial, if it modeled out. (However they may want
to keep performance similar+predictable unless there's a solid win.)

Agner is showing they've _reduced_ L1 latency. So this smaller is faster seems
to have gotten them something.

So you really have to work backwards and ask why they didn't/don't. There may
more than one reason; but they don't and haven't in quite some time.

I'm old school assembly/compiler hack. I read Agner and the Intel Optimization
Manual _a lot_. VTune, IACA and the PMCs. Someone has to do it.

~~~
etep
I think maybe we were talking past each other. Yes there is more than one
reason.

It's far easier to add capacity by adding sets, as opposed to ways. But they
can't add sets in the L1 because of the aliasing problem. When they do
increase L1 capacity, if nothing else has changed, then it will be by adding
ways.

------
en4bz
I think the 2009 tag is really important here. The memory layout of Intel's
chip has changed quite a bit since 2009. Core 2 chips didn't even have L3
cache at the time. The next article in this series talks about the Northbridge
which has long since been replaced by onchip integrated memory controllers and
the a PCIe root complex. On top of that both Intel and AMD brand new chips no
longer have inclusive caches which is a big departure for the inclusive
hierarchy that has been around for years. As mentioned in the article the
inclusive nature of the cache is important for multi threading applications
since the L3 cache always contains the master copy of a cacheline and threads
can always load the latest copy from L3 . I doubt that the behavior of chips
will change but it will be interesting to see how this affects multi threaded
programs going forward.

