This isn't fill the reticle with CPU, it's make a dozen separate chips and package them on a network fabric. Amount of cache can increase linearly with number of cores without a problem.
There is some outstanding uncertainty about cache coherency vs performance as N goes up which shows up in numa cliff fashion. My pet theory is that'll be what ultimately kills x64 - the concurrency semantics are skewed really hard towards convenience and thus away from scalability.
I know next to nothing about CPU architecture so please forgive a stupid question.
Are you saying that the x86 memory model means RAM latency is more impactful than on some other architectures?
Is this (tangentially) related to the memory mode that Apple reportedly added to the M1 to emulate x86 memory model to make emulation faster? - presumably to account for assumptions that compilers make about the state of the CPU after certain operations?
The preceding comment was about the CPU cache coherency, which is the bane of the symmetric multiprocessor (SMP) system design. The problem arises due to the fact the main memory is shared across processors (or CPU cores). Consider this rough sketch:
Each CPU (or CPU core) has its own private L1 cache that other CPU's/CPU cores do not have access to. Now,
code running on the CPU 0 has modified 32 bytes at the address 0x1234 but the modification does not occur directly in the main memory, it takes places within a cache and changes to the data now have to be written back into the main memory. Depending on the complexity of the system design, the change has to be back propagates through a hierarchy of L2/L3/L4 (POWER CPU's have a L4 cache) caches until the main memory that is shared across all CPU's is updated.
It is easy and simple if no other CPU is trying to access the address 0x1234 at the same time – the change is simply written back and the job is done.
But when another CPU is trying to access the same 0x1234 address at the same time whilst the change has not made it back into the main memory, there is a problem as stale data reads are typically not allowed, and another CPU / CPU cores has to wait for the CPU 0 to complete the write back. Since multiple cache level are involved in modern system design, the problem is known as the cache coherency problem, and it is a very complex problem to solve in SMP designs.
It is a grossly oversimplified description of the problem, but it should be able to illustrate what the parent was referring to.
Thanks for the explanation, however I was more curious about why ARM would have an advantage over x86.
I think the sibling comment explains it - x86 makes memory consistency promises that are increasingly expensive to keep, suggesting that x86’s future success might be limited by how much it can scale in a single package.
There is some outstanding uncertainty about cache coherency vs performance as N goes up which shows up in numa cliff fashion. My pet theory is that'll be what ultimately kills x64 - the concurrency semantics are skewed really hard towards convenience and thus away from scalability.