All guns are pointed at _memory_
Memory is an uncompetitive industry, a cash cow unseen in history, comparable only to oil. The SEL empire is built not on top of galaxy notes, but on a pile of memory chips.
The easiest way to get an order of magnitude improvement right away is to put more memory on die and closer to execution units and eliminate the I/O bottleneck, but no mem co. will sell you the memory secret sauce.
Not only that memory is made on proprietary equipment, but decades of research were made entirely behind closed doors of Hynix/SEL/Micron triopoly hydra, unlike in the wider semi community where even Intel's process gets leaks out a bit in their research papers.
SEL makes a lot of money not only selling you the well known rectangular pieces, but also effectively forces all top tier players buying their fancy interface IP if they want to jump on the bandwagon of the next DDR generation earlier than others: https://www.design-reuse.com/samsung/ddr-phy-c-342/ . This makes them want to keep the memory chip a separate piece even more.
Many companies tried to break the cabal, or workaround them, but with no results. Even Apple's only way around this was just to put a whopping 13 megs of SRAM on die.
Changing the classical Von Neuman style CPU for GPU or the trendy neural net streaming processor changes little when it comes to _hardware getting progressively worse_ at running synchronous algorithms because of memory starvation.
You see, the first gen Google TPU is rumored to have the severest memory starvation problem, as do embedded GPUs without steroid pumped memory busses of gaming grade hardware.
When PS3 came out, outstanding benchmark results on typical PC benchmark tasks were wrongly attributed to it having 8 dsp cores, while they were not used in any way. It was all due to it reverting back to more skinny, and synchronous operation friendlier memory. The amazing SPU performance was all thanks to that too. DSP style loads benefitted enormously from nearly synchronous memory behaviour.
Probably didn't work out for the reasons you mention.
Next question: What prevents Intel or AMD combining logic process and DRAM process or Samsung and others combining logic in their DRAM ships?
Integrating CMOS and DRAM might be impossible to do so that the price/speed is less than using separate chips. Combining two processes increases the price. CPU/GPU makers don't have the latest DRAM knowledge. Reverse is also true. Partnership is required.
Then there are technological problems: There are yield differences in different processes. CPU/GPU's operate in high temperatures. DRAM's technology needs to adjust or there needs to be halfway solution.
It's possible that in some time in the future new technology called STT-MRAM could replace replace low-density DRAM and SRAM and it could be integrated into logic because it can use existing CMOS manufacturing techniques and processes. It will take time. (STT = Spin-Transfer Torque)
If Intel wants to add more cache, they simply paste the template again. What's limiting on-die caches is the competition for space. Chip yields sink roughly proportional to die size.
> The REX Neo architecture gains its performance and efficiency improvements with a reexamining of the on chip memory system, but retains general programmability with breakthrough software tools.
https://insidehpc.com/2017/02/rex-neo-energy-efficient-new-p... (check out the linked video)
Did not seem to be an issue in the paper with the first generation.
An in-depth look at Google's first Tensor Processing Unit (TPU ...
Even with monsterous HBM2 memory, they still have it.
It is probably hard to predict what matrix set to prefetch when you deal with a neural net. So you have cache misses there too
The other mammoth problem however, is scaling the deep trench capacitors.
However if you have a great idea for logic and dram on the same die you are unlikely to be able to do so economically since you can't get the dram IP.