The gpu has a cache on it. So does the cpu. Blow the cache and performance is gone anyway. So uniform memory access is really annoying to implement, really convenient for developers, non-issue performance-wise.
> So uniform memory access is really annoying to implement,
Is it really, though? It seems like almost every SoC small enough to be implemented as a single piece of monolithic silicon has gone the route of unified memory shared by the CPU and GPU.
NVIDIA's GH200 and GB200 are NUMA, but they put the CPU and GPU in separate packages and re-use the GPU silicon for GPU-only products. Among solutions that actually put the CPU and GPU chiplets in the same package, I think everyone has gone with a unified memory approach.
Indeed, much like the pending AMD Strix Halo and already chipping AMD MI300A.
Much like dual socket servers, where each socket can address all memory, these new servers have two memory systems, one optimized for CPU and another optimized for GPU. Seems like a good idea to me, why serial/deserialize complex data structures between the CPU and GPU, which are then bulk transfered, and then checked for completion. With NUMA you can pass a pointer, caches help, everything is coherent, and it "just works". No more failures when you don't have enough memory for textures or a LLM, it would just gracefully page to the CPUs memory.