It seems like unified memory has to be the goal. This all just feels like a kludgy workaround until that happens (kind of like segmented memory in the 16-bit era).
Is unified memory practical for a "normal" desktop/server configuration though? Apple has been doing unified memory, but they also have the GPU on the CPU die. I would be interested to know if a discrete GPU plugged into a PCIe slot would have enough latency to make unified memory impractical.
Sure, but we're talking about RPC/syscall for disk and network transfers from the CPU side. Almost nothing on the CPU side can sustain 1 TB/s anyway--you can only do GPU->GPU transforms for that--and even then for very specific workloads. And the only reason you are reaching off the GPU is because you either need new data or you need the CPU to chew on something that the GPU can't manage due to branchiness.