> does the data move from the CPU cache to RAM to the GPU cache
Probably, not. Because it need dedicate channel on hardware level.
- GPU are mostly for streaming applications with large data blocks, so usually, CPU cache architecture is too different from GPU to simply copy (move) data, plus, they are on different chiplets, and dedicate channel means additional interface pins on imposer which are definitely very expensive.
So, when it is possible to make SoC with dedicated channel CPU<->GPU (or between chiplets), but usually it used only on very expensive architectures, like Xeon, or IBM Power, and not used on consumer products.
For example on older AMD products with APU, usually, graphics core have priority over the CPU to access unified RAM, but CPU cache don't have any additions to handle shared with GPU memory.
On latest IBM Power and similarly on Xeon, invented shared L4 cache architecture, where blocks of extremely huge L4 (near to 1Gb per socket on Power, as I remember somewhere about 128Gb on Xeon), could be assigned programmatically to exact core(s) and could give extremely high performance gain for applications running on these cores (usually these things very beneficial for DB or something like zip compressing).
Added: example difference CPU cache to GPU, for CPU usual size of transaction is less than 64bits, may be current 128..256bits but this is not common on consumer hardware (could be on server SoC), just because many consumer applications are not optimal to use large blocks, but for GPU normal to use 256..1024bits bus, so their cache definitely also have 256bits and larger blocks.
Plus, main idea of GPU, their "Computing Unit" is not alone.
Mean, in CPU could cut any core and it will work completely separated without other cores.
In GPU, typical, have blocks for example 6x CUs, which have one pipeline for all, and this is how they achieve thousand CUs or more. So, all CUs basically run same program, in some architectures could make limited independent branching with huge penalties on speed, but mostly just one execution path for all CUs.
Very similar to SIMD CPU, even some GPUs was basically SIMD CPUs with extremely wide data bus (or even just VLIW). So, GPU cache sure optimized for such usage, it provide buffer wide enough for all CUs on same time.
I have seen video. There stated exactly: CPU don't have access to GPU cache, because "they have ran tests, and with this configuration some applications seen two digits speed increase, but nearly none applications they tested shown significant gains with CPU have access to GPU cache".
So, when CPU access GPU memory, CPU just directly access RAM via system bus, but not trying to check GPU cache. And yes, this mean, could be large delay between GPU write cache and data actually delivered to RAM and seen by CPU, but probably, smaller than on discrete GPU on PCIe.