The other point to add is that SIMD operation for GPUs is what gives them efficient batched reads from GPU memory for each operation.
I can't say I'm an expert yet. But the more and more I read about highly optimized code on any platform, the more and more I realize that 90% of the problem is dealing with memory.
Virtually every optimization guide or highly-optimized code tutorial spends an enormous amount of time discussing memory problems. It seems like memory bandwidth is the singular thing that HPC coders think about the most.
If you don't have enough computations-per-byte to perform on the GPU, you will find your total job time starts to be dominated by the time it takes to stage data in and out of the GPU, without being able to keep the GPU cores busy. Even if the CPU is 5-10x slower according to issue rates and RAM bandwidth, it can keep calculating steadily with a higher duty cycle since system RAM can be much larger.
However, the CPU also benefits from locality, so you should still prefer to structure your work into block-decomposed work units if possible. A decomposition which allows you to work through a large problem as a series of sub-problems sized for a modest GPU RAM area will also let the sub-problem rise higher in the CPU cache hierarchy to get more effective throughput. However, if the decomposition adds too much sequential overhead for marshalling or final reduction of results, it may not help versus a monolithic algorithm with reasonably good vectorization/streaming access to the full data.