The data bandwidth between main memory and that which the GPU works with isn't speeding up by the same factors though. This is fine when working with a dataset that fits into the GPU's memory pool and your workload involves relatively few (or zero) changes because you can transfer it once and repeatedly ask the GPUs to analyse it in what-ever ways. As soon as the common dataset doesn't fit neatly into the GPU's RAM (leaving enough spare for scratch space) you end up thrashing the channel and it becomes the main bottleneck.
That's true. There are some workarounds though, like distributing the workload between multiple GPUs - actively under development in machine learning for instance
Though, I would like to see Xeon Knights Landing compared to GPUs for db uses.