FPGAs in industry are used for a very small number of specific applications: Smart NICs, Early stages of wireless networks (5G whilst the standards are being hammered out), military (where you need high performance with no consideration of cost), and embedded, Prof Video (where the custom I/O is essential).
Generally, unless you're doing something that fits those applications well, the FPGA will not look good, and there are the same mistakes made in research time after time. For data centre these are twice as bad. The four really glaring ones are always:
* Quoting performance without taking into account the time to get the data onto the FPGA (generally via a PCI-E link that killed any chance of winning vs. CPU).
* Assuming performance scales linearly to fill up an FPGA (Full FPGAs can't run as fast as 10% full ones without significant effort)
* Profiling only the part of the problem or set of data that your code performs well for and not reporting how it transfers onto corner cases that CPUs would obviously do well for.
* Comparing against some noddy s/w solution when you've literally spent the last 3 years of your PhD optimizing the FPGA solution, and doing no background reading to see what the state of the art s/w does.
It just destroys a load of the research we see. The good applications are far less exciting, but the MS Catapult is a great example - the reason it's competitive is because they're using the custom I/O of the FPGA to move data around really quick, it's like a custom smart NIC almost.
In a number of the applications I've seen the other killers are the fact that not only do you have the transfer costs you mentioned to the device, you also:
1. Have to get information back from the device - and in regular expression matching this might be 1 match in 1000 or 1 match in 5 if you're unlucky, and
2. Have to have a lot of parallelism to hit peak performance, yielding great throughput but so-so latency. At Sensory Networks during our hardware stage, we had a "2 Gbps regex accelerator" (hah) that didn't even hit that modest number on a single stream - it actually required 14 streams or so running at 142Mbps.
Many of the same sins are repeated for GPGPU.
The other thing that I notice is that the "noddy s/w solution" sometimes is the only thing out there. I looked at some accelerator work on Random Forest inference (not training) and - wow - all the RF implementations are naive. There are a lot of s/w tasks out there that no-one has bothered to optimize with any effort at all.
However, when your adviser says "make a GPGPU/FPGA thesis" I think a smart PhD just goes and does that, rather than sinking 6 months into building a really great s/w comparison. :-)