I note the only metric they offer is power usage. A poor substitute for actual p...

I note the only metric they offer is power usage. A poor substitute for actual performance numbers. In particular in any given CPU/GPU you could have all cores busily computing, nothing stalled and still be a good way below maximum power. To hit maximum power you need to carefully construct software that will fully utilize all functional units as well as causing maximum toggling within computation paths (1 + 2 will toggle fewer bits than 0xffffffff + 0xffffffff).

The thread doesn't give any clear indication if their explanation that the GPU sees significant stalls due to waiting on TLB miss has actual data behind it or is pure conjecture based upon observed power usage.