For reference latest Titan X offers 12 TFLOPs [1] and upcoming AMD card for Deep Learning [2] offers 13 . Though its not clear if TPU performance is calculated at fp16 or fp32[2]. The best GPUs currentLY available on AWS offer mere 2 TFLOPs per GPU [3].
Tesla V100 is the thing to be compared with, as it's the first chip optimized for training with the Tensor Core operation (4x4 matrix multiplication and accumulation with mixed fp16/fp32 precision: the inputs to be multiplied are fp16, the accumulation is fp32). V100 performance 100 TFlops this way.
TESLA V100 is apparently ridiculously expensive at 65,0000$.
In fact to an extent even NVidia has realized that there is more money in creating a GPU cloud from scratch rather than selling GPUs.
I think the net losers are Apple, Amazon/AWS (I believe NVidia is responsible for their lackluster GPU offerings.) & Intel (Who are still hoping for Multi-core to work. And are on track to be disappointed just like they lost mobile market to ARM hoping for Atom to be eventually adopted.)
For reference latest Titan X offers 12 TFLOPs [1] and upcoming AMD card for Deep Learning [2] offers 13 . Though its not clear if TPU performance is calculated at fp16 or fp32[2]. The best GPUs currentLY available on AWS offer mere 2 TFLOPs per GPU [3].
[1] https://blogs.nvidia.com/blog/2017/04/06/titan-xp/
[2] http://pro.radeon.com/en-us/vega-frontier-edition/
[3] http://images.nvidia.com/content/pdf/tesla/NVIDIA-Kepler-GK1...