- TF can't do anything quantized on GPUs. It just switches back to to the CPU/TPU.
- TF gets relatively poor utilization of the GPU and tends to not be careful with memory use.
- I was able to do certain types of classification hundreds of times faster by seeing what TF was doing it and hand writing it in OCL. Using https://docs.rs/ocl/0.14.1/ocl/. It's a super cool library for rust. Also users should checkout tensorRT https://github.com/NVIDIA/gpu-rest-engine/tree/master/tensor.... It's not super well supported and may go away, but it is fast
They were proposing a chip that did nothing but a limited set of linear algebra operations at gigabit rates. They were former Transmeta people
Perhaps the biggest recent success story in this field is Anton.
They claim to have 3.5x as much on-chip memory as a GPU, but the R9 Fury X has 16.7 MiB of register memory compared to their 28MiB. And then of course there's caches on top of that (which funnily add up to less than the register memory, I believe).
I also don't get how they come up with those MAC numbers. An RX Vega 64 can do 27 TFlop/s of half-precision arithmetic, which is way more than 1/25x the 92 TOp/s they claim for the TPU. In fact, it makes the GPU look pretty damn good, considering the TPU only does 8-bit ops.
Of course I'd expect the TPU to beat a GPU in terms of perf/watt, but that's not what they're comparing on that particular slide.
There's the whole question of how you manage latency in inference, but then I'd expect them to talk about the utilization of the GPU resources relative to the theoretical peak.
Also, is that 25x claim really about the rate of operations? It reads to me like they're talking about the number of execution units.
Are the cards you mention from 2015? Are they for gaming or servers? Do they use ECC? If not, that rules them out right away.
You can find the paper with the methodology, theoretical peaks and latency management at
The first one, for the memory comparison, is indeed from 2015.
The second one isn't -- 2015 desktop/server GPUs didn't have good half-float operations yet, as that hadn't really been a market. However, the first-mentioned GPU from 2015 has 8.6 single-precision TFlop/s, which is also more than 2x higher than their comparison baseline for GPUs.
The gaming/server and especially the ECC thing is pretty moot. First, while I'm not sure what kind of server SKUs were available at the time, it hardly affects architectural results. Second, even market availability shouldn't matter much. They're Google. If they had wanted different SKUs in volume, they almost certainly could have gotten them.
I mean, it's clear that a special-purpose chip is going to beat anything else at a task like this. It's just odd that they apparently felt the need to make themselves look better than they really are when the result is impressive enough with a real comparison.
I think it's fair to say that 92T/s 8-bit arithmetic ops is much less than 25x the 27T/s half-float operations of a GPU.
I don't think that accusation is justified.
The part about float operations being better is only a side note. The core of the comment is that they are not inferior. If you needed to, you could snip wires to turn that half-float unit into an 8 bit unit. So treat the numbers as if they were the same thing. 27 vs. 92. That's not a 25x increase. Not even close. Something about this comparison seems either unfair or misleading. For example if a GPU doesn't engage most of its ALUs for certain sizes of input (cough GP10x cough), that's not a point in favor of the google design, that's just the GPU being broken.
But the actually important part here is that the register and bypass networks to pass 4 bits of int8 data around are way more complicated than those required to pass a single float32 around and that's where Google's decision to restrict the flexibility of its TPU pays big dividends. NVidia's GPUs do not have broken designs. They're just making compromises based on the need to handle a wider variety of use cases.
Yeah but there wasn't a suggestion to do so. Just by raw count there are issues with 25x.
> NVidia's GPUs do not have broken designs.
The part where the current generation sticks in a single FP16x2 unit per 128 FP32 units, so that if your code triggers them it runs 64x slower on FP16 while leaving all the FP32 units idle? That's broken as far as I can see, there to upsell you the pro cards.
Anything that would make 8 bit math slower than 32 bit math is just a fundamental lack of forethought. It's not preferred by GPU design, and shouldn't be used as a point against GPUs in general.
TPU v2 is in alpha stage right now but if you're a research you can apply to use it over at google cloud service.