Hacker News new | comments | show | ask | jobs | submit login
Google TPU Performance Analysis (anandtech.com)
129 points by kartD 10 months ago | hide | past | web | favorite | 27 comments

So many details that people gloss over. I have used tensorflow (TF) and it is true that GPUs suck at interference at it. But it's not always the GPUs fault

- TF can't do anything quantized on GPUs. It just switches back to to the CPU/TPU. - TF gets relatively poor utilization of the GPU and tends to not be careful with memory use. - I was able to do certain types of classification hundreds of times faster by seeing what TF was doing it and hand writing it in OCL. Using https://docs.rs/ocl/0.14.1/ocl/. It's a super cool library for rust. Also users should checkout tensorRT https://github.com/NVIDIA/gpu-rest-engine/tree/master/tensor.... It's not super well supported and may go away, but it is fast

Do you know why TF is getting poor GPU util? Is it the pipeline feeding data to the core compute ops, inefficiencies in the core compute ops, or something else?

Just wanted to chime in on TensorRT, it's a well supported product and it's different than gpu-rest-engine. This GitHub repo is simply an example of how to use TensorRT in a specific situation.

Question about TensorRT: It takes in Caffe trained models. Is this BVLC Caffe or NV-Caffe? In my experience the models aren't compatible between the two.

Either but I've never tried

Seems very much "back to the future." Systolic array processors were used to accelerate neural networks in the 1980's. Great for matrix math too. (ref: http://repository.cmu.edu/cgi/viewcontent.cgi?article=2939&c...). These aren't quite the systolic array processor of old, but too close to be considered new arch/micro-arch. The formula is simple, have low precision MM to accelerate, drop in a matrix multiply unit that can be blocked for and high bandwidth memory to feed it and let it go. I'm waiting for more new takes on old arch....as fabbing chips becomes more economical, I hope to see more retro chips. Especially things that didn't quite make the jump from research to production b/c of scaling (or other reason), might now make sense.

Back in early-noughties, I remember that there were a company that was developing an accelerator chip for seismic data analysis for oil exploration companies. I can't remember the name now. Can anybody remember?

They were proposing a chip that did nothing but a limited set of linear algebra operations at gigabit rates. They were former Transmeta people

Clearspeed? The HPC history books are littered with the bankrupt corpses of special purpose hardware.

I remember an ASIC that was supposed to accelerate multigrid preconditioners, out of some big German university. They were never able to get stuff to market fast enough to beat Intel and Moore's law.

Perhaps the biggest recent success story in this field is Anton.


Looks to be all about TPU1? Which is inference-only. Afaik TPU2 allows for training as well, Im much more interested in that. Last line: "There was a TPU2 talk earlier that I missed that I need to look through the slides of and write up later"

The Hot Chips talks will eventually make their way onto YouTube...

I really don't get how they came up with those numbers comparing CPUs to GPUs.

They claim to have 3.5x as much on-chip memory as a GPU, but the R9 Fury X has 16.7 MiB of register memory compared to their 28MiB. And then of course there's caches on top of that (which funnily add up to less than the register memory, I believe).

I also don't get how they come up with those MAC numbers. An RX Vega 64 can do 27 TFlop/s of half-precision arithmetic, which is way more than 1/25x the 92 TOp/s they claim for the TPU. In fact, it makes the GPU look pretty damn good, considering the TPU only does 8-bit ops.

Of course I'd expect the TPU to beat a GPU in terms of perf/watt, but that's not what they're comparing on that particular slide.

There's the whole question of how you manage latency in inference, but then I'd expect them to talk about the utilization of the GPU resources relative to the theoretical peak.

It's an old chip, it needs to be compared to the 2015 competition rather than the current one. While it's not on the slides, the notes suggest that the CPU is a Haswell and the GPU a Tesla K80.

Also, is that 25x claim really about the rate of operations? It reads to me like they're talking about the number of execution units.

They compared TPU v1 with server hardware available at the time (2015, i.e. K80 and Haswell).

Are the cards you mention from 2015? Are they for gaming or servers? Do they use ECC? If not, that rules them out right away.

You can find the paper with the methodology, theoretical peaks and latency management at https://arxiv.org/abs/1704.04760

Fair enough.

The first one, for the memory comparison, is indeed from 2015.

The second one isn't -- 2015 desktop/server GPUs didn't have good half-float operations yet, as that hadn't really been a market. However, the first-mentioned GPU from 2015 has 8.6 single-precision TFlop/s, which is also more than 2x higher than their comparison baseline for GPUs.

The gaming/server and especially the ECC thing is pretty moot. First, while I'm not sure what kind of server SKUs were available at the time, it hardly affects architectural results. Second, even market availability shouldn't matter much. They're Google. If they had wanted different SKUs in volume, they almost certainly could have gotten them.

I mean, it's clear that a special-purpose chip is going to beat anything else at a task like this. It's just odd that they apparently felt the need to make themselves look better than they really are when the result is impressive enough with a real comparison.

I would argue ECC is completely pointless during neural network inference, the amount of change a single bit flip, even a sign change, is likely to generate is minimal.

An exponent bit flip, especially at later layers, would completely break inference.

I'd be concerned about code, too. It's not unheard of for a single bit flip to make petabytes go away (true story).

Floating point operations (TFlop) != Tensor Operations (TOp)

Sure, but read the slide. They have 64k multiply-accumulate units, running at 700 MHz. That means ~46T/s multiply-accumulates, which means ~92T/s individual arithmetic ops. It's a standard way to measure this.

I think it's fair to say that 92T/s 8-bit arithmetic ops is much less than 25x the 27T/s half-float operations of a GPU.

If an 8-bit integer is sufficient for a problem there isn't anything to be gained by moving to 32 bit floats. You can't just use a 32-bit floating point operation to emulate 4 8-bit integer operations for free (or vice versa) so you just can't compare the two the way you're trying to. Especially since moving to larger precision values would balloon memory and bandwidth requirements. For an honest comparison find out what the 8-bit integer performance of the GPU is.

> you just can't compare the two the way you're trying to

I don't think that accusation is justified.

The part about float operations being better is only a side note. The core of the comment is that they are not inferior. If you needed to, you could snip wires to turn that half-float unit into an 8 bit unit. So treat the numbers as if they were the same thing. 27 vs. 92. That's not a 25x increase. Not even close. Something about this comparison seems either unfair or misleading. For example if a GPU doesn't engage most of its ALUs for certain sizes of input (cough GP10x cough), that's not a point in favor of the google design, that's just the GPU being broken.

They aren't inferior but you can't just multiply by 4 here either. You could turn a int32 adder into 4 int8 addres if the larger adder works on a ripple-carry principle but really everybody uses ripple-carry or carry-bypass. And a float32 is more complex and you could get 3 int8s out of it but you'd have lots of transistors left over in the execution logic. But simple quantity of execution logic is almost never an interesting constraint in a design.

But the actually important part here is that the register and bypass networks to pass 4 bits of int8 data around are way more complicated than those required to pass a single float32 around and that's where Google's decision to restrict the flexibility of its TPU pays big dividends. NVidia's GPUs do not have broken designs. They're just making compromises based on the need to handle a wider variety of use cases.

> They aren't inferior but you can't just multiply by 4 here either.

Yeah but there wasn't a suggestion to do so. Just by raw count there are issues with 25x.

> NVidia's GPUs do not have broken designs.

The part where the current generation sticks in a single FP16x2 unit per 128 FP32 units, so that if your code triggers them it runs 64x slower on FP16 while leaving all the FP32 units idle? That's broken as far as I can see, there to upsell you the pro cards.

Anything that would make 8 bit math slower than 32 bit math is just a fundamental lack of forethought. It's not preferred by GPU design, and shouldn't be used as a point against GPUs in general.

This article just seems odd. They're still quoting numbers from how they compared 2 years ago to Kepler GPUs. Unless they have a new TPU out, these are worse than the V100 GPU out today, so it's strange that in a field moving so fast they're constantly quoting old data. It doesn't matter anymore that you had the fastest chip in 2015. If you haven't iterated since then, you are probably losing.

The link is about TPUv1, but Google is already using TPUv2 (or maybe TPUv3, they don't talk too much about this things).

Bottom of article the author missed out on the TPU v2 talk.

TPU v2 is in alpha stage right now but if you're a research you can apply to use it over at google cloud service.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact