This makes no sense, the V100 has *more* memory bandwidth than both the TPU and ...

jlebar · on Feb 12, 2018

V100 has 900gb/s memory bandwidth [0].

TPUv2 has 600gb/s per chip x 4 chips, so 2400gb/s [1].

As we've discussed elsewhere [2], comparing TPUv2 to V100 on a per chip basis doesn't make much sense. Who cares how many chips are on the board? If Google announced tomorrow that TPUv3 is coming out, which is identical to TPUv2 but the four chips are glued together, nobody would care.

The questions that we should instead be asking are, how fast can I train my model and how much does it cost?

Per elsewhere in thread [3], on Volta you have 900gb/s per 100Tops/s = 0.9 bytes/s per op/s, whereas on TPUv2 you have 2400gb/s memory bandwidth over 180Tops/s = 1.33 bytes/s per op/s. This means that TPUv2's memory-bandwidth-to-compute ratio is 1.33/9 = 1.5x higher than Volta's.

We can do a similar comparison for memory available. V100 has 16gb per 100Tops, TPUv2 has 64gb per 180Tops. So the memory-to-compute ratio for Volta is 16g/100T = .16 milli while for TPUv2 it's 64g/180T = .36 milli, for a ratio of .36/.16 = 2.25x higher on TPUv2.

Does any of this matter? Does it translate into faster and/or cheaper training? Do models actually need and benefit from this additional memory and memory bandwidth? My guess from working on GPUs is yes, at least insofar as bandwidth is concerned, but it's just a guess. I'm excited to find out for real.

(Disclaimer: I work at Google on XLA, and used to work on TPUs.)

[0] https://images.nvidia.com/content/technologies/volta/pdf/437... [1] https://supercomputersfordl2017.github.io/Presentations/Imag... [2] https://news.ycombinator.com/item?id=16360212 [3] https://news.ycombinator.com/item?id=16359531

twtw · on Feb 12, 2018

I responded to your other comment to disagree, and I'll do so again here.

Nobody is comparing DGX1-V to a single TPUv2 chip, because it doesn't make any sense to do so. they are totally different kinds of machines. But for some reason everyone is comparing a cluster of 4 TPUv2 chips to a single V100 chip.

It only makes sense to compare 4xTPUv2 to 1xV100 if they are equivalent in some meaningful metric, like total die size, power, etc.

In lieu of any available data, I'm going to continue to assume that each TPUv2 chip is roughly comparable in terms of power & die size to each V100 chip. If this was grossly wrong, I would expect that all four would be condensed into a single chip, which would dramatically increase the performance of the interconnects.

We could resolve this rapidly if there were any data available about die size, TDP, anything of TPUv2.

jlebar · on Feb 12, 2018

> But for some reason everyone is comparing a cluster of 4 TPUv2 chips to a single V100 chip.

I agree that some people are doing that. Marketing, I suppose. But that comparison is explicitly not the point of my parent post. I'm comparing the "shapes" of the chips -- specifically, the compute/memory and compute/memory-bandwidth ratios. These ratios stay the same regardless of whether you multiply the chips by 4 or by 400.

The point I was trying to make is that V100 has a higher peak-compute-to-memory(-bandwidth) ratio than TPUv2. This much seems clear from the arithmetic. Whether this matters in practice, I don't know, but I think it is relevant if one believes (as I do, based on the evidence I have as an author of an ML compiler targeting the V100) that the V100 is starved for memory bandwidth.

> In lieu of any available data, I'm going to continue to assume that each TPUv2 chip is roughly comparable in terms of power & die size to each V100 chip. If this was grossly wrong, I would expect that all four would be condensed into a single chip, which would dramatically increase the performance of the interconnects.

I'm sure Google's hardware engineers operate under a lot of constraints that I'm not aware of; I'm not about to make assumptions. But more to the point, as we've said, things like die size and TDP don't directly affect consumers. The questions we have to ask are, how fast can you train your model, and at what cost?

Just as you don't like it when people (incorrectly, I agree) insist on comparing one V100 to four TPUs, because that's totally arbitrary (why not compare one V100 to 128 TPUs?), I don't like it when people insist on comparing TPUv2 to V100 on arbitrary metrics like die size, or peak flops/chip, or whatever. So I disagree that we could resolve anything if we had more info about the TPUv2 chip itself. None of that matters.

deepnotderp · on Feb 12, 2018

Well, if you ignore power consumptiom because ",it doesn't matter to the end user", you're talking about economic comparisons, not technical comparisons.

BTW, I absolutely agree that memory bandwidth is the bottleneck, I've built my company around that assertion and the data for that exists (Mitra's publications come to mind)

twtw · on Feb 12, 2018

Alright, I understand better now what you are saying. I'm eager to see some benchmarks that can answer those meaningful questions.

Thank you for your courteous reply.

boulos · on Feb 12, 2018

We mostly focus on the “whole board” numbers. So it’s not only units <=> “local” HBM, but NVLINK versus TPU to TPU. Sorry for the confusion.

Edit for this part of the thread: the best public numbers are in the linked presentation [1].

[1] https://supercomputersfordl2017.github.io/Presentations/Imag...

deepnotderp · on Feb 12, 2018

That's... a skewed ... comparison, NVLINK is a board to board connection whereas you're talking about TPU to TPU on board communication if I understand correctly?

boulos · on Feb 12, 2018

That's sort of the point though! We're actually selling these as the "board". So the right way to compare things is sort of DGX-1 style "deep learning rig" versus a board of four TPU units (or several connected). The on-chip network is a big part of its overall efficiency.

I don't recall what (if anything) we've said about how we link up the boards across racks, but the folks at Next Platform looked pretty carefully at the pictures: https://www.nextplatform.com/2017/05/22/hood-googles-tpu2-ma...

deepnotderp · on Feb 12, 2018

It's not the point though. You're comparing whole board tpu FLOPs (4x 45) but then comparing tpu single chip chip2chip communication with nVidia board2board communication.

You can't have your cake and eat it too.

jamesblonde · on Feb 12, 2018

Yes, when training DNNs memory bandwidth is the only figure you need to look at. That's why the 1080Ti is by far and away the best bang for buck right now (ignore the EULA nonsense). It has about 55% of the memory b/w of the V100 for 10% of the price.