Hacker News new | comments | show | ask | jobs | submit login
Benchmarking Google’s new TPUv2 (riseml.com)
192 points by henningpeters 9 months ago | hide | past | web | favorite | 93 comments



Disclosure: I work on Google Cloud.

While not perfect, I want to commend the RiseML folks for doing not only an “just out of the box” run in both regular and fp16 mode (for V100), but also adding their own LSTM experiment to the mix. We need third-party benchmarks whenever new hardware or software are being sold by vendors (reminder: I benefit from you buying Google Cloud!).

I hope the authors are able to collect some of the feedback here and update their benchmark and blog post. The question about batch size comparisons is probably the most direct, but like others, I’d encourage a run on 1, 2, 4 and 8 V100s as well.


Author here.

Thanks for your feedback and your suggestions (and from everybody else)! We'll make sure to gather all of the valuable feedback and run additional experiments. Different batch sizes and a comparison against >1 GPUs is already planned (and partly executed).


So this is a chip that no one outside of Google is going to be able to get a physical copy of ever?

It makes any benchmarks become Google-cloud benchmarks, right?

Edit: I am complaining a bit about the lack of availability but there's also a real point here. If there's no source for TPUs outside of Google, Google Cloud competes only with other cloud providers and with owning physical GPUs - long term, it has no incentive to be anything but little bit more efficient than these however much it's price for producing TPUs declines.


It's going to be a very exciting multi-company arms race -- at minimum, Google, Intel, Nvidia. Microsoft has their FPGAs, Amazon has their rumors. And there are several startups trying to enter the space. I don't think we're looking at stagnating; very much the opposite. It's going to be fantastic for the field.

(I'm saying this with my CMU hat, not my Google hat.)


I would not see things stagnating but it seems like there's a potential for the individual to get cut out of this excitement if each of these entities is keeping it's chips close to it's chest.

The era of the mainframe, with each provider competing with a custom chip, wasn't necessarily beneficial for individual buying computer power.


Industries go through cycles of innovation and concentration. During innovation cycles, many new non-standard products appear with innovative solutions, the entire pie grows really fast. As growth eventually stabilizes, standards become more relevant and consolidation happens, eventually leading to a stagnation that makes the industry ripe for disruption and change again.

If you look at processors, you see that with the early custom processors, followed by some standardization and copy around the IBM S/360, followed by more proprietary innovation around the PC era, resulting finally in the x86, eventually disrupted by the mobile chips, which then consolidated around ARM and so on.


I second this.

With cloud computing, we are essentially going back to the era of time-shared mainframes with remote access.


What are the Amazon rumors?


https://www.google.com/amp/s/www.theverge.com/platform/amp/2...

The (expensive) report from The Information upon which that article was based speculates a little more about chips for training.


Making it available outside would require hiring more people and doing more work that they don't need right now.

For a while, it's very likely that Google will be the main user of these, so there's still plenty of incentive for it to increase efficiency and reduce costs.


I thought they were going to provide to other cloud providers? I’m also guessing if you’re willing to purchase a lot of them then they’re willing to talk...


I'd be interested if anyone has details.

It may be that the other cloud providers would then sell them to those individuals. Indeed, the job of entities called "distributors" to buy big lots from manufacturers and break them up.

And of course, I don't know what the point of (apparently) keeping them out of the average person's hand would be.


For starters, Google especially for enterprise level hardware is known to limit their need to provide customer support; imagine supporting the chip would fall on Google, not any 3rd parties.


Google claim 29x better performance-per-Watt with TPUs than contemporary GPUs[0]. Interesting to contrast that to the images-per-$ figure in this post, which is more like 2x.

I assume there's a high capital cost for this new hardware, but when they scale it up I wonder if the ratio of cost TPU to GPU will trend towards the ratio of power-per-Watt between the platforms? Seems like a natural limit, even if it never quite gets there.

[0] https://cloud.google.com/blog/big-data/2017/05/an-in-depth-l...


That was TPUv1 (inference only). This article is about the new Cloud TPU (or TPUv2 as they call it), which handles both inference and training. The competitive landscape also changed a lot in the interim - NVidia added tensor cores to Volta to accelerate deep learning computation.


I think you might want to factor that they are adding their own fees? Also, I think the market may have changed since then (AWS lowered prices and new GPUs have come out). Google's workload is also different than the benchmark that is given in this.


> Google claim 29x better performance-per-Watt with TPUs than contemporary GPUs[0]. Interesting to contrast that to the images-per-$ figure in this post, which is more like 2x.

But you aren't paying for the electricity, you're paying for processing, which is an unconnected parameter. They only "sell" these chips per use, not on the open market

Presuming power is a major cost input (which I assume) their profit/op is much higher. So they could sell for less than an equivalent GPU and make more money. But they think they can get away with value pricing it (processing more/unit time is presumably worth it to many customers) and more power to them.

(that last bit was not an intentional pun; only noticed after typing it)


Maybe Google will pivot from high tech into crypto mining


[Edited] The top line results focus on comparing four TPUs in a rack node (which marketing cleverly named “one cloud TPU”), running ~16 bit mixed precision, to one GPU (out of 8 in a rack node), also capable of 16 bit or mixed precision, but handicapped to 32 bit IEEE 754. That is a misleading comparison. Images/$ are obviously more directly comparable, but again the emphasized comparisons are at different precision. Very different batch sizes make this significantly more misleading, still. Images/$ also only tells us that Google has chosen to look at the competition and set a competitive price; the per-die or per-package comparison is much more relevant to understand any intrinsic architectural advantage, since these are all large dies on roughly comparable process nodes.


Disclosure: I work on Google Cloud.

Depends on your metric, Jonathan! If you focus on the per dollar numbers, then it’s actually net favorable to the V100, because a second GPU over NVLINK won’t be as cost-efficient. If what you care about is raw throughput “in a single box”, then 8xV100 probably comes out ahead here.

Like someone else below though, I worry about the “hey wait a minute, changing the batch size just for the TPU seems unfair” and the whole “the LSTM didn’t converge” bit. Not a bad first draft, but hopefully the authors can do some more comparisons.


Author here.

Thanks for your feedback. As I noted above, we will report further results with larger batch sizes (and smaller ones for the TPU). The LSTM not converging is one of the experiences we wanted to share. We are working on solving this issue and will update the post accordingly. Our goal is really a fair and valuable comparison, which is not easy, so we value all of the feedback.


That's why you scroll down the page to the cost comparison, which places it on a more even keel. They do also compare float16 on Volta. Physical packaging is irrelevant -- what matters is dollars to convergence and time to convergence.

(I'm obviously biased - I helped with parts of the cloud-side of cloud TPU - but I presume this comment stands on its own. :-)


To be clear, I had read the whole post, I was just being terse since the emphasis seemed to be so heavily on an apples to bananas comparison (I believe 100% of the results cited in the prose, many in bold, are with mismatched precision and batch size), with minimal articulation of the many axes of nuance here. Precision isn't defined at all in the LSTM case, and could easily be the cause of the failure of the TPU run to converge where the GPU runs do. To a non-expert audience I think the end result is confusing and misleading.

Also, while I certainly agree that the performance/dollar comparison is highly relevant to customers at a given instant, that may only tell us that Google is subsidizing this hardware now that they've deployed it, and/or that, lacking serious competition, NVIDIA has been building crazy margins into their P100/V100 prices. In understanding fundamental technological tradeoffs, and even the limits of what the pricing in a more competitive market could be, it is relevant to compare performance per unit of hardware resources (mm^2, die/package, watt, GB of HBM, etc.)

In short, these comparisons are hard, and there is no one which tells a complete story. I pushed back because, while the post includes some nuance, it brushes a great deal under the rug and focuses primarily on problematic comparison

(Further disclosure: I'm at least the third person in this sub-thread with some Google Brain/Cloud affiliation. I am speaking in my independent academic voice. I also think TPUs are great, having them publicly available now is great, and competition and diversity of architectural approach in accelerators is great. I appreciate the effort of the authors, but think the subtlety of these comparisons requires serious care.)


(1) I think we should all collectively agree to ignore the LSTM results -- without the model converging, it's impossible to say whether or not the bug that prevented convergence also affected speed. I can build an arbitrarily fast LSTM that doesn't converge in a few print statements. :-)

(2) Unless Google releases the specs to the h/w, I'd argue that cost is our best proxy. But if you assume that both Google and Amazon want to make a profit on their cloud rentals, it at least gives us a way to get to something we can normalize to (the V100's list price is public, though who knows how much Amazon pays). And, given that you can't buy a Cloud TPU, the price Google charges really is the meaningful answer. It doesn't tell us about fundamentals, but it's the right answer from a consumer standpoint.

I think it's a fair bigger-picture question to ask how we fairly and informatively benchmark cloud-only services in ways that we can not only get consumer-oriented price comparisons, but also learn from the underlying technical choices. The longer-term answer is that we beg Google to write a paper about TPUv2, as they (surprisingly!) did about TPUv1 -- because without that, we just get black box numbers combined with informed speculation based upon glossy board and heatsink photos.

btw - the best current source of specs about TPUv2 was Jeff's NIPS talks: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

Which mentions a few details like 16GB HBM per chip with 600GB/s memory bandwidth.

(3) I agree completely with you that the comparisons are hard. I'm very glad the authors of the blog post are listening to the feedback they're getting here -- on the LSTM, on batch size comparisons, and about precision and being clear about which things they're measuring.

(Reminder disclosure: It's awkward talking about Google in the third person since they pay me part time, but I'm trying to take this discussion with my academic hat also. This nested series of disclaimers is an amusing commentary about how small the machine learning + systems community is.)


Thank you for your feedback! (author here)

Our intention is really to provide a sound comparison. I think we agree that these kinds of comparisons can be hard given the constraints (e.g., lack of available technical information on TPUv2 or public implementations of optimized models for certain architectures). As I stated elsewhere, we are collecting all of the feedback and will run additional experiments.

If you know of an implementation of a mixed-precision/fp16 model that you'd like to see results for, please let us know! I may also reach out directly to you for that if you don't mind.


The amount of devices is what is completely irrelevant.

It's all about performance per dollar.


Disclosure: I work on Google Cloud.

Not necessarily. The DGX-1, for example, has pretty poor perf/$$ but reduces the time a data scientist spends waiting. For some organizations, their people time is so valuable that what matters is “what gets me my answers back faster”, because that employee is easily $100/hr+.

That’s actually why the 8xV100 with NVLINK is so attractive (and why the TPUs also have board to board networking, not just chip to chip).


Sure, for a customer. But from a technological point of view, performance per dollar doesn't tell us everything. A company could subsidize their compute service and get astounding perf/dollar with a not particularly impressive chip.

I'd like to know perf/watt, for instance, even if it doesn't matter to the customer.


agreed, it's almost purposefully very misleading. He's not even using the same version of tensorflow, or the current version of cuda (9.1).


Would you expect a big performance difference from using CUDA 9.1?


Author here.

Point well taken, we'll make sure to add a comparison to 4 and 8 GPUs. For now, a "Cloud TPU" (containing 8 cores) seems to be the smallest unit to allocate. The question of what exactly makes up a single device and how many to compare against each other is not easy to answer.


The bar graph seems a little whacky. It groups the TPU (which can only do FP16) with the FP32 results from the GPUs, then puts the FP16 GPU results off to the side even though that's much closer to what the TPU is doing.

Impressive results regardless though; quite a bit faster than V100 than the paper specs would suggest.


It also seems like the price comparison should compare with the fp16 numbers on both platforms, not the fp32 numbers.


Author here.

Good point, I agree that the FP16 GPU results should be closer or grouped with the TPU results. We'll try to update accordingly.


Wait but, the batch size is 8x bigger for the TPU? That's not a fair comparison; increasing batch size always speeds things up...


Author here.

Note that the TPU supports larger batch sizes because it has more RAM. We tested multiple batch sizes for GPUs and reported the fastest one. We'll try increasing the batch sizes as far as possible and report. The overall comparison will likely not change by much - we saw speed increases of around 5% doubling the batch size from 64 to 128. (https://www.tensorflow.org/performance/benchmarks also reports numbers for batch sizes of 32 and 64 on the P100)


Disclosure: I work on Google Cloud.

Oh! You should definitely say that. It's semi-reasonable then to choose the batch size that is optimal for the part. It'd be good to make sure this isn't why your LSTM didn't converge though...


I tested many different batch sizes for the LSTM, so I am pretty confident it's not the reason.


But typically does not have an impact on power usage.

They are claiming a 29x improvement in that area.


Just to clarify, is this benchmark leveraging mixed-precision mode on the Volta V100? The major innovation of the Volta generation is mixed-precision which NVIDIA claims is a huge performance increase over the Pascal generation (P100 in the case of your benchmark).

Link to NVIDIA documentation on mixed-precision TensorCores: https://devblogs.nvidia.com/inside-volta/


Where specified "fp16", the V100 benchmarks use the code from https://github.com/tensorflow/benchmarks/tree/master/scripts... with the flag --use_fp16=true which enables fp16 for some but not all Tensors.


It's my understanding that fp16 (available on the previous generation P100) and mixed-precision (major innovation of V100) are different things and the speedup of TensorCores is entirely missing from this benchmark. Unlike the general purpose P100, the TPU is a heavily optimized chip built for Deep Learning, hence it's performance increase. However, the V100 is also heavily optimized for Deep Learning (arguably the first non-GPU chip) from NVIDIA. I'm in no position to defend NVIDIA here haha but it seems like the benchmark misses the point if this is indeed the case.


It was my understanding that the TensorFlow benchmarks do make use of TensorCores on the V100. We'll verify and update accordingly.


Specialization brings speedups.

TPUv2 is specially optimized for deep learning.

Nvidia's Volta microarchitecture is graphics processor with additional tensor units. It's a General-purpose (GPGPU) chip designed with graphics and other scientific computing tasks in mind. Nvidia has enjoyed monopoly power in the market and single microarchitecture has been enough in every high performance category.

Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others.


> Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others

I don't know, this benchmark seems to show V100 doing pretty well against a specialized ASIC. It may well be that all NVIDIA has to do is cut costs on V100 to make a two V100s about as expensive as the cloud TPUv2. With increased batch size, it looks like two V100s would have performance comparable to TPUv2.


Volta V100 already has "tensor cores" which are basically little matrix multiplication ASICs.


That's what I said.

The microarchctiecture has many unnecessary things and it's not optimized as a whole for deep learning.


I believe it was either the last MICRO* or the one before that when Dally addressed this point. The specialized hardware for graphics ends up comprising such a small portion of the overall chip that it wasn't worth it to remove it. The "GPUs were made for graphics thus aren't good for DL" argument really doesn't hold a lot of water IMO.

* It might've been a difference conference now that I think of it.


The entire idea that people are going to gain some huge advantage over nvidia with hardware softmax seems dubious. I do think it will buy them some time but eventually it seems as though nvidia will win this one.


I'd be interested how the superior perf/watt claims holds in googles practical setup. The additional Networking gear and power supply losses and so on might make the difference less.

I'm also not sure how we can take googles word for the numbers, since they might as well be eating a less-than-ideal power cost to promote their platform. Any upfront cost will probably offset by locked-in customers later on.

I might just be a bit cynical though.


IIRC, TPUv2 uses 16 bit floating point in some format with higher dynamic range and lower precision than standard fp16. Can someone confirm?

If that is right, is the "Tensorflow-optimized" Resnet-50 using 16bit floats when running on TPUv2?


Re: fp16 dynamic range: yes.


Does this take into account the fact that you might need fewer epochs if you reduce the batch size? (as is done for the CPU?)


Author here.

No, this is really only comparing the throughput on the devices. A thorough comparison should really focus on time to reach a certain quality - including all of the tricks available for a certain architecture.


Would I be able to buy one of these for my home? Or just in the cloud? If I can buy, how much would it cost?


Author here.

These are only available on the Google Cloud right now. I don't think there are plans to sell them anytime soon.


> In order to efficiently use TPUs, your code should build on the high-level Estimator abstraction.

Does this mean it's inference-only? (I only quickly scanned the article)


No, this whole blog post is about training models.


I wonder if Chinese companies will use (or be allowed to use) TPUs. It seems like a pretty obvious way to have the NSA scoop up any Chinese AI advancements China may want to keep secret.


It's interesting to me that the assumption here is that government actors are interested in this as opposed to the companies hosting them.


Oh, I totally agree with you there. It's just I consider Google a government actor too.


Does this mean you consider Google a government unto itself, or part of an existing government?


Google is the same as NSA, but exists as a dance around 4th amendment. Google can do the sort of spying the US government can't constitutionally do, then hand that over to the government, constitutionally, under gag order if necessary. It's all stagecraft. Same for Apple, Amazon, Intel, etc. Eric Schmidt runs HRCs campaign. Al Gore is on Apple's board.

They are all set up to spy on us. Deep state. They hunt sys admins. If you're here, you're a target.


> Google is the same as NSA, but exists as a dance around 4th amendment

You are a conspiracy theorist.

> Google can do the sort of spying the US government can't constitutionally do, then hand that over to the government, constitutionally

This is an agenda-driven redefinition of the word "spy" that I find disingenuous in the extreme.

> It's all stagecraft. Same for Apple, Amazon, Intel, etc. Eric Schmidt runs HRCs campaign. Al Gore is on Apple's board.

Firstly: Eric Schmidt founded a company designed to legally channel lots of money via analytics expertise into his favorite candidate, like every rich person does under the current set of laws. Neither of us has to like it, but he did not "run her campaign" and if they actually did? Wow, not a great job there.

As for Al Gore?

I've worked with Al Gore. He did some advisory work for my financial data startup, as part of our first round's venture firm. Our data was some of the most valuable data about consumers that can possibly exist, and had incredible applications for both surveillance and law enforcement.

To the best of my knowledge, we were never pressured to hand over a byte to anyone. Quite the opposite. I specifically remember a conversation where he mentioned user data privacy was the single most important priority he felt we could have.

So between my lying eyes, ears, and email history and your wild gesticulations about the evil overlordship, I'm gonna have to lean towards my own personal experience.

Unless, of course, you got some actual evidence and not a room full of old coffee cups and red string up on the walls.


Only on HN does a conspiracy theory get shut down with “I once worked with Al Gore...” :)


He works with KP so it's not that unusual. He even tweeted about my product!


I wonder which Chinese companies are developing their own processors like TPUs.


Well, they do have the fastest supercomputer in the world currently and it's made with homegrown chips. No Intel ME backdoors there. Smaller chinese companies could, for a little more money, get similar performance buying 8x V100 machines from NVidia. I don't think they want to share their advancements in AI fighter pilots with USA. They have a big lead.


What is the hardest thing to accomplish with something like a TPU? Is it the IP or the fabrication?

How does the TPU design offer improved performance? By leveraging IP or fabrication improvements?


Scale. Even if you design the fastest chip, you need to convince people to use your proprietary solution that has 0 users currently. Until then you don't have enough volume and scale.

Becomes a chicken-and-egg problem.

Google was able to solve it by designing TPUs for internal usage first, therefore reaching minimum scale, and then making changes and offering it publicly.

So you'd need to be:

1) Chip designer

2) Massive internal user

3) Cloud service provider

I don't see any other company that has all 3. Amazon has (3), and maybe (1) if they hired the right people. Microsoft could have enough (2) to justify it, but they're going the way of FPGAs.


Neither, it's a matrix multiply systolic array ASIC, that's been done decades ago.

There are host of Chinese companies developing similar processors.


Why is Google investing in its own?


Cost advantage.


Bitmain already announced some.


It is hard for Google to make money on these TPUs as the whole engineering cost has to be made back from its pricing on Google Cloud, where as with NVIDIA it can pay back its engineering costs via multiple mature channels (games, super computers, and multiple cloud providers.)

I wonder which is higher, the cost for creating the TPUs in terms of engineering and manufacturing or the cost differential in terms of usage as compared to NVIDIA's latest?

I worry about Google long term here. I am surprised the TPU doesn't kick the ass of the NVIDIA chips.


Disclosure: I work on Google Cloud.

By the logic above, you would conclude that TPUv1 (the inference-only chip) might have been a mistake, but we’ve been very public about how it “saved us from building lots of datacenters”.

That wasn’t ever sold as part of Cloud, so the benefit there is all from the second bit you mentioned: cheaper and more efficient than GPUs at the time. The paper also goes into more detail, but the size of that initial engineering team and time to market were both quite small.

For training, before Volta (and kind of Pascal), GPUs were the best option but not particularly efficient. Volta does the same “we should have a single instruction that does lots of math in one shot” by cleverly reusing the existing functional units. That the V100 is a great chip, is a good outcome for the whole industry. But GPUs aren’t (and shouldn’t be) just focused on ML. My bet is that there’s still a decent amount of runway left in specialized chips for ML, just as GPUs carved out their own niche versus CPUs.

But again, the “even just for Google” benefit is really enormous so I wouldn’t assume that Cloud has to pay for the entire effort. Could GPU manufacturers improve the cost:performance ratio of ML workloads enough that Google doesn’t have to build TPUs anymore? Perhaps, but like the V100 improvements that would be a great outcome!


Is there going to be an updated paper on performance per Watt, now that TPUv2 is public and V100 has been preannounced on the Google blog?


There's no real need to worry about Google in the long term - nVidia can make back their money solely with their GPUs; Google probably made their expenses back this weekend with searches around the Olympics. It'd be pointless for them to not use their TPUs themselves, and their main product, Adsense, uses ML.


Are you sure about Adsense? Talked to ad pros recently and they all complained Adsense is ancient (still mySQL?) and often broken; doesn't look like Google emphasizes it despite being their cash cow, more like deep state of neglect.



> still mySQL?

The F1 distributed database was developed to move the AdWords business off of MySQL.

https://research.google.com/pubs/pub41344.html


Adsense isn’t Google’s cash cow, Adwords is.


Ah OK, I might have been confused then. Thanks!


Google probably got back a lot of the engineering costs before it even rented out the first TPU, simply by virtue of running its own workloads, without having to buy tons of CPUs or GPUs. They're also very, very good at reducing computing resource waste (I know this firsthand).

I wouldn't be surprised if public TPUs are to some degree a way to print money: at least for a while, Google can probably just rent out its unused capacity that it had already planned and paid for. :-)


> I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

30% cheaper e2e price for the company's first public offering, compared to the market leader's top-of-the-line chip sounds...pretty good to me?


30% list price. Who knows what the underlying margins are and how much cheaper Google can go with an offline agreement.


Since TPUs are used at Google to process data for its own service offerings (e.g. image classification, voice recognition, language translation, NLP, route planning, etc.) wouldn't it be fair to say that they will also be able to recoup the sunk costs (R&D) by purchasing fewer GPUs?


> TPU doesn't kick the ass of the NVIDIA chips

It used to until Volta came out with basically TPUs embedded on the board. We will see if AMD will join them as Vega in theory should be around Volta as well, just tooling is not there.


How long has Google had the TPUv2 for internal use? I was under the impression that V100 and TPUv2 where developed around same time. They were certainly announced around the same time at least. Just seems weird to say "it used to," when V100 has been shipping since mid-summer 2017.


I think at least for inference TPUv1 was beating all previously available GPUs by a wide margin. TPUv2 did that for training as well, with the exception of Volta.


>I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

Yeah, I'm a bit disappointed myself. When announced initially, it seemed Google had a huge lead. But they dragged their feet for two years getting it to market, and now NVidia is nipping at their heels already.

I suspect they are using the TPUs internally for competitive advantage, and these are the leftovers they are done with. They're probably using v4 or v5 internally already.


I agree with you that the cost of TPU development probably out ways the number of dollars that Google will earn renting TPUs. The thing is, no one else has a TPU but Google. That doesn't look like it will change any time soon. That means that if you want to run the fastest machine learning models, you have to use Google Cloud. Now, Google doesn't just benefit from the TPUs, they can now sell more customers to come to their cloud. After that starts happening, all of the best machine learning people will have Google Cloud experience. Then when they start something new, they will use what they know: Google Cloud. Also, they will create the tooling that only works with TPUs and gives an advantage you cannot use outside of Google Cloud. So, it will be a net win for Google even if it is more expensive to run a TPU than what they are renting them for.

tl;dr TPU helps Google Clouds' network effect.


Furthermore, computer hardware is not static. Is this a real long term investment by Google?

If they do not continue to improve on process, they will fall behind in just a few years.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: