While not perfect, I want to commend the RiseML folks for doing not only an “just out of the box” run in both regular and fp16 mode (for V100), but also adding their own LSTM experiment to the mix. We need third-party benchmarks whenever new hardware or software are being sold by vendors (reminder: I benefit from you buying Google Cloud!).
I hope the authors are able to collect some of the feedback here and update their benchmark and blog post. The question about batch size comparisons is probably the most direct, but like others, I’d encourage a run on 1, 2, 4 and 8 V100s as well.
Thanks for your feedback and your suggestions (and from everybody else)! We'll make sure to gather all of the valuable feedback and run additional experiments. Different batch sizes and a comparison against >1 GPUs is already planned (and partly executed).
It makes any benchmarks become Google-cloud benchmarks, right?
Edit: I am complaining a bit about the lack of availability but there's also a real point here. If there's no source for TPUs outside of Google, Google Cloud competes only with other cloud providers and with owning physical GPUs - long term, it has no incentive to be anything but little bit more efficient than these however much it's price for producing TPUs declines.
(I'm saying this with my CMU hat, not my Google hat.)
The era of the mainframe, with each provider competing with a custom chip, wasn't necessarily beneficial for individual buying computer power.
If you look at processors, you see that with the early custom processors, followed by some standardization and copy around the IBM S/360, followed by more proprietary innovation around the PC era, resulting finally in the x86, eventually disrupted by the mobile chips, which then consolidated around ARM and so on.
With cloud computing, we are essentially going back to the era of time-shared mainframes with remote access.
The (expensive) report from The Information upon which that article was based speculates a little more about chips for training.
For a while, it's very likely that Google will be the main user of these, so there's still plenty of incentive for it to increase efficiency and reduce costs.
It may be that the other cloud providers would then sell them to those individuals. Indeed, the job of entities called "distributors" to buy big lots from manufacturers and break them up.
And of course, I don't know what the point of (apparently) keeping them out of the average person's hand would be.
I assume there's a high capital cost for this new hardware, but when they scale it up I wonder if the ratio of cost TPU to GPU will trend towards the ratio of power-per-Watt between the platforms? Seems like a natural limit, even if it never quite gets there.
But you aren't paying for the electricity, you're paying for processing, which is an unconnected parameter. They only "sell" these chips per use, not on the open market
Presuming power is a major cost input (which I assume) their profit/op is much higher. So they could sell for less than an equivalent GPU and make more money. But they think they can get away with value pricing it (processing more/unit time is presumably worth it to many customers) and more power to them.
(that last bit was not an intentional pun; only noticed after typing it)
Depends on your metric, Jonathan! If you focus on the per dollar numbers, then it’s actually net favorable to the V100, because a second GPU over NVLINK won’t be as cost-efficient. If what you care about is raw throughput “in a single box”, then 8xV100 probably comes out ahead here.
Like someone else below though, I worry about the “hey wait a minute, changing the batch size just for the TPU seems unfair” and the whole “the LSTM didn’t converge” bit. Not a bad first draft, but hopefully the authors can do some more comparisons.
Thanks for your feedback. As I noted above, we will report further results with larger batch sizes (and smaller ones for the TPU). The LSTM not converging is one of the experiences we wanted to share. We are working on solving this issue and will update the post accordingly. Our goal is really a fair and valuable comparison, which is not easy, so we value all of the feedback.
(I'm obviously biased - I helped with parts of the cloud-side of cloud TPU - but I presume this comment stands on its own. :-)
Also, while I certainly agree that the performance/dollar comparison is highly relevant to customers at a given instant, that may only tell us that Google is subsidizing this hardware now that they've deployed it, and/or that, lacking serious competition, NVIDIA has been building crazy margins into their P100/V100 prices. In understanding fundamental technological tradeoffs, and even the limits of what the pricing in a more competitive market could be, it is relevant to compare performance per unit of hardware resources (mm^2, die/package, watt, GB of HBM, etc.)
In short, these comparisons are hard, and there is no one which tells a complete story. I pushed back because, while the post includes some nuance, it brushes a great deal under the rug and focuses primarily on problematic comparison
(Further disclosure: I'm at least the third person in this sub-thread with some Google Brain/Cloud affiliation. I am speaking in my independent academic voice. I also think TPUs are great, having them publicly available now is great, and competition and diversity of architectural approach in accelerators is great. I appreciate the effort of the authors, but think the subtlety of these comparisons requires serious care.)
(2) Unless Google releases the specs to the h/w, I'd argue that cost is our best proxy. But if you assume that both Google and Amazon want to make a profit on their cloud rentals, it at least gives us a way to get to something we can normalize to (the V100's list price is public, though who knows how much Amazon pays). And, given that you can't buy a Cloud TPU, the price Google charges really is the meaningful answer. It doesn't tell us about fundamentals, but it's the right answer from a consumer standpoint.
I think it's a fair bigger-picture question to ask how we fairly and informatively benchmark cloud-only services in ways that we can not only get consumer-oriented price comparisons, but also learn from the underlying technical choices. The longer-term answer is that we beg Google to write a paper about TPUv2, as they (surprisingly!) did about TPUv1 -- because without that, we just get black box numbers combined with informed speculation based upon glossy board and heatsink photos.
btw - the best current source of specs about TPUv2 was Jeff's NIPS talks: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf
Which mentions a few details like 16GB HBM per chip with 600GB/s memory bandwidth.
(3) I agree completely with you that the comparisons are hard. I'm very glad the authors of the blog post are listening to the feedback they're getting here -- on the LSTM, on batch size comparisons, and about precision and being clear about which things they're measuring.
(Reminder disclosure: It's awkward talking about Google in the third person since they pay me part time, but I'm trying to take this discussion with my academic hat also. This nested series of disclaimers is an amusing commentary about how small the machine learning + systems community is.)
Our intention is really to provide a sound comparison. I think we agree that these kinds of comparisons can be hard given the constraints (e.g., lack of available technical information on TPUv2 or public implementations of optimized models for certain architectures). As I stated elsewhere, we are collecting all of the feedback and will run additional experiments.
If you know of an implementation of a mixed-precision/fp16 model that you'd like to see results for, please let us know! I may also reach out directly to you for that if you don't mind.
It's all about performance per dollar.
Not necessarily. The DGX-1, for example, has pretty poor perf/$$ but reduces the time a data scientist spends waiting. For some organizations, their people time is so valuable that what matters is “what gets me my answers back faster”, because that employee is easily $100/hr+.
That’s actually why the 8xV100 with NVLINK is so attractive (and why the TPUs also have board to board networking, not just chip to chip).
I'd like to know perf/watt, for instance, even if it doesn't matter to the customer.
Point well taken, we'll make sure to add a comparison to 4 and 8 GPUs. For now, a "Cloud TPU" (containing 8 cores) seems to be the smallest unit to allocate. The question of what exactly makes up a single device and how many to compare against each other is not easy to answer.
Impressive results regardless though; quite a bit faster than V100 than the paper specs would suggest.
Good point, I agree that the FP16 GPU results should be closer or grouped with the TPU results. We'll try to update accordingly.
Note that the TPU supports larger batch sizes because it has more RAM. We tested multiple batch sizes for GPUs and reported the fastest one. We'll try increasing the batch sizes as far as possible and report. The overall comparison will likely not change by much - we saw speed increases of around 5% doubling the batch size from 64 to 128. (https://www.tensorflow.org/performance/benchmarks also reports numbers for batch sizes of 32 and 64 on the P100)
Oh! You should definitely say that. It's semi-reasonable then to choose the batch size that is optimal for the part. It'd be good to make sure this isn't why your LSTM didn't converge though...
They are claiming a 29x improvement in that area.
Link to NVIDIA documentation on mixed-precision TensorCores: https://devblogs.nvidia.com/inside-volta/
TPUv2 is specially optimized for deep learning.
Nvidia's Volta microarchitecture is graphics processor with additional tensor units. It's a General-purpose (GPGPU) chip designed with graphics and other scientific computing tasks in mind. Nvidia has enjoyed monopoly power in the market and single microarchitecture has been enough in every high performance category.
Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others.
I don't know, this benchmark seems to show V100 doing pretty well against a specialized ASIC. It may well be that all NVIDIA has to do is cut costs on V100 to make a two V100s about as expensive as the cloud TPUv2. With increased batch size, it looks like two V100s would have performance comparable to TPUv2.
The microarchctiecture has many unnecessary things and it's not optimized as a whole for deep learning.
* It might've been a difference conference now that I think of it.
I'm also not sure how we can take googles word for the numbers, since they might as well be eating a less-than-ideal power cost to promote their platform. Any upfront cost will probably offset by locked-in customers later on.
I might just be a bit cynical though.
If that is right, is the "Tensorflow-optimized" Resnet-50 using 16bit floats when running on TPUv2?
No, this is really only comparing the throughput on the devices. A thorough comparison should really focus on time to reach a certain quality - including all of the tricks available for a certain architecture.
These are only available on the Google Cloud right now. I don't think there are plans to sell them anytime soon.
Does this mean it's inference-only? (I only quickly scanned the article)
They are all set up to spy on us. Deep state. They hunt sys admins. If you're here, you're a target.
You are a conspiracy theorist.
> Google can do the sort of spying the US government can't constitutionally do, then hand that over to the government, constitutionally
This is an agenda-driven redefinition of the word "spy" that I find disingenuous in the extreme.
> It's all stagecraft. Same for Apple, Amazon, Intel, etc. Eric Schmidt runs HRCs campaign. Al Gore is on Apple's board.
Firstly: Eric Schmidt founded a company designed to legally channel lots of money via analytics expertise into his favorite candidate, like every rich person does under the current set of laws. Neither of us has to like it, but he did not "run her campaign" and if they actually did? Wow, not a great job there.
As for Al Gore?
I've worked with Al Gore. He did some advisory work for my financial data startup, as part of our first round's venture firm. Our data was some of the most valuable data about consumers that can possibly exist, and had incredible applications for both surveillance and law enforcement.
To the best of my knowledge, we were never pressured to hand over a byte to anyone. Quite the opposite. I specifically remember a conversation where he mentioned user data privacy was the single most important priority he felt we could have.
So between my lying eyes, ears, and email history and your wild gesticulations about the evil overlordship, I'm gonna have to lean towards my own personal experience.
Unless, of course, you got some actual evidence and not a room full of old coffee cups and red string up on the walls.
How does the TPU design offer improved performance? By leveraging IP or fabrication improvements?
Becomes a chicken-and-egg problem.
Google was able to solve it by designing TPUs for internal usage first, therefore reaching minimum scale, and then making changes and offering it publicly.
So you'd need to be:
1) Chip designer
2) Massive internal user
3) Cloud service provider
I don't see any other company that has all 3. Amazon has (3), and maybe (1) if they hired the right people. Microsoft could have enough (2) to justify it, but they're going the way of FPGAs.
There are host of Chinese companies developing similar processors.
I wonder which is higher, the cost for creating the TPUs in terms of engineering and manufacturing or the cost differential in terms of usage as compared to NVIDIA's latest?
I worry about Google long term here. I am surprised the TPU doesn't kick the ass of the NVIDIA chips.
By the logic above, you would conclude that TPUv1 (the inference-only chip) might have been a mistake, but we’ve been very public about how it “saved us from building lots of datacenters”.
That wasn’t ever sold as part of Cloud, so the benefit there is all from the second bit you mentioned: cheaper and more efficient than GPUs at the time. The paper also goes into more detail, but the size of that initial engineering team and time to market were both quite small.
For training, before Volta (and kind of Pascal), GPUs were the best option but not particularly efficient. Volta does the same “we should have a single instruction that does lots of math in one shot” by cleverly reusing the existing functional units. That the V100 is a great chip, is a good outcome for the whole industry. But GPUs aren’t (and shouldn’t be) just focused on ML. My bet is that there’s still a decent amount of runway left in specialized chips for ML, just as GPUs carved out their own niche versus CPUs.
But again, the “even just for Google” benefit is really enormous so I wouldn’t assume that Cloud has to pay for the entire effort. Could GPU manufacturers improve the cost:performance ratio of ML workloads enough that Google doesn’t have to build TPUs anymore? Perhaps, but like the V100 improvements that would be a great outcome!
The F1 distributed database was developed to move the AdWords business off of MySQL.
I wouldn't be surprised if public TPUs are to some degree a way to print money: at least for a while, Google can probably just rent out its unused capacity that it had already planned and paid for. :-)
30% cheaper e2e price for the company's first public offering, compared to the market leader's top-of-the-line chip sounds...pretty good to me?
It used to until Volta came out with basically TPUs embedded on the board. We will see if AMD will join them as Vega in theory should be around Volta as well, just tooling is not there.
Yeah, I'm a bit disappointed myself. When announced initially, it seemed Google had a huge lead. But they dragged their feet for two years getting it to market, and now NVidia is nipping at their heels already.
I suspect they are using the TPUs internally for competitive advantage, and these are the leftovers they are done with. They're probably using v4 or v5 internally already.
tl;dr TPU helps Google Clouds' network effect.
If they do not continue to improve on process, they will fall behind in just a few years.