I want to highlight this paragraph from the post:
> Here at Google Cloud, we want to provide customers with the best cloud for every ML workload and will offer a variety of high-performance CPUs (including Intel Skylake) and GPUs (including NVIDIA’s Tesla V100) alongside Cloud TPUs.
We fundamentally want Google Cloud to be the best place to do computing. That includes AI/ML and so you’ll see us both invest in our own hardware, as well as provide the latest CPUs, GPUs, and so on. Don’t take this announcement as “Google is going to start excluding GPUs”, but rather that we’re adding an option that we’ve found internally to be an excellent balance of time-to-trained-model and cost. We’re still happily buying GPUs to offer to our Cloud customers, and as I said elsewhere the V100 is a great chip. All of this competition in hardware is great for folks who want to see ML progress in the years to come.
AMD hopefully has a team writing such patches now. It makes business sense for them to do so.
Google is getting even more price gouging from Nvidia than the general public, and has even more incentive to level the playing field.
We know V100 is FP16/FP32 on their tensor cores, when will you follow suit?
Edit: sort of, from https://www.theregister.co.uk/2017/12/14/google_tpu2_specs_i...
"32-bit floating-point precision math units for scalars and vectors, and 32-bit floating-point-precision matrix multiplication units with reduced precision for multipliers."
So what does "reduced" mean exactly?
Once DX9 was split into DX9b and c that “advantage” went away and NVIDIA proved that 16/32 bit was better, something that ATI also had to adopt once MSFT told them enough is enough.
24bit is only better as long as it can do everything 32bit can do and it’s advantageous to build a hardware with 24bit FPUs instead of 32bit FPUs that can also do 2x16bit ops per cycle.
Basically if the silicon cost allow you to put far more 24bit FPUs than 32/16bit ones.
And history proved that this isn’t the case.
For gaming eventually even 2:1 FPUs went away since they are costlier than only 32bit FPUs with promotion.
Maybe in the future we’ll have a 24bit FPU that can also do 3 8bit ops or 16bit+8bit op per cycle if it will be more beneficial than the current 2:1 16/32bit model.
Lower. Network. Egress. Pricing. By. Two. Orders. Of. Magnitude.
Market rate is close to $1 per TB outbound. Your rate is $80-$120 per TB. That's just embarrassing.
Any plans for ryzen?
But I’d like to note that even if we were to use parts internally at Google (or not!), that for Cloud what matters is market demand. If there really was enormous customer demand for say ARM64, then we would look into it, even if the rest of Google wasn’t interested.
Attaching a VM of similar power to a TPU on Google Compute Engine is much cheaper (https://cloud.google.com/compute/pricing, n1-highmem-64, +$3.78/hr to the TPU cost for $10.28/hr total).
Per recent benchmarks for training ImageNet (https://dawn.cs.stanford.edu/benchmark/), training ImageNet on a p3.16xlarge cost $358, when this post claims it'll cost less than $200. (EDIT: never mind; the benchmark uses ImageNet-152, and Google compares TPU performance against ImageNet-50) Interesting.
It might be worth it if I were willing to pay a huge amount to get back results from an experiment faster, by using lots of TPUs- distributed learning on GPUs doesn't seem easy yet.
Peak ops/second isn’t the only thing that matters though. You have to be able to feed the units. The V100 does lots of finer-grained matrix multiplies which can make it harder to keep up.
Don’t get me wrong, the V100 is a great chip. And we’re all looking forward to more (preferably third-party) benchmark results, to tease out when one is the better choice for a workload. But don’t just compare ops/second or any other architectural number.
TPUv2 has 600gb/s per chip x 4 chips, so 2400gb/s .
As we've discussed elsewhere , comparing TPUv2 to V100 on a per chip basis doesn't make much sense. Who cares how many chips are on the board? If Google announced tomorrow that TPUv3 is coming out, which is identical to TPUv2 but the four chips are glued together, nobody would care.
The questions that we should instead be asking are, how fast can I train my model and how much does it cost?
Per elsewhere in thread , on Volta you have 900gb/s per 100Tops/s = 0.9 bytes/s per op/s, whereas on TPUv2 you have 2400gb/s memory bandwidth over 180Tops/s = 1.33 bytes/s per op/s. This means that TPUv2's memory-bandwidth-to-compute ratio is 1.33/9 = 1.5x higher than Volta's.
We can do a similar comparison for memory available. V100 has 16gb per 100Tops, TPUv2 has 64gb per 180Tops. So the memory-to-compute ratio for Volta is 16g/100T = .16 milli while for TPUv2 it's 64g/180T = .36 milli, for a ratio of .36/.16 = 2.25x higher on TPUv2.
Does any of this matter? Does it translate into faster and/or cheaper training? Do models actually need and benefit from this additional memory and memory bandwidth?
My guess from working on GPUs is yes, at least insofar as bandwidth is concerned, but it's just a guess. I'm excited to find out for real.
(Disclaimer: I work at Google on XLA, and used to work on TPUs.)
Nobody is comparing DGX1-V to a single TPUv2 chip, because it doesn't make any sense to do so. they are totally different kinds of machines. But for some reason everyone is comparing a cluster of 4 TPUv2 chips to a single V100 chip.
It only makes sense to compare 4xTPUv2 to 1xV100 if they are equivalent in some meaningful metric, like total die size, power, etc.
In lieu of any available data, I'm going to continue to assume that each TPUv2 chip is roughly comparable in terms of power & die size to each V100 chip. If this was grossly wrong, I would expect that all four would be condensed into a single chip, which would dramatically increase the performance of the interconnects.
We could resolve this rapidly if there were any data available about die size, TDP, anything of TPUv2.
I agree that some people are doing that. Marketing, I suppose. But that comparison is explicitly not the point of my parent post. I'm comparing the "shapes" of the chips -- specifically, the compute/memory and compute/memory-bandwidth ratios. These ratios stay the same regardless of whether you multiply the chips by 4 or by 400.
The point I was trying to make is that V100 has a higher peak-compute-to-memory(-bandwidth) ratio than TPUv2. This much seems clear from the arithmetic. Whether this matters in practice, I don't know, but I think it is relevant if one believes (as I do, based on the evidence I have as an author of an ML compiler targeting the V100) that the V100 is starved for memory bandwidth.
> In lieu of any available data, I'm going to continue to assume that each TPUv2 chip is roughly comparable in terms of power & die size to each V100 chip. If this was grossly wrong, I would expect that all four would be condensed into a single chip, which would dramatically increase the performance of the interconnects.
I'm sure Google's hardware engineers operate under a lot of constraints that I'm not aware of; I'm not about to make assumptions. But more to the point, as we've said, things like die size and TDP don't directly affect consumers. The questions we have to ask are, how fast can you train your model, and at what cost?
Just as you don't like it when people (incorrectly, I agree) insist on comparing one V100 to four TPUs, because that's totally arbitrary (why not compare one V100 to 128 TPUs?), I don't like it when people insist on comparing TPUv2 to V100 on arbitrary metrics like die size, or peak flops/chip, or whatever. So I disagree that we could resolve anything if we had more info about the TPUv2 chip itself. None of that matters.
BTW, I absolutely agree that memory bandwidth is the bottleneck, I've built my company around that assertion and the data for that exists (Mitra's publications come to mind)
Thank you for your courteous reply.
Edit for this part of the thread: the best public numbers are in the linked presentation .
I don't recall what (if anything) we've said about how we link up the boards across racks, but the folks at Next Platform looked pretty carefully at the pictures: https://www.nextplatform.com/2017/05/22/hood-googles-tpu2-ma...
You can't have your cake and eat it too.
I know you hedged by saying "back of the envelope", but I'd much rather compare on real benchmarks than based on cited peak performance numbers, which are kind of meaningless.
In any case I agree, we shouldn't be looking at the stated peak compute of either of the chips.
(Disclaimer: I work at Google on XLA, and have in the past worked on TPUs.)
which claims 2400 GB/s for the board and 600 GB/s per “chip”.
Also, that benchmark cost of $358 is for Resnet-152, not Resnet-50.
Note that the post says “less than $200” not $200. There are lots of values between 0 and 200. What we’d love is for third-party folks like yourself to do the comparison (which I know you can, Max!)
Tensorflow benchmarks show ResNet-152 resulting in 2.4x lower throughput than ResNet-50. 
Edit: I'm genuinely curious about why this comment is getting downvotes.
I didn’t downvote you, but presumably people disagree with “Here’s an FPGA” as comparable to being given a working piece of hardware. That is, would you have said that the best comparison to a V100 is this same FPGA box?
I (and others) get what you were trying to say: TPUs are ASICs that aren’t general purpose at all, so an FPGA is a better comparison than a more general purpose GPU. As an end user, that just isn’t true though. If someone hands you an f1.16xlarge, you have to build your own psuedo-chip for machine learning. While with this offering, TensorFlow handles the acceleration / offload for you.
On the retail side: Using computer vision to deliver alerts about shelf condition.
For farming: Using computer vision + ML to devise and track health monitoring for crops.
For manufacturing: Predictive maintenance of equipment has been a very popular area of focus.
There have been countless use cases on the finance side of things. For instance, anomaly detection techniques help with reconciling accounts and detecting fraud.
The energy industry seems to never run out of use cases for tracking commodities and/or helping predict load.
In HR, predicting turnover and education demands are some of the early use cases being approached but I expect a lot more over time.
Logistics is another area that will have a seemingly endless supply of use case. Things like loss tracking, warehouse optimization, raw material allocation and sourcing. I don't think I've ever been involved in a logistics/manufacturing project that couldn't have used some ML to add efficiency to the process.
We all see success stories for very refined and well defined problems with huge amount of training data, with models created by 1% top engineers, but for average business such conditions may not be achievable, to train model to recognize various shelf conditions in different situations, buildings, etc. you need nontrivial set of training data, and will have unclear expectations about model performance.
I believe NYT does this already, but using minimal oversight to prevent any edge case misses or false positives.
Presently there’s not much in the way of suitable options for large media that build their modules in house. At the same time media tends to prefer to not invest too heavily in hardware if they don’t have to. Convincing leadership of using a cloud service to train an AI/ML model sounds leaner and lets them tick off even more buzzwords for the executive, etc. That said, results from efforts in the aforementioned application sound promising.
Don't startups want to win a big exit though? Google won't need to buy the startup for billions, because the TOS already grants them permission to use all the models and training data for free. Seems like a Faustian bargain to me.
The TOS you are quoting only refers to the information you provide in the survey. Here are the Google Cloud TOS: https://cloud.google.com/terms/ if you're interested in what Cloud does with customers data.
5.2 Use of Customer Data. Google will not access or use Customer Data, except as necessary to provide the Services to Customer.
Your training data and models are secure.
Also, even Google's general consume terms of service really isn't what you think: https://www.google.com/policies/terms/
"You retain ownership of any intellectual property rights that you hold in that content. In short, what belongs to you stays yours."
1) AI startups usually don't have a lot of value to potential acquirers based on their data, but based on other things (e.g., talent, customers, business model, platform, brand). That's like saying you shouldn't use AWS because Amazon can just steal and commercialize all your data.
2) There are other companies than Google that acquire startups
Having said that, I highly doubt that Google can just use all the training data to on GCloud to launch their own products with that. They can surely look at it and maybe do stuff with them internally, but I am pretty sure that they can't use them commercially.
How would you ever know if they did? People who worked at Google have been accused, by Google, of stealing the entire self driving car program and taking it to a competitor.
It's also vastly different. Of course someone working at at google on a project has access to that project. It doesn't mean they have access to your stuff.
I shiggy diggy.
Hard to get employees to not steal from you if you are stealing from your customers.
I think this is what tripped me up before. I closed my business years ago but it was completely impossible to get Google to fix this. Now it fixed it "by itself".
Just a warning to everyone before signing up with your main Google account :-)
A "single TPU" is 4 ASICs. It is not clear if it makes sense to compare a "single TPU" to a "single GPU."
As a point of reference, NVIDIA's numbers are 6 hours for Resnet-50 on Imagenet when training with 8xV100. From a naive extrapolation, 4xV100 would probably take ~12 hours and 1xV100 about two days.
Google has previously only compared TPUs to K80, so it will be interesting to see some benchmarks that compare TPUs to more recent GPUs. K80 was released in 2014, and the Kepler architecture was introduced in 2012.
Why does the number of chips matter?
Put another way, suppose Google tomorrow announced Cloud TPU v3 which was one ASIC identical in all ways to four v2 ASICs glued together. Would that be notable in any way? Seems like it would be a nop to me.
I think what matters is, how fast can you train a model, and at what cost? Doesn't really matter if it's one chip or 10,000 behind the scenes.
I only mention the number of chips because loads of people are comparing the "single TPU" to a single V100 with the assumption that it is meaningful. I don't know the TDP, die size, etc. of the TPUv2 chip, so it may well make more sense for ballpark comparisons to compare "single TPU" to 4xV100.
For example, a "single TPU" has 64 GB of memory, whereas a "single GPU" has 16 GB (V100). Is this meaningful? I don't know.
It just seems like something worth noting. I could buy a DGX1-V with 8xV100, rebrand it as the TWTW TPU, and then go around and tell everyone how my TPU is 8x faster than GPUs. It appears that everyone is normalizing by marketing unit until benchmarks come out, which is potentially flawed.
But ultimately it comes down to the cost to complete some amount of work. Google also offers Nvidia GPUs in their cloud for training and should be able to compare the cost of using one over the other as both are supported by TF.
That is the ultimate guide on how good or not good the TPUs really are.
Hard to replicate by competitors: AWS and Azure.
The article makes it sound like this is a new thing...
> On our production AI workloads that utilize neural network inference, the TPU is 15x to 30x faster than contemporary GPUs and CPUs.
As far as I know this will be the first opportunity for the public to prove those claims, as until now they've not been available on GCP. I don't mean to sound skeptical–I'm quite confident they're not exaggerating.
It will have very different performance characteristics.
From the paper:
"Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU."
In-Datacenter Performance Analysis of a Tensor Processing Unit
Price is about 5x cloud nvidia gpu instance on an hourly basis.
I suspect that we'll see more information about the ASICs over time, but it'll take time to really understand their characteristics vs a Nvidia GPU - which are at least right now a bit better understood.
The TPU is only an improvement in (b). It's not going to result in a big-O style speedup, because the same training algorithms and architectures will run on it that we run on CPUs & GPUs today.
I'm not sure what counts as "breaking new ground" - is that 10%? 100%? 1000? :-) The things to watch out for in benchmarks will be:
(a) Perf/$. This is actually a big deal - one of my students recently blew through $5000 of Google Cloud credits running Imagenet experiments, in a week. And we didn't finish them! As this cost really drops, it enables things like Neural Architecture Search, which uses tons of compute capability to explore architectural variants automatically.
(b) Absolute perf.
(c) Performance scaling. To what degree will the fast, 2D torroidal mesh allow a full pod of Cloud TPUs to scale nearly-linearly? Absolute training times matter from a user productivity standpoint. Waiting 30 minutes for a result is very different from waiting 12 hours (you can do one of these while you sneak out to go running! :-).
The NIPS'17 slides have more technical context for some of this: https://supercomputersfordl2017.github.io/Presentations/Imag...
And everything that entails: lower energy consumption, higher throughput, lower cost at volume, higher profits for GCP, etc.
> Google didn't break any new ground in terms of training perf?
Relative to GPUs, sure, but I can't say how well they stack up against other custom ASICs for DL applications.
Seems like in terms of cost per performance, both AWS P3 spot instances and Paperspace v100 offerings are more economical.
Are these prices expected to become more competitive once it is out of beta?
(Yes TF is OSS, but realistically Google is putting much more resources into it)
(I'm part time on Brain, but, of course, this isn't some kind of Official Statement(tm)).
So i'm going to with "pretty low".
Is it more performant, or less?
If it's less performant, why mention it at all?
If it's more performant, why not simply say "better performance per dollar"?
It also would have significant export restrictions: Neural network related asics are very tightly export controlled:
(search for neural network)
My 2c: It would be an expensive waste of time for Google :)
Though certainly, not gonna disagree it would be cool for the sake of progress.
Presumably, there's a whole server behind that address that has all the right drivers and libraries: details you don't need to care about.
The only partial lock-in is that not all ops are supported and you need to figure if there are any parts of the graph in the critical part that will run on the CPU instead. There's a tool for that:
Competitors could launch something similar that uses GPUs tomorrow. Now, if you don't already use TF and don't want to switch, that's another story.
The hardest part in DNNs is the model and data. That's basically platform-independent. My students mix and match TensorFlow and Caffe, for example, on several different models.
The next part is getting the model implemented in a framework (TensorFlow? Caffe? MXNet? PyTorch?). That's work to change, particularly if you're in a production environment. But it's not the same amount of work as collecting data and building a model.
The final part is running training - CPUs, GPUs, TPUs, etc. This is really fungible. The platform-specific optimizations are relatively small here.
Looking at it from a customer perspective:
- Can a trained model be exported (weights included) for use on another platform? (yes)
- Can the code written for training be used on the customer's own hardware? (yes, absent any small tweaks needed for TPU, but they're *small*).
- Might the customer not want to leave because of ease-of-use, particularly at scale, or performance, or total cost of ownership? (yes, and I think that's what the sales pitch is).
Google aspiring to be leader in Cloud machine learning. Let's do On Premise.
"When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. The rights you grant in this license are for the limited purpose of operating, promoting, and improving our Services, and to develop new ones."
All your training data are belong to us.
We can use your models to improve ours.
The terms will prevent me from using it. I can't grant Google permission to redistribute HIPAA PHI.
The TOS you are quoting only refers to the information you provide in the survey. Here are the Google Cloud TOS: https://cloud.google.com/terms/ if you're interested in what Cloud does with customers data.