Hacker News new | comments | show | ask | jobs | submit login
Cloud TPUs in Beta (googleblog.com)
248 points by saeta 8 months ago | hide | past | web | favorite | 129 comments

Disclosure: I work on Google Cloud.

I want to highlight this paragraph from the post:

> Here at Google Cloud, we want to provide customers with the best cloud for every ML workload and will offer a variety of high-performance CPUs (including Intel Skylake) and GPUs (including NVIDIA’s Tesla V100) alongside Cloud TPUs.

We fundamentally want Google Cloud to be the best place to do computing. That includes AI/ML and so you’ll see us both invest in our own hardware, as well as provide the latest CPUs, GPUs, and so on. Don’t take this announcement as “Google is going to start excluding GPUs”, but rather that we’re adding an option that we’ve found internally to be an excellent balance of time-to-trained-model and cost. We’re still happily buying GPUs to offer to our Cloud customers, and as I said elsewhere the V100 is a great chip. All of this competition in hardware is great for folks who want to see ML progress in the years to come.

Any plans to support AMD GPUs and the Radeon Open Compute project? The AI/ML community really needs viable alternatives to NVIDIA, otherwise they will continue to flex pricing power. Google, via TensorFlow, is in a phenomenal position to promote open source alternatives to the proprietary Deep Learning software ecosystem that we see today with CUDA/CuDNN.

Google would happily accept patches to enable support for it.

AMD hopefully has a team writing such patches now. It makes business sense for them to do so.

Google is getting even more price gouging from Nvidia than the general public, and has even more incentive to level the playing field.

Or the opposite - they're getting nice savings in return for not actively developing or encouraging CUDA/cuDNN alternatives.

Did you guys ever reveal the internal math model of TPU 2?

We know V100 is FP16/FP32 on their tensor cores, when will you follow suit?

Edit: sort of, from https://www.theregister.co.uk/2017/12/14/google_tpu2_specs_i...

"32-bit floating-point precision math units for scalars and vectors, and 32-bit floating-point-precision matrix multiplication units with reduced precision for multipliers."

So what does "reduced" mean exactly?

We still don’t document it exactly, but [1] shows that bfloat16 is supported on lots of ops.

[1] https://cloud.google.com/tpu/docs/tensorflow-ops

That doesn’t prove that the chip operates at 16 bits. For example, we could do 18-bit multipliers (or anything >= 16) and still use 16-bit floats.

ATI demonstrated FP24 was frickin' awesome over a decade and a half ago. it wouldn't surprise me in the least if you went somewhere like that, but it perplexes me as to why you think that's secret sauce in any way long after ATI nearly destroyed NVIDIA with FP24 back in the early days of DirectX 9 and NV3x.

This isn’t exactly correct. ATI pulled a “fast one” and went with 24bit despite the initial DX9 spec called for 16/32 bit floats which NVIDIA followed.

Once DX9 was split into DX9b and c that “advantage” went away and NVIDIA proved that 16/32 bit was better, something that ATI also had to adopt once MSFT told them enough is enough.

24bit is only better as long as it can do everything 32bit can do and it’s advantageous to build a hardware with 24bit FPUs instead of 32bit FPUs that can also do 2x16bit ops per cycle.

Basically if the silicon cost allow you to put far more 24bit FPUs than 32/16bit ones.

And history proved that this isn’t the case.

For gaming eventually even 2:1 FPUs went away since they are costlier than only 32bit FPUs with promotion.

Maybe in the future we’ll have a 24bit FPU that can also do 3 8bit ops or 16bit+8bit op per cycle if it will be more beneficial than the current 2:1 16/32bit model.

I personally would stick to FP32 across the board for my ML efforts, but we have an entire cottage industry of people coming up with approximations to drive up perf and perf/W, all of which will prove irrelevant until Moore's Law runs out IMO. And even then, I'll still stick to FP32 personally. Speaking from direct experience, bulletproof mixed precision is tough.

I don't think it is secret sauce. If you're gonna let customers send operations to these TPU's, one could figure out what kind of multiplier is used almost immediately upon inspection of a few inputs and outputs.

>We fundamentally want Google Cloud to be the best place to do computing.

Lower. Network. Egress. Pricing. By. Two. Orders. Of. Magnitude.

Market rate is close to $1 per TB outbound. Your rate is $80-$120 per TB. That's just embarrassing.

> high-performance CPUs (including Intel Skylake)

Any plans for ryzen?

We’re always exploring the best hardware for the dollar. We’re a founding member of OpenPOWER and to your question about AMD parts, we’ve previously (publicly) run Opterons when they were the best choice. At this time, we don’t have any announcements to make :).

But I’d like to note that even if we were to use parts internally at Google (or not!), that for Cloud what matters is market demand. If there really was enormous customer demand for say ARM64, then we would look into it, even if the rest of Google wasn’t interested.

That $6.50/hr rate might be the big deal here. Amazon does offer instances with a V100 GPU (https://aws.amazon.com/ec2/pricing/on-demand/, the P3 instances), but if you're training something like ImageNet, you'll want the biggest image (p3.16xlarge) at $24.48/hr.

Attaching a VM of similar power to a TPU on Google Compute Engine is much cheaper (https://cloud.google.com/compute/pricing, n1-highmem-64, +$3.78/hr to the TPU cost for $10.28/hr total).

Per recent benchmarks for training ImageNet (https://dawn.cs.stanford.edu/benchmark/), training ImageNet on a p3.16xlarge cost $358, when this post claims it'll cost less than $200. (EDIT: never mind; the benchmark uses ImageNet-152, and Google compares TPU performance against ImageNet-50) Interesting.

Back of the envelope, a TPU costs a little more than 2x as much as a Volta on AWS P3, and delivers a little less than 2x the performance (180 TOPs for the TPU, 100 for Volta). On a raw performance/$ metric, I'm not sure the TPU is that interesting.

It might be worth it if I were willing to pay a huge amount to get back results from an experiment faster, by using lots of TPUs- distributed learning on GPUs doesn't seem easy yet.

Disclosure: I work on Google Cloud.

Peak ops/second isn’t the only thing that matters though. You have to be able to feed the units. The V100 does lots of finer-grained matrix multiplies which can make it harder to keep up.

Don’t get me wrong, the V100 is a great chip. And we’re all looking forward to more (preferably third-party) benchmark results, to tease out when one is the better choice for a workload. But don’t just compare ops/second or any other architectural number.

This makes no sense, the V100 has more memory bandwidth than both the TPU and TPUv2

V100 has 900gb/s memory bandwidth [0].

TPUv2 has 600gb/s per chip x 4 chips, so 2400gb/s [1].

As we've discussed elsewhere [2], comparing TPUv2 to V100 on a per chip basis doesn't make much sense. Who cares how many chips are on the board? If Google announced tomorrow that TPUv3 is coming out, which is identical to TPUv2 but the four chips are glued together, nobody would care.

The questions that we should instead be asking are, how fast can I train my model and how much does it cost?

Per elsewhere in thread [3], on Volta you have 900gb/s per 100Tops/s = 0.9 bytes/s per op/s, whereas on TPUv2 you have 2400gb/s memory bandwidth over 180Tops/s = 1.33 bytes/s per op/s. This means that TPUv2's memory-bandwidth-to-compute ratio is 1.33/9 = 1.5x higher than Volta's.

We can do a similar comparison for memory available. V100 has 16gb per 100Tops, TPUv2 has 64gb per 180Tops. So the memory-to-compute ratio for Volta is 16g/100T = .16 milli while for TPUv2 it's 64g/180T = .36 milli, for a ratio of .36/.16 = 2.25x higher on TPUv2.

Does any of this matter? Does it translate into faster and/or cheaper training? Do models actually need and benefit from this additional memory and memory bandwidth? My guess from working on GPUs is yes, at least insofar as bandwidth is concerned, but it's just a guess. I'm excited to find out for real.

(Disclaimer: I work at Google on XLA, and used to work on TPUs.)

[0] https://images.nvidia.com/content/technologies/volta/pdf/437... [1] https://supercomputersfordl2017.github.io/Presentations/Imag... [2] https://news.ycombinator.com/item?id=16360212 [3] https://news.ycombinator.com/item?id=16359531

I responded to your other comment to disagree, and I'll do so again here.

Nobody is comparing DGX1-V to a single TPUv2 chip, because it doesn't make any sense to do so. they are totally different kinds of machines. But for some reason everyone is comparing a cluster of 4 TPUv2 chips to a single V100 chip.

It only makes sense to compare 4xTPUv2 to 1xV100 if they are equivalent in some meaningful metric, like total die size, power, etc.

In lieu of any available data, I'm going to continue to assume that each TPUv2 chip is roughly comparable in terms of power & die size to each V100 chip. If this was grossly wrong, I would expect that all four would be condensed into a single chip, which would dramatically increase the performance of the interconnects.

We could resolve this rapidly if there were any data available about die size, TDP, anything of TPUv2.

> But for some reason everyone is comparing a cluster of 4 TPUv2 chips to a single V100 chip.

I agree that some people are doing that. Marketing, I suppose. But that comparison is explicitly not the point of my parent post. I'm comparing the "shapes" of the chips -- specifically, the compute/memory and compute/memory-bandwidth ratios. These ratios stay the same regardless of whether you multiply the chips by 4 or by 400.

The point I was trying to make is that V100 has a higher peak-compute-to-memory(-bandwidth) ratio than TPUv2. This much seems clear from the arithmetic. Whether this matters in practice, I don't know, but I think it is relevant if one believes (as I do, based on the evidence I have as an author of an ML compiler targeting the V100) that the V100 is starved for memory bandwidth.

> In lieu of any available data, I'm going to continue to assume that each TPUv2 chip is roughly comparable in terms of power & die size to each V100 chip. If this was grossly wrong, I would expect that all four would be condensed into a single chip, which would dramatically increase the performance of the interconnects.

I'm sure Google's hardware engineers operate under a lot of constraints that I'm not aware of; I'm not about to make assumptions. But more to the point, as we've said, things like die size and TDP don't directly affect consumers. The questions we have to ask are, how fast can you train your model, and at what cost?

Just as you don't like it when people (incorrectly, I agree) insist on comparing one V100 to four TPUs, because that's totally arbitrary (why not compare one V100 to 128 TPUs?), I don't like it when people insist on comparing TPUv2 to V100 on arbitrary metrics like die size, or peak flops/chip, or whatever. So I disagree that we could resolve anything if we had more info about the TPUv2 chip itself. None of that matters.

Well, if you ignore power consumptiom because ",it doesn't matter to the end user", you're talking about economic comparisons, not technical comparisons.

BTW, I absolutely agree that memory bandwidth is the bottleneck, I've built my company around that assertion and the data for that exists (Mitra's publications come to mind)

Alright, I understand better now what you are saying. I'm eager to see some benchmarks that can answer those meaningful questions.

Thank you for your courteous reply.

We mostly focus on the “whole board” numbers. So it’s not only units <=> “local” HBM, but NVLINK versus TPU to TPU. Sorry for the confusion.

Edit for this part of the thread: the best public numbers are in the linked presentation [1].

[1] https://supercomputersfordl2017.github.io/Presentations/Imag...

That's... a skewed ... comparison, NVLINK is a board to board connection whereas you're talking about TPU to TPU on board communication if I understand correctly?

That's sort of the point though! We're actually selling these as the "board". So the right way to compare things is sort of DGX-1 style "deep learning rig" versus a board of four TPU units (or several connected). The on-chip network is a big part of its overall efficiency.

I don't recall what (if anything) we've said about how we link up the boards across racks, but the folks at Next Platform looked pretty carefully at the pictures: https://www.nextplatform.com/2017/05/22/hood-googles-tpu2-ma...

It's not the point though. You're comparing whole board tpu FLOPs (4x 45) but then comparing tpu single chip chip2chip communication with nVidia board2board communication.

You can't have your cake and eat it too.

Yes, when training DNNs memory bandwidth is the only figure you need to look at. That's why the 1080Ti is by far and away the best bang for buck right now (ignore the EULA nonsense). It has about 55% of the memory b/w of the V100 for 10% of the price.

I know people don't know what to expect from tpu performance, but does anyone actually get 100tops out of Volta? I thought you'd have to spin the tensorcores and never touch memory, which is...not realistic.

I know you hedged by saying "back of the envelope", but I'd much rather compare on real benchmarks than based on cited peak performance numbers, which are kind of meaningless.

This is true of the TPU as well, check out their paper's utilization numbers. If you ignore one outlier at ~90% utilization, their utilization plummets. I'm glad people are finally looking past the b.s. "peak" numbers for once though.

Has Google published data on the memory bandwidth of TPU v2 (aka "cloud TPU")? I'm having trouble finding it.

In any case I agree, we shouldn't be looking at the stated peak compute of either of the chips.

(Disclaimer: I work at Google on XLA, and have in the past worked on TPUs.)

From the blog post is the link to the fairly recent NIPS presentation: https://supercomputersfordl2017.github.io/Presentations/Imag...

which claims 2400 GB/s for the board and 600 GB/s per “chip”.

This is in comparison to 900gb/s for V100.

It'll make a lot more sense when the TPU pods they alluded to come out.

Why does it make any sense to compare the price/hour for a single TPU (4 ASICs) to the price/hour for p3.16xlarge, which has 8x V100?

Also, that benchmark cost of $358 is for Resnet-152, not Resnet-50.

Whoops, I misread, added edit.

Disclosure: I work on Google Cloud.

Note that the post says “less than $200” not $200. There are lots of values between 0 and 200. What we’d love is for third-party folks like yourself to do the comparison (which I know you can, Max!)

P3.16x benchmark is ResNet-152, TPU cost of $200 was for ResNet-50.

Tensorflow benchmarks show ResNet-152 resulting in 2.4x lower throughput than ResNet-50. [0]

[0] https://www.tensorflow.org/performance/benchmarks

Just a minor nitpick that ResNet is the model you are referring to, ImageNet is the dataset that ResNet is trained on.

A better comparison would be the f1.16xlarge[1] instance @ ~$4/hr. It comes with 8 FPGAs (12 Gbps link) and 64 vCPUs.

[1]: https://aws.amazon.com/ec2/instance-types/f1/

Edit: I'm genuinely curious about why this comment is getting downvotes.

Disclosure: I work on Google Cloud.

I didn’t downvote you, but presumably people disagree with “Here’s an FPGA” as comparable to being given a working piece of hardware. That is, would you have said that the best comparison to a V100 is this same FPGA box?

I (and others) get what you were trying to say: TPUs are ASICs that aren’t general purpose at all, so an FPGA is a better comparison than a more general purpose GPU. As an end user, that just isn’t true though. If someone hands you an f1.16xlarge, you have to build your own psuedo-chip for machine learning. While with this offering, TensorFlow handles the acceleration / offload for you.

Fair enough. I wonder if there are any model compilers targeting FPGA training backends...

Do much deep learning training with the FPGAs?

Nope, but I'm sure others do.

Microsoft uses FPGAs for Deep Learning. Source:


... for inference. I don't know of anyone who takes training on FPGAs seriously. They tend to get crushed by GPU/TPU/other ASIC in throughput, perf/watt, and perf/$.

Training on FPGAs starts to make a lot of sense when you consider low bit precision computation (e.g. DoReFa-Net).

This is exciting. There are lots of specific reasons to choose Google Cloud over AWS (and vice versa), but proprietary hardware is surely an advantage that is going to be hard to replicate / compete with. If TPUs hold up to the hype, GCloud may become the de facto for ML/AI startups.

Having had the chance to attend a fireside chat with leadership from Google and SAP, I get the sense that the hype is likely to hold up. There are a lot of big bets happening in the Enterprise space around this notion of efficient, easy to implement ML.

Can you describe a line of business function that makes novel use of ML?

I don't know what qualifies as novel for you but some use cases I've seen:

On the retail side: Using computer vision to deliver alerts about shelf condition.

For farming: Using computer vision + ML to devise and track health monitoring for crops.

For manufacturing: Predictive maintenance of equipment has been a very popular area of focus.

There have been countless use cases on the finance side of things. For instance, anomaly detection techniques help with reconciling accounts and detecting fraud.

The energy industry seems to never run out of use cases for tracking commodities and/or helping predict load.

In HR, predicting turnover and education demands are some of the early use cases being approached but I expect a lot more over time.

Logistics is another area that will have a seemingly endless supply of use case. Things like loss tracking, warehouse optimization, raw material allocation and sourcing. I don't think I've ever been involved in a logistics/manufacturing project that couldn't have used some ML to add efficiency to the process.

I am curious if DL really can deliver good results in such spaces.

We all see success stories for very refined and well defined problems with huge amount of training data, with models created by 1% top engineers, but for average business such conditions may not be achievable, to train model to recognize various shelf conditions in different situations, buildings, etc. you need nontrivial set of training data, and will have unclear expectations about model performance.

Most businesses will probably not develop and train their own systems, but rather implemented solutions developed by the folks with the expertise and training data.

From a media standpoint: frontline comment moderation. It would take a lot of the legwork out of filtering for advertisements, uncivil discussion, attacks, off topic posts, and trolling.

I believe NYT does this already, but using minimal oversight to prevent any edge case misses or false positives.

Presently there’s not much in the way of suitable options for large media that build their modules in house. At the same time media tends to prefer to not invest too heavily in hardware if they don’t have to. Convincing leadership of using a cloud service to train an AI/ML model sounds leaner and lets them tick off even more buzzwords for the executive, etc. That said, results from efforts in the aforementioned application sound promising.

For those who want to read more: https://www.nytimes.com/2017/06/13/insider/have-a-comment-le... [not particularly techincal, but given the GP seemed to be skeptical about real world use I think this is still appropriate]

Thanks! Coming from a company isn't currently implementing anything like this (you'll find many do not as of yet), it would help a great deal to improve the quality of the content which is an obvious precursor to ad impressions and subscriptions— especially for media companies who do not introduce [hard/any] paywalls.

if there is one thing we should hope for its that the next generation of deep learning processors will NOT be owned by Google or NVIDIA

>If TPUs hold up to the hype, GCloud may become the de facto for ML/AI startups.

Don't startups want to win a big exit though? Google won't need to buy the startup for billions, because the TOS already grants them permission to use all the models and training data for free. Seems like a Faustian bargain to me.

Cloud TPU product manager here. As I said in another thread:

The TOS you are quoting only refers to the information you provide in the survey. Here are the Google Cloud TOS: https://cloud.google.com/terms/ if you're interested in what Cloud does with customers data.

5.2 Use of Customer Data. Google will not access or use Customer Data, except as necessary to provide the Services to Customer.

Your training data and models are secure.

It's worth noting that Google Cloud has its own terms of service that is very different from what you may be thinking of: https://cloud.google.com/terms/

Also, even Google's general consume terms of service really isn't what you think: https://www.google.com/policies/terms/

"You retain ownership of any intellectual property rights that you hold in that content. In short, what belongs to you stays yours."

Regardless of the TOS saying that or not (I haven't read them), I can think of at least two reasons why your statement doesn't hold:

1) AI startups usually don't have a lot of value to potential acquirers based on their data, but based on other things (e.g., talent, customers, business model, platform, brand). That's like saying you shouldn't use AWS because Amazon can just steal and commercialize all your data.

2) There are other companies than Google that acquire startups

Having said that, I highly doubt that Google can just use all the training data to on GCloud to launch their own products with that. They can surely look at it and maybe do stuff with them internally, but I am pretty sure that they can't use them commercially.

>They can surely look at it and maybe do stuff with them internally, but I am pretty sure that they can't use them commercially.

How would you ever know if they did? People who worked at Google have been accused, by Google, of stealing the entire self driving car program and taking it to a competitor.

That is just not true, the suit was about LIDAR.

It's also vastly different. Of course someone working at at google on a project has access to that project. It doesn't mean they have access to your stuff.

First of all, that's wrong (as another comment pointed out). Of course, the probability of them stealing your stuff is non-zero, but it's very rare. Even if you use all your own hard and software, people still can steal your stuff :-)

I can be hacked by malware which can leak secrets from air gapped, Faraday caged machines. Therefore, I should put my billion dollar idea on the public cloud and just trust Google.

I shiggy diggy.

Because you do not stay in business if you operate in such a manner. Plus it is not good from an employee standpoint in retaining. Most people prefer to conduct themselves in an ethical manner.

Hard to get employees to not steal from you if you are stealing from your customers.

Where's this TOS you are speaking of?

Ha! No Google does not get the models and data.

Interestingly, GCP now appears to be available to individuals in Europe. It wasn't like that before, no idea when that policy got changed. Before, GCP wasn't even a consideration compared to AWS (which always handled that).

"You can’t change the tax status of your Google Cloud Platform billing account."

I think this is what tripped me up before. I closed my business years ago but it was completely impossible to get Google to fix this. Now it fixed it "by itself".

Just a warning to everyone before signing up with your main Google account :-)

Some things:

A "single TPU" is 4 ASICs. It is not clear if it makes sense to compare a "single TPU" to a "single GPU."

As a point of reference, NVIDIA's numbers are 6 hours for Resnet-50 on Imagenet when training with 8xV100. From a naive extrapolation, 4xV100 would probably take ~12 hours and 1xV100 about two days.

Google has previously only compared TPUs to K80, so it will be interesting to see some benchmarks that compare TPUs to more recent GPUs. K80 was released in 2014, and the Kepler architecture was introduced in 2012.

> A "single TPU" is 4 ASICs. It is not clear if it makes sense to compare a "single TPU" to a "single GPU."

Why does the number of chips matter?

Put another way, suppose Google tomorrow announced Cloud TPU v3 which was one ASIC identical in all ways to four v2 ASICs glued together. Would that be notable in any way? Seems like it would be a nop to me.

I think what matters is, how fast can you train a model, and at what cost? Doesn't really matter if it's one chip or 10,000 behind the scenes.

It doesn't matter in the ways you are considering. The ultimate comparisons are going to be time, cost, and power to complete some benchmark, just as you say.

I only mention the number of chips because loads of people are comparing the "single TPU" to a single V100 with the assumption that it is meaningful. I don't know the TDP, die size, etc. of the TPUv2 chip, so it may well make more sense for ballpark comparisons to compare "single TPU" to 4xV100.

For example, a "single TPU" has 64 GB of memory, whereas a "single GPU" has 16 GB (V100). Is this meaningful? I don't know.

It just seems like something worth noting. I could buy a DGX1-V with 8xV100, rebrand it as the TWTW TPU, and then go around and tell everyone how my TPU is 8x faster than GPUs. It appears that everyone is normalizing by marketing unit until benchmarks come out, which is potentially flawed.

It matters when defining parallel work distribution. Unless memory bandwidth is homogeneous across the whole board (i.e. each TPU on a board gets 600 GB/s to its peers), we can't do model parallelism across ASICs efficiently, and must fall back to data parallelism. Which is fine, until you run into limits on maximum batchsize (e.g. up to 8192, as FAIR was able to manage [1] with some tweaks to SGD).

[1] https://arxiv.org/abs/1706.02677

The comparison was the first generation TPUs not the second generation which is what these are.

But ultimately it comes down to the cost to complete some amount of work. Google also offers Nvidia GPUs in their cloud for training and should be able to compare the cost of using one over the other as both are supported by TF.

That is the ultimate guide on how good or not good the TPUs really are.

On 4x1080Ti it takes 2 days to train ResNet-50. 4xASICs doing it in a day is not that impressive.

That may be Google Cloud competitive edge for AI startups. Both in terms of development cycle and cost efficiency.

Hard to replicate by competitors: AWS and Azure.

How does this compare to Nvidia GPUs on AWS price/perf-wise?

The article makes it sound like this is a new thing...

Google claims[0] the TPU is many times faster for the workloads they've designed it for.

> On our production AI workloads that utilize neural network inference, the TPU is 15x to 30x faster than contemporary GPUs and CPUs.

As far as I know this will be the first opportunity for the public to prove those claims, as until now they've not been available on GCP. I don't mean to sound skeptical–I'm quite confident they're not exaggerating.

[0]: https://cloudplatform.googleblog.com/2017/04/quantifying-the...

Keep in mind that what you linked refers to TPUv1, which is built for quantized 8-bit inference. The TPUv2, which was announced in this blog post, is for general purpose training and uses 32-bit weights, activations, and gradients.

It will have very different performance characteristics.

Thanks for pointing that out!

The reserve TPU button has been available on the dashboard for the last few months. But I assume instances have been prioritized for large customers such as Two Sigma.

From the paper:

"Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU."

In-Datacenter Performance Analysis of a Tensor Processing Unit


Price is about 5x cloud nvidia gpu instance on an hourly basis.

It will be interesting to see some benchmarks that compare TPUs to V100, since all previously published comparisons from Google compare TPU to K80 (3 GPU architectures ago).

Perf per watt matters to Google but not you. You should only think of it on a perf/$ basis, right?

They're closely related though, since if the perf per watt is lower then Google can charge you less doller per perf. The price they charge you is ultimately tied to the operating cost.

I wonder how these would compare with Amazon's FPGA instances with a comparable core running.

I would imagine that (by design) they're not directly comparable.

I suspect that we'll see more information about the ASICs over time, but it'll take time to really understand their characteristics vs a Nvidia GPU - which are at least right now a bit better understood.

It is. TPUs perform calculations on weights using low-precision floating point and integer types. This saves a ton of computation, but doesn't matter much for training models.

But GPU is also able to use lower resolution types. There must be more to the TPU advantage.

GPUs are much more complex (general-purpose) and therefore cannot be optimized beyond a certain point due to timing requirements and PVT (process, temperature, voltage) variations. In other words, the more stuff you have on an ASIC, the more careful you have to be ensure a margin of tolerance for variations.

So the only advantage of the TPU is it's a simpler and more specialized asic? Google didn't break any new ground in terms of training perf?

So, a way to think of this is: The speed (and therefore, cost) of training a machine learning model depends on (a) the ML techniques (how rapidly the model converges and to what accuracy); and (b) how quickly the processor executes the operations involved in the ML techniques.

The TPU is only an improvement in (b). It's not going to result in a big-O style speedup, because the same training algorithms and architectures will run on it that we run on CPUs & GPUs today.

I'm not sure what counts as "breaking new ground" - is that 10%? 100%? 1000? :-) The things to watch out for in benchmarks will be:

(a) Perf/$. This is actually a big deal - one of my students recently blew through $5000 of Google Cloud credits running Imagenet experiments, in a week. And we didn't finish them! As this cost really drops, it enables things like Neural Architecture Search, which uses tons of compute capability to explore architectural variants automatically.

(b) Absolute perf.

(c) Performance scaling. To what degree will the fast, 2D torroidal mesh allow a full pod of Cloud TPUs to scale nearly-linearly? Absolute training times matter from a user productivity standpoint. Waiting 30 minutes for a result is very different from waiting 12 hours (you can do one of these while you sneak out to go running! :-).

The NIPS'17 slides have more technical context for some of this: https://supercomputersfordl2017.github.io/Presentations/Imag...

> So the only advantage of the TPU is it's a simpler and more specialized asic

And everything that entails: lower energy consumption, higher throughput, lower cost at volume, higher profits for GCP, etc.

> Google didn't break any new ground in terms of training perf?

Relative to GPUs, sure, but I can't say how well they stack up against other custom ASICs for DL applications.

This is a new thing. Google also has Nvidia GPUs. these are new custom designed ASICs google has designed for certain ML tasks.

This seems a bit pricey compared to other offerings. Wouldn't an ASIC make things more economical?

Seems like in terms of cost per performance, both AWS P3 spot instances and Paperspace v100 offerings are more economical.

Are these prices expected to become more competitive once it is out of beta?

isn't the tpu kind of a deep learning asic?

Is this just go-faster-juice for Tensorflow code or does it have other implications? If you train on TPUs can you still run the model efficiently elsewhere?

I assume Azure and AWS have some buddying up with Intel/Nervana and Nvidia counterstroke to Google TPUs. I can’t quite imagine what it will be though.

Amazon announced today they are working on their own TPU type chips.

Do you have a link for that?

"Amazon is reportedly following Apple and Google by designing custom AI chips for Alexa"


What are the chances of TensorFlow code gradually optimizing for TPUs over GPUs?!

(Yes TF is OSS, but realistically Google is putting much more resources into it)

Very low. A lot of the performance on GPUs comes from Nvidia's optimizations in CuDNN -- it's mostly a matter of making sure TensorFlow feeds the right formats/etc. to CuDNN for core NN ops. TF should run well on CPUs, GPUs, TPUs, and likely future embedded accelerators (via tensorflow lite, which already supports the Android Neural Networks API).

(I'm part time on Brain, but, of course, this isn't some kind of Official Statement(tm)).

TF funds one of my teams explicitly just to optimize CPUs and GPUs. Every discussion i've had with them tells me they care about making customers succeed, period.

So i'm going to with "pretty low".

There is already preliminary support for TPU devices in the TF API.

Is there any way to use these for applications other than tensorflow/machine learning?

I'm puzzled by the phrase "differentiated performance per dollar."

Is it more performant, or less?

If it's less performant, why mention it at all?

If it's more performant, why not simply say "better performance per dollar"?

It is both more performant overall as well as per dollar.

We really need a standard easy to run benchmark.

When is off the shelf edition coming?

if there ever was a chance for a hardware start-up to become the next big thing it's entering this space. unfortunately Nervana sold out to intel.

My guess: Never

I hope that's not true, for the sake of progress. Todays clouds wouldn't have happened if AMD and Intel had restricted cloud use of their processors.

Among other things, it would be expensive (in a ton of ways), a digression, require providing direct end user support in a way they aren't good at.

It also would have significant export restrictions: Neural network related asics are very tightly export controlled:


(search for neural network)

My 2c: It would be an expensive waste of time for Google :)

Though certainly, not gonna disagree it would be cool for the sake of progress.

does anyone thought about cryptocurrency mining?


I don't like it. Google is mixing too many things. No way to buy a TPU. No competition from other cloud providers. Proprietary hardware and vendor lock-in.

This is really Tensorflow as a service. You get an IP address and a port you send gRPC requests to:


Presumably, there's a whole server behind that address that has all the right drivers and libraries: details you don't need to care about.

The only partial lock-in is that not all ops are supported and you need to figure if there are any parts of the graph in the critical part that will run on the CPU instead. There's a tool for that:


Competitors could launch something similar that uses GPUs tomorrow. Now, if you don't already use TF and don't want to switch, that's another story.

That's my point. Competitors are largely moated out by high costs of TPU production and proprietary drivers.

Amazon is reportedly looking into building their own. Nvidia not only added Tensor cores to the Volta series (impressively quickly, might I add), but they're also creating the NVidia GPU Cloud. Intel has been acquiring DNN hardware startups left and right (Nervana, Movidius, MobilEye) and trying to roll those into their production series.

The hardest part in DNNs is the model and data. That's basically platform-independent. My students mix and match TensorFlow and Caffe, for example, on several different models.

The next part is getting the model implemented in a framework (TensorFlow? Caffe? MXNet? PyTorch?). That's work to change, particularly if you're in a production environment. But it's not the same amount of work as collecting data and building a model.

The final part is running training - CPUs, GPUs, TPUs, etc. This is really fungible. The platform-specific optimizations are relatively small here.

Looking at it from a customer perspective:

  - Can a trained model be exported (weights included) for use on another platform?  (yes)
  - Can the code written for training be used on the customer's own hardware?  (yes, absent any small tweaks needed for TPU, but they're *small*).
  - Might the customer not want to leave because of ease-of-use, particularly at scale, or performance, or total cost of ownership?  (yes, and I think that's what the sales pitch is).
(disclaimer: I worked on part of the Cloud TPU stuff. I'm funded academically by Intel. I have friends at NVidia and own a lot of their GPUs. I love everyone. :)

Why are proprietary drivers a blocker? As long as you expose the same gRPC interface, your customers don't need to know what happens behind the scenes. You could have an FPGA or a Beowulf cluster of Raspberry Pis hiding.

I should clarify. I like all the individual pieces(hardware, cloud services, grpc interface) I just wish you could opt into them independently.

Someone at Dell/HPE headquarter - When can we start selling "Integrated TPU" machines. ;)

Google aspiring to be leader in Cloud machine learning. Let's do On Premise.

Reading the TOS it seems like this is a really great deal for Google:

"When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. The rights you grant in this license are for the limited purpose of operating, promoting, and improving our Services, and to develop new ones."

All your training data are belong to us.

We can use your models to improve ours.

The terms will prevent me from using it. I can't grant Google permission to redistribute HIPAA PHI.

Cloud TPU product manager here.

The TOS you are quoting only refers to the information you provide in the survey. Here are the Google Cloud TOS: https://cloud.google.com/terms/ if you're interested in what Cloud does with customers data.

5.2 Use of Customer Data. Google will not access or use Customer Data, except as necessary to provide the Services to Customer.

Your training data and models are secure.

This URL isn't on the TPU beta signup page. The Google TOS is. Perhaps you can see the confusion? I would be reluctant to trust random 37 karma guy on Hacker News message board on this particularly important consideration.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact