Hacker News new | past | comments | ask | show | jobs | submit login
Who uses Google TPUs for inference in production?
116 points by arthurdelerue on March 11, 2024 | hide | past | favorite | 48 comments
I am really puzzled by TPUs. I've been reading everywhere that TPUs are powerful and a great alternative to NVIDIA.

I have been playing with TPUs for a couple of months now, and to be honest I don't understand how can people use them in production for inference:

- almost no resources online showing how to run modern generative models like Mistral, Yi 34B, etc. on TPUs - poor compatibility between JAX and Pytorch - very hard to understand the memory consumption of the TPU chips (no nvidia-smi equivalent) - rotating IP addresses on TPU VMs - almost impossible to get my hands on a TPU v5

Is it only me? Or did I miss something?

I totally understand that TPUs can be useful for training though.




We've previously tried and almost always regretted the decision. I think the tech stack needs another 12-18 months to mature (doesn't help that almost all work ex Google is being done in torch).


> I think the tech stack needs another 12-18 months to mature

Google has been doing AI before any other company even thought about it. They are on the 6th generation of TPU hardware.

I don't think there is any maturity issue, just an availability issue because they are all being used internally.


100% agree, if I have access to the TPU team internally, it'll be very easy to use in production.

If you aren't internal, the documentation, support, and even just general bug fixing is impossible to get.


(Has an expert team dedicated solely to optimizing for exotic hardware) = an option

(Doesn't have a team like that) = stick to mass-use, commodity hardware

That's generally been the trade-off since ~1970. And usually, the performance isn't worth the people-salaries.

How many examples of successful hardware that isn't well-documented and doesn't have drop-in 1:1 SDK coverage vs (more popular solution) are there?

It seems like a heavy-lift to even get something that does have parity in those ways adopted, given you're fighting market inertia.


Google sells access to TPUs in its cloud platform, so you'd think they would be more open about sharing development and tooling frameworks for TPUs. It's like Borg (closed source, never used outside Google, made them no profit) vs. Kubernetes (open source, used everywhere, makes them profit).


> Google has been doing AI before any other company even thought about it

This not even remotely true. SRI was working on AI in various forms long before google existed


Who or what is SRI?


next to NASA, probably the most innovative organization in human history

https://www.sri.com/timeline-of-innovation/



I feel like I have been hearing that since V1 TPU. I think Google is the perfect solution because they are teams whose job is to take a model and TPUify it. Elsewhere there is no team, so it's no fun.


I agree with that, and I'm not sure they'll be able to improve the stack dramatically by themselves without the open-source community being more involved.


They aren’t really an alternative to anything. For one thing they’re now often slower on per-accelerator basis than NVIDIA stuff. They’re cheaper, of course, but because of disparity in performance you’ll need to estimate cost per flop on your own particular workload. They are also more difficult and slower to develop against, and SWE cost is always an issue if you don’t own a money printer like Google. Furthermore, for advanced users who can do their own CUDA kernels or Triton, that too can unlock additional efficiency from GPU. Such capability can’t even be contemplated on the TPU side because you basically get a black box. Then there’s the issue of limited capacity, further exacerbated by the fact that this capacity is provided by a single supplier who is struggling to fulfill its internal needs (which is why you can’t get v5). You can’t just get TPUs elsewhere. You can’t get them under your desk for dev work either.

That said, it wouldn’t be too difficult to port most models to Jax, load in the existing weights, and export the result for serving. Should you bother? IMO, no, unless we’re talking really large scale inference. Your time and money are almost certainly better spent iterating on the models.


I agree, except about this statement: "it wouldn’t be too difficult to port most models to Jax"

--> We tried such ports at https://kwatch.io (the company I work for), and it appeared to be much harder than expected (at least for us). I don't think so many people are capable of porting an LLM based on PyTorch + GPU to Jax + TPU.


Well, I should have said “it wouldn’t be too difficult for me” then. I keep forgetting why I get paid so much.


I would love for you to expound, I found it interesting that you qualified your "should you bother, no" with "unless you are doing inference at scale". But in the previous paragraph you explained why you can get better performance with GPUs.

So is there some advantage of TPU, assuming there was SWE/API parity between GPUs?


Could be cheaper, depending on workload, and if you’re large that could justify the cost of additional SWE time required to port and support. Triton/CUDA requires people who know both DL and low level programming. Whether you get better performance _per dollar_ really depends on workload and also on the size of your workload. Here I don’t just mean the cost of buying compute in cloud, I mean the more broad definition: total cost of doing business, all in, including SWE cost. If you’re huge (eg Anthropic), SWE cost at scale is a lot easier to justify. If you’re on the smaller side, SWE cost matters a lot more. It’s way easier to hire PyTorch people (market share 60%) than eg Jax (market share 3%). And yeah I know there’s Torch XLA, but it’s basically the same thing with a different frontend.


> Such capability can’t even be contemplated on the TPU side because you basically get a black box.

I'll just leave this here: https://jax.readthedocs.io/en/latest/pallas/index.html


Pallas is very new. Given how difficult these things are to debug and how half assed the XLA tooling generally is, I’d give it at least another year, if not two, before I touch it for anything prod related.


Apparently Midjourney uses it. GCP put out a press release a while ago: https://www.prnewswire.com/news-releases/midjourney-selects-...


The quote from the linked press release is that they do training on TPUv4, while inference is running on GPUs. I have also heard this separately from people associated with Midjourney recently, and that they solely do training on TPUs.


Google is using them in prod. I think they're so hungry for chips internally that cloud isn't getting much support in selling them.


I think this is right, in part because I've been told exactly this from people who work for Google and their job is to sell me cloud stuff- i.e., they say they have so much internal demand they aren't pushing TPUs for external use. Hence external pricing and support just isn't that great right now. But presumably when capacity catches up they'll start pushing TPUs again.


Feels like a bad point in the curve to try and sell them. “Oh our internal hypecycle is done… we’ll put them in the market now that they’re all worn out.


Sounds like ButterflyLabs.


They're getting swallowed up by Anthropic and the other huge spenders:

https://www.prnewswire.com/news-releases/google-announces-ex...

"Partnership includes important new collaborations on AI safety standards, committing to the highest standards of AI security, and use of TPU v5e accelerators for AI inference "


I would guess that Google's vertexAI managed solution uses TPUs. Also Google uses them internally to train and infer for all their research products.


80 to 90% are consumed internally !! Only from version 5 it is planned to be customer focussed !!


While you can use TPUs with vertexai, it's just virtual machines - you can have one with an nvidia card if you like.


Or maybe they are just using nVidia. Who knows ...


Beyond the fact that this is hardly a secret, there’s lots of other signs.

1. They have bought far less from NVidia than other hyper scalers, and they literally can’t vomit without saying “AI”. They have to be running those models on something. They have purchased huge amounts of chips from fabs, and what else would that be?

2. They have said they use them. Should be pretty obvious here.

3. They maintain a whole software stack for them, they design the chips, etc. Then they don’t really try to sell the TPU. Why else would they do this?


They have announced publicly using TPUs for inference, as far back as 2016. They did not offer TPUs for Cloud customers until 2017. The development is clearly driven by internal use cases. One of the features they publicly disclosed as TPU-based was Smart Reply and that launched in 2015. So their internal use of TPUs for inference goes back nearly a decade.


Lots of people know.


marks as solved


We tried hard to move some of our inference workloads to TPUs at NLP Cloud, but finally gave up (at least for the moment) basically for the reasons you mention. We now only perform our fine-tunings on TPUs using JAX (see https://nlpcloud.com/how-to-fine-tune-llama-openllama-xgen-w...) and we are happy like that.

It seems to me that Google does not really want to sell TPUs but only showcase their AI work and maybe get some early adopters feedback. It must be quite a challenge for them to create a dynamic community around JAX and TPUs if TPUs stay a vendor locked-in product...


I tried to use a Google Coral. I have no idea how to make it work. I could follow a tutorial using tensorflow. I could not figure out how to use for anything else. Is there some way to run CUDA stuff on it? I always assumed it required someone with actual skills (not me). I have used CUDA stuff before, but more for mass calculation and simulation (for financial stuff). It is great when it works. I worked at a shop that had these Xeon Phi systems that worked great, but I had no clue how, and it only worked with their pre-canned tools.

Just as an example, over a decade ago I replaced a few cases filled with racks and a SAN that made up a compute cluster with one box (plus SAN) and a backup box (both boxes were basically the same in case one failed), but basically like dozens of servers were replaced by a two CPU box with a couple Tesla cards (probably one A100 later). The entire model had to be re-written, but it was not that bad. I wanted to do with AMD cards, but there was no easy way.

I would also say that modern networked has made all kinds of stuff more interesting (also lining Nvidia's pockets). Those TPU's do not make sense to me. I have no idea how to use them. They should release their version of CUDA.



>Cheers,

> The PyTorch/XLA Team at Google

Meanwhile you have an issue from 5 years ago with 0 support

https://github.com/pytorch/xla/issues/202


5 years ago PyTorch wasn’t owned by the Linux foundation. Give ‘em a chance now.

On my wish list for PyTorch is that the apt install version work out of the box on Jetson SBCs


TPUs are tightly coupled to JAX and the XLA compiler. If your model is based on Pytorch you can use a bridge to export your model to StableHLO and then compile it to a TPU accelerator. In theory the XLA compiler should be more performant than the Pytorch Inductor.


There's a cubesat using a Coral TPU for pose estimation.

https://aerospace.org/article/aerospaces-slingshot-1-demonst...


They were lucky to get that going. The software support for the USB TPU was abandoned by google years ago now. Works fine if you run ubuntu 16 I think.


To see memory consumption on the TPU while running on GKE you can look at kubernetes.io/node/accelerator/memory_used

https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#...


Ask HN:


[flagged]


Look like this was generated by an LLM.


Indeed. The user made no comments in first 6 months of the account, then starting 4 hours ago has been somewhat prolific.


Did you forget to include the links? I searched "TPU inference docs" but the results are either general TPU docs or just some inference examples.


I've seen people connecting these to Raspberry Pis to run local LLMs but I'm not sure how effective it is. Check YouTube for some videos about it.

Speaking of SBCs, prior to the Raspberry Pi, I was looking at the Orange Pi 5 which has a Rockchip RK3588S with an NPU (Neural Processing Unit). This was the first I had heard of such a thing but I was curious how/what exactly it does. Unfortunately, there's very little support for Orange Pi & not a large community for it so I couldn't find any feedback on how well it worked or what it did.

http://www.orangepi.org/html/hardWare/computerAndMicrocontro...


The rock chip npu can do object recognition a la opencv but not LLMs




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: