More

zak · on July 30, 2023

Actually, the TPU Research Cloud program is still going strong! We've expanded the compute pool significantly to include Cloud TPU v4 Pod slices, and larger projects still use hundreds of chips at a time. (TRC capacity has not been reclaimed for internal use.)

Check out this list of recent TRC-supported publications: https://sites.research.google/trc/publications/

Demand for Cloud TPUs is definitely intense, so if you're using preemptible capacity, you're probably seeing more frequent interruptions, but reserved capacity is also available. Hope you email the TRC support team to say hello!

sillysaurusx · on July 30, 2023

Zak, I love you buddy, but you should have some of your researchers try to use the TRC program. They should pretend to be a nobody (like I was in 2019) and try to do any research with the resources they’re granted. I guarantee you those researchers will all tell you “we can’t start any training runs anymore because the TPUs die after 45 minutes.”

This may feel like an anime betrayal, since you basically launched my career as a scientist. But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem, especially today. And TRC just does not support them anymore. I tried, many times, over the last year and a half.

You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

I held out hope for so long. I thought it was temporary. It ain’t temporary, Zak. And I vividly remember when it happened. Some smart person in google proposed a new allocation algorithm back near the end of 2021, and poof, overnight our ability to make TPUs went from dozens to a handful. It was quite literally overnight; we had monitoring graphs that flatlined. I can probably still dig them up.

I’ve wanted to email you privately about this, but given that I am a small fish in a pond that’s grown exponentially bigger, I don’t think it would’ve made a difference. The difference is in your last paragraph: you allocate reserved instances to those who deserve it, and leave everybody else to fight over 45 minutes of TPU time when it takes 25 minutes just to create and fill your TPU with your research data.

Your non-preemptible TPUs are frankly a lie. I didn’t want to drop the L word, but a TPUv3 in euw4a will literally delete itself — aka preempt — after no more than a couple hours. I tested this over many months. That was some time ago, so maybe things have changed, but I wouldn’t bet on it.

There’s some serious “left hand doesn’t know that right hand detached from its body and migrated south for the winter” energy in the TRC program. I don’t know where it embedded itself, but if you want to elevate any other engineers from software devs to researchers, I urge you to make some big changes.

One last thing. The support staff of TRC is phenomenal. Jonathan Colton has worked more miracles than I can count, along with the rest of his crew. Ultimately he had to send me an email like “by the way, TRC doesn’t delete TPUs. This distinction probably won’t be too relevant, but I wanted to let you know” (paraphrasing). Translation: you took the power away from the people who knew where to put it (Jonathan) and gave it to some really important researchers, probably in Brain or some other division of Google. And the rest is history. So I don’t want to hear that one of the changes is “ok, we’ve punished the support staff” - as far as I can tell, they’ve moved mountains with whatever tools they had available, and I definitely wouldn’t have been able to do any better in their shoes.

Also, hello. Thanks for launching my career. Sorry that I had to leave this here, but my duty is to the open source community. The good news is that you can still recover, if only you’d revert this silly “we’ll slip you some reserved TPUs that don’t kamikaze themselves after 45 minutes if you ask in just the right way” stuff. That wasn’t how the program was in 2019, and I guarantee that I couldn’t have done the work I did then under the current conditions.

zak · on July 31, 2023

A few quick comments:

> But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem

Totally agree! This was a big part of my original motivation for creating the TPU Research Cloud program. People sometimes assume that e.g. an academic affiliation is required to participate, but that isn't true; we want the program to be as open as possible. We should find a better way to highlight the work of TRC tinkerers - for now, the GitHub and Hugging Face search buttons near the top of https://sites.research.google/trc/publications/ provide some raw pointers.

I'm sorry to hear that you've personally had a hard time getting TPU v3 capacity in europe-west4-a. In general, TRC TPU availability varies by region and by hardware generation, and we've experimented with different ways of prioritizing projects. It's possible that something was misconfigured on our end if your TPU lifetimes were so short. Could you email Jonathan the name of the project(s) you were using and any other data you still have handy so we can figure out what was going wrong?

Also, thanks for the kind words for Jonathan and the rest of the TRC team. They haven't lost any power or control, and they are allocating a lot more Cloud TPU capacity than ever. However, now that everyone wants to train LLMs, diffusion models, and other exciting new things, demand for TPU compute is way up, so juggling all of the inbound TRC requests is definitely more challenging than it used to be.

sillysaurusx · on July 31, 2023

It’s not euw4a. It’s everywhere. The allocation algorithm across the board kills off TPUs after no more than a couple hours. usc1f, usc1a, usc1c, euw4a; it makes no difference.

It would be funny if someone set gpt-2-15b-poetry (our project) in some special way to prevent us from making TPUs that ever last more than a few hours, but from what I’ve heard from other people, this isn’t the case. That’s what I mean about the left hand doesn’t know what’s going on with the right hand. It’s not a misconfiguration. Again, pretend to be some random person who just wants to apply for TPU access, fill out your form, then try to do research with the TPUs that are available to you. You’ll have a rough time, but it’ll also cure this misconception that it’s a special case or was just me.

Again, no need to take my word for it; here’s an organic comment from someone who was rolling their eyes whenever I was cheerleading TRC, because their experience was so bad: https://news.ycombinator.com/item?id=36936782

I think that the experience is probably great for researchers who get special approval. And that’s fine, if that’s how the program is designed to be. But at least tell people that they shouldn’t expect more than an hour or two of TPU time.

zak · on July 31, 2023

It sounds like you're primarily using preemptible TPU quota, which doesn't come with any availability or uptime expectations at all.

By default, the TRC program grants both on-demand quota and preemptible quota. If you are able to create a TPU VM with your on-demand quota, it should last quite a bit longer than a few hours. (There are situations in which on-demand TRC TPU VMs can be interrupted, but these ought to be rare.) If your on-demand TPU VMs are being interrupted frequently, please email TRC support and provide the names of the TPU hosts that were interrupted so folks can try to help.

When there is very high demand for Cloud TPUs, it's certainly possible for preemptible TPU VMs to be interrupted frequently. It would be an interesting engineering project to make a very robust training system that could make progress even with low TPU VM uptime, and I hope someone does it! Until then, though, you should have a better experience with on-demand resources when you're able to create them. Reserved capacity is even better since it provides an expectation of both availability and uptime.

sillysaurusx · on July 31, 2023

I was using on-demand TPUs primarily, and preemptible TPUs secondarily. Neither would last more than an hour or two. And two was something of a minor miracle by the end.

zak · on Aug 2, 2023

For future reference, the team looked into this, and it appears that the interruptions you experienced were specific to your project and a small number of other projects. The vast majority of TRC projects should see much longer Cloud TPU uptimes when they are able to create on-demand TPUs.

I'm sorry that you had such a frustrating time and that we weren't able to sort it out via email while it was happening. If you decide to try TRC again and run into issues like this, please be sure to engage with TRC support!

nl · on July 31, 2023

> You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

> Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

Unless I'm misreading this they sound pretty happy and you sound pessimistic? Their last substantial comment was "I'm sure Zak could hook you up with something better"?

sillysaurusx · on July 31, 2023

TRC is supposed to be the “something better”. This insider TPU stuff is for the birds. If TRC can only offer 4 hours with no preemptions, that’s fine, but they need to be up front about that. Saying that TPUs preempt every 24 hours and then killing them off after 45 minutes is… not very productive.

As for their comments, the third screenshot is the key; they’re agreeing that the situation is bad. They’re a friend, and they’re a little indirect with the way they phrase things. (If you’ve ever had a friend who really doesn’t want to be wrong, you know what I mean; they kind of say things in a circular way in order to agree without agreeing. After awhile it’s pretty cute and endearing though.)

I was particularly pessimistic in those DMs because it came a couple months after I thought I’d give TRC one last try, back in January, which was roughly a year after I’d started my “ok, I’m losing hope, but I’ll wait and see” journey. In the meantime I kept cheerleading TRC and driving people to their signup page. But after the TPUs all died in less than two hours yet again, that was that.

I have a really high tolerance for faulty equipment. This is free compute; me complaining is just ungrateful. But I saw what things were like in 2019. “Different” would be the understatement of the century. If my baby wasn’t being incubated in the NICU today, I’d show the charts where our usage went from thousands of cores down to almost zero, and not for lack of trying.

It also would’ve been fine to say “sorry, this is unsustainable, the new limits are one tpu per person per project” and then give me a rock solid tpu. We had those in 2021. One of our TPUv3s stayed online for so long that I started to host my blog on it just to show people that TPUs were good for more than AI; the uptime was measured in months. Then poof, now you can barely fire one up.

nl · on July 31, 2023

I don't have a qualified opinion on the subject of TPU availability.

I'm just pointing out that your summary of the DMs ("Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying") is the opposite of what the DMs show.

zak · on July 31, 2023

As mentioned in another comment, it sounds like you're using preemptible TRC TPU quota. If you use on-demand TRC TPU quota instead, that should improve your uptime substantially.

KirillPanov · on July 31, 2023

This is totally fascinating.

Frankly, it sounds to me like they're having severe yield+reliability problems with the TPUv4s that aren't getting caught by wafer-level testing, and have binned the flakiest ones for use by outsiders.

A lot of yield issues show up as spontaneous resets/crashes.

nl · on July 31, 2023

It's more likely Google preempting researcher who are on a preemptable research grant, and it is happening a lot more often because there are more paying customers.

KirillPanov · on July 31, 2023

"Preemptable money" sounds like the kind of bullshit I would use to cover up failed chips. And yes, I am a VLSI engineer.

choppaface · on July 31, 2023

Main problem with the TPU Research Cloud is you get dragged down a LOT by the buggy TPU API-- not just the Google Cloud API being awful but the Tensorflow/Jax/Pytorch support also being awful too. You also basically must use Google Cloud Storage, which is also slow and can be really expensive getting anything into / out-of.

The Googlers maintaining the TPU Github repo also just basically don't care about your PR unless it's somehow gonna help them in their own perf review.

In contrast with a GPU-based grid, you can not only run the latest & greatest out-of-the-box but also do a lot of local testing that saves tons of time.

Finally, the OP here appears to be offering real customer engagement, which is totally absent from my own GCloud experiences across several companies.

zak · on July 31, 2023

Could you share a few technical details about the issues you've encountered with TF / JAX / PyTorch on Cloud TPUs? The overall Cloud TPU user experience improved a whole lot when we enabled direct access to TPU VMs, and I believe the newer JAX and PyTorch integrations are improving very rapidly. I'd love to know which issues are currently causing the most friction.

zak · on May 11, 2022

No, Cloud TPUs support JAX, PyTorch, and TensorFlow, and the new TPU VM architecture provides enough low-level access that users could add support for additional frameworks themselves if they are willing to put in substantial effort.

zak · on May 11, 2022

We try to make it easy to switch back and forth between Cloud TPUs and other hardware platforms using JAX, PyTorch, and TensorFlow. This is a difficult technical challenge, but the XLA compiler helps a lot, and switching is easier now than it has ever been.

zak · on May 10, 2022

Yes, there are a couple of ways to use Cloud TPUs at lower priority and lower cost. If you are a hobbyist, I highly recommend trying out Cloud TPUs for free via the TPU Research Cloud: https://sites.research.google/trc/

If you are flexible, you (or your scripts) can access a lot of compute power at odd hours!

Filligree · on May 10, 2022

That’s for researchers, however. For hobbyists, I guess we just have colab?

zak · on May 11, 2022

I started the TRC program alongside the Cloud TPU program to make interesting amounts of ML compute available to a broad group of creative people, not only to academic researchers.

The TRC program welcomes hobbyists, artists, students, independent learners, technical writers, and a variety of others. We love it when the TPU Research Cloud enables people to do something that wouldn't have been possible otherwise.

I definitely recommend applying to the TRC program - please feel free to say directly that you are a hobbyist. The sign-up form is short, and it's likely that you can access a lot of compute if you are flexible and persistent.

zak · on May 10, 2022

The MLPerf 1.0 results provided an apples-to-apples comparison of large-scale TPU and GPU systems across several ML workloads: https://cloud.google.com/blog/products/ai-machine-learning/g...

In MLPerf 1.1, we showcased model training at larger scale: https://cloud.google.com/blog/topics/tpus/google-showcases-c...

The deep learning workloads that people find most interesting and the underlying hardware and software systems are all changing very rapidly. In addition to following MLPerf, we generally recommend that people run rigorous performance and cost comparisons on the actual workloads that they care about accelerating.

zak · on May 10, 2022

Founder of the Cloud TPU program here. If you'd like to experiment with TPU VMs for free and are willing to share your work with the world somehow (e.g. via publications or open-source projects), you can apply to participate in the TPU Research Cloud (TRC) program here: https://sites.research.google/trc/

jasonphang · on May 10, 2022

Hi Zak, this question is a little out of your scope, but perhaps you may know the answer: Do you know if/when Colab TPU runtimes are likely to be updated to support newer JAX functionality like pjit? (Or put another way: are Colab and Cloud TPU runtimes expected to be in-sync at all?) I'd written some model code that worked great on a TPU-VM and that I was excited to share on Colabs (which is likely far more accessible), but then I found out that Colabs simply don't support pjit (https://github.com/google/jax/issues/8300).

Other than that, really like TRC!

zak · on May 10, 2022

We love Colab and would love to upgrade the Colab TPU integration to support TPU VMs! No timeframe yet, but the right folks across JAX / Colab / Cloud TPU are very aware of this issue.

frankchn · on May 10, 2022

Just wanted to pop in to say congrats on making GA! Really happy to see the program develop from the early days :)

zak · on May 10, 2022

Thanks, Frank! You personally helped more Cloud TPU and TRC users than I can count, and you always came through something needed to get done and fast. I really appreciated it!

bertday · on May 10, 2022

What are some of the things on the roadmap for the platform? Any immediate plans to close the command-line gap for TPU utilization, etc.?

My overall impression is TPUs are pretty awesome but the software stack is still a bit hard to use compared to mature GPU tools. I’d imagine it’s pretty hard for inexperienced users to use them.

zak · on May 11, 2022

If you haven't used Cloud TPUs in a while, I'd encourage you to try them now with TPU VMs and the latest versions of JAX, PyTorch / XLA, or TensorFlow. We've gotten a lot of positive feedback from customers and TRC users, so we think the overall experience has improved a lot, though there's always more we want to do.

People especially seem to find Cloud TPUs easy to use in comparison to alternatives when they are scaling up ML training workloads. Once you have a model running on a single TPU core, it is relatively straightforward from a systems perspective to scale it out to thousands of cores. You still need to work through the ML challenges of scaling, but that is more tractable when you aren't simultaneously struggling with systems-level issues.

In particular, you don't need to master a sequence of different networking technologies as you scale up, and the TPU interconnect is so much faster at scale than other technologies (10X last time I checked) that you don't have to work as hard to avoid network bottlenecks. Support for model parallelism on Cloud TPUs is improving across the ML frameworks, too.

To be clear, training ML models at scales that we currently consider large is still very challenging on any platform - for example, the logbooks that Meta recently published are fascinating: https://github.com/facebookresearch/metaseq/blob/main/projec...

We aim for Cloud TPUs to simplify the process of training models at these scales and far beyond: https://ai.googleblog.com/2022/04/pathways-language-model-pa...

jacquesm · on May 10, 2022

Nice achievement and great to see you are still there after all that time to see it through to GA. Congrats! It must be a bit like seeing your child cycle to school alone for the first time :)

zak · on May 11, 2022

Thanks very much! We've come a long way, but there is always more interesting work required to keep up with the deep learning frontier and enable Cloud TPU customers and TRC users to expand it further.

teleforce · on May 10, 2022

Thanks Zak, already applied.

Just wondering does TPU VM support Vectorflow?

https://github.com/Netflix/vectorflow

zak · on May 10, 2022

No, Vectorflow is not supported out of the box, and I'm not sure the workloads it targets are the right fit for Cloud TPU hardware. However, be sure to check out the "Ranking and recommendation" section of the linked blog post above - Cloud TPUs are able to accelerate the ML models with very large embeddings that are increasingly common in state-of-the-art ranking and recommendation systems.

BooneJS · on May 10, 2022

Congrats Zak!

zak · on May 10, 2022

Thanks, and congratulations to many others across many teams who have supported the Cloud TPU program over the years!

zak · on May 10, 2022

Here's a quick overview of TPUs from last year's Google I/O keynote: https://www.youtube.com/watch?v=XFFrahd05OM&t=1565s

ChadNauseam · on May 10, 2022

The section on error-correcting qubits right after the bit you timestamped is really interesting too, I wonder what the status of that is

zak · on May 10, 2022

Yes, TPU VMs dramatically improve the Cloud TPU user experience. You now have direct access to the VM on each TPU host whether you are using JAX, PyTorch, or TensorFlow, which provides a lot more flexibility and control and can often improve performance.

zwaps · on May 10, 2022

I struggle to understand precisely what you mean by user experience and ‘often improved performance‘.

Previously, there was no actual support for crucial features of the TPU related to data loading when using PyTorch, say. In turn, using a TPU over a GPU on that setup was frequently not worth it due to that exact issue. Your answer suggests it might be different now: are TF, Jax and PyTorch now on par in all stages?

zak · on May 10, 2022

In the previous Cloud TPU architecture, PyTorch and JAX users had to create a separate CPU VM for every remote TPU host and arrange for these CPU hosts to communicate indirectly with the TPU hosts via gRPC. This was cumbersome and made debugging difficult.

With TPU VMs, none of this is necessary. You can SSH directly into each TPU host machine and install arbitrary software on a VM there to handle data loading and other tasks with much greater flexibility.

The blog post provides an example of training cost improvement using PyTorch / XLA on TPU VMs in the "Local execution of input pipeline" section. Hopefully we will be able to provide more tutorials on using PyTorch / XLA with TPU VMs soon.

With TPU VMs, workloads that require lots of CPU-TPU communication can now do that communication locally instead of going over the network, which can improve performance.

ultrons · on May 10, 2022

Here is an example post that shows how to train a PyTorch/XLA model with data pipeline reading from cloud storage.https://cloud.google.com/blog/topics/developers-practitioner...

zak · on May 10, 2022

Thanks for the very kind feedback! We've wanted to provide TPU VMs since the beginning of the Cloud TPU program, and I'm delighted that you're enjoying them. Many people across Google contributed to this launch.

We're definitely working on GKE integration, and we intend to provide more support for orchestration over time since individual Cloud TPU workloads are getting bigger and bigger. We'd love to hear any more detailed feedback you might have.

bigcat12345678 · on May 10, 2022

Hi Zak, great achievement.

Xoogler, had connection with people working on TPU.

AFAIK, TPU support on GCP has been started in 2018? Or earlier? It's been 4 years to GA.

Rumors have it that the project has been incredibly difficult to get out due to Googler's poor cross-org collaboration support. It has been a sad story, another one in the list of things that Google could have been putting more resources and made available sooner to the public.

zak · on May 10, 2022

I started pitching the Cloud TPU program in 2016. Many, many people have contributed since then to build the products that are available today.

Google is a large and complicated place, but we're getting closer to providing the magical interactive supercomputing experience we've wanted for a long time.

The deep learning landscape is evolving very rapidly, and there is a lot of interest in scaling up further, so the next few years will be exciting.

ec109685 · on May 10, 2022

Curious what is the difficulty with GKE integration? From afar, it seems like the flavor of VM shouldn’t impact ability to run K8s.

londons_explore · on May 10, 2022

The GKE team probably has other priorities.

ec109685 · on May 15, 2022

Question was around what makes these instance types harder to support than any other.

zak · on Jan 3, 2020

(I'm one of the Cloud TPU product leads)

We've seen multiple BERT-related PyTorch models training successfully on Cloud TPUs, including training at scale on large, distributed Cloud TPU Pod slices.

Would you consider filing a GitHub issue at https://github.com/pytorch/xla or emailing pytorch-tpu@googlegroups.com to provide a bit more context about the specific issue you encountered?

Here's the current PyTorch/TPU troubleshooting guide, which provides information on how to collect and interpret metrics that are very helpful for debugging: https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.m...

Thanks!

riku_iki · on Jan 3, 2020

> BERT-related PyTorch models training successfully on Cloud TPUs

How do you see it? Do you look at your client's code?

nl · on Jan 4, 2020

Google wrote BERT and they provide technical support to the FB Pytorch TPU port so it's not entirely surprising. RoBERTa, (Fb's variant) would be a good candidate to test it with.

zak · on Jan 4, 2020

We only see code when customers open-source it or otherwise explicitly share it with us. We are directly in touch with several customers who are using the PyTorch / TPU integration, so we hear feedback from them, and we also run a variety of open-source PyTorch models on Cloud TPUs ourselves as we continue to improve the integration.