Hacker News new | past | comments | ask | show | jobs | submit login
How to train large models on many GPUs? (2021) (lilianweng.github.io)
216 points by eternalban on Feb 11, 2023 | hide | past | favorite | 33 comments



When training over multiple GPUs, it's hard not to think about Ray (https://docs.ray.io/en/latest/train/train.html). Ray, as an open-source project, has exploded over the last few years and helps with the memory bottleneck by segregating memory and computing.

FYI, I am not affiliated with Ray. However, I did write the following paper on scaling data-parallel training for large ML models ;) https://openreview.net/pdf?id=rygFWAEFwS

Also, another one of my papers talks about distributed training while reducing the communication bottleneck for distributed training: https://dl.acm.org/doi/pdf/10.1145/3447548.3467080


I'm one of the Ray developers, thanks for the shoutout :)

If you're curious about how Ray is used for LLMs, here are some interesting examples of LLM projects using Ray!

- Alpa does training and serving with 175B parameter models https://github.com/alpa-projects/alpa

- GPT-J https://github.com/kingoflolz/mesh-transformer-jax

- Another HN thread on training LLMs with Ray (on TPUs in this case) https://news.ycombinator.com/item?id=27731168

- OpenAI fireside chat on the evolution of their infrastructure and usage of Ray for training https://www.youtube.com/watch?v=CqiL5QQnN64

- Cohere on their architecture for training LLMs https://www.youtube.com/watch?v=For8yLkZP5w&t=3s

Some other thoughts

1. There is a lot more we want to do to make Ray better for working with large language models and for making training, serving, and batch inference work well out of the box.

2. The original post is about training, but we actually see even more interest in fine-tuning and serving with LLMs, in part because there are good pre-trained models.

3. For LLMs, we see a lot of interest in Ray + Jax or Ray + TPUs relative to what we see in other use cases.


Do you see any convergence on wire (arrow?) and storage (pandas?) formats?


And we can make Ray more efficient by optimizing GPU hardware utilization https://centml.ai/


Will it work with a PC that has 7 AMD Vega GPUs?


Yes, but this will largely come down to whether the deep learning framework that you're using (PyTorch, TensorFlow, Jax, etc) works well in that setting. Ray is pretty framework and hardware agnostic and can be used to schedule / scale different ML frameworks on different types of devices (CPUs, GPUs, TPUs, etc), but the actual logic for running code on the accelerators lives in the deep learning framework.


Why isn't there a framework that does all this automatically for you?

I tried torch FSDP but it only managed to increase the memory to something like 150% of 1 GPU.

I eventually ended up sharding my model manually with .cuda() and .to() which works much better, but now I am limited to one module on one GPU and I would like to expand even more, and that would mean spinning up more nodes and splitting the model over how many GPUs manually.

I would be interested if anyone knows of a framework that manages this automatically and just works.

EDIT: BTW I am talking about model sharding not data parallelism which works very well with DDP.


DeepSpeed became popular soon after this post was originally published and is natively supported by many PyTorch training frameworks.

https://www.deepspeed.ai

https://www.deepspeed.ai/training/


I tried that as well, but maybe I did not use it correctly. I did not see the full sharding that I was hoping for. I only saw results similiar to FSDP.



>Why isn't there a framework that does all this automatically for you?

Check MosaicML if it might help in your case. I haven’t tried myself but they’ve most customizations and speed up optimizations I came across in the recent times

https://www.mosaicml.com/blog/supercharge-training-composer

Also worth checking out their “training from scratch” blog posts.

Training StableDiffusion: https://www.mosaicml.com/blog/training-stable-diffusion-from...

Training GPT-3: https://www.mosaicml.com/blog/billion-parameter-gpt-training...


Mosaic's open source library is excellent: Composer https://github.com/mosaicml/composer.

* It gives you PyTorch DDP for free. Makes FSDP about as easy as can be, and provides best in class performance monitoring tools. https://docs.mosaicml.com/en/v0.12.1/notes/distributed_train...

Here's a nice intro to using Huggingface models: https://docs.mosaicml.com/en/v0.12.1/examples/finetune_huggi...

I'm just a huge fan of their developer experience. It's up there with Transformers and Datasets as the nicest tools to use.


> Transformers and Datasets

Are these specific libraries?



It's a fair question.

Nvidia's NCCL and AMD's RCCL provide parallelism constructs that really are hidden at the framework level (such as PyT).

However, I don't think that you would want to hide model, data, or tensor parallelism. It's too important a consideration for performance and training convergence impact.

At least in scientific computing, I've never observed effective means of automatic parallelism expressed across many nodes despite decades of research. I'm not optimistic this will be effective anytime soon.


Any framework that "just works" tends to not work when some small change is needed or a new model with new data/compute roofline comes out.


There is - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en...

Supports data, tensor, pipeline, sequence parallelisms, activation checkpointing, distributed optimizers, fused kernels and more.


Beyond the other answers, I’ll point out that pytorch is developing tools that will make doing this work by hand or implementing in a framework much easier. They’re building a native DTensor implementation and testing out SPMD-style distributed models with pipelining. DTensor is in torch.distributed, and the SPMD code is in the repo called Tau under the pytorch org on github.


Might this be what you're looking for: https://github.com/bigscience-workshop/petals ?


> Why isn't there a framework that does all this automatically for you?

Question: could this be implemented in PyTorch in an opaque way? Or would it require changes to its API?


Discussed (a bit) at the time:

How to train large models on many GPUs? - https://news.ycombinator.com/item?id=28657797 - Sept 2021 (9 comments)


Somewhat amazed, dang, that this topic is not discussed more widely here or elsewhere. There is a lot of HPC and DS expertise out there which lacks understanding of ML system architecture (in the sense of the deployed machinery in toto).

Her follow up post [1] is also recommended for those who (like me, are experienced but not in ML) finally had things click because of the OP writeup:

Large Transformer Model Inference Optimization (2023)

https://lilianweng.github.io/posts/2023-01-10-inference-opti...

A very cool cite from that article is LLM.int8(): https://arxiv.org/abs/2208.07339


I'm waiting for GPU cards that allow the user to plug in memory modules.


Not going to happen:

https://www.servethehome.com/wp-content/uploads/2022/08/NVID...

That unless an utterly revolutionary new interconnect technology comes...

LPDDR is getting adopted more and more too, with CPUs losing memory expansion capabilities in exchange of huge power savings.


I bet a lot of people would be happy if their GPU could train large models, even if it took 5x as long.


That's getting a multi-level DRAM config, see Grace Superchip for an example.

Or just a plain slower GPU: Apple's GPU series, but that's at very much higher price tags than desktop GPUs at a given perf level.


The bandwidth and latency is as much a bottleneck as the capacity is. That's why recent ML and GPU chips have moved to on-package "high-bandwidth memory." Even if you were to add more high-bandwidth memory off-package (via lots of parallel DIMMs, say), it's difficult to actually manage this memory well, because of a host of competing factors, including a) on-chip resources (controllers etc) that must consume area and power to manage the new memory, b) increased design cost for validating handling the extra (crappier than HBM) memory, c) software not knowing how to hide the (much higher) latency of this memory - deep learning isn't as good at memory hierarchies as CPUs are, d) a chip that has spare on-chip bus bandwidth for this kind of thing has wasted bus bandwidth. Et cetera.

So if you are building GPUs or AI accelerators, you tend to just go ahead and build this in.


AMD Radeon Pro SSG had 4 nvme slots on the card itself but that was 2017 but with direct storage API that might be able to have some gains for large models.


I could never get a solid answer wether that was presented as memory to the GPU or just as a PCIE switch with NVME drives hanging off one side and the GPU on another.


PCIe switch with NVMe drives and the GPU behind it, they actually just are presented straight to windows even.

https://www.youtube.com/watch?v=-fEjoJO4lEM

In principle they could be used with an API like DirectStorage RDMA or CUDA GPUDirect RDMA (which dates back to Kepler) and in this case they would never need to talk to the CPU, given appropriate software support. But it's not going to be presented as GPU memory ever, it's going to work like a block storage device you can do RDMA requests against, most likely.

https://docs.nvidia.com/cuda/gpudirect-rdma/

https://developer.download.nvidia.com/video/gputechconf/gtc/...

Now technically - it all depends on what you mean by "as GPU memory" because PCIe is all RDMA anyway, even CPU-to-GPU is a RDMA operation. That's why there's the whole thing about "resizable BAR" etc - that's the aperture window in CPU memory that gets mapped in from the GPU memory.

So technically yes you can map those SSDs in "as GPU memory" via GPUDirect RDMA block storage (or DirectStorage), but you can do that with a regular NVMe SSD in an adapter card too. The SSG is just a "combo GPU+SSD card" in the same way QNAP makes those "combo network+SSD cards", but with a lot of fanfare/marketing around it.

To be absolutely fair, Fiji/Vega is a good design for that since it doesn't have a bunch of memory packages around it. But it wasn't what AMD trumpeted it as, as the LTT video describes, it was a very specific reaction to the question of 'workstation GPUs are using a lot of memory, HBM can't be scaled as high, how do we put more memory on a Fiji/Vega GPU for workstation users". And AMD's claims that HBM meant you could just swap everything around and not have to worry about framebuffer size were never true, the PCIe bus itself is not fast enough for that.

https://www.youtube.com/watch?v=NQ8YlXh-jbI&t=280s


As far as I remember it was presented as a drive and was good for sequential read but you had to use AMD's API to get the full benefit


Instead of waiting for the future maybe you could look to the past? That's how graphics cards used to work back a couple decades ago.


[flagged]





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: