Hacker News new | past | comments | ask | show | jobs | submit login
Switch Transformers C – 2048 experts (1.6T params for 3.1 TB) (2022) (huggingface.co)
73 points by tosh 10 months ago | hide | past | favorite | 36 comments



The title should be changed to reflect that the checkpoints were released 1.5 years ago [1], and the paper was published almost 3 years ago [2]. IMO "releases" means that it was released recently, but that's not the case.

[1] https://github.com/google-research/t5x/commit/199f226eeff5f8...

[2] https://arxiv.org/abs/2101.03961


Changed now. Thanks!


It's an experimental model that is trained on a small set of data. It's been out there for months. It wasn't popular because it's pretty much useless.


The top comment says a lot

> It's pretty much the rumored size of GPT-4. However, even when quantized to 4bits, one would need ~800GB of VRAM to run it.


It's a sparse model, which means you only need to load the "experts" that you're going to use for a given input.


But you still need to page in the weights from disk to the GPU at each layer, right?


You should only need the weights for the experts you want to run. The experts clock in at around 400 MB each (based on the 800 GB figure given elsewhere). A 24 GB GPU could fit around 60 experts, so it might be usable with a couple of old M40s.


You can get a second hand RTX 3090 with $800. So in theory you could build a cluster with less than $40k for all hardware. Not that this is pocket change, but given the capabilities of such a model the expense doesn’t sound that much.


3090's probably aren't gonna cut it because of the generational improvements in architecture (even if you could hack the drivers together to work)


you don't have the nvlink, how would this work then


You don’t need the entire VRAM to be accessed as one. Everything is parallelized by the framework. You lose some speed, but it’s the only way to do it. There’s no freaking way to have 800 GB of unified VRAM.


GH200 would like a word ;-)


Hahaha I love this.


As I understand it, NVLink isn't actually required for much of anything. It allows P2P transfers and atomics. That same functionality can be used over PCIe on some GPUs with reduced performance. And the P2P features are not transparent and are lower-performance than staying inside one GPU, so generally software avoids P2P transfers even when they are available.

The Nvidia documentation on this is absolutely horrible, the features vary wildly by model of GPU, and I claim no real expertise.


Or one can try to rent a cluster in the clouds.


It's no big deal to run an EPYC host with 2TB of ram. It'll run slow. But the Falcon 180B model runs at several tokens per second that way and its nowhere near as sparse as this should be.

I also have 960GB of vram in my garage (40x P40), though I suspect getting this model distributed in a useful way might be annoying and not worth the effort particularly since it would probably be close to turn-key (if a bit slow) on a high ram epyc host.


How many motherboards did you need to supply PCIe lanes to all of those P40s?

I have a similarly-sized pile of MI25s and recently managed to get a eight of them running in a single plentiful-on-ebay supermicro motherboard (custom bifurcating fanout-riser, x8 to each card). It was a "learning experience".

Falcon 180B, 4-bit quantization fits in VRAM and works, but at under 10tokens/sec because llama.cpp's support for multi-GPU on AMD is kind of an afterthought.


I have them across 4 hosts with pci-e switches. Really only 9 per host are maximally useful due to nvidia mailbox limits in cuda (also because one per host is on 8x, IIRC).


> I have them across 4 hosts with pci-e switches.

Nice!

Sorry to pester you, last question: did you find a source for those at a reasonable price?

Pricing from last time I investigated that would've been more than the P40s, so I'm assuming you did. Apparently the actual ICs are/were incredibly expensive and only Pericom sells them as standalone chips AFAICT. (Edit: oh wow, Pericom got bought by Diodes Inc; this explains a lot)

Thanks!


be careful, the Compute Enforcement Administration might be reading


Wow... I won't be running anything like that locally anytime soon unfortunately...


MoE = Mixture of Experts

Not the anime kind. ;)


I'm much more interested in lower parameter models that are optimized to punch above their weight. There is already interesting work done in this space with Mistral and Phi. I see research coming out virtually every week trying to address the low hanging fruit.


Interesting, all 3 authors left Google. 2 went to OpenAI, the 3rd cofounded Character.AI.


Noob question - why is it must to load the whole model all at once in VRAM?

Why it can't be streamed from disk layer by layer, a sliding window of it, computed, temporary results held, offloaded it back and load the next window. Repeat till whole inference is done.

Also, if these are so much repetitive calculations in nature that you need CUDA cores to compute then why the inference can't be streamed and spread across a cluster of machines each having multiple commodity GPUs where one central "conductor/orchestrator" machines collects all the results from all the cluster participants?


That is actually possible. For example, someone wrote python code to do this for the massive open source model BLOOM.

However, it's still slow as tar. When I was running the BLOOM model I think my inference time was 1 token / m.

See: https://towardsdatascience.com/run-bloom-the-largest-open-ac...


It's not a formal must, it's a practical "must". Paging and streaming and distribution (and collection) all bring their own overhead, and given the architecture of most of these systems, the overhead of those techniques is very large.

So for researchers who can access a system with enough memory, which is most of the people professionally exploring these models, nobody needs to invest much preparatory effort into those other techniques or reducing their overhead. For them, there's basically no ROI for it.

But a lot of that work to optimize for those other techniques is gradually being accomplished in the open source community, where people don't have access to expensive clouds and lab systems. It takes time though, and can still only achieve so much.


Yes, it can be done and is done when needed, but your life will be simpler if you can fit your models in GPU memory. Even better if you can fit it in a single GPU memory.


No, its not necessary to load it in all at once, you could in theory make it stream in chunks of layers.


Assuming you have the hardware, what could you use this for?

Would it be difficult to wire up for conversations like ChatGPT? Could you run it against a local photo store to let you search by names of objects/people? Or is it basically an intermediate model that needs further training to fine tune to your application?


this is what I have been struggling to understand too. the accolades speak about the size and other input related capabilities of the model, but the output - not so much


I reckon it would be useful if there were a set of open source MOE models with different sizes, like llama, for example 1.5B, 7B, 34B and 72B (or perhaps those numbers x the number of experts). It would enable much experimentation in the community. Not many are GPU rich enough to play with a 1.6T model!


... in November of 2022.


It's wild that you could run a* quant (160GB) of this on a $6k Mac Studio. Times are a changin'.


Google is so irrelevant


Google is dead in the water. Their products are hard to use, all of them. Remember when you can setup most of the GA & Ads stuff for your business yourself? Gone are the days. UI is clustered, broken, stupid. You need advisors and professional services for simple things. It’s only Google.com that keeps things afloat. The CEO is retroactive, and doesn’t know what the right and left hands are doing. Google will be gone in 10 years.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: