Implementation of mixture of experts language model in a single file of PyTorch

avisoori1x · on March 22, 2024

A from scratch implementation of a sparse mixture of experts language model in a single file of PyTorch. This is inspired by and largely based on Andrej Karpathy's project 'makemore' and borrows a number of re-usable components from that implementation. Just like makemore, makeMoE is also an autoregressive character-level language model but uses the aforementioned sparse mixture of experts architecture. I added Expert Capacity to this implementation to make it more complete

radarsat1 · on March 22, 2024

    Adding scaled unit gaussian noise to the logits
        noise = torch.randn_like(logits)*F.softplus(noise_logits)
        noisy_logits = logits + noise

Question, if you changed this Gaussian normal for Gumbel noise you would get something like Gumbel softmax, right? I'm curious why not use it? Isn't it a usual way to implement differentiable discrete selection? My curiosity is about the effectiveness of Gumbel softmax since I have had some trouble using it in practice so I'm curious why it's not used here and if there are downsides to it compared to other methods. Honestly just adding normal noise like this seems simpler anyway.

avisoori1x · on March 22, 2024

This is a good point. I'm yet to try it as I've kind of let this project sit for a couple of months and only getting back to it. I went with this because it's simpler but I'm not sure simpler is necessarily better in this case.

radarsat1 · on March 23, 2024

Ah ok, I was wondering if there was some theory here that I wasn't aware of but if it's just experimentation no problem ;) good to know in any case!

I find it a bit difficult to find resources describing the properties of various options for this topic of discrete choices and clustering, apart from a few papers & blogs describing the idea.

zingelshuher · on March 22, 2024

Question, have you seen the improvement after adding the noise? I mean in practice. Asking because intuition sometimes doesn't work.

avisoori1x · on March 23, 2024

Quite honestly not in my experiments. I wanted to do some Bayesian hyperparameter optimization with some discretized options like noise/no-noise and n_expert/top_k but haven't been able to find the time or free time in one of our GPU clusters. I plan on using perplexity as this is not yet instruction fine tuned.

haeggee · on March 23, 2024

very cool work! we did something similar in the context of the Swiss AI initiative (https://www.swiss-ai.org/) here: https://github.com/swiss-ai/MoE. The implementation is as simple and fast as nanoGPT and works with our modular llm-baselines codebase (https://github.com/epfml/llm-baselines) for experimenting with transformers and different datasets :)

avisoori1x · on March 23, 2024

This is awesome! Thanks for sharing. I'll definitely check this out.

zingelshuher · on March 22, 2024

Similar MoE implementation was on GitHub for a while, since Jan 2024

https://github.com/zxaall/moegpt

avisoori1x · on March 22, 2024

Oh nice. What's new here would be noisy top-k routing and expert capacity. It also seems to use the nanoGPT base from Andrej Karpathy. Mine is from January as well. Here's the original blog: https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch

zingelshuher · on March 22, 2024

It was inspired by Mixtral 8x7B, of course. I think the same approach, soft to hard MoE, can be used in other domains. Like video/image processing. Would be interesting to take it to extreme, like 4 experts out of 100.

gradascent · on March 22, 2024

Very cool. I'm curious - did you find the results from your mixture of experts model to be (qualitatively) better than with the standard approach?

avisoori1x · on March 22, 2024

Thanks! So this is something I tried and qualitatively I didn't see a huge difference. I'd like to swap out my hand rolled modules with standard pytorch modules for self attention etc. and train it on the wikipedia English split. That's on my to-do list for sure.

zingelshuher · on March 22, 2024

I run some tests. Single model of the same size is better than MoE. Single expert out of N is better than model of the same size (i.e. same as expert). 2 experts are better than one. That was on small LLM, not sure if it scales.