A from scratch implementation of a sparse mixture of experts language model in a single file of PyTorch. This is inspired by and largely based on Andrej Karpathy's project 'makemore' and borrows a number of re-usable components from that implementation. Just like makemore, makeMoE is also an autoregressive character-level language model but uses the aforementioned sparse mixture of experts architecture. I added Expert Capacity to this implementation to make it more complete
Adding scaled unit gaussian noise to the logits
noise = torch.randn_like(logits)*F.softplus(noise_logits)
noisy_logits = logits + noise
Question, if you changed this Gaussian normal for Gumbel noise you would get something like Gumbel softmax, right? I'm curious why not use it? Isn't it a usual way to implement differentiable discrete selection? My curiosity is about the effectiveness of Gumbel softmax since I have had some trouble using it in practice so I'm curious why it's not used here and if there are downsides to it compared to other methods. Honestly just adding normal noise like this seems simpler anyway.
This is a good point. I'm yet to try it as I've kind of let this project sit for a couple of months and only getting back to it. I went with this because it's simpler but I'm not sure simpler is necessarily better in this case.
Ah ok, I was wondering if there was some theory here that I wasn't aware of but if it's just experimentation no problem ;) good to know in any case!
I find it a bit difficult to find resources describing the properties of various options for this topic of discrete choices and clustering, apart from a few papers & blogs describing the idea.
Quite honestly not in my experiments. I wanted to do some Bayesian hyperparameter optimization with some discretized options like noise/no-noise and n_expert/top_k but haven't been able to find the time or free time in one of our GPU clusters. I plan on using perplexity as this is not yet instruction fine tuned.
Oh nice. What's new here would be noisy top-k routing and expert capacity. It also seems to use the nanoGPT base from Andrej Karpathy. Mine is from January as well. Here's the original blog: https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch
It was inspired by Mixtral 8x7B, of course. I think the same approach, soft to hard MoE, can be used in other domains. Like video/image processing. Would be interesting to take it to extreme, like 4 experts out of 100.
Thanks! So this is something I tried and qualitatively I didn't see a huge difference. I'd like to swap out my hand rolled modules with standard pytorch modules for self attention etc. and train it on the wikipedia English split. That's on my to-do list for sure.
I run some tests. Single model of the same size is better than MoE. Single expert out of N is better than model of the same size (i.e. same as expert). 2 experts are better than one. That was on small LLM, not sure if it scales.