however, if you need to swap experts on each token, you might as well run on cpu...

tarruda · on Dec 8, 2023

> Presumably, the same expert would frequently be selected for a number of tokens in a row

In other words, assuming you ask a coding question and there's a coding expert in the mix, it would answer it completely.

ttul · on Dec 8, 2023

See my poorly educated answer above. I don’t think that’s how MoE actually works. A new mixture of experts is chosen for every new context.

read_if_gay_ · on Dec 8, 2023

yes I read that. do you think it's reasonable to assume that the same expert will be selected so consistently that model swapping times won't dominate total runtime?

tarruda · on Dec 8, 2023

No idea TBH, we'll have to wait and see. Some say it might be possible to efficiently swap the expert weights if you can fit everything in RAM: https://x.com/brandnarb/status/1733163321036075368?s=20