Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

however, if you need to swap experts on each token, you might as well run on cpu.


> Presumably, the same expert would frequently be selected for a number of tokens in a row

In other words, assuming you ask a coding question and there's a coding expert in the mix, it would answer it completely.


See my poorly educated answer above. I don’t think that’s how MoE actually works. A new mixture of experts is chosen for every new context.


yes I read that. do you think it's reasonable to assume that the same expert will be selected so consistently that model swapping times won't dominate total runtime?


No idea TBH, we'll have to wait and see. Some say it might be possible to efficiently swap the expert weights if you can fit everything in RAM: https://x.com/brandnarb/status/1733163321036075368?s=20




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: