What does expert mean in this context?

moffkalast · on Dec 8, 2023

It means it's 8 7B models in a trench coat in a sense, it runs as fast as a 14B (2 experts at a time apparently) but takes up as much memory as a 40B model (70% * 8 * 7B). There is some process trained into it that chooses which experts to use based on the question posed. GPT 4 is allegedly based on the same architecture, but at 8*222B.

dragonwriter · on Dec 8, 2023

> GPT 4 is based on the same architecture, but at 8*222B.

Do we actually either no that it is MoE or that size? IIRC both if those started as outsidr guesses that somehow just became accepted knowledge without any actual confirmation.

moffkalast · on Dec 8, 2023

Iirc some of the other things the same source stated were later confirmed, so this is likely to be true as well, but I might be misremembering.

rishabhjain1198 · on Dec 8, 2023

In a MoE model with experts_per_token = 2 and each expert having 7B params, after picking the experts it should run as fast as the slowest 7B expert, not a comparable 14B model.

nullc · on Dec 9, 2023

Only assuming it's able to hide the faster one in free parallelism.

moffkalast · on Dec 9, 2023

My CPU trying its best to run inference: parallelwhat?

tavavex · on Dec 8, 2023

Does anyone here know roughly how an expert gets chosen? It seems like a very open-ended problem, and I'm not sure on how it can be implemented easily.

rishabhjain1198 · on Dec 8, 2023

[Relevant paper](https://arxiv.org/abs/1701.06538).

TL;DR you can think of it as the initial part of the model is essentially dedicated to learning which experts to choose.

WeMoveOn · on Dec 8, 2023

How did you come up with 40b for the memory? specifically, why 0.7 * total params?

moffkalast · on Dec 9, 2023

It's just a rough estimate given that these things are fairly linear, the original 7B mistral was 15 GB and the new one is 86 GB, whereas a fully duplicated 8 * 15 GB would suggest a 120 GB size, so 86/120 = 0.71 for actual size, suggesting 29% memory savings. This of course doesn't really account for any multiple vs single file saving overhead and such, so it's likely to be a bit off.

sockaddr · on Dec 8, 2023

Fascinating. Thanks