When visualised this way, the scale of GPT-3 is insane. I can't imagine what 4 w...

spi · on April 15, 2024

IIRC, GPT-4 would actually be a bit _smaller_ to visualize than GPT3. Details are not public, but from the leaks GPT-4 (at least, some by-now old version of it) was a mixture of expert, with every model having around 110B parameters [1]. So, while the total number of parameters is bigger than GPT-3 (1800B vs. 175B), it is "just" 16 copies of a smaller (110B) parameters model. So if you wanted to visualize it in any meaningful way, the plot wouldn't grow bigger - or it would, if you included all different experts, but they are just copies of the same architecture with different parameters, which is not all that useful for visualization purposes.

[1] https://medium.com/@daniellefranca96/gpt4-all-details-leaked...

joaogui1 · on April 15, 2024

Mixture of Experts is not just 16 copies of a network, it's a single network where for the feed forward layers the tokens are routed to different experts, but the attention layers are still shared. Also there are interesting choices around how the routing works and I believe the exact details of what OpenAI is doing are not public. In fact I believe someone making a visualization of that would dispell a ton of myths around what are MoEs and how they work