I'm curious how Deepmind's MOE models Perceiver and Switch might play into manag...

gwern · on Dec 10, 2021

(Perceiver isn't a MoE, and Switch isn't DM.)

MoEs would work well with this paradigm. The whole point is to have discrete fully-separate experts, so if you train on one task and I train on another, our patches won't likely touch the same experts even without any special tricks. You could even go so far as to patch the dispatch layer and plug in a brand new expert(s). MoEs would be able to accumulate lots of patches and merge them with little difficulty. If this paradigm catches on, it might well justify MoEs on its own, regardless of the touted benefits of more efficient training & much cheaper forward passes.

Perceiver would have more trouble. Perceiver is like a RNN for Transformers: it's relatively few weights, applied repeatedly & intensively to a small latent encoding the knowledge about the input. Even with tricks, patches are going to fight over how to change those weights and change the encoded knowledge. A few patches might work, but a lot will get ugly.