Latter two intend to beat latency with Mixture-of-Expert models (MoEs). If the results of the former hold, it shows that with a simple algorithmic transformation you can merge two independently trained models in weight-space and have performance functionally equivalent to a model trained monolithically.
I too would like to go down this rabbit hole. I am going to poke around using the terms “distributed learning” and “federated learning” (They’re different areas, but somewhat related as far as I understand).