My own experience has been that the performance differences among the different embedding models largely disappear if you take the time to individually fine-tune the hyperparameters for each one. Which is not something that is typically done when reporting performance results in the literature.
I haven't really read the paper, but it seems like there's room for that to have happened here - a search for "hyperparameter" yields one mention, where they describe the choice of algorithm as itself being a hyperparameter. I only skimmed the methods, but didn't notice any mention about individually optimizing each of the embedding models they tested with there, either.
So, unless I'm missing something, that means there's plenty of room to question whether this approach would significantly outperform a single well-tuned embedding model. It might be that it's mostly useful for ensembling together a bunch of off-the-shelf pre-trained models.
We've definitely seen that tuning the embedding hyperparameters (along with the others) can have a significant impact on performance. 
Additionally, whenever you open up the space of tunable parameters to include the embeddings or feature representations themselves you can usually significantly outperform just a well tuned classifier. 
This model seems like it trades off complexity in tuning for complexity of an ensemble, but I wonder what would happen if you tried to have your cake and eat it too and just tuned everything.
The downside is a very heavy slow-down in training/prediction.
After having made fasttext, Facebook should have called this implementation slowtext. I'm surprised there isn't more discussion of training speed in the paper.