
Improving AI language understanding by combining multiple word representations - jamesjue
https://code.fb.com/ai-research/dynamic-meta-embeddings/
======
bunderbunder
Point of armchair skepticism --

My own experience has been that the performance differences among the
different embedding models largely disappear if you take the time to
individually fine-tune the hyperparameters for each one. Which is not
something that is typically done when reporting performance results in the
literature.

I haven't really read the paper, but it seems like there's room for that to
have happened here - a search for "hyperparameter" yields one mention, where
they describe the choice of algorithm as itself being a hyperparameter. I only
skimmed the methods, but didn't notice any mention about individually
optimizing each of the embedding models they tested with there, either.

So, unless I'm missing something, that means there's plenty of room to
question whether this approach would significantly outperform a single well-
tuned embedding model. It might be that it's mostly useful for ensembling
together a bunch of off-the-shelf pre-trained models.

~~~
Zephyr314
It would be interesting to see what would happen if you also tried to tune the
ensemble towards a specific task in the same way that you could tune a single
model.

We've definitely seen that tuning the embedding hyperparameters (along with
the others) can have a significant impact on performance. [1]

Additionally, whenever you open up the space of tunable parameters to include
the embeddings or feature representations themselves you can usually
significantly outperform just a well tuned classifier. [2]

This model seems like it trades off complexity in tuning for complexity of an
ensemble, but I wonder what would happen if you tried to have your cake and
eat it too and just tuned everything.

[1]: [https://aws.amazon.com/blogs/machine-learning/fast-cnn-
tunin...](https://aws.amazon.com/blogs/machine-learning/fast-cnn-tuning-with-
aws-gpu-instances-and-sigopt/)

[2]: [https://blog.sigopt.com/posts/unsupervised-learning-with-
eve...](https://blog.sigopt.com/posts/unsupervised-learning-with-even-less-
supervision-using-bayesian-optimization)

------
fizx
It's not surprising that an ensemble of embeddings beats single embeddings. I
can't imagine it beats a single embedding fine-tuned to the task at hand,
given sufficient training data. Where ensemble embeddings might shine is when
there is sufficient data to determine how to weight the ensemble, but
insufficient data to effectively fine-tune. I'm not sure how often this case
comes up.

------
bratao
Another popular and promising approach is using contextual word
representations, such as ELMo (
[https://allennlp.org/elmo](https://allennlp.org/elmo) \- RIP Paul Allen) and
Flair
([https://github.com/zalandoresearch/flair](https://github.com/zalandoresearch/flair))

The downside is a very heavy slow-down in training/prediction.

------
minimaxir
> We use 256-dimensional embedding projections, 512-dimensional BiLSTM
> encoders and an MLP with 512-dimensional hidden layer in the classifier.

After having made fasttext, Facebook should have called this implementation
_slow_ text. I'm surprised there isn't more discussion of training speed in
the paper.

------
yters
Ensemble models have higher VC dimension and are more prone to overfitting.

~~~
mightybird707
I thought the whole point of ensemble models is to reduce the variance

