Hacker News new | past | comments | ask | show | jobs | submit login
Improving AI language understanding by combining multiple word representations (fb.com)
86 points by jamesjue 5 months ago | hide | past | web | favorite | 7 comments

Point of armchair skepticism --

My own experience has been that the performance differences among the different embedding models largely disappear if you take the time to individually fine-tune the hyperparameters for each one. Which is not something that is typically done when reporting performance results in the literature.

I haven't really read the paper, but it seems like there's room for that to have happened here - a search for "hyperparameter" yields one mention, where they describe the choice of algorithm as itself being a hyperparameter. I only skimmed the methods, but didn't notice any mention about individually optimizing each of the embedding models they tested with there, either.

So, unless I'm missing something, that means there's plenty of room to question whether this approach would significantly outperform a single well-tuned embedding model. It might be that it's mostly useful for ensembling together a bunch of off-the-shelf pre-trained models.

It would be interesting to see what would happen if you also tried to tune the ensemble towards a specific task in the same way that you could tune a single model.

We've definitely seen that tuning the embedding hyperparameters (along with the others) can have a significant impact on performance. [1]

Additionally, whenever you open up the space of tunable parameters to include the embeddings or feature representations themselves you can usually significantly outperform just a well tuned classifier. [2]

This model seems like it trades off complexity in tuning for complexity of an ensemble, but I wonder what would happen if you tried to have your cake and eat it too and just tuned everything.

[1]: https://aws.amazon.com/blogs/machine-learning/fast-cnn-tunin...

[2]: https://blog.sigopt.com/posts/unsupervised-learning-with-eve...

It's not surprising that an ensemble of embeddings beats single embeddings. I can't imagine it beats a single embedding fine-tuned to the task at hand, given sufficient training data. Where ensemble embeddings might shine is when there is sufficient data to determine how to weight the ensemble, but insufficient data to effectively fine-tune. I'm not sure how often this case comes up.

Another popular and promising approach is using contextual word representations, such as ELMo ( https://allennlp.org/elmo - RIP Paul Allen) and Flair (https://github.com/zalandoresearch/flair)

The downside is a very heavy slow-down in training/prediction.

> We use 256-dimensional embedding projections, 512-dimensional BiLSTM encoders and an MLP with 512-dimensional hidden layer in the classifier.

After having made fasttext, Facebook should have called this implementation slowtext. I'm surprised there isn't more discussion of training speed in the paper.

Ensemble models have higher VC dimension and are more prone to overfitting.

I thought the whole point of ensemble models is to reduce the variance

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact