Hacker News new | past | comments | ask | show | jobs | submit login
GloVe: Global Vectors for Word Representation (stanford.edu)
22 points by ot on Oct 9, 2014 | hide | past | favorite | 7 comments

This is really interesting work and a good paper. But, it should also be mentioned that some concerns have been raised about the evaluation. As troublesome as it is, it is a common problem with word representation papers since there is not yet a solid standardized way to approach evaluation.


Yoav Goldberg's comments - and those of the anonymous reviewers - were indeed very useful in encouraging us to do a better job at the evaluation in the paper. These were comments on the submission version, and so the improvements were included in the camera-ready version following the usual procedure.

To say that there were troublesome concerns about the evaluation is a bit too strong. Yoav's experiment showed that for the one data setup he ran, training both models on the same data, GloVe out-performed word2vec by less than in our results where we compared against the publicly released word2vec vectors. But it still outperformed word2vec. And Yoav's comparison isn't the last word: In his experiments, he ran GloVe for only 15 iterations, but, as we already knew and were taking advantage of, GloVe's performance continues to improve for many more iterations. This is now much more clearly documented in the final version of the paper (see fig. 4).

But at the end of the day, these numeric differences were never the point of the paper. The contribution of the paper is to show how the kind of good results word2vec gets with online learning on a token stream can be achieved also by working from a global co-occurrence count matrix, more in the style of the traditional SVD, but changing the loss function and frequency scaling, and that you could expect working in this way to be somewhat more statistically efficient. Yoav has actually been involved in some very interesting work along the same lines himself: https://levyomer.files.wordpress.com/2014/09/neural-word-emb...

@jo_9: We don't use word2vec for training, only for experimental comparison, and the style of training is fairly different.

Doctor Pennington? I really like the paper, as I said "This is really interesting work and a good paper.", but disagree that "troublesome" would be too strong. The original version claimed 11% improvement, Professor Goldberg found that it would be along the lines of 2-3%. Even if this is not the point of the paper it is how it was empirically evaluated and was the key performance statement in the original abstract and could mislead a reader to expect better model improvements. Nevertheless, I am a fan of your work and excited to see what you publish next.

The original version of the paper was what Yoav was criticizing. It's worth noting that the authors made nontrivial changes to address most of Yoav's comments (and as a result, ended up with a much higher quality paper).

I specialize in word representations and using them for various tasks. Word Similarity prediction has been used as a basic first evaluation for many years now, with analogies becoming an additional standard task in the past couple of years. But it's worth noting that word representations have a LOT of open parameters (which model? how many dimensions? do I remove stopwords and low frequency words prior? do I use a bag-of-words context or a syntactic context?).

The optimal parameter choices for one task are very frequently not the optimal parameters for another. While there are usually "reasonable defaults" for when you don't want to optimize everything, a solid standardized approach risks vastly overfitting to one task, possibly at the expense of more useful tasks.

I was not aware of these updates and will have to read the new version. It is great that it is possible for non reviewer feedback to have such an impact on a paper before it is published and presented. Was the camera ready version also changed?

The most remarkable take-away from this whole genre of word-embedding is that just by doing 'dumb averages' of word contexts and then optimizing the 'vector[word]' on the input (and output sides), you end up with a SEMANTIC understanding of the English language in the word vectors.

This paper is the latest in the series (across multiple researchers), and seems to boil the task down to its bare minimum : Just a raw least-squares optimization works. And instead of the 'linguistic knowledge' being smuggled into the problem set-up increasing (initially, people used tree-embeddings, and WordNet bootstrapping, in the 2003 papers), this is getting rid of almost all structure. And ending up with better results.

So, instead of semantics being a naturally very deep problem, apparently common sense understanding can be derived from surface statistics. IMHO, more people should be excited about this (from an AI standpoint).

How is this different from the heretofore prolific Word2Vec? I see they mention it but don't provide information about how it is distinct from their approach.

EDIT: My fault. I was only reading through the site instead of the paper. It looks like they utilize a similar approach to training (even making use of Word2Vec), but their approach involves using a smaller, specially chosen subset of the data to improve the robustness of the comparison between two word vectors.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact