To say that there were troublesome concerns about the evaluation is a bit too strong. Yoav's experiment showed that for the one data setup he ran, training both models on the same data, GloVe out-performed word2vec by less than in our results where we compared against the publicly released word2vec vectors. But it still outperformed word2vec. And Yoav's comparison isn't the last word: In his experiments, he ran GloVe for only 15 iterations, but, as we already knew and were taking advantage of, GloVe's performance continues to improve for many more iterations. This is now much more clearly documented in the final version of the paper (see fig. 4).
But at the end of the day, these numeric differences were never the point of the paper. The contribution of the paper is to show how the kind of good results word2vec gets with online learning on a token stream can be achieved also by working from a global co-occurrence count matrix, more in the style of the traditional SVD, but changing the loss function and frequency scaling, and that you could expect working in this way to be somewhat more statistically efficient. Yoav has actually been involved in some very interesting work along the same lines himself: https://levyomer.files.wordpress.com/2014/09/neural-word-emb...
@jo_9: We don't use word2vec for training, only for experimental comparison, and the style of training is fairly different.
I specialize in word representations and using them for various tasks. Word Similarity prediction has been used as a basic first evaluation for many years now, with analogies becoming an additional standard task in the past couple of years. But it's worth noting that word representations have a LOT of open parameters (which model? how many dimensions? do I remove stopwords and low frequency words prior? do I use a bag-of-words context or a syntactic context?).
The optimal parameter choices for one task are very frequently not the optimal parameters for another. While there are usually "reasonable defaults" for when you don't want to optimize everything, a solid standardized approach risks vastly overfitting to one task, possibly at the expense of more useful tasks.
This paper is the latest in the series (across multiple researchers), and seems to boil the task down to its bare minimum : Just a raw least-squares optimization works. And instead of the 'linguistic knowledge' being smuggled into the problem set-up increasing (initially, people used tree-embeddings, and WordNet bootstrapping, in the 2003 papers), this is getting rid of almost all structure. And ending up with better results.
So, instead of semantics being a naturally very deep problem, apparently common sense understanding can be derived from surface statistics. IMHO, more people should be excited about this (from an AI standpoint).
EDIT: My fault. I was only reading through the site instead of the paper. It looks like they utilize a similar approach to training (even making use of Word2Vec), but their approach involves using a smaller, specially chosen subset of the data to improve the robustness of the comparison between two word vectors.