Hacker News new | past | comments | ask | show | jobs | submit login
Word embeddings in 2017: Trends and future directions (ruder.io)
158 points by stablemap on Oct 21, 2017 | hide | past | favorite | 25 comments

https://github.com/kudkudak/word-embeddings-benchmarks has a pretty nice evaluation of existing embedding methods. Notably missing from this article is GloVe ( https://nlp.stanford.edu/projects/glove/) and LexVec ( https://github.com/alexandres/lexvec ) both which tend to outperform word2vec in both intrinsic and extrinsic tasks. Also of interest are methods which perform retrofitting, improving already trained embeddings. Morph fitting (ACL 2017) is a good example. Hashimoto et al (2016) sheds some interesting insight on how embeddings methods are performing metric recovery. Lots of exciting stuff in this area.

Alex Gittens also has a nice paper this year showing how Skipgram enables vector additivity. See http://www.aclweb.org/anthology/P17-1007

No mention of StarSpace (from FaceBook) ? It figures, with the rapid pace of innovation these days.

StarSpace can compute 6 types of entity embeddings, of which word embeddings are just one type. It's a whole family of algorithms.


Note for those who it's relevant for, this is not usable in a commercial setting.

I don't really understand the implications of this license. Does it forbid using the resulting vectors for commercial purpose? Or does it only forbid stuff like packaging their code into a product, or offering to run it as a service.

That particular implementation isn’t (FAIR research code, non-commercial), but are there known patents on the algorithms that would prevent a clean-room implementation from being used in a commercial setting?

Given that this was a discussion of word vectors specifically it seems reasonable. StarSpace doesn't have anything new in terms of word vectors.

Good point. I've added a reference to StarSpace anyway as it's useful for applications outside of NLP.

My question is what are they really good for.

I mean king = queen -woman + man

That's the kind of thing we have ontologies for.

This article mentions that word embeddings are useful inside translators, but from the viewpoint of somebody who wants to extract meaning from text, what use is something that doesn't handle polysemy and phrases?

Word embeddings (or subword embeddings) are used for nearly all recent NLP algorithms, both shallow (e.g. FastText) and deep (e.g. Google Neural Translation). Unless you're using a basic bag-of-words approach, you need to translate your words into some vector format, so you probably want some kind of embeddings. In practice, all the state of the art approaches for translation, language modeling, classification (eg sentiment analysis), etc. all sit on top of embeddings.

It's not word embeddings job to handle phrases - but nearly all modern phrase embedding algorithms sit on top of word embeddings. They often create a weighted average of embeddings by using an attention model, or they can use a more complex model such as an LSTM with attention (e.g. CoVE - https://arxiv.org/abs/1708.00107 ).

Word embeddings can handle polysemy - high dimensional vectors can (and do) hold information of various types that is used in different contexts in different ways. Some approaches deal with this more directly (e.g. including part-of-speech as part of the vocab item), and that sometimes can help a bit.

Maybe I'm not reading it right, but that arXiv paper about CoVE doesn't seem to be getting anywhere near commercially useful results.

For instance, the random result for ImDB in Table 2 is 88.4 and the best one is 92.1; that's really not a lot of lift. I could see TREC-6 and TREC-50 results being good enough to let off the leash, but I still have a hard time picturing this being useful in the real world.

Oh BTW the "random" result means randomly initialized vectors. It's still using embeddings, just not pretraining

I have a version that gets 94.5 on IMDB although not published yet.

It's incredibly useful for search, given the property that similar words are close in the vectorial space. And given it's purely numbers, it's really fast to compute.

To see an example, type "fuel" in the search input on this page: https://openvoyce.com//products/quuu

You'll see many relevant results, none of them using the word "fuel". This is done purely with postgres, computing a L2 distance sort - no elasticsearch.

Would you be willing to go into a little more detail about what this is actually doing?

What is the shape of the database? Do you normalize each document into a single vector which is compared, or are you keeping per-word vectors? I'm imagining you probably don't have a database with a row for every word, but maybe you do?

How do you pre-filter the list of documents to compute L2 for? If no pre-filtering, can this approach scale into millions of documents?

The main trick is to do an average of word embeddings in a given document, an idea I took from the youtube paper on recommendation engine [1].

I have a separated service that contains the word embeddings, generated with word2vec. The idea is to generate an embedding for the document by making an average of the embeddings of the words it contains, each having a coefficient based on the word's rarity (so, a rarer word has more weight than a stop word).

When saving a document, OpenVoyce is contacting this API and asks to generate an embedding for the document, then it only saves that in its own database (as a "cube" vector of 200 dimensions).

From there, searching for something new is just about asking for an embedding for the search terms and using `cube_distance()` [2] as sort function, it does not require pre-filtering since stop words are already weighted off (although, there is some filtering in the API as it ignores words it doesn't know).

It would still help to be able to define user specific stop words, though. For example, on Quuu's OpenVoyce, most suggestions are about adding new categories, so "category" should be considered a stop word, that's something I plan to implement.

I can't tell yet how it scales to million of records because we're very far from there for now (there are 4500+ suggestions and comments on OpenVoyce at present day). My bet is that if the amount of data becomes a problem, it may be fixed by reducing the number of dimensions of the vectors.

Oh, there's also something to know: the cube extension for postgres doesn't allow for more than 100 dimensions. This is something configurable, but only by editing a header file from the extension (that's the author's recommended method). I've detailed the problem and solution on my pg350d repos [3]

[1] https://static.googleusercontent.com/media/research.google.c...

[2] https://www.postgresql.org/docs/current/static/cube.html

[3] https://github.com/oelmekki/postgres-350d

I think the results for "fuel" are incredibly bad. Sure, "Diesel" shows up, but so does "Essential Oils".

Have you done any evaluation to show that this is better than a more a conventional search engine? The Lesson of TREC is that 95% of the things that will "obviously" improve search results will not.

The goal is not to not show a single bad result, it's to show the good results.

Word vectors are "useful" as an alternative representation for words in any machine learning task.

Previously one used TF-IDF to represent a document, but now one uses Word2Vec and will usually get better results.

For example, From Word Embeddings To Document Distances[1] shows how to use a new distance measure (Word Mover Distance) in classification tasks. This leads to state-of-the-art performance on 6 out of 8 classic text classification datasets (and very close on the other 2).

In most other practical NLP tasks you find similar results: replacing an older representation with a word embedding almost always improves performance.

[1] http://proceedings.mlr.press/v37/kusnerb15.pdf

Even if that example was the only use-case (it's not, they are used for word similarity, sentiment analysis and more...), word embeddings would still be useful, since creating ontologies is not Easy and takes time.

Somewhat unrelated, but the old joke goes: give one problem to two ontology experts, and they will come up with three different ontologies.

I also think that there is still room for improvement for embeddings based on other contexts as pointed in the blog entry. Another example from this year is leveraging dictionary entries as external context - http://aclweb.org/anthology/D17-1024 ()

Selecting context words differently is also an option for improvement. Using dependency structures to "filter" out context window seems to work better than "filtering" using subsampling frequent words illustrate that there is room. We may see other solutions to select context words in the future, as a building block as it is. Especially lately with the StarSpace hype advocating the idea of general purpose - task-agnostic - embeddings.

Or we can also consider that the expected improvements are insignificant w.r.t. improvements with the model learnt on those embeddings for downstream tasks that may update embeddings especially for this task...

() disclaimer: I am a co author

Thanks for the note, Christophe. I had missed your paper. I've added a short paragraph with regard to improving negative sampling by incorporating contextual information.

Thank you Sebastian.

Keep up the great work !

You will note that negative sampling improved by leveraging information on word pairs form dictionaries entries (we called it "controlled negative sampling") do help, though not much. It actually really depends in the rare words rate (see section 5.4, improvement ranges from 0.7% up to 10%). But I guess it is already an interesting, somehow counter-intuitive, observation.

Another very interesting observation is that you can also choose to just clamp a general purpose dataset and expand it with external contextual information (meaning not using is for supervision but rather just collapse it at the end of the training corpus in a raw form [^]). In our case, we call those corpus : - corpus A : plain old wikipedia dump - corpus B : plain old wikipedia dump + dictionaries text collapse at the end of it.

It sounds a bit naive : the latter part of the training corpus is really small w.r.t. the full wikipedia dump. Nonetheless, it has an significant impact on word similarity (see improvement in Table 2 to see how those training corpus influences representations learnt by word2vec, fasttext and dict2vec). (Related to the effect of the training corpus : https://arxiv.org/pdf/1507.05523v1.pdf)

I mention this effect of training corpus content here since it sounds like an interesting info for the working natural language processing practitioners (get a mid size general training corpus, add as many contextual corpus as possible => may yield useful embeddings...).

[^] to be entirely fair, this has been suggested to us by an anonymous reviewer, many thanks for him/her for pointing this out : I found the results surprising.

Typo from the blog : "to move related works" => "to move related words"

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact