
Word embeddings in 2017: Trends and future directions - stablemap
http://ruder.io/word-embeddings-2017/
======
serveboy
[https://github.com/kudkudak/word-embeddings-
benchmarks](https://github.com/kudkudak/word-embeddings-benchmarks) has a
pretty nice evaluation of existing embedding methods. Notably missing from
this article is GloVe (
[https://nlp.stanford.edu/projects/glove/](https://nlp.stanford.edu/projects/glove/))
and LexVec (
[https://github.com/alexandres/lexvec](https://github.com/alexandres/lexvec) )
both which tend to outperform word2vec in both intrinsic and extrinsic tasks.
Also of interest are methods which perform retrofitting, improving already
trained embeddings. Morph fitting (ACL 2017) is a good example. Hashimoto et
al (2016) sheds some interesting insight on how embeddings methods are
performing metric recovery. Lots of exciting stuff in this area.

~~~
serveboy
Alex Gittens also has a nice paper this year showing how Skipgram enables
vector additivity. See
[http://www.aclweb.org/anthology/P17-1007](http://www.aclweb.org/anthology/P17-1007)

------
visarga
No mention of StarSpace (from FaceBook) ? It figures, with the rapid pace of
innovation these days.

StarSpace can compute 6 types of entity embeddings, of which word embeddings
are just one type. It's a whole family of algorithms.

[https://github.com/facebookresearch/Starspace/](https://github.com/facebookresearch/Starspace/)

~~~
tensor
Note for those who it's relevant for, this is not usable in a commercial
setting.

~~~
Matumio
I don't really understand the implications of this license. Does it forbid
using the resulting vectors for commercial purpose? Or does it only forbid
stuff like packaging their code into a product, or offering to run it as a
service.

------
PaulHoule
My question is what are they really good for.

I mean king = queen -woman + man

That's the kind of thing we have ontologies for.

This article mentions that word embeddings are useful inside translators, but
from the viewpoint of somebody who wants to extract meaning from text, what
use is something that doesn't handle polysemy and phrases?

~~~
jph00
Word embeddings (or subword embeddings) are used for nearly all recent NLP
algorithms, both shallow (e.g. FastText) and deep (e.g. Google Neural
Translation). Unless you're using a basic bag-of-words approach, you need to
translate your words into some vector format, so you probably want some kind
of embeddings. In practice, all the state of the art approaches for
translation, language modeling, classification (eg sentiment analysis), etc.
all sit on top of embeddings.

It's not word embeddings job to handle phrases - but nearly all modern phrase
embedding algorithms sit on top of word embeddings. They often create a
weighted average of embeddings by using an attention model, or they can use a
more complex model such as an LSTM with attention (e.g. CoVE -
[https://arxiv.org/abs/1708.00107](https://arxiv.org/abs/1708.00107) ).

Word embeddings can handle polysemy - high dimensional vectors can (and do)
hold information of various types that is used in different contexts in
different ways. Some approaches deal with this more directly (e.g. including
part-of-speech as part of the vocab item), and that sometimes can help a bit.

~~~
PaulHoule
Maybe I'm not reading it right, but that arXiv paper about CoVE doesn't seem
to be getting anywhere near commercially useful results.

For instance, the random result for ImDB in Table 2 is 88.4 and the best one
is 92.1; that's really not a lot of lift. I could see TREC-6 and TREC-50
results being good enough to let off the leash, but I still have a hard time
picturing this being useful in the real world.

~~~
jph00
Oh BTW the "random" result means randomly initialized vectors. It's still
using embeddings, just not pretraining

------
cgravier
I also think that there is still room for improvement for embeddings based on
other contexts as pointed in the blog entry. Another example from this year is
leveraging dictionary entries as external context -
[http://aclweb.org/anthology/D17-1024](http://aclweb.org/anthology/D17-1024) (
_)

Selecting context words differently is also an option for improvement. Using
dependency structures to "filter" out context window seems to work better than
"filtering" using subsampling frequent words illustrate that there is room. We
may see other solutions to select context words in the future, as a building
block as it is. Especially lately with the StarSpace hype advocating the idea
of general purpose - task-agnostic - embeddings.

Or we can also consider that the expected improvements are insignificant
w.r.t. improvements with the model learnt on those embeddings for downstream
tasks that may update embeddings especially for this task...

(_) disclaimer: I am a co author

~~~
sebastianruder
Thanks for the note, Christophe. I had missed your paper. I've added a short
paragraph with regard to improving negative sampling by incorporating
contextual information.

~~~
cgravier
Thank you Sebastian.

Keep up the great work !

You will note that negative sampling improved by leveraging information on
word pairs form dictionaries entries (we called it "controlled negative
sampling") do help, though not much. It actually really depends in the rare
words rate (see section 5.4, improvement ranges from 0.7% up to 10%). But I
guess it is already an interesting, somehow counter-intuitive, observation.

Another very interesting observation is that you can also choose to just clamp
a general purpose dataset and expand it with external contextual information
(meaning not using is for supervision but rather just collapse it at the end
of the training corpus in a raw form [^]). In our case, we call those corpus :
\- corpus A : plain old wikipedia dump \- corpus B : plain old wikipedia dump
+ dictionaries text collapse at the end of it.

It sounds a bit naive : the latter part of the training corpus is really small
w.r.t. the full wikipedia dump. Nonetheless, it has an significant impact on
word similarity (see improvement in Table 2 to see how those training corpus
influences representations learnt by word2vec, fasttext and dict2vec).
(Related to the effect of the training corpus :
[https://arxiv.org/pdf/1507.05523v1.pdf](https://arxiv.org/pdf/1507.05523v1.pdf))

I mention this effect of training corpus content here since it sounds like an
interesting info for the working natural language processing practitioners
(get a mid size general training corpus, add as many contextual corpus as
possible => may yield useful embeddings...).

[^] to be entirely fair, this has been suggested to us by an anonymous
reviewer, many thanks for him/her for pointing this out : I found the results
surprising.

~~~
cgravier
Typo from the blog : "to move related works" => "to move related words"

