
Multilingual word vectors in 78 languages - RPeres
https://github.com/Babylonpartners/fastText_multilingual
======
rspeer
Cool. I'm going to be trying this out and comparing to ConceptNet Numberbatch
[1] (the current state-of-the-art multilingual vectors). I'd like to see how
they compare on existing evaluations of word similarity.

EDIT: never mind the comparison, I just got the part where the monolingual
performance should be the same as fastText's. Which I don't think is a great
thing. The precomputed fastText vectors aren't very good in most languages
because most Wikipedias are small. Even the Japanese fastText vectors perform
only slightly better than chance on Sakaizawa's evaluation of Japanese word
similarity [2].

I would expect that, having data from both Wikipedia and Google Translate, you
should be able to make a system that's monolingually much better than fastText
on Wikipedia alone. That's what I was hoping to compare to. Don't limit
yourself to the performance of fastText's data.

I also observe that the data files could be made much, much smaller. Every
value is written out as a double-precision number in decimal. Most of these
digits convey no information, and they're effectively random, limiting the
effectiveness of compression. These could all be rounded to about 4 digits
after the decimal point with no loss of performance.

[1] [https://github.com/commonsense/conceptnet-
numberbatch](https://github.com/commonsense/conceptnet-numberbatch)

[2] [https://github.com/tmu-
nlp/JapaneseWordSimilarityDataset](https://github.com/tmu-
nlp/JapaneseWordSimilarityDataset)

~~~
nl
Are the Numberbatch vectors aligned in the same vector space? That seems to be
the primary innovation here, and it's a pretty important one.

~~~
rspeer
They are.

There was a SemEval competition about having vectors that are both accurate
for 5 individual languages and aligned between all the pairs of languages [1],
where Numberbatch was the first-place system [2].

The fastText_multilingual system does have something going on, that it can
discover multilingually aligned vectors after the fact for a system that was
previously separate for each language, and leaves it "the same system" in a
sense. But incorporating the ConceptNet knowledge graph also gives you a
multilingually aligned system.

[1]
[http://alt.qcri.org/semeval2017/task2/](http://alt.qcri.org/semeval2017/task2/)

[2] [https://blog.conceptnet.io/2017/03/02/how-luminoso-made-
conc...](https://blog.conceptnet.io/2017/03/02/how-luminoso-made-conceptnet-
into-the-best-word-vectors-and-won-at-semeval/)

~~~
sls56
Cool rspeer, we will check NumberBatch out! Yes, as you say, our goal here is
to show you can get something for nothing from pre-trained embeddings (I can
learn the 78 matrices on my macbook in about ten minutes...).

The alignment procedure is completely agnostic to what word vectors you use.
It only requires that cosine similarity is a good distance metric. We used
fastText because there were lots of open-sourced vectors available...

------
sharp11
In case I'm not the only wanting to know what a "word vector" is:
[https://blog.acolyer.org/2016/04/21/the-amazing-power-of-
wor...](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-
vectors/)

~~~
paradite
A more technical term for that in NLP is "word embedding":

[https://en.wikipedia.org/wiki/Word_embedding](https://en.wikipedia.org/wiki/Word_embedding)

~~~
sharp11
Thanks, very helpful!

------
supermdguy
I'd be interested in trying to do something like this to put image vectors and
word vectors in the same space. It might be tricky because some images would
look similar, without being similar to each other at all. However, you could
probably get some interesting results.

~~~
sls56
People have actually tried to do things like this; though I suspect a linear
transformation wouldn't work well. Chris Olah talks about it in his awesome
blog: [https://colah.github.io/posts/2014-07-NLP-RNNs-
Representatio...](https://colah.github.io/posts/2014-07-NLP-RNNs-
Representations/)

------
nicodemus26
The next trick would be to create a matrix that is the closest to all of the
languages and doesn't have gaps from missing words. Using English as the
identity is probably a bit ethnocentric -- another reification of the three-
percent problem of literary translation:
[http://www.rochester.edu/College/translation/threepercent/](http://www.rochester.edu/College/translation/threepercent/)
.

Maybe we can call this new matrix Mondoshawan?

~~~
sls56
Yes definitely! We didn't want to complicate the repository, but from a few
in-house experiments we already know that it is possible to improve the
rotation matrices by: 1\. First aligning to a reference language (English) 2\.
Then defining a new reference as the mean vector of all the languages for each
entry in the training dictionary 3\. Re-align the languages to this new
reference "language" 4\. Iteratively repeat 2 and 3 to convergence

As you suggest potentially this mean language is itself really high quality
word vectors; but we haven't looked at this yet...

------
isoprophlex
I'm amazed that apparently, multiplication with an orthogonal matrix (simply
rotating and mirroring the word vecs) is able to align two languages!!

~~~
cs702
It's really not that surprising, because word embeddings like word2vec are
_linear projections_ from one-hot vectors (i.e., of dimension equal to
dictionary size) to dense lower-dimensional vectors, typically of dimension
300 (because in practice 300 works really well).

Note also that word embedding models are typically trained to optimize a loss
function whose terms include the dot product of embeddings based on the
respective words' location in relation to each other in sliding context
windows, such that words that are more similar end up with embeddings that
have greater cosine similarity, and vice versa.

What's incredible (to me) is how well these embeddings work in practice :-)

~~~
isoprophlex
Thanks for the additional information :) agreed, it's pretty phenomenal
stuff..!

------
unignorant
This is great! We recently released a tool that allows computational social
scientists to analyze text using a set of lexicons generated from word
embeddings ([https://github.com/Ejhfast/empath-
client](https://github.com/Ejhfast/empath-client)). Would be awesome to port
this automatically to new languages.

~~~
gnaddel
Looks interesting!

The link to your paper seems to be down:
[http://hci.stanford.edu/publications/empath-
chi-2016.pdf](http://hci.stanford.edu/publications/empath-chi-2016.pdf)

~~~
unignorant
Thanks, fixed!

[https://hci.stanford.edu/publications/2016/ethan/empath-
chi-...](https://hci.stanford.edu/publications/2016/ethan/empath-chi-2016.pdf)

------
J_cst
Cit. " In general, the procedure works best for other European languages like
French, Portuguese and Spanish. "

[I am easily wrong but] may the highest "similarities" be due to the neolatin
root of some european languages ?

~~~
tempay
All the languages are aligned to English which I presume biases it towards
romance/germanic languages. Later on comparisons are are made showing how
better performance can be obtained by choosing a more suitable language to
align to, for example Russian and Ukrainian.

~~~
slezyr
Ukrainian language is language that has a lot of common words with Czech and
Polish. I can simply take article written in one of those languages and start
reading without any knowledge about those languages.

And this matrix shows that it has nothing common. Instead only Belorussian(?)
and Russian are darkest marks on UK line.

It just compares LETTERS and not SOUNDS of those words.

------
shacharz
Did you try to quantify the similarity or dissimilarity of language pairs
(using GSVD for example)?

~~~
sls56
We haven't done this in a rigorous way yet no, but it would be really
interesting! In particular, I wonder if it's possible to reconstruct the
linguistic tree of languages evolving over time; without any prior knowledge,
solely from the word vector spaces...

