Hacker News new | comments | show | ask | jobs | submit login
Multilingual word vectors in 78 languages (github.com)
205 points by RPeres 151 days ago | hide | past | web | 23 comments | favorite

Cool. I'm going to be trying this out and comparing to ConceptNet Numberbatch [1] (the current state-of-the-art multilingual vectors). I'd like to see how they compare on existing evaluations of word similarity.

EDIT: never mind the comparison, I just got the part where the monolingual performance should be the same as fastText's. Which I don't think is a great thing. The precomputed fastText vectors aren't very good in most languages because most Wikipedias are small. Even the Japanese fastText vectors perform only slightly better than chance on Sakaizawa's evaluation of Japanese word similarity [2].

I would expect that, having data from both Wikipedia and Google Translate, you should be able to make a system that's monolingually much better than fastText on Wikipedia alone. That's what I was hoping to compare to. Don't limit yourself to the performance of fastText's data.

I also observe that the data files could be made much, much smaller. Every value is written out as a double-precision number in decimal. Most of these digits convey no information, and they're effectively random, limiting the effectiveness of compression. These could all be rounded to about 4 digits after the decimal point with no loss of performance.

[1] https://github.com/commonsense/conceptnet-numberbatch

[2] https://github.com/tmu-nlp/JapaneseWordSimilarityDataset

Are the Numberbatch vectors aligned in the same vector space? That seems to be the primary innovation here, and it's a pretty important one.

They are.

There was a SemEval competition about having vectors that are both accurate for 5 individual languages and aligned between all the pairs of languages [1], where Numberbatch was the first-place system [2].

The fastText_multilingual system does have something going on, that it can discover multilingually aligned vectors after the fact for a system that was previously separate for each language, and leaves it "the same system" in a sense. But incorporating the ConceptNet knowledge graph also gives you a multilingually aligned system.

[1] http://alt.qcri.org/semeval2017/task2/

[2] https://blog.conceptnet.io/2017/03/02/how-luminoso-made-conc...

Cool rspeer, we will check NumberBatch out! Yes, as you say, our goal here is to show you can get something for nothing from pre-trained embeddings (I can learn the 78 matrices on my macbook in about ten minutes...).

The alignment procedure is completely agnostic to what word vectors you use. It only requires that cosine similarity is a good distance metric. We used fastText because there were lots of open-sourced vectors available...

In case I'm not the only wanting to know what a "word vector" is: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-wor...

A more technical term for that in NLP is "word embedding":


Thanks, very helpful!

I'd be interested in trying to do something like this to put image vectors and word vectors in the same space. It might be tricky because some images would look similar, without being similar to each other at all. However, you could probably get some interesting results.

People have actually tried to do things like this; though I suspect a linear transformation wouldn't work well. Chris Olah talks about it in his awesome blog: https://colah.github.io/posts/2014-07-NLP-RNNs-Representatio...

The next trick would be to create a matrix that is the closest to all of the languages and doesn't have gaps from missing words. Using English as the identity is probably a bit ethnocentric -- another reification of the three-percent problem of literary translation: http://www.rochester.edu/College/translation/threepercent/ .

Maybe we can call this new matrix Mondoshawan?

Yes definitely! We didn't want to complicate the repository, but from a few in-house experiments we already know that it is possible to improve the rotation matrices by: 1. First aligning to a reference language (English) 2. Then defining a new reference as the mean vector of all the languages for each entry in the training dictionary 3. Re-align the languages to this new reference "language" 4. Iteratively repeat 2 and 3 to convergence

As you suggest potentially this mean language is itself really high quality word vectors; but we haven't looked at this yet...

I'm amazed that apparently, multiplication with an orthogonal matrix (simply rotating and mirroring the word vecs) is able to align two languages!!

It's really not that surprising, because word embeddings like word2vec are linear projections from one-hot vectors (i.e., of dimension equal to dictionary size) to dense lower-dimensional vectors, typically of dimension 300 (because in practice 300 works really well).

Note also that word embedding models are typically trained to optimize a loss function whose terms include the dot product of embeddings based on the respective words' location in relation to each other in sliding context windows, such that words that are more similar end up with embeddings that have greater cosine similarity, and vice versa.

What's incredible (to me) is how well these embeddings work in practice :-)

Thanks for the additional information :) agreed, it's pretty phenomenal stuff..!

This is great! We recently released a tool that allows computational social scientists to analyze text using a set of lexicons generated from word embeddings (https://github.com/Ejhfast/empath-client). Would be awesome to port this automatically to new languages.

Looks interesting!

The link to your paper seems to be down: http://hci.stanford.edu/publications/empath-chi-2016.pdf

Cit. " In general, the procedure works best for other European languages like French, Portuguese and Spanish. "

[I am easily wrong but] may the highest "similarities" be due to the neolatin root of some european languages ?

All the languages are aligned to English which I presume biases it towards romance/germanic languages. Later on comparisons are are made showing how better performance can be obtained by choosing a more suitable language to align to, for example Russian and Ukrainian.

Ukrainian language is language that has a lot of common words with Czech and Polish. I can simply take article written in one of those languages and start reading without any knowledge about those languages.

And this matrix shows that it has nothing common. Instead only Belorussian(?) and Russian are darkest marks on UK line.

It just compares LETTERS and not SOUNDS of those words.

yes tempay precisely, because we use English as our reference, there is bias in favour of languages similar to English. Additionally the bigger a language's wikipedia, the higher quality the word vectors, which also tends to improve performance

Did you try to quantify the similarity or dissimilarity of language pairs (using GSVD for example)?

We haven't done this in a rigorous way yet no, but it would be really interesting! In particular, I wonder if it's possible to reconstruct the linguistic tree of languages evolving over time; without any prior knowledge, solely from the word vector spaces...

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact