EDIT: never mind the comparison, I just got the part where the monolingual performance should be the same as fastText's. Which I don't think is a great thing. The precomputed fastText vectors aren't very good in most languages because most Wikipedias are small. Even the Japanese fastText vectors perform only slightly better than chance on Sakaizawa's evaluation of Japanese word similarity .
I would expect that, having data from both Wikipedia and Google Translate, you should be able to make a system that's monolingually much better than fastText on Wikipedia alone. That's what I was hoping to compare to. Don't limit yourself to the performance of fastText's data.
I also observe that the data files could be made much, much smaller. Every value is written out as a double-precision number in decimal. Most of these digits convey no information, and they're effectively random, limiting the effectiveness of compression. These could all be rounded to about 4 digits after the decimal point with no loss of performance.
There was a SemEval competition about having vectors that are both accurate for 5 individual languages and aligned between all the pairs of languages , where Numberbatch was the first-place system .
The fastText_multilingual system does have something going on, that it can discover multilingually aligned vectors after the fact for a system that was previously separate for each language, and leaves it "the same system" in a sense. But incorporating the ConceptNet knowledge graph also gives you a multilingually aligned system.
The alignment procedure is completely agnostic to what word vectors you use. It only requires that cosine similarity is a good distance metric. We used fastText because there were lots of open-sourced vectors available...
Maybe we can call this new matrix Mondoshawan?
As you suggest potentially this mean language is itself really high quality word vectors; but we haven't looked at this yet...
Note also that word embedding models are typically trained to optimize a loss function whose terms include the dot product of embeddings based on the respective words' location in relation to each other in sliding context windows, such that words that are more similar end up with embeddings that have greater cosine similarity, and vice versa.
What's incredible (to me) is how well these embeddings work in practice :-)
The link to your paper seems to be down:
[I am easily wrong but] may the highest "similarities" be due to the neolatin root of some european languages ?
And this matrix shows that it has nothing common. Instead only Belorussian(?) and Russian are darkest marks on UK line.
It just compares LETTERS and not SOUNDS of those words.