
Polyglot Word Embeddings Discover Language Clusters - shriphani
http://blog.shriphani.com/2020/02/03/polyglot-word-embeddings-discover-language-clusters/?foo-1
======
pattusk
I read the title and got excited thinking this would be using embeddings to
gather insights about language family. As in, if you ran k-means on the same
corpus of n languages with k < n, how would, say, Finnish, Mongolian, Turkish
and Japanese turn out in the clusters. Curious too as to whether it would be
possible to interpret the results rigorously to gather scientifically valid
linguistic conclusions.

Instead it looks like this just performs language detection. Is there a
significant advantage to that method as opposed to just reusing one of the
many existing open sources solutions based on simpler models such as [1] and
retraining them with a corpus that includes the language(s) that weren't
supported? You offer a comparative table for FastText & GCP, how do you
explain FastText's abysmal performance on English in terms of precision? The
value just seems way too low to not be a bug of some sort?

[1] [https://code.google.com/archive/p/language-
detection/](https://code.google.com/archive/p/language-detection/)

~~~
shriphani
In the Indian subcontinent, most native content features a lot of code-mixing
- so in a Bengali or Hindi document you'll see some English words for sure.

And the vast majority of native content is written in the Roman script -
(native script keyboards are poorly designed or unavailable I suppose).

Thus a large chunk of content gets tossed in as English - granted it won't be
a high confidence prediction but it still is the produced label.

Corpora in the subcontinent can manifest in 10 - 12 languages - say the
Rohingya language for instance - it is near impossible to get an annotator
that speaks that language. Getting a monolingual corpus out with zero
annotation is quite useful.

------
nl
This is nice, but the blog post should point out that FastText has language
identification built in[1].

The authors knew this, because it compares it in the paper, but doesn't call
it out in the post!

Edit: just realised the link on _popular "open source"_ goes to the FastText
post I linked below. Still - I think it would have been good to explicitly
note this!

[1] [https://fasttext.cc/blog/2017/10/02/blog-
post.html](https://fasttext.cc/blog/2017/10/02/blog-post.html)

~~~
shriphani
Sorry about that I'll edit the post with an explicit mention right away.

