
Fast and accurate language identification using fastText - exgrv
https://fasttext.cc/blog/2017/10/02/blog-post.html
======
allan_s
Nice the author created this based on tatoeba.org data, I used to be the main
developer and for tatoeba I created a language detector (because it's was
painful for people to have to input a sentence AND the language, especially
for polyglots), so it's more likely the language data used for this language
detector was made itself by a language detector, funny when you think about it
:)

[https://github.com/allan-simon/Tatodetect](https://github.com/allan-
simon/Tatodetect) (I should rewrite it in Rust some days) , it's a simple
N-gram detector.

------
matthberg
Really fascinating from a linguistics perspective, I'm curious as to how this
works and if it is possible to abstract away to help with the cataloguing of
dying languages.

------
wyldfire
I think it would be cool to see how easily they could create a WASM/asm.js
target.

------
visarga
Why is it just 93% accurate on Wikipedia? Is it that hard to identify
languages?

~~~
microcolonel
I suspect it's due to mixed-language content on Wikipedia. A lot of Wikipedia
articles talk about foreign language art and culture, this is one of the
largest (if not the largest) single categories of content on non-English
Wikipedias.

~~~
alexott
Yes, it's not so good on the samples with several languages

------
alexott
I would try to make comparison with Google's CLD tomorrow

~~~
microcolonel
Bearing in mind that fastText supports many more languages than CLD.

~~~
alexott
depends on the mode, but I've compared with only ~60 languages

