

Show HN: Franc – Detect natural languages - wooorm
https://github.com/wooorm/franc

======
jules
Seems like this just compares the L1 distance of the trigram count vector to
some preselected document in each language. That won't be very accurate. A
much better way to go here is naive bayes. There are more sophisticated
approaches but naive bayes will get you much further than this already. If you
train this with wikipedia articles for the most popular languages you would
most likely get >99% accuracy.

~~~
breuderink
One method that I have used in the past was über-simple, yet extremely
effective. It exploits ZIP compression, based on the the insight/assumption
that two concatenated texts compress beter when they share their language.

I think I found it in this paper [1]. The implementation was like 13 lines of
Python code. I wonder how it would compare.

[1]
[http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/Ben...](http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/BenedettoCaLo.pdf)

~~~
wooorm
It’s a very interesting idea. Would it work accurate enough when scaled to
160+ languages?

~~~
breuderink
I don't know, I think I used about 40 languages. The beauty is that zip-
compression captures rich statistical properties of the languages, so
representation-wise it should come a long way. But counting compressed output
length discretises the lang-lang distance. For shorter text this might be
troubling, since this could easily result in ties. So, maybe. Perhaps I should
try :).

~~~
wooorm
Perhaps you should ;) If, I’d be interest to know how it goes!

------
allan_s
The project does not seems to state clearly how the detection is made, does it
call an external webservice or does it rely on a offline database created at
some time?

shameless plug [https://github.com/allan-
simon/Tatodetect](https://github.com/allan-simon/Tatodetect) it covers 179
language (actually as much as Tatoeba project does) and it can run offline
with explanation on how to generate your own database from a CC-by corpus.

After the advantage of Franc is that it can be used directly as a npm library
while Tatodetect is a micro-webservice, and for some edge languages,
Tatodetect is certainly not as good as Franc (haven't done yet a test of both
to compare)

~~~
hywel
Based on a 2-sec look at the code, it's using a built-in database of trigrams
as a predictor of the language.

[https://github.com/wooorm/franc/blob/master/lib/data.json](https://github.com/wooorm/franc/blob/master/lib/data.json)

~~~
riffraff
the question would be where he got the language data

If the original language data is available I'd suggest classifying the
trigrams as "high" and "low" frequency, which should improve performance
without needing to keep full frequency data.

~~~
wooorm
No full-frequency data is kept, only 300 top-trigrams are identified. A quick
through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!

~~~
riffraff
yes, I meant: keeping full frequency could have been avoided to save
space/memory but having two classes high/low could be a good tradeoff.

~~~
wooorm
It’s an interesting thought. I might fiddle on it, but I’m not sure it would
work in practice (d’oh). Thanks!

------
perlgeek
What I'd really like to see is code that takes a body of text and extracts
parts that are written in another language.

That's quite common, like in mixed-language IRC channels, quotes from English
documents in documents mostly written in another language, and so on.

And stemming and indexing such a document for full text search usually gives
crappy results.

(Bonus points of detecting programming code samples, so that this part isn't
stemmed at all).

~~~
wooorm
That would be awesome :)

------
grimborg
Interesting!

Sometimes it gets it almost right: I tried with this piece of text in Catalan
(Balear variant) and it classifies it as Portuguese (with Catalan as 2nd
option): "I s'horabaixa la deixam passar i me mires tan a prop que me fa mal,
que surt es sol i encara plou, que t'estim massa i massa poc, que no sé com ho
hem d'arreglar, que som amics, que som amants."

It's strange, because it's pretty different from Portuguese...

The Catalan poem "tirallonga de monosíl·labs" gets classified as French.
([http://www.rodamots.com/calaix.asp?text=tirallonga](http://www.rodamots.com/calaix.asp?text=tirallonga))

~~~
wooorm
It sucks, right? Currently, it’s good at long passages. But for shorter
values, the results are pretty poor. The amount of supported languages is just
too damn high!

------
lifthrasiir
The 60% threshold for the single-language scripts seems to be way low for CJK
languages. And your method to calculate the occurrence ratio is flawed.

CJK scripts and languages tend to be relatively more concise (in terms of # of
Unicode codepoints) than many other languages, so it is possible that the
ratio of CJK scripts over non-CJK scripts can be lower than the average. And
the occurrence ratio is currently calculated over the number of characters
including _non-letters_ , making the ratio much lower. Maybe the custom
threshold per script based on the actual corpus (90th percentile, maybe?) and
better occurrence calculation would improve the detection on those languages.

~~~
wooorm
I’m not sure. I don’t know any CJK languages myself. I’d like some test-cases
where the current methods do not work, as the example in the Readme seems to
work pretty well: `এটি একটি ভাষা একক IBM স্ক্রিপ্ট` is classified as Bengali?

~~~
lifthrasiir
Some examples follow. I've really tested with arbitrary text on the Web, and I
agree that they are somewhat marginal examples. (But I do think that Franc's
margin for CJK languages is way wide.)

한국어 문서가 전 세계 웹에서 차지하는 비중은 2004년에 4.1%로, 이는 영어(35.8%), 중국어(14.1%), 일본어(9.6%),
스페인어(9%), 독일어(7%)에 이어 전 세계 6위이다. 한글 문서와 한국어 문서를 같은 것으로 볼 때, 웹상에서의 한국어 사용 인구는 전
세계 69억여 명의 인구 중 약 1%에 해당한다.

This text from Korean Wikipedia is about the ratio of Korean documents over
all documents in the Internet. Digits distort the overall ratio and Franc
doesn't give any candidates (even no "und").

現行の学校文法では、英語にあるような「目的語」「補語」などの成分はないとする。英語文法では "I read a book." の "a book"
はSVO文型の一部をなす目的語であり、また、"I go to the library." の "the library"
は前置詞とともに付け加えられた修飾語と考えられる。

This text from Japanese Wikipedia concerns about the distinction of objectives
and complements in the English syntax. In this bilingual text it looks like
that Japanese has reached the 60% threshold but the codepoint count doesn't.

~~~
wooorm
I pushed a fix, incorporating your suggestions, and your examples in the
specs.

Thanks a lot!

------
jodent
Quick test:

    
    
      ron? snn
      fra? cat
      swe? nds
      ita? und
      nld? gax
    

Source:

    
    
      var franc = require('franc');
      console.log('ron?', franc('Cate bere ai baut?'));
      console.log('fra?', franc('C\'est quoi le bordel la, putain'));
      console.log('swe?', franc('Jag kanner en bot, hon heter Anna'));
      console.log('ita?', franc('che guai'));
      console.log('nld?', franc('graag gedaan'));

~~~
indubitably
Testing a statistical language identifier with texts this short is absurd. If
you type in four or five words from

[https://en.wikipedia.org/wiki/List_of_English_words_of_Frenc...](https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin)

…do you expect it to return French or English?

~~~
andreasvc
It is not absurd. Generally, if humans can do it, it is a reasonable task for
NLP to attempt.

Yes you can present edge cases where there is no definite answer, like the one
you cite, but this doesn't mean that the task in general is impossible or
useless.

~~~
wooorm
I agree the task is neither impossible nor useless. There’s work to do. Short
passages should be supported. I do however think franc does a good job, and
adds support for some languages which before today have never (I think) been
supported. Franc, certainly, “attempt”s to fix language detection, which I
would argue is an AI-complete problem.

------
robin_reala
The supported languages file
([https://github.com/wooorm/franc/blob/master/Supported-
Langua...](https://github.com/wooorm/franc/blob/master/Supported-
Languages.md)) lists Matu Chin as having 182,000,000 speakers. Having never
heard of it this surprised me, but the Wikipedia page for it
([http://en.wikipedia.org/wiki/Matu_Chin_language](http://en.wikipedia.org/wiki/Matu_Chin_language))
lists 40,000 speakers. Mistake to fix?

~~~
wooorm
You seem to be completely right, I hand-crawled the data
([https://github.com/wooorm/speakers](https://github.com/wooorm/speakers)),
but seem to have made big typo there! Thanks!

------
mholmes680
+1 for using those iso codes. I introduced them at work 4 yrs ago, and
everyone looked at me like i had ten heads.

~~~
allan_s
+1 indeed, but I think most of people have already a hard time to see why we
need to make the difference between country code and language code, and even
more that something that people consider as a "dialect" can actually be a
totally different language (for example in China a lot of "dialect/fangyang"
are actually not dialect of Mandarin, for example Shanghainese (Wu language)
and languages from Hunan province)

after you can also try to explan them that the common "represent a language by
a flag" becomes quickly broken and subject to strong arguing between people
(what flag do you put for Tibetan language for example? or for each of Indian
languages)

------
michaelmior
It would be interesting to see comparisons with language detection libraries
written in other languages as well. Not just in terms of runtime, but also
accuracy. Actually, it seems like this would be useful as a separate project
to help the decision-making process when choosing a library.

~~~
wooorm
Agreed :)

~~~
allan_s
for the case of "one sentence detection" you can use Tatoeba project database
dump [http://tatoeba.org/eng/downloads](http://tatoeba.org/eng/downloads)

you have a CSV of iso code => sentence , which should be 99% accurate (as it
gets user proofed), so on in which you can compare your tool with.

I think for longer text one could use Wikipedia dump or alike ?

~~~
michaelmior
Thanks for the pointer. I might decide to whip something up one of these days.
I really have no need for language detection, but I just find it interesting
and I'm curious to see wich libraries will win out.

------
1ris
"»Butter and cheese« is proper English and proper Fries."

Unfortunately Fries is not supported, but I'd be interested in the results.
But I don't think polyglots for natural languages are common, this is in fact
the only one I know.

~~~
wooorm
And it doesn’t have a Universal Declaration of Human rights:
[http://www.unicode.org/udhr/index_by_name.html](http://www.unicode.org/udhr/index_by_name.html)

~~~
Luc
It does have several translations of the bible, though. I guess it would be a
lot of work to find bible translations for all those languages - or was there
another reason for using the Human Rights Declaration?

P.S. Kudos, very cool project!

EDIT: Frisian version should you want it:
[https://www.google.com/search?q=Yn+betinken+nommen+dat+it+er...](https://www.google.com/search?q=Yn+betinken+nommen+dat+it+erkennen+fan+de+ynherinte+weardichheid+en+fan+de+gelikense)

~~~
wooorm
Thanks! Currently, the UDHRs are crawled, and I’d rather not include
exceptions and maintain their plain-text and XML/JSON versions by hand. If
you’re into growing the language, I suggest contacting the Office of the High
Commissioner of Human Rights of the UN, and the Unicode project, or fork
wooorm/udhr and add support, I’ll merge :)

------
BenjaminN
Tried "hey how are you?", gives me Haitian first.

~~~
wooorm
That’s because Haitians always say that! No, joking, it’s just that because of
so may supported languages, the accuracy for very short inputs is extremely
low.

~~~
ppod
A regularized prior would help.

~~~
wooorm
I’m also really interested in trying something like this:
[http://www.slideshare.net/shuyo/short-text-language-
detectio...](http://www.slideshare.net/shuyo/short-text-language-detection-
with-infinitygram-12949447) (slide 6). But I’d need a lot of training data,
more than UDHR.

------
apierre
I am using IDOL OnDemand which gives good results too.

------
melling
On a slight tangent, are there open source dictionaries that developers can
use for app localization, etc?

