
Language detection with Google's Compact Language Detector - davidw
http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html
======
dangoldin
Nice. I've been using a call to the Google online translator to achieve the
same result -
[http://ajax.googleapis.com/ajax/services/language/translate?...](http://ajax.googleapis.com/ajax/services/language/translate?q=Hello&v=1.0&langpair=|en)

~~~
toisanji
are there published api limits on this service?

~~~
dangoldin
I'm not aware of them. I'm using this in a high volume/non critical capacity
and it seems to be okay. I should keep better stats though.

------
tha-dude
Nice! I wrote a .NET wrapper myself, never got around to a Python extension
though. One question - did you experience any memory leak issues with the CLD?
Said, .NET wrapper DLL seems to leak and I never really checked if it was the
C++/CLI I added on top or the actual CLD native C++ code. I doubt the latter
since (according to my basic understanding) nothing is created in the original
code which needs to be cleaned up manually. Before I start debugging mixed-
mode .NET applications I just wanted to be sure.

------
lstrojny
Great library! Here are bindings for PHP: <https://github.com/lstrojny/php-
ccld>

------
sick-boy
_"You must provide it clean (interchange-valid) UTF-8, so any encoding issues
must be sorted out before-hand."_

In most cases you have to know the language in order to guess the encoding and
convert to UTF-8 if necessary. Mutual recursion...

~~~
ninjin
Mark Pilgrim reversed (or ripped out, can't remember) the encoding detection
that Firefox uses. It has done a fairly good job for my web crawling:

<http://pypi.python.org/pypi/chardet>

~~~
e98cuenc
In my experience chardet misclassifies very often iso-8859-1 as iso-8859-2. I
saw the misclassification even in small spanish pages, which were using only
the typical spanish characters.

------
johnx123-up
I thought that the detection is easier with Unicode ordinal value map table

~~~
abhaga
I am assuming it will recognize languages even when they are using the same
character sets. No?

