

Ask HN: determine language - AssocPars

I'm planning to work on a project that will involve i) harvesting small pieces of text and ii) determining the language of those pieces, to be later able to display the pieces sorted by language.
I'd be very interested to hear HN'suggestions on how to do that language identification part the simplest, quickest, cheapest way. Not looking for 100% accuracy here, rather something that can be implemented in a day or 2 of coding and gives satisfactory enough results ;)
Thanks!
======
patio11
Ooh, natural language processing, my first love.

You can take a corpora of text in any language, collect letter frequencies,
and create a vector for each language. Then, when you have sample text, create
a vector of letter counts. Take dot products of sample vector and pattern
vectors, rank by magnitude in descending order as guesses.

You can also do this with bigrams or trigrams.

There is a lot of academic work on this question. I can't recall citations
since it has been 7 years since I worked with it.

------
michael_dorfman
The simplest, quickest, cheapest way to solve any problem is to let somebody
else solve it for you.

Like Google, for example. Their AJAX Language API for Translation and
Detection sounds like it might fit your need.

<http://code.google.com/apis/ajaxlanguage/documentation/>

~~~
AssocPars
Looks to work efficiently with even very short texts, but I'm afraid we'll get
banned pretty soon by Goog : we'll have lots and lots of very short text, so
we'll need to invoke the API a lot... (with the aforementioned ban outcome, we
fear :(

------
pierrefar
If you're using PHP, a while back I found a library that returned a scored
array of languages the texxt might be. It needed quite a bit of text to work
well, but I was working with only a few sentences. I can probably dig it out
for you if you want.

~~~
AssocPars
hmm, our early tries with a php lib proved no good since we have very limited
text to analyse each time (not enough, I mean, for the library to help us
identify the language efficiently) :(

