

Show HN: My Leeds Hack Day project, a language detection API - lgeek
http://polyglossy.com/

======
petercooper
I was also at Leeds Hack Day. I didn't know you were working on this until the
presentations but I wrote a language detection library for Ruby some time
back: <https://github.com/peterc/whatlanguage>

It uses a slightly weird technique, though. Dictionary based and using a bloom
filter for memory efficiency. Going forward, though, I plan to rewrite it to
use a combination of n-grams and language "fingerprints."

~~~
lgeek
That's pretty cool! One of my friends suggested to use dictionaries and bloom
filters but I've wanted to build probabilistic language models.

Do you have any accuracy stats? I'm guessing my approach might work better in
some cases because the models include frequency information too. Did you
experience significant accuracy loss when adding new languages? Anyway, I'll
run it over my test data and compare.

~~~
petercooper
_Do you have any accuracy stats?_

No, but as you have noted, the method has the intrinsic property of being less
accurate with fewer words and more accurate the longer the text. As my
anticipated use was for documents over 10-20 words, this was OK. I expect the
other techniques I outlined that I'm switching to to yield more accurate
results across the board.

------
cemregr
Good luck with your project, just wanted to say Google Translate API also has
language detection, and gives you a degree of confidence.
<http://www.google.com/uds/samples/language/detect.html>

------
zorander
Uhhh..what?

<http://imgur.com/LH2Hi.png>

~~~
lgeek
Sorry, only supports English, German and French at the moment

~~~
zbanks
"Je t'aime" is wrongly classified as English. (Although understandable, since
it's short)

How does this work? You claim it has "zero language knowledge," so how does it
classify? Did it start out with nothing, and then train it on a corpus?

EDIT: nevermind, found your slides (<http://polyglossy.com/presentation.pdf>
[pdf]) How big of a corpus did you use?

~~~
lgeek
I've used around 40KB of text per language for training.

~~~
catch23
wouldn't something like 2-gram bayes be more accurate & faster?

~~~
lgeek
I don't know. I haven't really done NLP before. My guess is that it would be
slower, but not by a significant amount. It might be more accurate, so I could
give it a shot.

~~~
zbanks
Any chance of posting the source?

~~~
lgeek
Yeah, I'll probably push it to github later this week.

~~~
lgeek
...and here it is: <https://github.com/lgeek/polyglossy> (with a slight delay
because I've worked on improving the accuracy)

