

Show HN: Language detection as a service - mgaudin
https://getlang.io

======
davidjgraph
I'll ask plainly what others are hinting at : Is this actually your own built
service, or are you a proxy for something like Google Translate API[1]?

If it's your own built service, it's critical how you explain the hows and
whys of your forecast availability and scalability numbers for your chosen
architecture, given who you are competing with.

[1][https://developers.google.com/translate/v2/using_rest#detect...](https://developers.google.com/translate/v2/using_rest#detect-
language)

------
beering
Alternatively, people can just download langid.py[1] and do language detection
locally. This is not a particularly hard problem - I think it's doable by
undergrad ML or NLP classes.

The tricky parts are usually political - are users going to be angry if you
confuse Indonesian with Malaysian, or so on?

[1] [https://github.com/saffsd/langid.py](https://github.com/saffsd/langid.py)

~~~
danieldk
_I think it 's doable by undergrad ML or NLP classes._

In fact, we had a course for high school students where they learnt how a
language guesser works and where they had to change a language guesser. A
simplistic method that already works very well is:

* Create an n-gram fingerprint for each language by making a list of character uni-, bi-, and trigrams ordered by their frequency in a text. Retain the (say) 300 most frequent n-grams.

* To categorize a text, create a fingerprint for that text. Then compute for each language the sum n-gram rank differences. If an n-gram does not occur, the difference is the fingerprint size. Finally, pick the language with the lowest sum.

Of course, you can do fancier things, such as training a SVM or logistic
regression classifier with n-grams and words as features, etc.

An interesting variation is to be able to distinguish different languages in a
text. E.g. a Dutch text with English quotes.

~~~
ma2rten
It's easy to write a language guesser, but's not easy to write a good one.
Even Google Translate is not prefect (see below).

~~~
Radim
Great point. Often overlooked by people who only know what I call "drive-by
machine learning" (finished an online ML course or something).

There's a multitude of problems with real-world texts that a robust guesser
must deal with gracefully: short texts; texts in none of the languages the
"guesser" was trained for (is it able to return "none of the above?" or does
it return a random one then?); texts in multiple languages (incl. common noun
phrases phrases inserted into text in another language); texts with parts
repeated multiple times (web pages and blogs in particular are a bitch!),
which skews char/word distributions and messes up statistical models etc.

It's the same thing as with spelling correction, really. "But Norvig did it in
1.5 lines of Python!" See "A Spellchecker Used To Be A Major Feat of Software
Engineering" at
[https://news.ycombinator.com/item?id=3466927](https://news.ycombinator.com/item?id=3466927)
Spoiler: it still is, except for "drive-by ML apps".

~~~
danieldk
_Often overlooked by people who only know what I call "drive-by machine
learning" (finished an online ML course or something)._

A bit sour, are we? ;)

The point is that it is an NLP task where it is relatively easy to get good
results on general text (see Cavnar and Trenkle). So, it is a fun and
satisfying exercise.

Saying there is difficult noisy data is pointing out the obvious ;).

~~~
Radim
If it's obvious to you, then you're not the target audience of my disclaimer
:)

But HN responses to posts like these overwhelmingly suggest it's far from
obvious.

    
    
      So, it is a fun and satisfying exercise.
    

I agree. Perhaps you can help evangelize the world of difference between "fun
exercise" and a production-ready system (the OP is a paid service).

~~~
danieldk
_I agree. Perhaps you can help evangelize the world of difference between "fun
exercise" and a production-ready system (the OP is a paid service)._

I used to be a bit upset when someone claims to have implemented a state-of-
the-art POS tagger, when they just took the dictionary and rules produced by
Eric Brill's learner verbatim and apply those. Or worse, they take the first
ten rules ;).

Nowadays I just prefer to evolution let do its work. The best or the one with
the best marketing wins :).

~~~
Radim
Liberating approach Daniel!

I'm still in the naive do-it-well phase, but seeing the downvotes, it may be
time to join the hipsters. Or at least shut up ;)

------
chrismorgan
The design is fine, but the language used on the page itself isn't quite
right.

I see three spelling errors in your language list:

\- Panjabi should be Punjabi;

\- Teligu should be Telugu;

\- Ukraininan should be Ukrainian.

There are also a few grammar problems earlier in the document, and style
problems (e.g. English doesn't use a space before sentence-ending punctuation
marks).

------
mdemare
Hmm, it takes 5+ seconds to get a response, and it chokes on the same test
phrase as Google, thinking "Ik hou van vette lettertypes." is Norwegian...

~~~
ma2rten
It's probably overloaded because it's on hackernews and is based on the same
features (character n-grams) as Google Translate. Your text is simply too
short for character n-grams to be 100% reliable.

------
diasks2
Looks interesting. Why not have a input on the landing page where someone can
try it out without even signing up? I think then people could give it a spin
before they give away their email address. Otherwise, the user just has to
trust your 99% figure, which it might be helpful to give some data around,
even if it is a footnote (on a corpus of x, over x period of time, etc.)

Also, I think it would be clearer if it said "A simple and scalable way to
automatically classify text by language" instead of "A simple and scalable way
to classify automatically text by language".

Design looks very clean though. Nice work.

EDIT: Also, your social media links at the bottom aren't hooked up yet.

~~~
himal
Hint: You can enter any email address you want.you don't have to validate
it.(well, at least for now)

------
captn3m0
For those who thought (like me) that this was a programming language detection
service, you can take a look at github/linguist.

------
danieldk
Also, for those who would like to know how you can implement a language
guesser (sources + link to paper):

[http://www.let.rug.nl/vannoord/TextCat/](http://www.let.rug.nl/vannoord/TextCat/)

Python version:

[http://thomas.mangin.com/data/source/ngram.py](http://thomas.mangin.com/data/source/ngram.py)

It's something that is fun to implement and doesn't take more than a few hours
at most.

------
mdemare
Why is this better than the Google or Bing translate APIs, which also offer
language detection?

------
redox_
You should also consider full-non-ambiguous words before trying with trigrams.
"marché" is only available in French, whereas "mar", "arc", ... are available
in lots of languages. This should drastically improve your results.

~~~
redox_
Store only the top N common non-ambiguous words if the RAM consumption matters
;)

~~~
danieldk
Or store the lexicon in a determinisitic acyclic finite state automaton. E.g.
(shameless plug):

[https://github.com/danieldk/dictomaton](https://github.com/danieldk/dictomaton)

Though, having implemented a language guesser myself, it's only an issue with
very short texts (a few words). On longer texts models based on character
n-grams achieve very high accuracies.

------
web64
I've used detectlanguage.com[1] in the past, which seems like a very similar
service to getlang.io. With both of them it is hard to know what is behind the
scenes...

[1] [http://detectlanguage.com/](http://detectlanguage.com/)

------
alexott
And it looks like that they are using the following library:
[http://code.google.com/p/language-
detection/](http://code.google.com/p/language-detection/) \- at least the
number & list of languages is very similar :-)

~~~
ma2rten
or just the same training data...

------
jhull
I wonder how this performs on short text posts like tweets. At my last gig
where we did social media text analysis we used a few different packages
(chromium, guess-language, and our own ngram classifier) and still had pretty
low accuracy for tweets.

~~~
AznHisoka
Have you look at the metadata returned by a tweet? They also returned
language, as well as location of the tweeter, which gives you some clues.

------
himal
You guys might want to handle GET requests for /try
URL([https://getlang.io/try](https://getlang.io/try)) as well.currently it's
returning "Server Error (500)" for GET requests.

------
martingordon
Matthew Kirk spoke about a neural network language predictor at RubyConf a few
weeks ago. Here are his slides and code:
[http://modulus7.com/rubyconf/](http://modulus7.com/rubyconf/)

------
efeamadasun
I don't know why I can't stand this sentence "A simple and scalable way to
classify automatically text by language". "Classify" and "automatically" need
to switch places.

------
alexott
Apache Tika ([http://tika.apache.org/](http://tika.apache.org/)) also has
language detector, although it maybe not so good as CLD...

------
razvvan
If I were to implement this I'd rather use google's prediction api. At least
with that you get a bit of control over what goes into the training data.

------
bkamapantula
It's Telugu not Teligu. By Panjabi, do you mean Punjabi?

As others already mentioned, it would be good to have users try examples
before signup.

------
phpnode
how does this compare in accuracy to chromium's Compact Language Detector?

[https://code.google.com/p/chromium-compact-language-
detector...](https://code.google.com/p/chromium-compact-language-detector/)

[https://github.com/mzsanford/cld](https://github.com/mzsanford/cld)

~~~
alexott
From my experience, the CLD works pretty well in the most cases. But you need
to take care for encoding detection...

~~~
dbuxton
Yes, but you presumably need to get that right in order to encode as UTF-8 and
send off to a third-party API...

------
donutdan4114
"test it out" comes back as french...

~~~
oedj
Maybe you've fallen in the 1% error rate ?

~~~
afsina
Language guessing is rather hard when few letters are used especially if you
use statistical methods. I think after 20 something letters you enter >%95
accuracy zone. In a simple library I wrote (
[https://github.com/ahmetaa/zemberek-nlp/tree/master/lang-
id](https://github.com/ahmetaa/zemberek-nlp/tree/master/lang-id) Works for 60
languages but no docs yet) , for Turkish and English test results are:

For 20 letters

TR=95.90 EN=94.96

For 50 Letters

TR=99.44 EN=99.53

If 50 letters are used in a Doc, it identifies about 20000 docs per second in
a decent desktop.

------
RBerenguel
Some day I have to rewrite whatlanguageis.com (currently not working) with all
the great ideas I had to improve it...

------
m4tthumphrey
curl -XPOST -d 'hello'
'[https://getlang.io/get?token=...'](https://getlang.io/get?token=...') {
"code": "fi", "name": "suomi, suomen kieli", "name_en": "Finnish" }

O_O

------
ssiddharth
It might be mild OCD but it'd be great if the list of supported languages is
ordered in some logical way.

------
ismaelc
Where's the login page? I need to get my token

