Ask HN: What is the best open source OCR software supporting multiple languages? - postila
======
contingencies
Tesseract[0] is a system that is broken in to different parts, at least one
does layout analysis and another does the actual OCR. Output is a different
layer again. I believe it is an open source adaptation of what Google used for
its books project. The interface was less than polished a few years ago, to
the point where getting it running at all was rather difficult. However, for
multilingual work (including Chinese) it is probably ideal.[1] Note that if
you are scanning books there are now some interesting open hardware systems
appearing online that turn pages and take photos with cameras, so you can scan
books - without cutting them up - to a high resolution.

[0] [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-
ocr/tesseract) [1] [https://github.com/tesseract-
ocr/langdata](https://github.com/tesseract-ocr/langdata)

~~~
kidsil
Tesseract can give you nightmares, but unfortunately it's the only solid OCR
library out there.

~~~
postila
What kind of nightmares?

------
frik
Beside Tesseract which was a state-of-the-art OCR software by HP in the early
nineties and recovered by Google a few years ago and is open source.

There is Cuneiform, a former main competitor to ABBYY Finereader. CuneiForm
got open sourced a view years ago, though in a sad state (project files where
in VS C++ 6 ('98), comments in Russian), but a community fixed that and ported
it to Linux. It's also probably the best one for Russian language. It also has
an UI and some advanced features that only ABBYY amd Cuneiform have, but non
of the other competitors (certainly no other open spurce OCR package).
[https://en.wikipedia.org/wiki/CuneiForm_(software)](https://en.wikipedia.org/wiki/CuneiForm_\(software\))

------
deedubaya
Sorry to hijack, but what about the best OCR service? I'd much rather farm the
OCR work out to another service than trying to do it myself.

~~~
danso
I bought ABBYY FineReader for Mac for abut $99. I find it to be pretty
amazing. My new scanner also came with it and I generally expect it to do a
reasonable text translation of whatever I throw it, whether it be newsprint
articles or crumpled receipts.

If you need to do OCR that also preserves table structure -- which is what I
bought FineReader for in the first place, I don't think there's any open
source alternative, and FineReader does a very capable job.

Here's an example of FineReader in action: OCRing the docs released by the FBI
on Clinton's email system. I've also included the pdftotext output showing how
FineReader's text conversion also attempts to preserve the physical layout of
the text characters:

[https://github.com/dannguyen/clinton-hillary-email-fbi-
inves...](https://github.com/dannguyen/clinton-hillary-email-fbi-
investigation-docs)

~~~
criddell
That's pretty good! How does it do with your handwriting?

I used to use Evernote quite heavily and I've never been able to replace their
excellent OCR. I would search for some text and was always blown away when it
would find a photo of a whiteboard or a sketch of mine.

Any idea what Evernote uses?

~~~
jumasheff
Since both ABBYY and Evernote were founded by Russians, I bet Evernote uses
FineReader.

~~~
nshm
It is not the case. Evernote was created by
[https://en.wikipedia.org/wiki/Stepan_Pachikov](https://en.wikipedia.org/wiki/Stepan_Pachikov)
who also founded Parascript company
[http://www.parascript.com](http://www.parascript.com), which was doing OCR
long before Abbyy and is mostly specialized on handwriting. So both
technologies are from Russia, but teams are independent.

------
pgodzin
I've used Tessarect with Tess4J Java wrappers, which has been pretty good.

~~~
postila
Not that good for Russian. For English, it's not the best as well – too many
mistakes for some fonts.

~~~
pgodzin
Sorry, thought you meant multiple programming language support. Yes,
definitely ran into some font issues and noise turning '1' into 'L' or 'T',
etc. As people have been saying though, it may not be great out of the box for
you but you can train it on the font you want.

------
msandford
Do you want to OCR several human languages, or do you want bindings/libraries
in several programming languages? The question as written is a little
ambiguous.

~~~
postila
I need to process millions of images and extract texts from them as better as
possible. Primary language is Russian, but some texts are English. Also,
interested in other languages (Spanish, German, etc) for future needs.

What I've tried so far (including Tesseract) is either bad for Russian texts
or cannot work with mixed texts (e.g.Russian with some English words). Or
both.

Programming languages/platform don't matter, but smth Linux-compatible is
better of course.

------
dogma1138
Tesseract can (and has to be) trained, so it can effectively support anything.

OCR isn't limited to language usually unless you are doing some really high
end stuff when it does linguistic prediction but you only need that if you are
working with really poor (image) quality sources.

But overall OCR is "language" agnostic, it is however usually not type set
agnostic so what you would want to do is train it for whatever fonts are
common for a particular language.

This gets slightly tricky if you have to do handwritten transcription or very
stylized fonts but in those cases the "language" again is not an issue because
your OCR program doesn't understand language to begin with.

------
acd
Caffee Deep learning possibly outperforms Tesseract.

[https://christopher5106.github.io/computer/vision/2015/09/14...](https://christopher5106.github.io/computer/vision/2015/09/14/comparing-
tesseract-and-deep-learning-for-ocr-optical-character-recognition.html)

------
mynewtb
Tesseract

------
postila
The best tool would be something that I can iteratively improve using some ML
methods, that I would run on Linux and integrate into my programs. And open
source, of course. I know, I want too much :)

~~~
avmich
No, not too much, at least I'd agree with you.

A lot of information about current Tesseract is there -
[https://github.com/tesseract-
ocr/docs/tree/master/das_tutori...](https://github.com/tesseract-
ocr/docs/tree/master/das_tutorial2016) .

Tesseract is trainable, even though the bulk of capabilities came from
algorithms designed well before deep learning became popular.

One of slides mentions that it's puzzling that Tesseract is "winning" over
modern ML attempts to solve OCR. However, latest developments - adding LSTM
networks to Tesseract - are reported to be promising. Wonder when they'll
become available on the Github...

~~~
postila
This looks interesting, thank you

------
kondro
Is there anything great (even if potentially pricey) for ICR (individual
handwritten characters, usually separated by boxes) or handwriting? Preferably
as a service.

