
Ask HN: Open source OCR library? - whbv
Does anyone have a recommendation for a good character recognition library? I&#x27;m trying to pull text out of images. I tried ocrad.js but haven&#x27;t had much success with it.
======
hbornfree
As others pointed out, Tesseract with OpenCV (for identifying and cropping the
text region) is quite effective. On top of that, Tesseract is fully trainable
with custom fonts.

In our use case, we've mostly had to deal with handwritten text and that's
where none of them really did well. Your next best bet would be to use
HoG(Histogram of oriented gradients) along with SVMs. OpenCV has really good
implementations of both.

Even then, we've had to write extra heuristics to disambiguate between 2 and z
and s and 5 etc. That was too much work and a lot of if-else. We're currently
putting in our efforts on CNNs(Convolutional Neural Networks). As a start, you
can look at Torch or Caffe.

~~~
netheril96
In my experience, CNNs offer the best performance. It is also easy to treat as
a black box with many tunable parameters.

But the existing frameworks are mostly _bad_ as an engineering product. Caffe,
for example, calls `exit` every time an error occurs, including recoverable
ones like files which do not exist.

~~~
hbornfree
Yes, even we found Caffe to be overwhelming. This library seems promising
considering there are no dependencies and is quite portable to Android or iOS.
[https://github.com/nyanp/tiny-cnn](https://github.com/nyanp/tiny-cnn)

------
kargo
Does it have to be open-source? If free, but not trainable and restricted to
Windows apps/phone is good enough, then I recommend the Microsoft OCR library.
It gives you very, VERY good results out of the box. An excellent piece of
work from Microsoft Research. To test it, see for example
[https://ocr.a9t9.com/](https://ocr.a9t9.com/) which uses Microsoft OCR
inside.

And for comparison, an OCR application with Tesseract inside: It has a
dramatically lower text recognition rate: [http://blog.a9t9.com/p/free-ocr-
windows.html](http://blog.a9t9.com/p/free-ocr-windows.html)

(Disclaimer: both links are my little open-source side projects)

~~~
piyush_soni
Are you talking about this library from Microsoft:
[https://www.nuget.org/packages/Microsoft.Windows.Ocr/](https://www.nuget.org/packages/Microsoft.Windows.Ocr/)
?

~~~
kargo
Yes

------
physcab
I tried using Tesseract and could not get it to work reliably. I tried a bunch
of different pre-processing techniques and it turned into a very frustrating
experience.

When I compared Tesseract to Abbyy, the difference was night and day. Abbyy
straight out of the box got me 80%-90% accuracy of my text. Tesseract got
around 75% at best with several layers deep of image pre-processing.

I know you said open source, and just wanted to say, I went down that path too
and discovered in my case, proprietary software really was worth the price.

~~~
totalcookie
Finereader engine Abbyy for Linux looks fine, but I can't find price data, is
it <10k/year? One time license?

~~~
loumf
Abbyy, and most good commercial OCR, charge a runtime royalty based on the
number of cores you are deploying to. Depends on volume.

~~~
totalcookie
Can you point me to some prices, per core / core-year?

~~~
BlueberryYY
This is from years ago, but it was $4,900 for an SDK license for development
and testing, and then the runtime licenses that you actually deploy are
negotiable depending on pages/month, cores, languages, and other options. I
think if you're just buying 1 runtime license they try to make it so it is
around $5K so the total deal size is $10K. But they are negotiable. Maybe
mention if you're a startup.

They also sold Recognition Server which has fewer options to integrate with
programmatically (it was only 'hot folders' at the time), and I think only
runs on Windows, but costs less.

And their mobile OCR which is lightweight, designed to run on smartphones,
they worked out deals where they get a percentage of revenue.

------
bittersweet
I was actually looking in to Tesseract yesterday, so this post coincides
nicely.

I have a hobby project where I scrape instagram photos, and I actually only
want to end up with photos with actual people in them. There are a lot of
images being posted with motivational texts etc that I want to automatically
filter out.

So far I've already built a binary that scans images and spits out the
dominant color percentage, if 1 color is over a certain percentage (so black
background white text for example), I can be pretty sure it's not something I
want to keep.

I've also tried OpenCV with facial recognition but I had a lot of false
positives with faces being recognized in text and random looking objects, and
I've tried out 4 of the haarcascades, all with different, but not 'perfect'
end results.

OCR was my next step to check out, maybe I can combine all the steps to get
something nice. I was getting weird texts back from images with no text, so
the pre-processing hints in this thread are gold and I can't wait to check
those out.

This thread is giving me so much ideas and actual names of algorithms to check
out, I love it. But I would really appreciate it if anyone else has more
thoughts about how to filter out images that do not contain people :-)

~~~
kefka
My solution with regards to bad facial detection in OpenCV is to do the
following:

1\. Use an LBP cascade on the picture. This is lower quality, higher false
positives. Uses integer math so this is fast. Its named
lbpcascade_frontalface.xml

2\. Capture the regions of interest that the LBP cascade identifies with a
face and throw into a vector<Mat>. This means you can capture (potentially)
arbitrary amount of faces. Of course, with OpenCV you are limited to a minimum
of 28px by 28px minimum face.

3\. Run the haar cascade for eye detection on the ROI's you saved in the
vector. Ones that return eyes show a good match. Haar cascades are
slower(because they use floats), but the reduction in pixels means its
relatively fast. Its named haarcascade_eye_tree_eyeglasses.xml

I can maintain 20fps with this setup at 800x600 on a slow computer.

------
jlangenauer
Tesseract is ok, but I gather that a lot of the good work in the last few
years on it has remained closed source within Google.

If you want to do text extraction, look at things like Stroke Width Transform
to extract regions of text before passing them to Tesseract.

~~~
bouke
Is it possible to extract text from pre-formatted documents? Let's say I have
a document issued by the government, and I am only interested in the fields
that have been filled. Could I process such a document fully automated using
Tesseract? Maybe some image pre-processing would be needed?

~~~
leeoniya
If you'd be interested in using something like this as a paid API/service and
have decent volume, feel free to contact me. I'm currently in pre-beta for API
rollout of such a thing.

In summary, you need templates that map the field positions -> meaningful keys
so that you can get back useful data as json/csv/xml. I have some tools that
are still being polished that automate much of the template creation and do a
lot of the pre-proc for you.

email is my username (at) gmail

------
j-pb
Not really. Everything out there is ancient and uses techniques from the
80s/90s.

It's pretty sad considering that OCR is basically a solved problem. We have
neural nets that are capable of extracting entities in images and bigdata
systems that can play jeopardy. But no one has used that tech for OCR and put
it out there.

~~~
IshKebab
It's because the kind of massive training sets that are required for accurate
speach/text recognition are difficult and time-consuming to amass. It's just
too difficult for hobbiests at the moment.

It's kind of the same reason there is no good open source solid modelling
software (like Solidworks or NX). The problem has been "solved" for years, but
actually doing it is too mammoth a task for open-source software.

~~~
recmo
[OT] What about FreeCAD? It looks like good open source solid modelling
software: [http://freecadweb.org/](http://freecadweb.org/)

~~~
kefka
FreeCAD is starting to get good, but the problem is it lacks core
functionality that Solid works already has built in. Admittedly, the
underlying python calls are exposed, and you can use them... if you know how
to make the appropriate calls.

When they get good GUI hooked up to the calls, it will be in parity with the
pay solutions.

------
sandGorgon
[http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-
mode...](http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html)

~~~
riordan
Yep. Back when I was at NYPL Labs[0], we developed a bit of an obsession with
OCR. It turns out that so much of it depends on the context of the documents,
original scan quality (unpaper and deskewing be damned), typography, how much
handwriting you're willing to suffer (can someone please pick up the work of
applying Convnets to large handwriting corpora that aren't just MNIST?).

A lot of this comes down to how much time you're willing to put into it to get
it up and running and if you're willing to put in any effort to additionally
train/refine a model.

(ps all the elements I'm about to mention here have been mentioned throughout
this thread. I'm just providing a bit more context and bringing what I know
together).

_________________

Gotta have something now: Tesseract

Not saying it's "bad" but it ain't as great as you want. Bindings for every
language, lots of configs, documented well enough that it's 80% of the
suggestions here. For what it is, it's a damn miracle. Most languages and
alphabets and lots of kinds of fonts likely start giving you results somewhere
between 80% and the Uncanny Valley immediately. You also get a pretty good
Layout Analysis engine [1] for if you're working with complete pages of text.
The existing models it ships with are robust, but if you want to get better
outputs, retraining it is a real pain (you have to segment on a character-by-
character basis). You're better off trying to apply a general preprocessing to
clean the image or...

__________________

Gotta have a proof of concept this afternoon: Send just text to Tesseract

By now you've realized that asking an OCR engine to OCR a tree (or a picture
of a tree alongside some text) somehow always comes back looking like a cat
just napped on your keyboard. As alluded to in a few other places here, Just
Give It Text! Tesseract (and most traditional OCR) was developed in a text
document centric world where you could safely assume it's just a page of
words, and sometimes you might have to deal with columns (that's where the
Layout Analysis comes in). Probably not your real world.

Depending on your type of documents you might be able to develop a few
hueristics for identifying text regions in the picture, then sending only
those sections over to Tesseract. This'll dramaticlly help Tesseract out, but
it can increase some of the complexity you'll have to juggle. You might be
able to come up with some hueristics specific to your documents (which can be
very good, especially if it lets you infer more information about those
regions that you might want later).

You can also use something like Stroke-Width Transform (all the links I would
use were graciously linked in this earlier comment [2]) which was discovered
by Microsoft trying to spot text in the wild for their Street View efforts.
ccv has a very nice SWT implementation [3], and their http server for the
whole library [4] that with a bit of makefile finagleing can have a very nice
SWT preprocessor -> Tesseract api over HTTP up in an hour or so.

Also it looks like the OpenOCR[5] project is now using SWT -> Tesseract with a
nice HTTP API written in Go, conveniently packaged up for Docker and very well
documented[6].

_____________________

Take me down the rabbit hole, but maybe get results in 2 days:

The roots of tesseract were planted by HP in 1985, so there had to be a better
way at some point. OCRopus[7] was supposed to be the great state-of-the-art
hope that would save us all. The approach was incredible and the results being
published were great. But the documentation came in the form of a mind map[8].
Recently the project was picked up again by the original developer and
rechristened OCRopy[9] and has garnered a pretty active and growing community
in the past few months.

You'll have to do a bit more work here than just tesseract, but the LSTM
neural network[10] approach completely blows away Tesseract's results with
just a little training.

To get started, Dan Vanderkam's tutorial is excellent to start working with
the out-of-the-box model[11] immediately.

But results get INCREDIBLE when you take some time to train your own
model[12]. I provided Dan with the source images for the project[13], from The
New York Public Library's historical collections, and the source text was in a
font style the default model had never encountered before. But in an hour or
two of transcribing the training data (training data in OCRopus is awesome
because you just feed in a line at a time; no need to align and segment each
letter like in Tesseract), he was getting an under 1% error rate!

Layout analysis isn't OCRopy's strongest suit, so you might get even better
results if you pre-segment with something like SWT, but again, not completely
necessary.

____________

That's it. Way too much for a Newsy Combinator comment, but a pretty decent
tour of the big stuff in open ocr world. There's some magic state of the art
going on inside Google, Microsoft, and Abbyy, but hopefully you'll be able to
teach your computer to appreciate whatever it is you want it to read through
these now.

[0]: [http://nypl.org/labs](http://nypl.org/labs)

[1]:
[https://en.wikipedia.org/wiki/Document_layout_analysis](https://en.wikipedia.org/wiki/Document_layout_analysis)

[2]:
[https://news.ycombinator.com/item?id=9776034](https://news.ycombinator.com/item?id=9776034)

[3]: [http://libccv.org/doc/doc-swt/](http://libccv.org/doc/doc-swt/) and
[http://libccv.org/lib/ccv-swt/](http://libccv.org/lib/ccv-swt/)

[4]: [http://libccv.org/doc/doc-http/](http://libccv.org/doc/doc-http/) and
[http://libccv.org/lib/ccv-serve/](http://libccv.org/lib/ccv-serve/)

[5]: [https://github.com/tleyden/open-ocr](https://github.com/tleyden/open-
ocr)

[6]: [https://github.com/tleyden/open-ocr/wiki/Stroke-Width-
Transf...](https://github.com/tleyden/open-ocr/wiki/Stroke-Width-Transform)

[7]:
[https://en.wikipedia.org/wiki/OCRopus](https://en.wikipedia.org/wiki/OCRopus)

[8]: [https://www.mindmeister.com/192257150/ocropus-overview-
publi...](https://www.mindmeister.com/192257150/ocropus-overview-published)

[9]: [https://github.com/tmbdev/ocropy](https://github.com/tmbdev/ocropy) |
all day all week

[10]:
[https://en.wikipedia.org/wiki/Long_short_term_memory](https://en.wikipedia.org/wiki/Long_short_term_memory)

[11]: [http://www.danvk.org/2015/01/09/extracting-text-from-an-
imag...](http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-
ocropus.html) | though you've got to download the model separately because
it's too big for github

[12]: [http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-
mode...](http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html)
[13]: [https://www.oldnyc.org/](https://www.oldnyc.org/)

~~~
bouyoul420
Great list of links. You have done the research! Personally, I have used
ocropy to great effect. Right now, I am moving beyond forms at work and
researching cartographic map vectorization using deep learning techniques.
There is little science papers on the subject however. Have you done any
research on that by any chance?

~~~
riordan
Indeed! Map vectorization is one the things NYPL Labs is actively working on:
[https://github.com/nypl/map-vectorizer](https://github.com/nypl/map-
vectorizer)

It's designed to work primarily with old fire insurance atlases (e.g.
Sanborns) and is a bunch of hand-tuned heuristics. But it powers Building
Inspector
([http://buildinginspector.nypl.org](http://buildinginspector.nypl.org)),
which is where all the data is validated consensus crowdsourcing, providing
ground truth for the data coming out.

Unfortunately (or perhaps fortunately if you're trying to get a Computer
Science PhD) I haven't come across anyone applying deep learning to map
vectorization either. Frankly, it could open up a whole new field with respect
to historical mapping (among other things).

Would love to talk more about it and see how deep learning could be applied
here. We always structured the outputs of Building Inspector so they'd be
useful training sets for unsupervised or reinforcement learning so it could
hopefully apply here as well.

------
t0
The accuracy of GOCR is very high and it doesn't require any learning.

[http://manpages.ubuntu.com/manpages/dapper/man1/gocr.1.html](http://manpages.ubuntu.com/manpages/dapper/man1/gocr.1.html)

~~~
jahewson
GOCR is ok, certainly much better than GNU Ocrad, but I've had much better
results with Tesseract.

~~~
ajross
GOCR is simple enough to install from your distro's repository and wrap into a
script without fuss though. Tesseract is wildly more complicated to get
started with.

If you're doing real time street sign recognition or involved with a book
scanning and archival startup, investigate Tesseract for sure. But even then
you'll probably want to prototype with gocr first.

------
rashfael
(Disclaimer: I work for creatale GmbH)

As you mentioned ocrad.js I assume you search for something in js/nodejs. Many
others already recommended tesseract and OpenCV. We have built a library
around tesseract for character recognition and OpenCV (and other libraries)
for preprocessing, all for node.js/io.js: [https://github.com/creatale/node-
dv](https://github.com/creatale/node-dv)

If you have to recognize forms or other structured images, we also created a
higher-level library: [https://github.com/creatale/node-
fv](https://github.com/creatale/node-fv)

~~~
antjanus
I can attest to this. I used Node-DV in a production environment and it's
pretty damn amazing with all its built-in functionality.

There are options to adjust the image in various ways and once Tesseract runs,
it's easy to get the result in various formats.

------
Omnipresent
I've used tesseract to great affect. I don't know how your images are but if
only part of the image has text in it, you should only send that part to the
OCR engine. If you send the entire image and only a portion of it has text in
it, chances of the OCR extracting text are slim. There are pre-processing
techniques [1] you can use to crop out the part of the image that has text

[1]:
[https://en.wikipedia.org/?title=Hough_transform](https://en.wikipedia.org/?title=Hough_transform)

~~~
roel_v
Could you explain how to use a Hough transform to find areas that have text? I
only recently managed to wrap my head around how it works for line detection
and other shapes with the generalized form, but how would one recognize text?

------
byoung2
[https://code.google.com/p/tesseract-
ocr/](https://code.google.com/p/tesseract-ocr/) is pretty good

~~~
WalterGR
Tesseract does no layout analysis.

So if the source image contains text columns or pull quotes or similar, the
output text will just be each row of text, from the far left to the far right.

~~~
NeutronBoy
Could you add another layer on top (eg. image processing) to detect boundaries
of blocks of content, and send each block to Tesseract in sequence?

~~~
kefka
Yeah, you could do that. And I wouldn't think it would be that hard either.

I'd make a cascade that detects all letters and numbers from major font sets.
That shouldn't be too terribly difficult.

Now, use the cascade to scan the document. Now, convert the document to a list
of all detected characters (we don't actually care what the chars are).

Once you have this, do best fit bounding boxes around the data. You'll have to
figure out what distance you want to exclude from the bounding boxes.

Now what you should end up with are a few boxes indicating the regions of data
on the document. Now, crop each of these regions of interest and feed them
into Tesseract.

------
awjr
Been looking at Number Plate Recognition and
[https://github.com/openalpr/openalpr](https://github.com/openalpr/openalpr)
has been my go to solution at the moment. It uses Tesseract and OpenCV.

------
ivanca
Judging by the comments there is much space for improvement on open source OCR
libraries, maybe someone with experience in the field should start a
kickstarter for it, there must some here in HN itself; maybe something with
CLI and GUI and layout recognition.

------
stevewepay
I used Tesseract and OpenCV to process Words With Friends tiles.

~~~
kornish
Haha, this is awesome. Have you written about your program anywhere? Would
love to hear more about it.

~~~
stevewepay
It's pretty much as other people talked about, I used OpenCV to try to detect
the board, and then isolate each tile, and then used Tesseract to try to
figure out the letter. The biggest problem I had was whenever the red score
bubble landed on a tile, it would fall on top of the latter and thus lose
information so I ended up having to create training data and associating that
to get the letter. I was able to get to about 98% accuracy and it was
relatively robust until they created a new app that I haven't had time to
adjust for.

------
awfullyjohn
I've tried using Tesseract before, the biggest of the open source libraries.
It goes from "okay" to "terrible" depending on the application.

Our particular application was OCRing brick and mortar store receipts directly
from emulated printer feeds (imagine printing straight to PDF). We found that
Tesseract had too many built-in goodies for image scanning, like warped
characters, lighting and shadow defects, and photographic artifacts. When
applied directly to presumably 1 to 1 character recognition, it failed
miserably.

We found that building our own software to recognize the characters on a 1 to
1 basis produced much better results. See:
[http://stackoverflow.com/questions/9413216/simple-digit-
reco...](http://stackoverflow.com/questions/9413216/simple-digit-recognition-
ocr-in-opencv-python)

------
dr_hercules
A Guide on OCR with tesseract 3.03:

[http://www.joyofdata.de/blog/a-guide-on-ocr-with-
tesseract-3...](http://www.joyofdata.de/blog/a-guide-on-ocr-with-tesseract-3-)
03/

------
kxyvr
It depends on what you're trying to do. For myself, I wanted to OCR scanned
documents and I've been moderately successful using ScanTailor to process the
images and then Tesseract to OCR the result. Certainly, it's far from perfect
and the documentation on Tesseract and its options is spotty, but I've been
moderately happy. As a note, I've had better luck with the current trunk in
the git repo of Tesseract than the point releases. Some of the older versions
of Tesseract modified the images when processing to a PDF and this was
unacceptable to me.

------
zaphar
I used the go wrappers for tesseract[0] in my full text search indexer for
disks[1] and it worked great. No issues it handled pretty much anything I
threw at it.

With the caveat that none of the stuff was handwritten.

[0]
[http://godoc.org/gopkg.in/GeertJohan/go.tesseract.v1](http://godoc.org/gopkg.in/GeertJohan/go.tesseract.v1)
[1] [https://bitbucket.org/zaphar/goin/](https://bitbucket.org/zaphar/goin/)

------
JeremyWei
you may try baidu OCR API [https://github.com/JeremyWei/baidu-
ocr](https://github.com/JeremyWei/baidu-ocr)

~~~
dharma1
Is that trainable for new alphabets? API use policy?

------
tiatia
Tried tesseract, it worked very poorly. I don't think there is anything that
compares well to Abbyy.

Unfortunately the native Linux version is a bit pricey:
[http://www.ocr4linux.com/en:pricing](http://www.ocr4linux.com/en:pricing)

Otherwise I would use the command line version to help me index all my data.

------
zandorg
An important aspect is stripping text from image, before OCR'ing:
[http://sourceforge.net/projects/tirg/](http://sourceforge.net/projects/tirg/)

Try TIRG.

Examples here: [http://funkybee.narod.ru/](http://funkybee.narod.ru/)

------
baken
I'm working on these problems for applications in the financial industry right
now. We are building some interesting technologies and have quite a bit of
funding from top VCs. We are looking to hire people who know this problem
cold. Let me know if you're curious. nate@qualia-app com

------
rajadigopula
[http://projectnaptha.com/](http://projectnaptha.com/)

[https://github.com/naptha](https://github.com/naptha)

Apologies, I am not sure if its open source.

~~~
lsadam0
It seems this is just built on top of Tesseract as well.

------
steeve
We've had great success with CCV's SWT to indentify text zones and Tesseract
to extract it.

[1] [http://libccv.org/doc/doc-swt/](http://libccv.org/doc/doc-swt/)

------
reinhardt1053
Tesseract [https://github.com/tesseract-
ocr/tesseract](https://github.com/tesseract-ocr/tesseract)

------
ihodes
A coworker had great success using Ocropus (which uses neural nets under the
covers) extracting test from the backs of many photos for Old New York.

------
dharma1
Tesseract with ccv for SWT works but is a bit of a slow dog. Anyone used deep
learning (caffe maybe?) for OCR?

~~~
dharma1
Thought I'd reply on my own comment.

Caffe has been used for handwriting, seems like OCR of typefaces would work
just the same with a typeface dataset.

[https://github.com/BVLC/caffe/tree/dev/examples/mnist](https://github.com/BVLC/caffe/tree/dev/examples/mnist)

Here is the only other OCR example with caffe i could find:

[https://github.com/pannous/caffe-ocr](https://github.com/pannous/caffe-ocr)

------
automentum
Any references on OCR to extract MICR (font E13B) for desktop as well as
Android?

~~~
loisaidasam
Dunno about desktop, but as far as Android is concerned, tess-two is a fork of
Tesseract tools for Android: [https://github.com/rmtheis/tess-
two](https://github.com/rmtheis/tess-two)

This dude did a pretty nice writeup on getting it going:
[http://gaut.am/making-an-ocr-android-app-using-
tesseract/](http://gaut.am/making-an-ocr-android-app-using-tesseract/)

------
mattdennewitz
apache tika + ocr is an excellent combination -
[https://wiki.apache.org/tika/TikaOCR](https://wiki.apache.org/tika/TikaOCR)

------
SXX
Few years ago I used Cuneiform, but looks like it's dead now.

